Skip to content

bug: uncaught StreamStateError when drain event fires on closing stream #3415

@lodekeeper

Description

@lodekeeper

Description

A race condition in abstract-message-stream.ts causes an uncaught exception when a drain event fires on a stream whose underlying transport is already closing. The StreamStateError is thrown inside an event listener callback with no surrounding try/catch, so it propagates as an uncaught exception to the Node.js process level.

Environment

  • @libp2p/utils@7.0.13 (latest)
  • @chainsafe/libp2p-noise@17.0.0
  • @libp2p/mplex@12.0.14
  • Lodestar v1.41.0 (Ethereum consensus client)
  • Node.js (TCP transport)

Stack Trace

[network] error: uncaughtException: Cannot write to a stream that is closing
StreamStateError: Cannot write to a stream that is closing
    at TCPSocketMultiaddrConnection.send (abstract-message-stream.js:112)
    at stream.send (@libp2p/prometheus-metrics/index.js:183)
    at EncryptedMessageStream.sendData (@chainsafe/libp2p-noise/utils.js:169)
    at EncryptedMessageStream.processSendQueue (abstract-message-stream.js:355)
    at EncryptedMessageStream.continueSendingOnDrain (abstract-message-stream.js:56)

Root Cause Analysis

The race:

  1. A TCP connection begins closing → the underlying stream's writeStatus transitions to 'closing'
  2. Before the close completes, a drain event fires on the underlying TCP stream (the OS/Node.js TCP socket signals buffer space is available)
  3. The noise layer's noiseOnDrain handler (libp2p-noise/utils.ts) relays this as safeDispatchEvent('drain') on the encrypted stream
  4. The continueSendingOnDrain listener in abstract-message-stream.ts (line 97-104) calls processSendQueue()
  5. processSendQueue() (line 428) has guards for writableNeedsDrain, empty buffer, and sendingData — but no guard for writeStatus
  6. It calls sendData() → noise's sendData() calls this.stream.send() on the underlying (now-closing) TCP stream
  7. send() (line 169-171) throws StreamStateError because writeStatus === 'closing'
  8. The throw propagates through the event listener call stack with no try/catch → uncaught exception

Key observation: The continueSendingOnDrain callback at line 97 only checks this.writableNeedsDrain before calling processSendQueue(). Neither it nor processSendQueue() verifies that the stream is still writable.

Impact

  • On Lodestar mainnet: 6 uncaught exceptions in 7 days (3 distinct timestamps, occurring in bursts of 2-3 when a connection tears down)
  • The error is logged as [network] error: uncaughtException — it reaches the Node.js process.on('uncaughtException') handler
  • Currently non-fatal (Lodestar catches and logs it) but any uncaughtException is a process stability risk
  • Affects any js-libp2p consumer using noise over TCP when connections tear down while the send queue is draining

Proposed Fix

Add a writeStatus guard in processSendQueue(), which is the single chokepoint for all send paths:

// In abstract-message-stream.ts, processSendQueue() — add after existing bail-out checks:

protected processSendQueue (): boolean {
    // bail if the underlying transport is full
    if (this.writableNeedsDrain) { /* ... existing ... */ }

    // bail if there is no data to send
    if (this.writeBuffer.byteLength === 0) { /* ... existing ... */ }

    // bail if we are already sending data
    if (this.sendingData) { /* ... existing ... */ }

    // bail if the stream is no longer writable — a drain event can
    // arrive after writeStatus transitions to 'closing'/'closed'
    if (this.writeStatus !== 'writable') {
      this.log.trace('not processing send queue as stream is %s', this.writeStatus)
      return false
    }

    // ... rest of method unchanged
}

This is safe because:

  • It's a single guard at the beginning of the method, consistent with the existing bail-out pattern
  • Data remaining in writeBuffer when the stream closes will be discarded as part of the normal close/abort cleanup
  • The send() method already enforces this invariant by throwing — this just converts the throw into a graceful no-op at the right level

Alternative (less preferred): Add a try/catch around the sendData() call in processSendQueue() to swallow StreamStateError during close. This would be more defensive but masks the root cause.

Reproduction

This is a timing-dependent race that occurs during TCP connection teardown. It's most likely to manifest under load when many connections are being established and torn down simultaneously. In our case, a mainnet Ethereum beacon node with ~200 peers.

The simplest conceptual reproduction:

  1. Establish a noise-over-TCP connection with mplex
  2. Send enough data to fill the write buffer (trigger writableNeedsDrain)
  3. Close/abort the connection while the buffer is still draining
  4. The OS delivers a drain event after the close has started → boom

Reported from Lodestar — an Ethereum consensus client built on js-libp2p.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions