-
Notifications
You must be signed in to change notification settings - Fork 521
bug: uncaught StreamStateError when drain event fires on closing stream #3415
Description
Description
A race condition in abstract-message-stream.ts causes an uncaught exception when a drain event fires on a stream whose underlying transport is already closing. The StreamStateError is thrown inside an event listener callback with no surrounding try/catch, so it propagates as an uncaught exception to the Node.js process level.
Environment
@libp2p/utils@7.0.13(latest)@chainsafe/libp2p-noise@17.0.0@libp2p/mplex@12.0.14- Lodestar v1.41.0 (Ethereum consensus client)
- Node.js (TCP transport)
Stack Trace
[network] error: uncaughtException: Cannot write to a stream that is closing
StreamStateError: Cannot write to a stream that is closing
at TCPSocketMultiaddrConnection.send (abstract-message-stream.js:112)
at stream.send (@libp2p/prometheus-metrics/index.js:183)
at EncryptedMessageStream.sendData (@chainsafe/libp2p-noise/utils.js:169)
at EncryptedMessageStream.processSendQueue (abstract-message-stream.js:355)
at EncryptedMessageStream.continueSendingOnDrain (abstract-message-stream.js:56)
Root Cause Analysis
The race:
- A TCP connection begins closing → the underlying stream's
writeStatustransitions to'closing' - Before the close completes, a
drainevent fires on the underlying TCP stream (the OS/Node.js TCP socket signals buffer space is available) - The noise layer's
noiseOnDrainhandler (libp2p-noise/utils.ts) relays this assafeDispatchEvent('drain')on the encrypted stream - The
continueSendingOnDrainlistener inabstract-message-stream.ts(line 97-104) callsprocessSendQueue() processSendQueue()(line 428) has guards forwritableNeedsDrain, empty buffer, andsendingData— but no guard forwriteStatus- It calls
sendData()→ noise'ssendData()callsthis.stream.send()on the underlying (now-closing) TCP stream send()(line 169-171) throwsStreamStateErrorbecausewriteStatus === 'closing'- The throw propagates through the event listener call stack with no try/catch → uncaught exception
Key observation: The continueSendingOnDrain callback at line 97 only checks this.writableNeedsDrain before calling processSendQueue(). Neither it nor processSendQueue() verifies that the stream is still writable.
Impact
- On Lodestar mainnet: 6 uncaught exceptions in 7 days (3 distinct timestamps, occurring in bursts of 2-3 when a connection tears down)
- The error is logged as
[network] error: uncaughtException— it reaches the Node.jsprocess.on('uncaughtException')handler - Currently non-fatal (Lodestar catches and logs it) but any
uncaughtExceptionis a process stability risk - Affects any js-libp2p consumer using noise over TCP when connections tear down while the send queue is draining
Proposed Fix
Add a writeStatus guard in processSendQueue(), which is the single chokepoint for all send paths:
// In abstract-message-stream.ts, processSendQueue() — add after existing bail-out checks:
protected processSendQueue (): boolean {
// bail if the underlying transport is full
if (this.writableNeedsDrain) { /* ... existing ... */ }
// bail if there is no data to send
if (this.writeBuffer.byteLength === 0) { /* ... existing ... */ }
// bail if we are already sending data
if (this.sendingData) { /* ... existing ... */ }
// bail if the stream is no longer writable — a drain event can
// arrive after writeStatus transitions to 'closing'/'closed'
if (this.writeStatus !== 'writable') {
this.log.trace('not processing send queue as stream is %s', this.writeStatus)
return false
}
// ... rest of method unchanged
}This is safe because:
- It's a single guard at the beginning of the method, consistent with the existing bail-out pattern
- Data remaining in
writeBufferwhen the stream closes will be discarded as part of the normal close/abort cleanup - The
send()method already enforces this invariant by throwing — this just converts the throw into a graceful no-op at the right level
Alternative (less preferred): Add a try/catch around the sendData() call in processSendQueue() to swallow StreamStateError during close. This would be more defensive but masks the root cause.
Reproduction
This is a timing-dependent race that occurs during TCP connection teardown. It's most likely to manifest under load when many connections are being established and torn down simultaneously. In our case, a mainnet Ethereum beacon node with ~200 peers.
The simplest conceptual reproduction:
- Establish a noise-over-TCP connection with mplex
- Send enough data to fill the write buffer (trigger
writableNeedsDrain) - Close/abort the connection while the buffer is still draining
- The OS delivers a drain event after the close has started → boom
Reported from Lodestar — an Ethereum consensus client built on js-libp2p.