Skip to content

SSE streaming hangs indefinitely (no timeout) + ESC cannot fully cancel (queue auto-restart) — root cause analysis with fix proposals #33949

@kolkov

Description

@kolkov

Summary

These two bugs have been plaguing users for months (see #26224 — 28 comments, #6836 — 150+ reports), with no root cause analysis from the team. After yet another day of babysitting Claude Code and pressing ESC every few minutes to revive a hung agent, we decided to conduct our own deep investigation — reverse-engineering cli.js across 12 npm package versions and analyzing 1,571 session JSONL files containing 148,444 tool calls.

Here are the exact root causes and proposed fixes.

Claude Code hangs indefinitely when an SSE streaming connection silently dies. There is no client-side timeout or heartbeat detection, so the process waits forever for events that will never arrive. ESC partially works around this by aborting the dead connection, but the queue auto-restart mechanism (queue.length > 0 → n()) immediately starts the next queued prompt instead of returning control to the user.

Root cause identified in source code — two separate issues in cli.js:

  1. No streaming timeout: The messages.stream() call has no timeout. If the SSE connection dies silently (TCP half-open), the client waits forever.
  2. Queue auto-restart after abort: After ESC aborts a hung request, if (queue.length > 0) { n(); return; } immediately starts the next queued prompt. The user cannot fully cancel.

Environment

  • Claude Code: 2.1.74 (also confirmed on 2.1.50–2.1.73)
  • OS: Windows 10, Git Bash
  • Model: Opus 4.6
  • API: Anthropic direct (not Bedrock/Vertex)

Reproduction

  1. Start a Claude Code session
  2. Submit a prompt → agent starts processing
  3. Wait for a hang (0 tokens, timer running, no progress) — happens ~10-15% of prompts
  4. Submit another prompt while hung → goes to queue
  5. Press ESC
  6. Expected: Cancel everything, return to
  7. Actual: Cancels the hung prompt, immediately starts the queued one

Frequency

Measured across 1,571 sessions using a custom JSONL analyzer tool:

Period Versions Orphan rate (lost tool calls)
Dec 2025 2.0.72–2.1.2 6–14%
Jan 2026 2.1.5–2.1.23 5–10%
Feb 2026 2.1.29–2.1.56 3–8%
Mar 2026 2.1.69–2.1.74 2.4–4%

The hang frequency has been increasing over time: rare in fall 2025, now ~10-15% of prompts per hour.

Source Code Analysis

Analyzed cli.js extracted from npm pack @anthropic-ai/claude-code across versions 2.0.72 through 2.1.74.

Issue 1: No streaming timeout

The API call at approximately offset 2,553,870 in cli.js (v2.1.74):

client.beta.messages.stream({...params}, options)

There is no timeout parameter, no keepalive check, and no heartbeat detection. The Anthropic SSE API sends periodic :ping comments, but the client does not monitor for their absence.

When the TCP connection silently dies (common on Windows, WiFi, VPN, or after laptop sleep), the Node.js HTTP client has no way to know the connection is dead. The AbortController signal is never triggered because no error event fires.

Evidence: Packet inspection by other reporters confirms the client is stuck waiting for SSE events that never arrive. Token count stays at 0. ESC + re-submit creates a new connection that works immediately.

Issue 2: Queue auto-restart prevents full cancellation

The main processing loop (offset ~11,400,559 in v2.1.74):

n = async () => {
  if (M) return;       // running guard
  M = true;
  // ... prepare input, call API, process response ...
}

After completion or abort — in the finally block (offset ~11,406,174):

finally {
  M = false;           // clear running guard
  W6.start();          // restart idle timer
}
if (c36()) {           // c36() = yY.length > 0 = queue not empty?
  n();                 // YES → immediately restart with queued message!
  return;              // without returning control to user!
}

Historical analysis of npm packages confirms this pattern exists since v2.1.50 (as queue.length > 0) and was refactored to c36() in v2.1.74.

Issue 3: JSONL writer race condition (related)

The session writer class LZq (offset ~10,549,000) has a non-atomic insertMessageChain() that writes assistant (tool_use) and user (tool_result) messages one at a time in a loop:

async insertMessageChain(A, q, K, Y, z) {
  return this.trackWrite(async () => {
    for (let H of A) {
      await this.appendEntry(M);  // each message separately!
    }
  });
}

If the process is interrupted between writing tool_use and tool_result, the tool_use becomes orphaned. This is the root cause of issue #6836.

Proposed Fixes

Fix 1: Streaming timeout (critical)

Add a client-side timeout that aborts and retries if no SSE events are received within N seconds:

// Pseudocode
const STREAM_IDLE_TIMEOUT_MS = 30_000;
let lastEventTime = Date.now();

stream.on('event', () => { lastEventTime = Date.now(); });

const watchdog = setInterval(() => {
  if (Date.now() - lastEventTime > STREAM_IDLE_TIMEOUT_MS) {
    clearInterval(watchdog);
    abortController.abort();
    // retry with new connection
  }
}, 5_000);

The Anthropic API sends :ping SSE comments periodically. Monitoring for these would detect stale connections without false positives.

Fix 2: ESC should clear the queue

When the user presses ESC during a hang, the queue should be cleared (or the user should be asked):

// After abort, before checking queue:
if (userInitiatedAbort && c36()) {
  // Option A: Clear queue entirely
  clearQueue();
  return; // back to prompt

  // Option B: Ask user
  // "You have N queued messages. Clear queue? (y/n)"
}

Fix 3: Atomic message chain writes

insertMessageChain() should serialize the entire chain as a single appendToFile() call:

async insertMessageChain(messages) {
  const serialized = messages.map(m => JSON.stringify(m)).join('\n') + '\n';
  await this.appendToFile(sessionFile, serialized);
}

Note: history.jsonl already uses proper-lockfile for file locking — the same approach should be applied to session JSONL files when multiple agents write concurrently.

Related Issues

Methodology

Analysis performed using:

  • ccdiag: Custom Go CLI tool that parses JSONL session files, detects orphaned tool calls, analyzes timing, and scans multiple sessions
  • Source analysis: cli.js extracted from npm packages across 12 versions (2.0.72 through 2.1.74), searched for queue/abort/streaming patterns
  • Session data: 1,571 sessions, 148,444 tool calls, 8,007 orphaned

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:corebugSomething isn't workinghas reproHas detailed reproduction stepsplatform:windowsIssue specifically occurs on Windows

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions