-
Notifications
You must be signed in to change notification settings - Fork 17.5k
Bug: watchdog non-streaming fallback is unreachable dead code (v2.1.84/v2.1.85) #39755
Description
Bug: watchdog non-streaming fallback appears to be unreachable dead code (v2.1.84/v2.1.85)
Environment
- Claude Code: v2.1.85 (also verified in v2.1.84)
- Analysis method: reverse-engineering minified
cli.jsvianpm pack
Disclaimer
This analysis is based on reverse-engineering 12 MB of minified JavaScript. Variable names are obfuscated, control flow is compressed into single lines of 10,000–25,000 characters, and scoping has to be traced by counting brace depth at character offsets. We've done our best to reconstruct the logic accurately, but without access to the original source code, there may be nuances we're missing. If any part of this analysis is incorrect, we'd welcome corrections — ideally with a pointer to the relevant source.
Summary
The streaming idle watchdog (CLAUDE_ENABLE_STREAM_WATCHDOG=1) aborts hanging streams but appears to fail to trigger the non-streaming fallback. The fallback code exists and has telemetry (fallback_cause: "watchdog") — but based on our tracing, it's unreachable due to an early throw in the error handling chain.
Users see a generic "Request timed out" error instead of a transparent retry via the non-streaming path.
Root cause
In the inner catch block of the streaming event loop (v2.1.85, line 7682, char offset ~8979):
catch(O6) {
clearTimers();
if (watchdogFired) { /* log telemetry */ }
if (O6 instanceof AbortError) // ← watchdog calls AbortController.abort()
if (signal.aborted) // which creates an AbortError
throw O6; // user ESC → cancel (correct)
else
throw new TimeoutError("Request timed out"); // ← THROWN TO OUTER CATCH!
// ⚠️ UNREACHABLE for watchdog abort — both paths above throw
if (DISABLE_NONSTREAMING_FALLBACK) throw ...;
// Non-streaming fallback (DEAD CODE for watchdog):
log("falling back to non-streaming mode");
fallbackFlag = true;
telemetry("tengu_streaming_fallback_to_non_streaming", {
fallback_cause: watchdogFired ? "watchdog" : "other" // ← never reached
});
yield* nonStreamingRequest(...); // ← never called
}The outer catch doesn't know about the watchdog — it treats TimeoutError as a generic API failure, yields an error message to the UI, and returns.
Expected behavior
Watchdog fires → abort stream → detect watchdog (not user ESC) → fall through to non-streaming fallback → user gets response transparently.
Actual behavior
Watchdog fires → abort stream → throw TimeoutError → outer catch → "Request timed out" → done. No retry. No fallback.
Suggested fix
Don't throw on watchdog-triggered AbortError — let it fall through to the existing fallback code:
catch(O6) {
clearTimers();
if (O6 instanceof AbortError) {
if (signal.aborted) throw O6; // user ESC → cancel
if (!watchdogFired) { // unknown SDK abort
throw new TimeoutError("Request timed out");
}
// watchdog abort → fall through to non-streaming fallback below
}
if (DISABLE_NONSTREAMING_FALLBACK) throw ...;
// ... existing fallback code works as intended ...
}Evidence: the fallback code was intentionally written for watchdog
The telemetry in the unreachable fallback path explicitly checks the watchdog flag:
fallback_cause: watchdogFired ? "watchdog" : "other"The fallback was clearly intended for watchdog scenarios, but the AbortError instanceof check above it was likely added (or refactored) later without considering this interaction. This is the kind of subtle control-flow regression that's easy to miss in a 12 MB single-file codebase — especially when the code is being generated or refactored at scale.
Impact
- The watchdog feature (added ~v2.1.50, configurable since v2.1.84) is fundamentally broken: it aborts hanging streams but doesn't recover
- Users who enable
CLAUDE_ENABLE_STREAM_WATCHDOG=1get "Request timed out" errors instead of the intended transparent retry - This may be the reason the watchdog is disabled by default — it appears non-functional in testing because the fallback doesn't work, but the root cause is this unreachable code path, not a design problem with the watchdog itself
Request for source access
We've been reverse-engineering cli.js across 11 versions (v2.1.74–v2.1.85) by grepping through 12 MB of minified code and counting brace depth to trace scoping. We've found multiple issues this way — the streaming hang root cause (#33949), JSONL writer race conditions (#31328), and now this fallback bug — but the process is extremely slow. Tracing a single code path (like the one in this issue) takes hours of node -e scripts and manual character-offset arithmetic.
With access to the original source code, we could verify findings like this in minutes instead of hours, and catch bugs we're currently missing because minification obscures the control flow. Given the complexity of issues the community is hitting (#6836: 150+ orphaned tool reports, #26224: agent hangs, #30137/#32870: system deadlocks), having even one community researcher with source access would meaningfully accelerate debugging.
Our track record:
- github.com/kolkov — open source maintainer, 35+ public repos
- dev.to/kolkov — technical articles on developer tooling
- 11 versions of
cli.jsreverse-engineered with documented methodology - Root cause analysis for streaming hangs (SSE streaming hangs indefinitely (no timeout) + ESC cannot fully cancel (queue auto-restart) — root cause analysis with fix proposals #33949, 👍12, 21 comments)
- Bun runtime crash analysis (Auto-updater memory leak crashed 12-hour session (v2.1.76) — 13.81 GB committed, Bun panic #35171, Bun runtime unsuitable for long sessions: mimalloc crashes after 12-24h, ~1GB/h memory growth, system-wide stuttering #36132)
We're happy to work under NDA, read-only access, or whatever arrangement makes sense. The goal is the same — making Claude Code more reliable for everyone.
Why open-sourcing Claude Code makes business sense in 2026
Keeping cli.js closed-source may have made sense in early 2025 when Claude Code launched and had first-mover advantage. But in 2026, with Cursor, Codex, Windsurf, Aider, and dozens of open-source alternatives — the secrecy provides no competitive advantage while actively harming product quality.
Consider the facts:
- Anthropic's revenue comes from model API access, not from selling Claude Code as software. The CLI is a funnel to the API — the more reliable it is, the more tokens users consume.
- The "secret" is already out. The entire architecture is recoverable from the minified source — we've mapped the streaming pipeline, error classes, retry logic, telemetry events, and env vars across 11 versions. Anyone with
npm packand a weekend can do the same. It's security through obscurity, and it's not working. - Bugs like this one sit undiscovered for months because the community can't effectively review 12 MB of minified code. This specific dead-code bug means the watchdog feature (5+ months in the codebase) has never worked as intended. With readable source, someone would have caught this in a PR review.
- The community is already doing the work. SSE streaming hangs indefinitely (no timeout) + ESC cannot fully cancel (queue auto-restart) — root cause analysis with fix proposals #33949 has root cause analysis from reverse engineering. Session becomes unresumable after JSONL writer drops assistant entry during parallel tool calls #31328 identified JSONL race conditions. @yichao-mt decompiled the watchdog timer. @VRDate submitted PR fix(critical): Add tool-mutex plugin to prevent Wof.sys BSOD caused by parallel fs enumeration #35710 for tool mutex. We're all working blind — give us the source and we'll find bugs 10x faster.
- Open source would accelerate, not threaten. Recreating a CLI wrapper around the Anthropic API is straightforward — the hard part (the models) stays proprietary. What open source gives you is a community that catches regressions, proposes fixes, and builds trust. The current trajectory — 150+ unresolved bug reports, zero team responses, community threatening to leave for Codex — is far more dangerous to the business than open-sourcing a CLI tool.
We're not asking for model weights or internal infrastructure. Just the TypeScript source for a CLI tool that wraps your public API. The ROI is obvious: faster bug discovery, community PRs, and users who feel invested in the product rather than frustrated by it.
CC: @bcherny @ant-kurt @fvolcic @ashwin-ant @bogini @OctavianGuzu @hackyon-anthropic @chrislloyd @ThariqS @catherinewu @whyuan-cc @dhollman @rboyce-ant @dicksontsai @wolffiex @ddworken @km-anthropic — open to discussing any of this privately or publicly.