fix(web): prevent browser crash from timer leaks, DOM growth, SSE buffer (#2406)#2441
Conversation
…fer (#2406) Extended sessions with heavy bot interactions caused Chrome's "Pages Unresponsive" dialog due to accumulated browser resources that were never cleaned up. Fixes: - Add cleanupConnectionState() to clear leaked setInterval/setTimeout timers on SSE reconnect, tab visibility change, and page unload - Cap DOM at 200 message nodes via pruneOldMessages() with streaming- aware pruning (skips data-streaming elements, called at turn boundaries and after loadHistory) - Cap jobEvents Map at 50 entries with LRU eviction (excludes current job from eviction scan) - Increase SSE broadcast buffer from 256 to 1024 (configurable via SSE_BROADCAST_BUFFER env var, with zero-guard to prevent panic) Closes #2406 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… fix E2E timer test Move SSE_BROADCAST_BUFFER env var from direct std::env::var() in sse.rs to GatewayConfig in config/channels.rs, following the convention that all gateway env vars flow through structured config. Add MAX_BROADCAST_BUFFER (65,536) clamp to prevent OOM from misconfiguration. Fix E2E timer leak test to install setInterval monkey-patch via page.add_init_script() before navigation so initialization timers are tracked. Add test_dom_resource_limits.py to E2E CLAUDE.md scenario table. Add unit test for buffer config parsing, zero-rejection, and clamp. [skip-regression-check] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…gs, use constants
- Add gatewayStatusInterval to cleanupConnectionState() so it is cleared
on reconnect/tab-hide/unload; add guard in startGatewayStatusPolling()
to prevent double-start; restart polling on tab visibility restore
- Call pruneOldMessages() after addMessage('user', ...) in sendMessage()
so DOM stays bounded even during rapid user input
- Replace hardcoded broadcast_buffer: 1024 with DEFAULT_BROADCAST_BUFFER
in all test construction sites (5 occurrences across 4 files)
- Document in from_sender() doc comment why broadcast_buffer is absent
- Tighten E2E timer leak assertion from baseline+1 to baseline
[skip-regression-check]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
There was a problem hiding this comment.
Pull request overview
Addresses Web UI stability issues from #2406 by bounding client-side resource usage (timers, DOM growth) and improving SSE resilience via a configurable broadcast buffer.
Changes:
- Add connection-state cleanup and DOM pruning in the Web UI to prevent timer leaks and unbounded message-node growth.
- Make SSE broadcast buffer configurable (with sensible defaults and clamping) and plumb the new config through gateway initialization and tests.
- Add E2E coverage for DOM pruning, streaming-message preservation, and interval leak regression.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/e2e/scenarios/test_dom_resource_limits.py | New E2E scenarios covering DOM pruning, reconnect interval leak detection, and streaming preservation. |
| tests/e2e/CLAUDE.md | Documents the new E2E scenario file. |
| crates/ironclaw_gateway/static/app.js | Adds cleanupConnectionState(), pruneOldMessages(), jobEvents LRU cap, and polling guard to reduce resource leaks. |
| src/config/channels.rs | Introduces broadcast_buffer in GatewayConfig, env override via SSE_BROADCAST_BUFFER, clamping, and tests. |
| src/channels/web/sse.rs | Adds DEFAULT_BROADCAST_BUFFER and a new SseManager constructor supporting configurable buffer size. |
| src/channels/web/mod.rs | Wires GatewayConfig.broadcast_buffer into SseManager creation. |
| src/tunnel/mod.rs | Updates tunnel tests to provide the new GatewayConfig.broadcast_buffer field. |
| src/channels/web/tests/rebuild_state_preserves_fields.rs | Updates gateway test config to include broadcast_buffer. |
| src/channels/web/tests/no_silent_drop.rs | Updates gateway test config to include broadcast_buffer. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
henrypark133
left a comment
There was a problem hiding this comment.
Review: Timer/DOM/SSE resource leak fixes (Risk: medium)
Well-structured fix addressing real browser resource exhaustion. The centralized cleanupConnectionState(), DOM cap via pruneOldMessages(), configurable SSE buffer, and job-map eviction are all sound approaches. E2E test coverage is strong — the setInterval monkey-patch via add_init_script is well done.
Positives:
- Timer cleanup centralized in one function with clear ownership comment explaining why
_doneWithoutResponseTimeris excluded startGatewayStatusPolling()idempotency guard prevents the interval accumulation bug- SSE buffer config follows existing patterns: env var → clamped config → module factory
data-streamingguard in pruning prevents mid-stream breakage
Concerning: pruneOldMessages() in pagination path evicts just-loaded history [Logic]
File: crates/ironclaw_gateway/static/app.js:3037-3044
When the user scrolls up, loadHistory(before) prepends older messages at container.firstChild (line 3037). Then pruneOldMessages() runs (line 3044) and removes elements from the front of the container — exactly the items just prepended. The user scrolls up to see history and it's immediately pruned. The fresh-load path (lines 2978-2991) is fine since it clears the container first.
Suggested fix: Skip pruneOldMessages() when isPaginating is true. Pruning on sendMessage() and assistant-response receipt is sufficient to bound live-session DOM growth; history pagination loads a bounded batch (50 turns) that doesn't need immediate pruning.
Convention notes:
src/channels/web/CLAUDE.md:224says "Broadcast buffer: 256 events" — should be updated to reflect the new 1024 default andSSE_BROADCAST_BUFFERenv varMAX_DOM_MESSAGES(line 1859) is defined near its function rather than with other constants at lines 92-93 — minor style inconsistency_loadThreadsTimeris a similar debounce timeout tojobListRefreshTimerbut is not cleared incleanupConnectionState()— benign but asymmetric
- Remove pruneOldMessages() from loadHistory() pagination path to avoid immediately evicting just-prepended older messages - Move MAX_DOM_MESSAGES constant to top-level constants block - Add _loadThreadsTimer to cleanupConnectionState() for consistency - Add assert!(broadcast_buffer > 0) to SseManager constructor with panic doc (tokio broadcast channel requires capacity > 0) - Use Set-based interval tracking in E2E test to prevent counter underflow from double-clear - Update CLAUDE.md broadcast buffer docs (256 → 1024, SSE_BROADCAST_BUFFER) [skip-regression-check] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
All items from your review addressed in a96c62a and this follow-up push:
|
Replace assert!(broadcast_buffer > 0) with a doc comment noting the precondition. GatewayConfig already rejects 0 at the config layer. [skip-regression-check] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ReviewSound scope and clean Rust side. Request changes on a few correctness issues, and — importantly — the tests here are unit/local only. Please add Playwright coverage under Correctness
v2 coverageAll changes are in the web channel / SSE layer; not engine-specific. No v2-vs-v1 split here. ✅ Tests — please add e2e
Rust side
No |
serrrfirat
left a comment
There was a problem hiding this comment.
Paranoid Security Review — APPROVE
PR: #2441 — Prevent browser crash from timer leaks, DOM growth, SSE buffer
Reviewer context: Automated deep security review
Findings
Net security improvement. No new attack surfaces introduced.
- No XSS vectors:
pruneOldMessages()only calls.remove()on existing DOM nodes. NoinnerHTML, no dynamic content creation. ExistingDOMPurify.sanitize()pipeline unchanged. - Timer DoS surface reduced:
cleanupConnectionState()clears five connection-scoped timers on every reconnect path.startGatewayStatusPolling()guard prevents double-interval accumulation. - All exit paths covered: Cleanup on
beforeunload,visibilitychange(hidden), andconnectSSE()reconnection. - DOM growth bounded: 200-message cap with streaming-node preservation (
data-streaming="true"skip). - Job events bounded: 50-entry LRU cap. O(50) eviction — negligible.
- SSE buffer capped: Configurable up to 65,536. Only server operators can set env vars — no external DoS vector.
- No new
.unwrap()/.expect()in production Rust code. Config validation preventsbroadcast::channel(0)panic. - E2E tests well-structured: Timer leak test uses
add_init_script()monkey-patching before page load.
Optional follow-up
Consider calling pruneOldMessages() after loadHistory() to cover the history-scroll-back growth path.
LGTM.
serrrfirat
left a comment
There was a problem hiding this comment.
lets add playwright tests here
…#2406) Correctness: - pruneOldMessages: clean up orphaned leading time-separators after pruning - jobEvents LRU: replace O(n) scan with O(1) Map insertion-order eviction - Document degenerate all-streaming under-prune case Playwright e2e tests: - Tab hide/restore: no duplicate gateway status polling intervals - DOM cap + streaming: 260 elements prune to ≤200, streaming preserved, no orphan separators - jobEvents bounded: 60 jobs stay capped at ≤50 via LRU eviction - Fix assertion selector to match pruneOldMessages superset, tighten lower bound Rust: - Unit test: SseManager buffer size parameter actually controls lag behavior - Document MAX_BROADCAST_BUFFER memory impact (65K×100×200B ≈ 1.3 GB) - Move "capacity baked into tx" comment from from_sender to rebuild_state Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
) The O(1) LRU eviction skips the job that just received an event (moved to end via delete+set), but did not protect the job the user is actively viewing in the detail panel (currentJobId). If the user views a quiet job while 50+ other jobs fire events, the viewed job's events would be evicted and the activity tab would appear empty. Add a currentJobId guard to the eviction loop and a Playwright e2e test that verifies the actively-viewed job survives LRU pressure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
) The O(1) LRU eviction skips the job that just received an event (moved to end via delete+set), but did not protect the job the user is actively viewing in the detail panel (currentJobId). If the user views a quiet job while 50+ other jobs fire events, the viewed job's events would be evicted and the activity tab would appear empty. Add a currentJobId guard to the eviction loop and a Playwright e2e test that verifies the actively-viewed job survives LRU pressure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
cleanupConnectionState()clears all connection-scoped timers (includinggatewayStatusInterval) on SSE reconnect, tab hide, and page unload. Guard added tostartGatewayStatusPolling()to prevent double-start; polling restarts on tab restore.pruneOldMessages()— called after assistant responses, user messages, and history loads. Streaming messages (data-streaming="true") are preserved.SSE_BROADCAST_BUFFERenv var inGatewayConfig, clamped to 65,536 max, zero rejected).gatewayStatusIntervalincleanupConnectionState(), prune after user messages, replace hardcoded1024withDEFAULT_BROADCAST_BUFFERin tests, documentfrom_sender()buffer invariant, tighten E2E assertion.Closes #2406
Test plan
cargo fmt— cleancargo clippy --all --benches --tests --examples --all-features— zero warningscargo test— all passpytest tests/e2e/scenarios/test_dom_resource_limits.py— DOM pruning, timer leak, streaming preservation🤖 Generated with Claude Code