Skip to content

fix(web): prevent browser crash from timer leaks, DOM growth, SSE buffer (#2406)#2441

Merged
henrypark133 merged 8 commits intostagingfrom
worktree-fix-2406-web-ui-crash
Apr 14, 2026
Merged

fix(web): prevent browser crash from timer leaks, DOM growth, SSE buffer (#2406)#2441
henrypark133 merged 8 commits intostagingfrom
worktree-fix-2406-web-ui-crash

Conversation

@henrypark133
Copy link
Copy Markdown
Collaborator

Summary

  • Timer leak fix: cleanupConnectionState() clears all connection-scoped timers (including gatewayStatusInterval) on SSE reconnect, tab hide, and page unload. Guard added to startGatewayStatusPolling() to prevent double-start; polling restarts on tab restore.
  • DOM pruning: Cap DOM at 200 message nodes via pruneOldMessages() — called after assistant responses, user messages, and history loads. Streaming messages (data-streaming="true") are preserved.
  • jobEvents LRU: Cap tracked jobs at 50 with LRU eviction (current job excluded).
  • SSE buffer: Increase broadcast buffer from 256 to 1024 (configurable via SSE_BROADCAST_BUFFER env var in GatewayConfig, clamped to 65,536 max, zero rejected).
  • Review fixes (3rd commit): clean gatewayStatusInterval in cleanupConnectionState(), prune after user messages, replace hardcoded 1024 with DEFAULT_BROADCAST_BUFFER in tests, document from_sender() buffer invariant, tighten E2E assertion.

Closes #2406

Test plan

  • cargo fmt — clean
  • cargo clippy --all --benches --tests --examples --all-features — zero warnings
  • cargo test — all pass
  • pytest tests/e2e/scenarios/test_dom_resource_limits.py — DOM pruning, timer leak, streaming preservation
  • Manual: open web UI, send 200+ messages, verify DOM stays bounded and no "Pages Unresponsive" dialog

🤖 Generated with Claude Code

henrypark133 and others added 3 commits April 13, 2026 22:11
…fer (#2406)

Extended sessions with heavy bot interactions caused Chrome's "Pages
Unresponsive" dialog due to accumulated browser resources that were
never cleaned up.

Fixes:
- Add cleanupConnectionState() to clear leaked setInterval/setTimeout
  timers on SSE reconnect, tab visibility change, and page unload
- Cap DOM at 200 message nodes via pruneOldMessages() with streaming-
  aware pruning (skips data-streaming elements, called at turn
  boundaries and after loadHistory)
- Cap jobEvents Map at 50 entries with LRU eviction (excludes current
  job from eviction scan)
- Increase SSE broadcast buffer from 256 to 1024 (configurable via
  SSE_BROADCAST_BUFFER env var, with zero-guard to prevent panic)

Closes #2406

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… fix E2E timer test

Move SSE_BROADCAST_BUFFER env var from direct std::env::var() in sse.rs
to GatewayConfig in config/channels.rs, following the convention that all
gateway env vars flow through structured config. Add MAX_BROADCAST_BUFFER
(65,536) clamp to prevent OOM from misconfiguration.

Fix E2E timer leak test to install setInterval monkey-patch via
page.add_init_script() before navigation so initialization timers are
tracked. Add test_dom_resource_limits.py to E2E CLAUDE.md scenario table.

Add unit test for buffer config parsing, zero-rejection, and clamp.

[skip-regression-check]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…gs, use constants

- Add gatewayStatusInterval to cleanupConnectionState() so it is cleared
  on reconnect/tab-hide/unload; add guard in startGatewayStatusPolling()
  to prevent double-start; restart polling on tab visibility restore
- Call pruneOldMessages() after addMessage('user', ...) in sendMessage()
  so DOM stays bounded even during rapid user input
- Replace hardcoded broadcast_buffer: 1024 with DEFAULT_BROADCAST_BUFFER
  in all test construction sites (5 occurrences across 4 files)
- Document in from_sender() doc comment why broadcast_buffer is absent
- Tighten E2E timer leak assertion from baseline+1 to baseline

[skip-regression-check]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 14, 2026 05:12
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions bot added scope: channel/web Web gateway channel scope: docs Documentation size: L 200-499 changed lines risk: medium Business logic, config, or moderate-risk modules contributor: core 20+ merged PRs labels Apr 14, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Addresses Web UI stability issues from #2406 by bounding client-side resource usage (timers, DOM growth) and improving SSE resilience via a configurable broadcast buffer.

Changes:

  • Add connection-state cleanup and DOM pruning in the Web UI to prevent timer leaks and unbounded message-node growth.
  • Make SSE broadcast buffer configurable (with sensible defaults and clamping) and plumb the new config through gateway initialization and tests.
  • Add E2E coverage for DOM pruning, streaming-message preservation, and interval leak regression.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/e2e/scenarios/test_dom_resource_limits.py New E2E scenarios covering DOM pruning, reconnect interval leak detection, and streaming preservation.
tests/e2e/CLAUDE.md Documents the new E2E scenario file.
crates/ironclaw_gateway/static/app.js Adds cleanupConnectionState(), pruneOldMessages(), jobEvents LRU cap, and polling guard to reduce resource leaks.
src/config/channels.rs Introduces broadcast_buffer in GatewayConfig, env override via SSE_BROADCAST_BUFFER, clamping, and tests.
src/channels/web/sse.rs Adds DEFAULT_BROADCAST_BUFFER and a new SseManager constructor supporting configurable buffer size.
src/channels/web/mod.rs Wires GatewayConfig.broadcast_buffer into SseManager creation.
src/tunnel/mod.rs Updates tunnel tests to provide the new GatewayConfig.broadcast_buffer field.
src/channels/web/tests/rebuild_state_preserves_fields.rs Updates gateway test config to include broadcast_buffer.
src/channels/web/tests/no_silent_drop.rs Updates gateway test config to include broadcast_buffer.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/ironclaw_gateway/static/app.js
Comment thread tests/e2e/scenarios/test_dom_resource_limits.py Outdated
Comment thread src/channels/web/sse.rs
Copy link
Copy Markdown
Collaborator Author

@henrypark133 henrypark133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Timer/DOM/SSE resource leak fixes (Risk: medium)

Well-structured fix addressing real browser resource exhaustion. The centralized cleanupConnectionState(), DOM cap via pruneOldMessages(), configurable SSE buffer, and job-map eviction are all sound approaches. E2E test coverage is strong — the setInterval monkey-patch via add_init_script is well done.

Positives:

  • Timer cleanup centralized in one function with clear ownership comment explaining why _doneWithoutResponseTimer is excluded
  • startGatewayStatusPolling() idempotency guard prevents the interval accumulation bug
  • SSE buffer config follows existing patterns: env var → clamped config → module factory
  • data-streaming guard in pruning prevents mid-stream breakage

Concerning: pruneOldMessages() in pagination path evicts just-loaded history [Logic]

File: crates/ironclaw_gateway/static/app.js:3037-3044
When the user scrolls up, loadHistory(before) prepends older messages at container.firstChild (line 3037). Then pruneOldMessages() runs (line 3044) and removes elements from the front of the container — exactly the items just prepended. The user scrolls up to see history and it's immediately pruned. The fresh-load path (lines 2978-2991) is fine since it clears the container first.
Suggested fix: Skip pruneOldMessages() when isPaginating is true. Pruning on sendMessage() and assistant-response receipt is sufficient to bound live-session DOM growth; history pagination loads a bounded batch (50 turns) that doesn't need immediate pruning.

Convention notes:

  • src/channels/web/CLAUDE.md:224 says "Broadcast buffer: 256 events" — should be updated to reflect the new 1024 default and SSE_BROADCAST_BUFFER env var
  • MAX_DOM_MESSAGES (line 1859) is defined near its function rather than with other constants at lines 92-93 — minor style inconsistency
  • _loadThreadsTimer is a similar debounce timeout to jobListRefreshTimer but is not cleared in cleanupConnectionState() — benign but asymmetric

- Remove pruneOldMessages() from loadHistory() pagination path to avoid
  immediately evicting just-prepended older messages
- Move MAX_DOM_MESSAGES constant to top-level constants block
- Add _loadThreadsTimer to cleanupConnectionState() for consistency
- Add assert!(broadcast_buffer > 0) to SseManager constructor with
  panic doc (tokio broadcast channel requires capacity > 0)
- Use Set-based interval tracking in E2E test to prevent counter
  underflow from double-clear
- Update CLAUDE.md broadcast buffer docs (256 → 1024, SSE_BROADCAST_BUFFER)

[skip-regression-check]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@henrypark133
Copy link
Copy Markdown
Collaborator Author

All items from your review addressed in a96c62a and this follow-up push:

  1. pruneOldMessages() in pagination path — Removed from loadHistory(). Pruning on sendMessage() and assistant-response receipt is sufficient.
  2. CLAUDE.md broadcast buffer docs — Updated from '256 events' to 'SSE_BROADCAST_BUFFER env var (default 1024, clamped to 65,536 max)'.
  3. MAX_DOM_MESSAGES placement — Moved to top-level constants block with JOB_EVENTS_CAP etc.
  4. _loadThreadsTimer — Added to cleanupConnectionState() for consistency.
  5. assert!() CI failure — Removed the assert (no-panics-in-production check), replaced with doc comment noting the precondition.

Replace assert!(broadcast_buffer > 0) with a doc comment noting the
precondition. GatewayConfig already rejects 0 at the config layer.

[skip-regression-check]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 14, 2026 05:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/e2e/scenarios/test_dom_resource_limits.py
Comment thread crates/ironclaw_gateway/static/app.js
@ilblackdragon
Copy link
Copy Markdown
Member

Review

Sound scope and clean Rust side. Request changes on a few correctness issues, and — importantly — the tests here are unit/local only. Please add Playwright coverage under tests/e2e/ so we exercise the real browser behavior (timer cleanup on tab hide / SSE reconnect, DOM cap under sustained streaming, LRU eviction of jobEvents) end-to-end, not just via JSDOM-style unit asserts.

Correctness

  • web/static/app.js:1867pruneOldMessages iterates .message, .activity-group, .time-separator. Pruning by flat index can remove the message following a .time-separator while leaving the separator orphaned. Over long sessions this is a (smaller) unbounded leak. Clean up leading orphan separators after the loop.
  • tests/e2e/.../test_dom_resource_limits.py:21 — asserts on .message only, but pruneOldMessages counts a superset. The lower bound >= 100 masks the mismatch. Either assert on the same selector the prune uses, or rename the assertion message.
  • app.js:1071-1080jobEvents LRU scans all entries on every event. Map preserves insertion order; jobEvents.keys().next() is O(1). Simpler and faster.
  • app.js:1871 — degenerate case where everything is data-streaming="true" silently under-prunes. Minor.

v2 coverage

All changes are in the web channel / SSE layer; not engine-specific. No v2-vs-v1 split here. ✅

Tests — please add e2e

  • Playwright: open two tabs, trigger many messages, hide one tab for > polling interval, restore — assert no duplicate timers firing (count requests to /api/gateway/status before/after).
  • Playwright: send > 250 messages in a session, assert DOM .message count stays ≤ 200 AND streaming messages are preserved mid-stream.
  • Playwright: trigger > 50 jobs, assert jobEvents map stays bounded via an instrumentation hook or via memory-pressure proxy.
  • Rust: unit test that SseManager::with_max_connections_and_buffer actually honors the buffer size (receiver count + send-past-capacity sanity).

Rust side

  • channels.rs:51 — document that memory = buffer × receivers × event_size. At 65536 × 100 conn × 200B ≈ 1.3GB worst case under lag; ceiling deserves a comment.
  • sse.rs:76-77 — the "capacity is already baked into tx" comment is under with_max_connections_and_buffer's doc but describes rebuild_state. Move it.

No .unwrap() introduced. parse_optional_env + ConfigError::InvalidValue on zero is the right pattern. 👍

Copy link
Copy Markdown
Collaborator

@serrrfirat serrrfirat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paranoid Security Review — APPROVE

PR: #2441 — Prevent browser crash from timer leaks, DOM growth, SSE buffer
Reviewer context: Automated deep security review

Findings

Net security improvement. No new attack surfaces introduced.

  • No XSS vectors: pruneOldMessages() only calls .remove() on existing DOM nodes. No innerHTML, no dynamic content creation. Existing DOMPurify.sanitize() pipeline unchanged.
  • Timer DoS surface reduced: cleanupConnectionState() clears five connection-scoped timers on every reconnect path. startGatewayStatusPolling() guard prevents double-interval accumulation.
  • All exit paths covered: Cleanup on beforeunload, visibilitychange (hidden), and connectSSE() reconnection.
  • DOM growth bounded: 200-message cap with streaming-node preservation (data-streaming="true" skip).
  • Job events bounded: 50-entry LRU cap. O(50) eviction — negligible.
  • SSE buffer capped: Configurable up to 65,536. Only server operators can set env vars — no external DoS vector.
  • No new .unwrap()/.expect() in production Rust code. Config validation prevents broadcast::channel(0) panic.
  • E2E tests well-structured: Timer leak test uses add_init_script() monkey-patching before page load.

Optional follow-up

Consider calling pruneOldMessages() after loadHistory() to cover the history-scroll-back growth path.

LGTM.

@serrrfirat serrrfirat self-requested a review April 14, 2026 08:10
Copy link
Copy Markdown
Collaborator

@serrrfirat serrrfirat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add playwright tests here

…#2406)

Correctness:
- pruneOldMessages: clean up orphaned leading time-separators after pruning
- jobEvents LRU: replace O(n) scan with O(1) Map insertion-order eviction
- Document degenerate all-streaming under-prune case

Playwright e2e tests:
- Tab hide/restore: no duplicate gateway status polling intervals
- DOM cap + streaming: 260 elements prune to ≤200, streaming preserved, no orphan separators
- jobEvents bounded: 60 jobs stay capped at ≤50 via LRU eviction
- Fix assertion selector to match pruneOldMessages superset, tighten lower bound

Rust:
- Unit test: SseManager buffer size parameter actually controls lag behavior
- Document MAX_BROADCAST_BUFFER memory impact (65K×100×200B ≈ 1.3 GB)
- Move "capacity baked into tx" comment from from_sender to rebuild_state

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
serrrfirat added a commit that referenced this pull request Apr 14, 2026
)

The O(1) LRU eviction skips the job that just received an event (moved
to end via delete+set), but did not protect the job the user is actively
viewing in the detail panel (currentJobId). If the user views a quiet
job while 50+ other jobs fire events, the viewed job's events would be
evicted and the activity tab would appear empty.

Add a currentJobId guard to the eviction loop and a Playwright e2e test
that verifies the actively-viewed job survives LRU pressure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
)

The O(1) LRU eviction skips the job that just received an event (moved
to end via delete+set), but did not protect the job the user is actively
viewing in the detail panel (currentJobId). If the user views a quiet
job while 50+ other jobs fire events, the viewed job's events would be
evicted and the activity tab would appear empty.

Add a currentJobId guard to the eviction loop and a Playwright e2e test
that verifies the actively-viewed job survives LRU pressure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 14, 2026 10:24
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

@github-actions github-actions bot added size: XL 500+ changed lines and removed size: L 200-499 changed lines labels Apr 14, 2026
This was referenced Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: core 20+ merged PRs risk: medium Business logic, config, or moderate-risk modules scope: channel/web Web gateway channel scope: docs Documentation size: XL 500+ changed lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[QA] Pages Unresponsive dialog and black screen crashes

4 participants