You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Every recent UI desync report in the web gateway is the same class: the UI is driven by a stream of deltas without ever reconciling against an authoritative state. Two sources of truth diverge, and nothing forces them back together.
Concrete drift surfaces in play today:
Engine event log vs. AppEvent broadcast stream — thread_event_to_app_events in src/bridge/router.rs translates 8 of 18 engine variants (tracking: complete engine→AppEvent coverage (gaps after #2571 / #2530) #2654); plan tool, gate manager, and reasoning tool also emit AppEvents without a corresponding engine event, so the engine log and the broadcast stream are not the same projection.
Broadcast stream vs. UI state — BroadcastStream silently drops on lag (src/channels/web/platform/sse.rs:188-202); subscribe_raw (WebSocket + Responses API) has no Last-Event-ID replay at all.
Fixing each symptom individually is linear work with no upper bound. The structural fix below makes the class finite.
Target invariants
One source of truth per concept, everything else is a projection. Turn progression → engine event log. Gates → pending-gate store. Files → workspace. Every other surface (SSE, WS, history API, frontend state) is a pure derivation. No tool or handler may emit an AppEvent without writing an engine event first.
Every UI concept has exactly one GET that returns canonical truth. Events are accelerators, not contracts. If the stream is silent for an hour, the UI must converge by calling one documented endpoint. Most GETs already exist (/api/chat/history.pending_gate, .in_progress, /api/chat/threads.active_thread) — the rule is enforcement, not new surface.
Client and server share a monotonic cursor per stream.HistoryResponse, ThreadsResponse, and key status GETs return the same boot_id:counter cursor that SSE emits. On reconnect, tab focus, or cursor mismatch, the GET is canonical; the UI refetches the slice.
One transport with one envelope. SSE and WS share event-ID semantics and reconciliation protocol, or one is deprecated for chat. The current asymmetry (IDs on SSE, none on WS; two envelope shapes) is free drift.
Program of work (ordered)
Phase 1 — close the bridge (prereq for everything else)
Add a pre-commit / CI check: broadcast( and broadcast_for_user( outside bridge::thread_event_to_app_events fail unless annotated with a trailing // projection-exempt: <reason>. Pattern mirrors dispatch-exempt in scripts/pre-commit-safety.sh.
Migrate current non-engine AppEvent emit sites (plan tool, gate manager, reasoning tool, any direct channel broadcasts) to record an engine event first and derive the AppEvent via the bridge.
Phase 2 — cursor + reconcile protocol
Add cursor: String (same boot_id:counter shape as SSE event IDs) to HistoryResponse, ThreadsResponse, ExtensionsResponse, and the status GETs the frontend polls.
Frontend reconciler (crates/ironclaw_gateway/static/js/core/...): on (a) SSE reconnect, (b) tab focus, (c) lastEventId vs. GET cursor mismatch — refetch the canonical slice and apply as authoritative, overriding local optimistic state.
Documentation rule: every AppEvent variant names the GET that returns the same truth. Reviewers enforce.
Phase 3 — replay from the engine log
GET /api/chat/events/replay?thread={id}&since_cursor={N} — re-derive AppEvents from persisted engine events, not from the in-memory broadcast channel. Reconnect path switches to this.
Consequence: BroadcastStream silent drop becomes a performance concern (stale buffer), not a correctness bug (lost event).
Phase 4 — unify transports
Either add event-ID replay semantics to subscribe_raw (WS path) or deprecate WS for the chat surface. Document the choice in src/channels/web/CLAUDE.md and remove the loser in the same PR.
Phase 5 — delete legacy forks
Retire v1 pending_auth: delete /api/chat/auth-token, /api/chat/auth-cancel, clear_auth_mode* in platform/legacy_auth.rs, and the no-request_id branch in static/js/core/onboarding.js. Gate on engine v1 retirement.
E2E regression covering the protocol: disconnect SSE for longer than the broadcast buffer, reconnect, assert the UI converges to backend state without a manual refresh. Blocks on Phase 3.
Phase 2 touches every mutation handler. Small per-handler, nontrivial total surface.
Phase 3 assumes every UI-visible event has a corresponding engine event. That's exactly what Phase 1 guarantees — run in order.
Tab-focus reconcile could flood the backend at 100+ concurrent users. Cursor compare keeps most reconciles 304-equivalent; measure before Phase 2 frontend lands.
Scope creep into engine v2. This epic deliberately does not touch engine internals beyond "emit engine events where they should have existed." If a variant requires engine-loop surgery, file a sub-issue and defer.
Problem
Every recent UI desync report in the web gateway is the same class: the UI is driven by a stream of deltas without ever reconciling against an authoritative state. Two sources of truth diverge, and nothing forces them back together.
Concrete drift surfaces in play today:
AppEventbroadcast stream —thread_event_to_app_eventsinsrc/bridge/router.rstranslates 8 of 18 engine variants (tracking: complete engine→AppEvent coverage (gaps after #2571 / #2530) #2654); plan tool, gate manager, and reasoning tool also emitAppEvents without a corresponding engine event, so the engine log and the broadcast stream are not the same projection.BroadcastStreamsilently drops on lag (src/channels/web/platform/sse.rs:188-202);subscribe_raw(WebSocket + Responses API) has noLast-Event-IDreplay at all.pending_authvs. v2gate_required— two live gate protocols; Web UI: Service connection flow broken (stale approval + lost OAuth callback) #2534's stale approval modal is exactly a modal resolved on one side while chat state stayed on the other.WorkspacePoolcached byuser_idwith token-specific scopes applied post-cache (Gateway: pre-existing correctness/perf issues surfaced during #2628 platform extraction #2633 item 1);seed_if_emptyruns after the cache insert and a seed failure stays cached forever (Gateway: pre-existing correctness/perf issues surfaced during #2628 platform extraction #2633 item 2).Fixing each symptom individually is linear work with no upper bound. The structural fix below makes the class finite.
Target invariants
One source of truth per concept, everything else is a projection. Turn progression → engine event log. Gates → pending-gate store. Files → workspace. Every other surface (SSE, WS, history API, frontend state) is a pure derivation. No tool or handler may emit an
AppEventwithout writing an engine event first.Every UI concept has exactly one GET that returns canonical truth. Events are accelerators, not contracts. If the stream is silent for an hour, the UI must converge by calling one documented endpoint. Most GETs already exist (
/api/chat/history.pending_gate,.in_progress,/api/chat/threads.active_thread) — the rule is enforcement, not new surface.Client and server share a monotonic cursor per stream.
HistoryResponse,ThreadsResponse, and key status GETs return the sameboot_id:countercursor that SSE emits. On reconnect, tab focus, or cursor mismatch, the GET is canonical; the UI refetches the slice.One transport with one envelope. SSE and WS share event-ID semantics and reconciliation protocol, or one is deprecated for chat. The current asymmetry (IDs on SSE, none on WS; two envelope shapes) is free drift.
Program of work (ordered)
Phase 1 — close the bridge (prereq for everything else)
thread_event_to_app_eventscoverage for the 10 dropped engine variants and the 7 missing semantic variants.broadcast(andbroadcast_for_user(outsidebridge::thread_event_to_app_eventsfail unless annotated with a trailing// projection-exempt: <reason>. Pattern mirrorsdispatch-exemptinscripts/pre-commit-safety.sh.AppEventemit sites (plan tool, gate manager, reasoning tool, any direct channel broadcasts) to record an engine event first and derive theAppEventvia the bridge.Phase 2 — cursor + reconcile protocol
cursor: String(sameboot_id:countershape as SSE event IDs) toHistoryResponse,ThreadsResponse,ExtensionsResponse, and the status GETs the frontend polls.crates/ironclaw_gateway/static/js/core/...): on (a) SSE reconnect, (b) tab focus, (c)lastEventIdvs. GETcursormismatch — refetch the canonical slice and apply as authoritative, overriding local optimistic state.AppEventvariant names the GET that returns the same truth. Reviewers enforce.Phase 3 — replay from the engine log
GET /api/chat/events/replay?thread={id}&since_cursor={N}— re-deriveAppEvents from persisted engine events, not from the in-memory broadcast channel. Reconnect path switches to this.BroadcastStreamsilent drop becomes a performance concern (stale buffer), not a correctness bug (lost event).Phase 4 — unify transports
subscribe_raw(WS path) or deprecate WS for the chat surface. Document the choice insrc/channels/web/CLAUDE.mdand remove the loser in the same PR.Phase 5 — delete legacy forks
pending_auth: delete/api/chat/auth-token,/api/chat/auth-cancel,clear_auth_mode*inplatform/legacy_auth.rs, and the no-request_idbranch instatic/js/core/onboarding.js. Gate on engine v1 retirement.WorkspacePoolcache key must include a stable hash of applied scopes (or apply scopes outside the cache); seed-before-insert (or in-flight seed marker) closes the divergence window.Phase 6 — forcing functions
TurnCancelledengine event to exist, which forces the reconcile path to exist.Related issues this class subsumes
ProcessingstateWorkspacePoolcache vs. truthNon-goals
SseManageror theAppEventwire format. Phases 1-3 are additive.Risks
broadcastdirectly" will be broken within a week without a pre-commit check. Phase 1 must land the check.