Skip to content

Epic: Gateway state convergence — eliminate UI/backend state drift #2792

@ilblackdragon

Description

@ilblackdragon

Problem

Every recent UI desync report in the web gateway is the same class: the UI is driven by a stream of deltas without ever reconciling against an authoritative state. Two sources of truth diverge, and nothing forces them back together.

Concrete drift surfaces in play today:

Fixing each symptom individually is linear work with no upper bound. The structural fix below makes the class finite.

Target invariants

  1. One source of truth per concept, everything else is a projection. Turn progression → engine event log. Gates → pending-gate store. Files → workspace. Every other surface (SSE, WS, history API, frontend state) is a pure derivation. No tool or handler may emit an AppEvent without writing an engine event first.

  2. Every UI concept has exactly one GET that returns canonical truth. Events are accelerators, not contracts. If the stream is silent for an hour, the UI must converge by calling one documented endpoint. Most GETs already exist (/api/chat/history.pending_gate, .in_progress, /api/chat/threads.active_thread) — the rule is enforcement, not new surface.

  3. Client and server share a monotonic cursor per stream. HistoryResponse, ThreadsResponse, and key status GETs return the same boot_id:counter cursor that SSE emits. On reconnect, tab focus, or cursor mismatch, the GET is canonical; the UI refetches the slice.

  4. One transport with one envelope. SSE and WS share event-ID semantics and reconciliation protocol, or one is deprecated for chat. The current asymmetry (IDs on SSE, none on WS; two envelope shapes) is free drift.

Program of work (ordered)

Phase 1 — close the bridge (prereq for everything else)

  • tracking: complete engine→AppEvent coverage (gaps after #2571 / #2530) #2654 — finish thread_event_to_app_events coverage for the 10 dropped engine variants and the 7 missing semantic variants.
  • Add a pre-commit / CI check: broadcast( and broadcast_for_user( outside bridge::thread_event_to_app_events fail unless annotated with a trailing // projection-exempt: <reason>. Pattern mirrors dispatch-exempt in scripts/pre-commit-safety.sh.
  • Migrate current non-engine AppEvent emit sites (plan tool, gate manager, reasoning tool, any direct channel broadcasts) to record an engine event first and derive the AppEvent via the bridge.

Phase 2 — cursor + reconcile protocol

  • Add cursor: String (same boot_id:counter shape as SSE event IDs) to HistoryResponse, ThreadsResponse, ExtensionsResponse, and the status GETs the frontend polls.
  • Frontend reconciler (crates/ironclaw_gateway/static/js/core/...): on (a) SSE reconnect, (b) tab focus, (c) lastEventId vs. GET cursor mismatch — refetch the canonical slice and apply as authoritative, overriding local optimistic state.
  • Documentation rule: every AppEvent variant names the GET that returns the same truth. Reviewers enforce.

Phase 3 — replay from the engine log

  • GET /api/chat/events/replay?thread={id}&since_cursor={N} — re-derive AppEvents from persisted engine events, not from the in-memory broadcast channel. Reconnect path switches to this.
  • Consequence: BroadcastStream silent drop becomes a performance concern (stale buffer), not a correctness bug (lost event).

Phase 4 — unify transports

  • Either add event-ID replay semantics to subscribe_raw (WS path) or deprecate WS for the chat surface. Document the choice in src/channels/web/CLAUDE.md and remove the loser in the same PR.

Phase 5 — delete legacy forks

  • Retire v1 pending_auth: delete /api/chat/auth-token, /api/chat/auth-cancel, clear_auth_mode* in platform/legacy_auth.rs, and the no-request_id branch in static/js/core/onboarding.js. Gate on engine v1 retirement.
  • Gateway: pre-existing correctness/perf issues surfaced during #2628 platform extraction #2633 items 1-2 — WorkspacePool cache key must include a stable hash of applied scopes (or apply scopes outside the cache); seed-before-insert (or in-flight seed marker) closes the divergence window.

Phase 6 — forcing functions

  • feat(web): visible Stop control and hard cancel for active chat turns #2121 hard-cancel endpoint. Not for its own sake — because it forces a TurnCancelled engine event to exist, which forces the reconcile path to exist.
  • E2E regression covering the protocol: disconnect SSE for longer than the broadcast buffer, reconnect, assert the UI converges to backend state without a manual refresh. Blocks on Phase 3.

Related issues this class subsumes

Non-goals

  • Rewriting SseManager or the AppEvent wire format. Phases 1-3 are additive.
  • Replacing the broadcast channel with a durable queue. Replay comes from the engine event log, which is already durable; broadcast stays the fast path.
  • Engine v1 work. Phase 5's legacy deletions gate on v1 retirement elsewhere.
  • Frontend rendering of newly-bridged events — follow-ups per variant, tracked under tracking: complete engine→AppEvent coverage (gaps after #2571 / #2530) #2654.

Risks

  • Invariant Move whatsapp channel source to channels-src/ for consistency #1 has to be enforced by a script. A convention that says "don't call broadcast directly" will be broken within a week without a pre-commit check. Phase 1 must land the check.
  • Phase 2 touches every mutation handler. Small per-handler, nontrivial total surface.
  • Phase 3 assumes every UI-visible event has a corresponding engine event. That's exactly what Phase 1 guarantees — run in order.
  • Tab-focus reconcile could flood the backend at 100+ concurrent users. Cursor compare keeps most reconciles 304-equivalent; measure before Phase 2 frontend lands.
  • Scope creep into engine v2. This epic deliberately does not touch engine internals beyond "emit engine events where they should have existed." If a variant requires engine-loop surgery, file a sub-issue and defer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestscope: agentAgent core (agent loop, router, scheduler)scope: channel/webWeb gateway channel

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions