Skip to content

fix(gate): handle orphaned approval gates when thread deleted#2347

Open
zmanian wants to merge 2 commits intostagingfrom
fix/2323-orphaned-approval-gates
Open

fix(gate): handle orphaned approval gates when thread deleted#2347
zmanian wants to merge 2 commits intostagingfrom
fix/2323-orphaned-approval-gates

Conversation

@zmanian
Copy link
Copy Markdown
Collaborator

@zmanian zmanian commented Apr 11, 2026

Summary

Fixes #2323. When a thread is deleted while an approval gate is pending, the gate becomes orphaned -- unresolvable on backend and undismissable on frontend.

  • Backend: execute_pending_gate_action now detects missing threads and emits gate_resolved with resolution: "expired" instead of returning an opaque error. The gate (already taken from the store) is gracefully dismissed.
  • Backend: PendingGateStore::discard_for_thread(thread_id) added for bulk cleanup of all gates tied to a thread, usable from conversation clear and future thread deletion paths.
  • Frontend: handleGateResolved now handles resolution: "expired" to remove stale approval/auth cards and re-enable chat input.

Test plan

  • cargo test -p ironclaw --lib gate -- 22 tests pass, including 2 new regression tests:
    • test_discard_for_thread_removes_all_gates -- verifies bulk cleanup removes all gates for a thread across users
    • test_discard_for_thread_returns_empty_when_no_gates -- verifies no-op on absent thread
  • cargo clippy --all --benches --tests --examples --all-features -- zero warnings
  • cargo fmt --check -- clean
  • Manual: delete a thread with a pending approval gate, verify the card disappears on the frontend

Generated with Claude Code

@github-actions github-actions bot added size: M 50-199 changed lines risk: low Changes to docs, tests, or low-risk modules contributor: core 20+ merged PRs labels Apr 11, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fd11a6164c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/bridge/router.rs Outdated
.ok_or_else(|| engine_err("load thread", "thread not found"))?;
let thread = match state.store.load_thread(pending.thread_id).await {
Ok(Some(t)) => t,
Ok(None) | Err(_) => {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Propagate thread-load errors instead of expiring gate

The new Ok(None) | Err(_) match arm in execute_pending_gate_action treats any storage failure as a deleted thread and emits an expired resolution. If load_thread fails transiently (DB/network/backend error), we now dismiss the gate as if it were stale, even though the thread may still exist. Because resolve_gate has already removed the gate via take_verified, the user cannot retry approval and the waiting thread is left unresolved; this should only map Ok(None) to expired and keep Err(e) as an error path.

Useful? React with 👍 / 👎.

Comment thread src/bridge/router.rs Outdated
},
);
}
return Ok(Some("Thread no longer exists. Approval dismissed.".into()));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fail stale-gate path to trigger auto-approve rollback

Returning success on missing-thread gate execution breaks the existing rollback contract for “approve always.” In resolve_gate, auto-approve is enabled before calling this helper and only revoked when the result is Err; with this new Ok(...) return, approving a stale/deleted gate can persist tool auto-approval even though no action was resumed. This should remain an error (or explicitly revoke) so stale approvals do not silently grant lasting permission.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces logic to handle expired pending gates, specifically when a thread is deleted or becomes unreachable. Key changes include updating the frontend to process the 'expired' resolution, adding a discard_for_thread method to the gate store for bulk cleanup, and modifying the bridge router to emit expiration events when a thread cannot be found. Feedback was provided regarding the error handling in the router, noting that grouping database errors with missing threads could lead to transient system failures being incorrectly treated as permanent gate expirations.

Comment thread src/bridge/router.rs Outdated
Comment on lines +364 to +391
Ok(None) | Err(_) => {
// Thread was deleted or became unreachable while the gate was
// pending. Emit a "gate_resolved" event with resolution "expired"
// so the frontend can dismiss the stale approval card.
tracing::debug!(
thread_id = %pending.thread_id,
gate = %pending.gate_name,
action = %pending.action_name,
"thread not found for pending gate; emitting expired resolution"
);
if let Some(ref sse) = state.sse {
sse.broadcast_for_user(
&message.user_id,
AppEvent::GateResolved {
request_id: pending.request_id.to_string(),
gate_name: pending.gate_name.clone(),
tool_name: pending.action_name.clone(),
resolution: "expired".into(),
message: "Thread no longer exists.".into(),
thread_id: pending
.scope_thread_id
.clone()
.or_else(|| Some(pending.thread_id.to_string())),
},
);
}
return Ok(Some("Thread no longer exists. Approval dismissed.".into()));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Err(_) case from load_thread is grouped with Ok(None). This swallows potential system errors (e.g., database connection issues) and treats them as if the thread was deleted. Transient errors should be propagated to allow for retries rather than permanently dismissing the gate with an 'expired' resolution.

        Ok(None) => {
            // Thread was deleted or became unreachable while the gate was
            // pending. Emit a "gate_resolved" event with resolution "expired"
            // so the frontend can dismiss the stale approval card.
            tracing::debug!(
                thread_id = %pending.thread_id,
                gate = %pending.gate_name,
                action = %pending.action_name,
                "thread not found for pending gate; emitting expired resolution"
            );
            if let Some(ref sse) = state.sse {
                sse.broadcast_for_user(
                    &message.user_id,
                    AppEvent::GateResolved {
                        request_id: pending.request_id.to_string(),
                        gate_name: pending.gate_name.clone(),
                        tool_name: pending.action_name.clone(),
                        resolution: "expired".into(),
                        message: "Thread no longer exists.".into(),
                        thread_id: pending
                            .scope_thread_id
                            .clone()
                            .or_else(|| Some(pending.thread_id.to_string())),
                    },
                );
            }
            return Ok(Some("Thread no longer exists. Approval dismissed.".into()));
        }
        Err(e) => return Err(engine_err("load thread", e)),
References
  1. Avoid silently swallowing database errors by using .ok().flatten() on results from database operations. Instead, handle errors explicitly using .map_err()? or transpose() to ensure failures are properly surfaced.
  2. Create specific error variants for different failure modes to provide semantically correct and clear error messages.

Comment thread src/bridge/router.rs Outdated
.ok_or_else(|| engine_err("load thread", "thread not found"))?;
let thread = match state.store.load_thread(pending.thread_id).await {
Ok(Some(t)) => t,
Ok(None) | Err(_) => {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High — Ok(None) | Err(_) conflation permanently discards gate on transient DB error

This match arm collapses two fundamentally different situations: Ok(None) (thread genuinely deleted) vs Err(_) (database unreachable, deserialization error, transient failure).

By the time this code runs, take_verified at line ~1500 has already removed the gate from the store. If load_thread fails due to a transient DB error, the gate is silently discarded — the user gets "Thread no longer exists. Approval dismissed." when the thread still exists. The tool action is permanently lost.

Suggested fix: Split the arms:

Ok(None) => {
    // Thread genuinely deleted — emit expired resolution
    // ... existing expired logic ...
}
Err(e) => {
    // Transient failure — propagate error so caller can retry
    return Err(engine_err("load thread", e));
    // Or: re-insert the gate into the store as a compensating transaction
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f82b60d. Split the Ok(None) | Err(_) arm into two separate arms:

  • Ok(None) -- thread genuinely deleted, emits GateResolved with "expired" resolution (preserves the original UX behavior)
  • Err(e) -- transient DB failure, propagates the error via engine_err("load thread", e) so the caller can retry instead of permanently discarding the gate

The gate remains in the PendingGateStore on transient failure and can be retried on the next approval attempt.

Comment thread src/gate/store.rs
/// Returns the gates that were removed. Used when a thread is deleted or
/// becomes unreachable while gates are still pending — prevents orphaned
/// gates that can never be resolved.
pub async fn discard_for_thread(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High — discard_for_thread() is dead code

This new public method is never called from production code. The PR description says it is "usable from conversation clear and future thread deletion paths," but the conversation clear path at router.rs:2134-2140 still uses per-key discard() with only the requesting user's ID.

If a thread has gates from multiple users (as the test at line ~711 demonstrates with "alice" and "bob"), the existing clear path only clears the current user's gate — the other user's gate remains orphaned. The new method solves this but is never wired in.

Suggested fix: Either wire discard_for_thread into clear_engine_conversation (replacing the per-key loop), or remove the method and add it when it's actually needed. Dead public API surface creates maintenance burden and false confidence.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in f82b60d. Wired discard_for_thread() into the conversation clear path (clear_engine_conversation), replacing the per-user discard(&PendingGateKey{...}) call.

The per-user discard was insufficient because a thread can have pending gates from multiple users. discard_for_thread() removes all gates for the thread regardless of user, which is the correct behavior when a thread is being torn down.

Comment thread src/bridge/router.rs Outdated
.ok_or_else(|| engine_err("load thread", "thread not found"))?;
let thread = match state.store.load_thread(pending.thread_id).await {
Ok(Some(t)) => t,
Ok(None) | Err(_) => {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — Err(_) wildcard discards the actual error

The Err(_) arm discards the error value with a wildcard pattern. The tracing::debug! logs thread_id, gate_name, and action_name, but not the error itself. If this was a database connection failure, an operator would see "thread not found for pending gate" with no indication the real problem was a connection timeout or deserialization failure.

Suggested fix: At minimum, bind the error: Err(e) => { tracing::debug!(error = %e, ...); ... }. Better yet, handle Err separately as noted in finding #1.

data.resolution === 'credential_provided'
|| data.resolution === 'cancelled'
|| data.resolution === 'external_callback'
|| data.resolution === 'expired'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — expired resolution globally clears ALL auth cards

When resolution === 'expired', the code removes the approval card by request_id (correct), then calls removeAuthCard() (which removes all auth cards regardless of request ID) and enableChatInput(). But for resolutions like 'approved', the function only removes the specific approval card and does NOT call enableChatInput().

If two gates are pending simultaneously (different tools), resolving one as expired would dismiss the other's auth card too.

Suggested fix: Make expired only remove the specific card (which the first line already does) and skip the removeAuthCard() + enableChatInput() call, OR document why expired needs the aggressive cleanup.

Comment thread src/bridge/router.rs
.await
.map_err(|e| engine_err("load thread", e))?
.ok_or_else(|| engine_err("load thread", "thread not found"))?;
let thread = match state.store.load_thread(pending.thread_id).await {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — TOCTOU between take_verified and load_thread

There is a race window between take_verified (which removes the gate from the store) and load_thread. If the thread is deleted in this window, the gate is gone from the store but was never executed.

The thread-deleting user gets no notification that they caused a gate to be orphaned — only the gate-resolving user's SSE gets the expired event.

Suggested fix: Consider emitting a notification/warning to the user who deleted the thread (if different from the gate resolver).

Copy link
Copy Markdown
Collaborator

@serrrfirat serrrfirat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paranoid Architect Review — REQUEST CHANGES

2 High, 3 Medium findings.

The orphaned-gate handling is a real problem that needs solving, and the expired resolution + SSE broadcast approach is reasonable. However:

  • High: Ok(None) | Err(_) conflation permanently discards gates on transient DB failures. A transient load_thread error should not be treated the same as a genuinely deleted thread — especially since take_verified has already removed the gate from the store.
  • High: discard_for_thread() is dead code — tested but never called from production paths. The existing clear_engine_conversation still uses per-user discard() which can't clean cross-user orphans.
  • Medium: Error value discarded in wildcard, frontend expired handling is overly aggressive (clears all auth cards), and no notification to thread-deleting user.

Ship-blocker: Split the Ok(None) | Err(_) arm. Transient errors must propagate, not silently discard gates.

zmanian added a commit that referenced this pull request Apr 14, 2026
…o production

Address review feedback on #2347:

1. Split the conflated `Ok(None) | Err(_)` match arm in
   `execute_pending_gate_action`: `Ok(None)` means the thread was
   genuinely deleted (emit "expired" gate resolution), while `Err(_)`
   indicates a transient DB failure (propagate the error so the caller
   can retry instead of permanently discarding the gate).

2. Wire `discard_for_thread()` into the conversation clear path
   (`clear_engine_conversation`), replacing the per-user `discard()`
   call. This ensures all pending gates for a thread are cleaned up
   regardless of which user created them, preventing orphaned gates.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@zmanian zmanian requested a review from serrrfirat April 14, 2026 08:40
zmanian added a commit that referenced this pull request Apr 16, 2026
…o production

Address review feedback on #2347:

1. Split the conflated `Ok(None) | Err(_)` match arm in
   `execute_pending_gate_action`: `Ok(None)` means the thread was
   genuinely deleted (emit "expired" gate resolution), while `Err(_)`
   indicates a transient DB failure (propagate the error so the caller
   can retry instead of permanently discarding the gate).

2. Wire `discard_for_thread()` into the conversation clear path
   (`clear_engine_conversation`), replacing the per-user `discard()`
   call. This ensures all pending gates for a thread are cleaned up
   regardless of which user created them, preventing orphaned gates.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@zmanian zmanian force-pushed the fix/2323-orphaned-approval-gates branch from f82b60d to d382f09 Compare April 16, 2026 16:14
@github-actions github-actions bot added scope: channel/cli TUI / CLI channel scope: channel/web Web gateway channel scope: tool/builtin Built-in tools scope: llm LLM integration scope: setup Onboarding / setup size: L 200-499 changed lines risk: high Safety, secrets, auth, or critical infrastructure and removed size: M 50-199 changed lines risk: low Changes to docs, tests, or low-risk modules labels Apr 16, 2026
zmanian added a commit that referenced this pull request Apr 17, 2026
…o production

Address review feedback on #2347:

1. Split the conflated `Ok(None) | Err(_)` match arm in
   `execute_pending_gate_action`: `Ok(None)` means the thread was
   genuinely deleted (emit "expired" gate resolution), while `Err(_)`
   indicates a transient DB failure (propagate the error so the caller
   can retry instead of permanently discarding the gate).

2. Wire `discard_for_thread()` into the conversation clear path
   (`clear_engine_conversation`), replacing the per-user `discard()`
   call. This ensures all pending gates for a thread are cleaned up
   regardless of which user created them, preventing orphaned gates.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@zmanian zmanian force-pushed the fix/2323-orphaned-approval-gates branch from 7931225 to ab37519 Compare April 17, 2026 07:34
@github-actions github-actions bot added size: M 50-199 changed lines risk: low Changes to docs, tests, or low-risk modules scope: channel/wasm WASM channel runtime risk: medium Business logic, config, or moderate-risk modules and removed size: L 200-499 changed lines risk: high Safety, secrets, auth, or critical infrastructure risk: low Changes to docs, tests, or low-risk modules labels Apr 17, 2026
zmanian and others added 2 commits April 19, 2026 01:18
When a thread is deleted while an approval gate is pending, the gate
becomes orphaned -- unresolvable on backend and undismissable on
frontend. This fixes three root causes:

1. execute_pending_gate_action now detects missing threads and emits a
   gate_resolved event with resolution "expired" instead of erroring,
   so the frontend can dismiss the stale card.

2. PendingGateStore gains discard_for_thread() for bulk cleanup of all
   gates tied to a specific thread, regardless of user.

3. Frontend handleGateResolved now handles "expired" resolution to
   remove stale approval cards and re-enable chat input.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…o production

Address review feedback on #2347:

1. Split the conflated `Ok(None) | Err(_)` match arm in
   `execute_pending_gate_action`: `Ok(None)` means the thread was
   genuinely deleted (emit "expired" gate resolution), while `Err(_)`
   indicates a transient DB failure (propagate the error so the caller
   can retry instead of permanently discarding the gate).

2. Wire `discard_for_thread()` into the conversation clear path
   (`clear_engine_conversation`), replacing the per-user `discard()`
   call. This ensures all pending gates for a thread are cleaned up
   regardless of which user created them, preventing orphaned gates.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@zmanian zmanian force-pushed the fix/2323-orphaned-approval-gates branch from b2a2cb4 to a5eb886 Compare April 19, 2026 01:35
@github-actions github-actions bot added risk: low Changes to docs, tests, or low-risk modules and removed risk: medium Business logic, config, or moderate-risk modules labels Apr 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: core 20+ merged PRs risk: low Changes to docs, tests, or low-risk modules scope: channel/cli TUI / CLI channel scope: channel/wasm WASM channel runtime scope: channel/web Web gateway channel scope: dependencies Dependency updates scope: llm LLM integration scope: setup Onboarding / setup scope: tool/builtin Built-in tools size: M 50-199 changed lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Orphaned approval gates when thread is deleted but mission still active

2 participants