fix(gate): handle orphaned approval gates when thread deleted#2347
fix(gate): handle orphaned approval gates when thread deleted#2347
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fd11a6164c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| .ok_or_else(|| engine_err("load thread", "thread not found"))?; | ||
| let thread = match state.store.load_thread(pending.thread_id).await { | ||
| Ok(Some(t)) => t, | ||
| Ok(None) | Err(_) => { |
There was a problem hiding this comment.
Propagate thread-load errors instead of expiring gate
The new Ok(None) | Err(_) match arm in execute_pending_gate_action treats any storage failure as a deleted thread and emits an expired resolution. If load_thread fails transiently (DB/network/backend error), we now dismiss the gate as if it were stale, even though the thread may still exist. Because resolve_gate has already removed the gate via take_verified, the user cannot retry approval and the waiting thread is left unresolved; this should only map Ok(None) to expired and keep Err(e) as an error path.
Useful? React with 👍 / 👎.
| }, | ||
| ); | ||
| } | ||
| return Ok(Some("Thread no longer exists. Approval dismissed.".into())); |
There was a problem hiding this comment.
Fail stale-gate path to trigger auto-approve rollback
Returning success on missing-thread gate execution breaks the existing rollback contract for “approve always.” In resolve_gate, auto-approve is enabled before calling this helper and only revoked when the result is Err; with this new Ok(...) return, approving a stale/deleted gate can persist tool auto-approval even though no action was resumed. This should remain an error (or explicitly revoke) so stale approvals do not silently grant lasting permission.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request introduces logic to handle expired pending gates, specifically when a thread is deleted or becomes unreachable. Key changes include updating the frontend to process the 'expired' resolution, adding a discard_for_thread method to the gate store for bulk cleanup, and modifying the bridge router to emit expiration events when a thread cannot be found. Feedback was provided regarding the error handling in the router, noting that grouping database errors with missing threads could lead to transient system failures being incorrectly treated as permanent gate expirations.
| Ok(None) | Err(_) => { | ||
| // Thread was deleted or became unreachable while the gate was | ||
| // pending. Emit a "gate_resolved" event with resolution "expired" | ||
| // so the frontend can dismiss the stale approval card. | ||
| tracing::debug!( | ||
| thread_id = %pending.thread_id, | ||
| gate = %pending.gate_name, | ||
| action = %pending.action_name, | ||
| "thread not found for pending gate; emitting expired resolution" | ||
| ); | ||
| if let Some(ref sse) = state.sse { | ||
| sse.broadcast_for_user( | ||
| &message.user_id, | ||
| AppEvent::GateResolved { | ||
| request_id: pending.request_id.to_string(), | ||
| gate_name: pending.gate_name.clone(), | ||
| tool_name: pending.action_name.clone(), | ||
| resolution: "expired".into(), | ||
| message: "Thread no longer exists.".into(), | ||
| thread_id: pending | ||
| .scope_thread_id | ||
| .clone() | ||
| .or_else(|| Some(pending.thread_id.to_string())), | ||
| }, | ||
| ); | ||
| } | ||
| return Ok(Some("Thread no longer exists. Approval dismissed.".into())); | ||
| } |
There was a problem hiding this comment.
The Err(_) case from load_thread is grouped with Ok(None). This swallows potential system errors (e.g., database connection issues) and treats them as if the thread was deleted. Transient errors should be propagated to allow for retries rather than permanently dismissing the gate with an 'expired' resolution.
Ok(None) => {
// Thread was deleted or became unreachable while the gate was
// pending. Emit a "gate_resolved" event with resolution "expired"
// so the frontend can dismiss the stale approval card.
tracing::debug!(
thread_id = %pending.thread_id,
gate = %pending.gate_name,
action = %pending.action_name,
"thread not found for pending gate; emitting expired resolution"
);
if let Some(ref sse) = state.sse {
sse.broadcast_for_user(
&message.user_id,
AppEvent::GateResolved {
request_id: pending.request_id.to_string(),
gate_name: pending.gate_name.clone(),
tool_name: pending.action_name.clone(),
resolution: "expired".into(),
message: "Thread no longer exists.".into(),
thread_id: pending
.scope_thread_id
.clone()
.or_else(|| Some(pending.thread_id.to_string())),
},
);
}
return Ok(Some("Thread no longer exists. Approval dismissed.".into()));
}
Err(e) => return Err(engine_err("load thread", e)),References
- Avoid silently swallowing database errors by using .ok().flatten() on results from database operations. Instead, handle errors explicitly using .map_err()? or transpose() to ensure failures are properly surfaced.
- Create specific error variants for different failure modes to provide semantically correct and clear error messages.
dd0a652 to
39f390e
Compare
| .ok_or_else(|| engine_err("load thread", "thread not found"))?; | ||
| let thread = match state.store.load_thread(pending.thread_id).await { | ||
| Ok(Some(t)) => t, | ||
| Ok(None) | Err(_) => { |
There was a problem hiding this comment.
High — Ok(None) | Err(_) conflation permanently discards gate on transient DB error
This match arm collapses two fundamentally different situations: Ok(None) (thread genuinely deleted) vs Err(_) (database unreachable, deserialization error, transient failure).
By the time this code runs, take_verified at line ~1500 has already removed the gate from the store. If load_thread fails due to a transient DB error, the gate is silently discarded — the user gets "Thread no longer exists. Approval dismissed." when the thread still exists. The tool action is permanently lost.
Suggested fix: Split the arms:
Ok(None) => {
// Thread genuinely deleted — emit expired resolution
// ... existing expired logic ...
}
Err(e) => {
// Transient failure — propagate error so caller can retry
return Err(engine_err("load thread", e));
// Or: re-insert the gate into the store as a compensating transaction
}There was a problem hiding this comment.
Fixed in f82b60d. Split the Ok(None) | Err(_) arm into two separate arms:
Ok(None)-- thread genuinely deleted, emitsGateResolvedwith"expired"resolution (preserves the original UX behavior)Err(e)-- transient DB failure, propagates the error viaengine_err("load thread", e)so the caller can retry instead of permanently discarding the gate
The gate remains in the PendingGateStore on transient failure and can be retried on the next approval attempt.
| /// Returns the gates that were removed. Used when a thread is deleted or | ||
| /// becomes unreachable while gates are still pending — prevents orphaned | ||
| /// gates that can never be resolved. | ||
| pub async fn discard_for_thread( |
There was a problem hiding this comment.
High — discard_for_thread() is dead code
This new public method is never called from production code. The PR description says it is "usable from conversation clear and future thread deletion paths," but the conversation clear path at router.rs:2134-2140 still uses per-key discard() with only the requesting user's ID.
If a thread has gates from multiple users (as the test at line ~711 demonstrates with "alice" and "bob"), the existing clear path only clears the current user's gate — the other user's gate remains orphaned. The new method solves this but is never wired in.
Suggested fix: Either wire discard_for_thread into clear_engine_conversation (replacing the per-key loop), or remove the method and add it when it's actually needed. Dead public API surface creates maintenance burden and false confidence.
There was a problem hiding this comment.
Fixed in f82b60d. Wired discard_for_thread() into the conversation clear path (clear_engine_conversation), replacing the per-user discard(&PendingGateKey{...}) call.
The per-user discard was insufficient because a thread can have pending gates from multiple users. discard_for_thread() removes all gates for the thread regardless of user, which is the correct behavior when a thread is being torn down.
| .ok_or_else(|| engine_err("load thread", "thread not found"))?; | ||
| let thread = match state.store.load_thread(pending.thread_id).await { | ||
| Ok(Some(t)) => t, | ||
| Ok(None) | Err(_) => { |
There was a problem hiding this comment.
Medium — Err(_) wildcard discards the actual error
The Err(_) arm discards the error value with a wildcard pattern. The tracing::debug! logs thread_id, gate_name, and action_name, but not the error itself. If this was a database connection failure, an operator would see "thread not found for pending gate" with no indication the real problem was a connection timeout or deserialization failure.
Suggested fix: At minimum, bind the error: Err(e) => { tracing::debug!(error = %e, ...); ... }. Better yet, handle Err separately as noted in finding #1.
| data.resolution === 'credential_provided' | ||
| || data.resolution === 'cancelled' | ||
| || data.resolution === 'external_callback' | ||
| || data.resolution === 'expired' |
There was a problem hiding this comment.
Medium — expired resolution globally clears ALL auth cards
When resolution === 'expired', the code removes the approval card by request_id (correct), then calls removeAuthCard() (which removes all auth cards regardless of request ID) and enableChatInput(). But for resolutions like 'approved', the function only removes the specific approval card and does NOT call enableChatInput().
If two gates are pending simultaneously (different tools), resolving one as expired would dismiss the other's auth card too.
Suggested fix: Make expired only remove the specific card (which the first line already does) and skip the removeAuthCard() + enableChatInput() call, OR document why expired needs the aggressive cleanup.
| .await | ||
| .map_err(|e| engine_err("load thread", e))? | ||
| .ok_or_else(|| engine_err("load thread", "thread not found"))?; | ||
| let thread = match state.store.load_thread(pending.thread_id).await { |
There was a problem hiding this comment.
Medium — TOCTOU between take_verified and load_thread
There is a race window between take_verified (which removes the gate from the store) and load_thread. If the thread is deleted in this window, the gate is gone from the store but was never executed.
The thread-deleting user gets no notification that they caused a gate to be orphaned — only the gate-resolving user's SSE gets the expired event.
Suggested fix: Consider emitting a notification/warning to the user who deleted the thread (if different from the gate resolver).
serrrfirat
left a comment
There was a problem hiding this comment.
Paranoid Architect Review — REQUEST CHANGES
2 High, 3 Medium findings.
The orphaned-gate handling is a real problem that needs solving, and the expired resolution + SSE broadcast approach is reasonable. However:
- High:
Ok(None) | Err(_)conflation permanently discards gates on transient DB failures. A transientload_threaderror should not be treated the same as a genuinely deleted thread — especially sincetake_verifiedhas already removed the gate from the store. - High:
discard_for_thread()is dead code — tested but never called from production paths. The existingclear_engine_conversationstill uses per-userdiscard()which can't clean cross-user orphans. - Medium: Error value discarded in wildcard, frontend
expiredhandling is overly aggressive (clears all auth cards), and no notification to thread-deleting user.
Ship-blocker: Split the Ok(None) | Err(_) arm. Transient errors must propagate, not silently discard gates.
…o production Address review feedback on #2347: 1. Split the conflated `Ok(None) | Err(_)` match arm in `execute_pending_gate_action`: `Ok(None)` means the thread was genuinely deleted (emit "expired" gate resolution), while `Err(_)` indicates a transient DB failure (propagate the error so the caller can retry instead of permanently discarding the gate). 2. Wire `discard_for_thread()` into the conversation clear path (`clear_engine_conversation`), replacing the per-user `discard()` call. This ensures all pending gates for a thread are cleaned up regardless of which user created them, preventing orphaned gates. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…o production Address review feedback on #2347: 1. Split the conflated `Ok(None) | Err(_)` match arm in `execute_pending_gate_action`: `Ok(None)` means the thread was genuinely deleted (emit "expired" gate resolution), while `Err(_)` indicates a transient DB failure (propagate the error so the caller can retry instead of permanently discarding the gate). 2. Wire `discard_for_thread()` into the conversation clear path (`clear_engine_conversation`), replacing the per-user `discard()` call. This ensures all pending gates for a thread are cleaned up regardless of which user created them, preventing orphaned gates. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
f82b60d to
d382f09
Compare
…o production Address review feedback on #2347: 1. Split the conflated `Ok(None) | Err(_)` match arm in `execute_pending_gate_action`: `Ok(None)` means the thread was genuinely deleted (emit "expired" gate resolution), while `Err(_)` indicates a transient DB failure (propagate the error so the caller can retry instead of permanently discarding the gate). 2. Wire `discard_for_thread()` into the conversation clear path (`clear_engine_conversation`), replacing the per-user `discard()` call. This ensures all pending gates for a thread are cleaned up regardless of which user created them, preventing orphaned gates. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
7931225 to
ab37519
Compare
When a thread is deleted while an approval gate is pending, the gate becomes orphaned -- unresolvable on backend and undismissable on frontend. This fixes three root causes: 1. execute_pending_gate_action now detects missing threads and emits a gate_resolved event with resolution "expired" instead of erroring, so the frontend can dismiss the stale card. 2. PendingGateStore gains discard_for_thread() for bulk cleanup of all gates tied to a specific thread, regardless of user. 3. Frontend handleGateResolved now handles "expired" resolution to remove stale approval cards and re-enable chat input. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…o production Address review feedback on #2347: 1. Split the conflated `Ok(None) | Err(_)` match arm in `execute_pending_gate_action`: `Ok(None)` means the thread was genuinely deleted (emit "expired" gate resolution), while `Err(_)` indicates a transient DB failure (propagate the error so the caller can retry instead of permanently discarding the gate). 2. Wire `discard_for_thread()` into the conversation clear path (`clear_engine_conversation`), replacing the per-user `discard()` call. This ensures all pending gates for a thread are cleaned up regardless of which user created them, preventing orphaned gates. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
b2a2cb4 to
a5eb886
Compare
Summary
Fixes #2323. When a thread is deleted while an approval gate is pending, the gate becomes orphaned -- unresolvable on backend and undismissable on frontend.
execute_pending_gate_actionnow detects missing threads and emitsgate_resolvedwithresolution: "expired"instead of returning an opaque error. The gate (already taken from the store) is gracefully dismissed.PendingGateStore::discard_for_thread(thread_id)added for bulk cleanup of all gates tied to a thread, usable from conversation clear and future thread deletion paths.handleGateResolvednow handlesresolution: "expired"to remove stale approval/auth cards and re-enable chat input.Test plan
cargo test -p ironclaw --lib gate-- 22 tests pass, including 2 new regression tests:test_discard_for_thread_removes_all_gates-- verifies bulk cleanup removes all gates for a thread across userstest_discard_for_thread_returns_empty_when_no_gates-- verifies no-op on absent threadcargo clippy --all --benches --tests --examples --all-features-- zero warningscargo fmt --check-- cleanGenerated with Claude Code