Skip to content

fix(bridge): sanitize orchestrator failures before showing them to users (#2546)#2747

Merged
serrrfirat merged 2 commits intostagingfrom
fix/2546-orchestrator-502-error-message
Apr 20, 2026
Merged

fix(bridge): sanitize orchestrator failures before showing them to users (#2546)#2747
serrrfirat merged 2 commits intostagingfrom
fix/2546-orchestrator-502-error-message

Conversation

@ilblackdragon
Copy link
Copy Markdown
Member

Summary

  • Introduces src/bridge/user_facing_errors.rs with a FailureCategory classifier that maps raw ThreadOutcome::Failed errors (Rust error chains, Python tracebacks, upstream HTTP bodies) to short user-safe messages.
  • Routes ThreadOutcome::Failed through bridge_outcome_for_failed_thread in src/bridge/router.rs, which logs the full raw error server-side and returns a sanitized user-facing message.
  • Prevents HTTP 502 Bad Gateway, Python tracebacks (file paths, line 907, RuntimeError), and internal Rust error chains from leaking into the chat UI.

Fixes #2546.

Test plan

Notes

Aligns with the new .claude/rules/error-handling.md rule ("Error Boundaries at the Channel Edge") that landed on staging in #2714 and which explicitly references this issue.

The status-badge-stuck-on-"Loading..." symptom mentioned in the issue is a separate frontend concern (gateway SSE/UI) and is not addressed here.

🤖 Generated with Claude Code

…ers (#2546)

When the engine returned `ThreadOutcome::Failed { error }` the raw string
reached the user verbatim through `BridgeOutcome::Respond(format!("Error: {error}"))`.
The string includes multiple layers of wrapping (`Orchestrator error: effect
execution error: ...`), the Monty-hosted Python traceback with internal file
paths (`File "orchestrator.py", line 907`), and the upstream HTTP body
(`HTTP 502 Bad Gateway`). This is what the QA bug bash reported: a 502 from
the LLM provider surfaced the whole stack to a user on staging.

Add a shared `bridge::user_facing_errors` module that classifies failure
strings into intent-level categories (LLM unavailable, rate-limited, context
too large, auth failure, iteration limit, unknown) and returns a short,
user-safe message for each. Route `ThreadOutcome::Failed` through a named
helper (`bridge_outcome_for_failed_thread`) that logs the full raw error
server-side via `tracing::warn!` and responds with the sanitized text.

Extensive unit tests cover both the classifier and the router-side helper,
including the exact 502 traceback from the issue as a regression fixture
plus 413 (#2276) and context-length (#2408) variants.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 20, 2026 15:24
@github-actions github-actions Bot added size: L 200-499 changed lines risk: low Changes to docs, tests, or low-risk modules contributor: core 20+ merged PRs labels Apr 20, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a sanitization mechanism for engine errors to ensure that internal diagnostic information, such as Python tracebacks and raw HTTP responses, is not exposed to users. A new user_facing_errors module classifies raw error strings into categories and returns user-friendly messages, while the bridge router is updated to log the full error server-side before sanitization. Review feedback suggests narrowing several keyword matching patterns in the classification logic—specifically 'tokens used', 'unauthorized', and 'request failed'—to prevent potential false positives and ensure more accurate error reporting.

Comment thread src/bridge/user_facing_errors.rs Outdated
Comment on lines +108 to +109
|| lower.contains("context_length_exceeded")
|| lower.contains("tokens used")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The pattern "tokens used" is overly broad and may match informational messages or unrelated errors that happen to mention token usage. Since "context length exceeded" and "http 413" already cover the standard failure modes for context limits (including the variant in issue #2408), this pattern can be safely removed to avoid false positives.

Suggested change
|| lower.contains("context_length_exceeded")
|| lower.contains("tokens used")
|| lower.contains("context_length_exceeded")
References
  1. When detecting keywords in a string, avoid simple substring containment of broad phrases to prevent false positives.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — agreed. Removed the "tokens used" substring match in 925c47b. The explicit context length exceeded / context_length_exceeded / http 413 / payload too large markers already cover the real failure modes (including #2408's fixture). Existing context_length_exceeded_is_context_too_large test still passes via the context length exceeded prefix.

Comment thread src/bridge/user_facing_errors.rs Outdated
Comment on lines +120 to +121
|| lower.contains("session renewal failed")
|| lower.contains("unauthorized")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Matching on the generic string "unauthorized" might catch tool-level permission errors (e.g., a tool failing to access a specific resource) and incorrectly classify them as LLM provider authentication failures. This would lead to misleading user guidance suggesting they "reconnect the provider". It is safer to rely on more specific strings like "session renewal failed" or explicit HTTP status codes.

Suggested change
|| lower.contains("session renewal failed")
|| lower.contains("unauthorized")
|| lower.contains("session renewal failed")
References
  1. Avoid generic string matching that could lead to incorrect classification and misleading user guidance.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — agreed. Tightened the AuthFailure arm in 925c47b: replaced bare "unauthorized" with specific markers (401 unauthorized, invalid api key, invalid_api_key). The existing http 401 / http 403 / authentication failed / session expired / session renewal failed markers are kept. Added bare_unauthorized_is_not_auth_failure to prevent regression — a tool-level "unauthorized to access /etc/shadow" error no longer misclassifies as LLM provider auth failure.

Comment thread src/bridge/user_facing_errors.rs Outdated
Comment on lines +136 to +138
|| lower.contains("llm call failed")
|| lower.contains("provider nearai")
|| lower.contains("request failed")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The patterns "provider nearai" and "request failed" are very generic and likely to cause false positives. "request failed" in particular could match almost any network or logic error, leading to an incorrect LlmUnavailable classification and potentially misleading retry advice. Since the regression cases from issues #2546 and #2276 are already covered by more specific patterns like "http 502" and "llm call failed", these broad keywords should be removed.

Suggested change
|| lower.contains("llm call failed")
|| lower.contains("provider nearai")
|| lower.contains("request failed")
|| lower.contains("llm call failed")
References
  1. Use specific patterns instead of broad keywords to avoid false positives in error detection and classification.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — agreed. Removed both "provider nearai" and "request failed" from the LlmUnavailable arm in 925c47b. The arm now relies on explicit markers: http 502/503/504, bad gateway, service unavailable, gateway timeout, upstream connect error, upstream, provider temporarily unavailable, and llm call failed. The #2546 regression fixture still classifies correctly because it contains HTTP 502 Bad Gateway (verified by the existing issue_2546_* tests).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a bridge-level error boundary to prevent raw orchestrator/engine failure strings (tracebacks, internal paths, upstream HTTP bodies) from being shown to end users, replacing them with short sanitized messages while keeping full detail in server logs.

Changes:

  • Introduce bridge::user_facing_errors to classify raw ThreadOutcome::Failed strings and generate user-safe messages.
  • Route ThreadOutcome::Failed through a new router helper that logs the raw error and responds with the sanitized message.
  • Add unit tests for classification/sanitization and caller-level router wiring regression tests.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
src/bridge/user_facing_errors.rs New sanitizer + classifier for failed-thread error strings, with unit tests.
src/bridge/router.rs Wires failed thread outcomes through sanitizer and adds caller-level regression tests.
src/bridge/mod.rs Registers the new user_facing_errors module.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +126 to +141
// Provider unavailability. Matches the exact shape of the issue
// #2546 traceback (`HTTP 502 Bad Gateway`) and also covers 503/504
// and common upstream-timeout phrasings.
if lower.contains("http 502")
|| lower.contains("http 503")
|| lower.contains("http 504")
|| lower.contains("bad gateway")
|| lower.contains("service unavailable")
|| lower.contains("gateway timeout")
|| lower.contains("upstream connect error")
|| lower.contains("llm call failed")
|| lower.contains("provider nearai")
|| lower.contains("request failed")
{
return FailureCategory::LlmUnavailable;
}
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

classify_failure’s LlmUnavailable branch matches on very broad substrings ("request failed", and to a lesser extent "provider nearai"). This will misclassify non-5xx provider failures (e.g. "Provider openai_codex request failed: HTTP 400 Bad Request ...") as LlmUnavailable, showing the user the wrong guidance. Consider removing "request failed" from this branch (and relying on 502/503/504 + gateway/service-unavailable keywords), or gating it behind an explicit 5xx/timeout indicator (e.g. http 5, timeout, connect error) so 4xx/other failures fall back to Unknown/more specific categories.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — addressed in 925c47b. Removed both "request failed" and "provider nearai" from the LlmUnavailable substring set. A string like Provider openai_codex request failed: HTTP 400 Bad Request now correctly falls through to Unknown instead of being labeled as transient unavailability. Added non_5xx_provider_failure_is_not_llm_unavailable using exactly the fixture you suggested as regression coverage. The #2546 502 path still matches via the explicit http 502 / bad gateway markers.

Comment thread src/bridge/user_facing_errors.rs Outdated
.into()
}
FailureCategory::AuthFailure => {
"The AI model could not authenticate. Please reconnect the provider and try again."
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AuthFailure user-facing text says “Please reconnect the provider…”, but this router is used across channels (e.g. web/telegram). “Reconnect” is web-UI-specific and may be unclear elsewhere. Consider rewording to a channel-agnostic instruction like “Please re-authenticate the provider and try again.”

Suggested change
"The AI model could not authenticate. Please reconnect the provider and try again."
"The AI model could not authenticate. Please re-authenticate the provider and try again."

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks — good catch. The router is shared across web / telegram / CLI so "reconnect" was too web-specific. Changed to "The AI model could not authenticate. Please re-authenticate the provider and try again." in 925c47b. Added auth_failure_message_is_channel_agnostic to assert the copy never regresses to channel-specific verbs like "reconnect".

…view

- Drop overly broad substring matches ("tokens used", "unauthorized",
  "request failed", "provider nearai") that caused misclassification.
- Reword AuthFailure user message to be channel-agnostic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size: XL 500+ changed lines and removed size: L 200-499 changed lines labels Apr 20, 2026
@serrrfirat serrrfirat merged commit 336bdb1 into staging Apr 20, 2026
18 checks passed
@serrrfirat serrrfirat deleted the fix/2546-orchestrator-502-error-message branch April 20, 2026 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: core 20+ merged PRs risk: low Changes to docs, tests, or low-risk modules size: XL 500+ changed lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug Bash 4/16] Orchestrator surfaces raw HTTP 502 Bad Gateway error to user

3 participants