fix(bridge): sanitize orchestrator failures before showing them to users (#2546) by ilblackdragon · Pull Request #2747 · nearai/ironclaw

ilblackdragon · 2026-04-20T15:24:10Z

Summary

Introduces src/bridge/user_facing_errors.rs with a FailureCategory classifier that maps raw ThreadOutcome::Failed errors (Rust error chains, Python tracebacks, upstream HTTP bodies) to short user-safe messages.
Routes ThreadOutcome::Failed through bridge_outcome_for_failed_thread in src/bridge/router.rs, which logs the full raw error server-side and returns a sanitized user-facing message.
Prevents HTTP 502 Bad Gateway, Python tracebacks (file paths, line 907, RuntimeError), and internal Rust error chains from leaking into the chat UI.

Test plan

cargo fmt clean
cargo clippy --all --benches --tests --examples --all-features — zero warnings
21 new tests (18 unit tests in user_facing_errors, 3 caller-level tests in router::tests)
Includes the exact [Bug Bash 4/16] Orchestrator surfaces raw HTTP 502 Bad Gateway error to user #2546 traceback as a regression fixture; 413 ([QA] Orchestrator crashes with HTTP 413 Payload Too Large from nearai_chat provider #2276) and context-length ([QA] Context length overflow: input exceeds model's 200K token limit #2408) variants also covered
All 348 bridge tests pass

Notes

Aligns with the new .claude/rules/error-handling.md rule ("Error Boundaries at the Channel Edge") that landed on staging in #2714 and which explicitly references this issue.

The status-badge-stuck-on-"Loading..." symptom mentioned in the issue is a separate frontend concern (gateway SSE/UI) and is not addressed here.

🤖 Generated with Claude Code

…ers (#2546) When the engine returned `ThreadOutcome::Failed { error }` the raw string reached the user verbatim through `BridgeOutcome::Respond(format!("Error: {error}"))`. The string includes multiple layers of wrapping (`Orchestrator error: effect execution error: ...`), the Monty-hosted Python traceback with internal file paths (`File "orchestrator.py", line 907`), and the upstream HTTP body (`HTTP 502 Bad Gateway`). This is what the QA bug bash reported: a 502 from the LLM provider surfaced the whole stack to a user on staging. Add a shared `bridge::user_facing_errors` module that classifies failure strings into intent-level categories (LLM unavailable, rate-limited, context too large, auth failure, iteration limit, unknown) and returns a short, user-safe message for each. Route `ThreadOutcome::Failed` through a named helper (`bridge_outcome_for_failed_thread`) that logs the full raw error server-side via `tracing::warn!` and responds with the sanitized text. Extensive unit tests cover both the classifier and the router-side helper, including the exact 502 traceback from the issue as a regression fixture plus 413 (#2276) and context-length (#2408) variants. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request implements a sanitization mechanism for engine errors to ensure that internal diagnostic information, such as Python tracebacks and raw HTTP responses, is not exposed to users. A new user_facing_errors module classifies raw error strings into categories and returns user-friendly messages, while the bridge router is updated to log the full error server-side before sanitization. Review feedback suggests narrowing several keyword matching patterns in the classification logic—specifically 'tokens used', 'unauthorized', and 'request failed'—to prevent potential false positives and ensure more accurate error reporting.

gemini-code-assist · 2026-04-20T15:26:41Z

+        || lower.contains("context_length_exceeded")
+        || lower.contains("tokens used")


The pattern "tokens used" is overly broad and may match informational messages or unrelated errors that happen to mention token usage. Since "context length exceeded" and "http 413" already cover the standard failure modes for context limits (including the variant in issue #2408), this pattern can be safely removed to avoid false positives.

Suggested change

|| lower.contains("context_length_exceeded")

|| lower.contains("tokens used")

|| lower.contains("context_length_exceeded")

References

When detecting keywords in a string, avoid simple substring containment of broad phrases to prevent false positives.

Thanks — agreed. Removed the "tokens used" substring match in 925c47b. The explicit context length exceeded / context_length_exceeded / http 413 / payload too large markers already cover the real failure modes (including #2408's fixture). Existing context_length_exceeded_is_context_too_large test still passes via the context length exceeded prefix.

gemini-code-assist · 2026-04-20T15:26:41Z

+        || lower.contains("session renewal failed")
+        || lower.contains("unauthorized")


Matching on the generic string "unauthorized" might catch tool-level permission errors (e.g., a tool failing to access a specific resource) and incorrectly classify them as LLM provider authentication failures. This would lead to misleading user guidance suggesting they "reconnect the provider". It is safer to rely on more specific strings like "session renewal failed" or explicit HTTP status codes.

Suggested change

|| lower.contains("session renewal failed")

|| lower.contains("unauthorized")

|| lower.contains("session renewal failed")

References

Avoid generic string matching that could lead to incorrect classification and misleading user guidance.

Thanks — agreed. Tightened the AuthFailure arm in 925c47b: replaced bare "unauthorized" with specific markers (401 unauthorized, invalid api key, invalid_api_key). The existing http 401 / http 403 / authentication failed / session expired / session renewal failed markers are kept. Added bare_unauthorized_is_not_auth_failure to prevent regression — a tool-level "unauthorized to access /etc/shadow" error no longer misclassifies as LLM provider auth failure.

gemini-code-assist · 2026-04-20T15:26:41Z

+        || lower.contains("llm call failed")
+        || lower.contains("provider nearai")
+        || lower.contains("request failed")


The patterns "provider nearai" and "request failed" are very generic and likely to cause false positives. "request failed" in particular could match almost any network or logic error, leading to an incorrect LlmUnavailable classification and potentially misleading retry advice. Since the regression cases from issues #2546 and #2276 are already covered by more specific patterns like "http 502" and "llm call failed", these broad keywords should be removed.

Suggested change

|| lower.contains("llm call failed")

|| lower.contains("provider nearai")

|| lower.contains("request failed")

|| lower.contains("llm call failed")

References

Use specific patterns instead of broad keywords to avoid false positives in error detection and classification.

Thanks — agreed. Removed both "provider nearai" and "request failed" from the LlmUnavailable arm in 925c47b. The arm now relies on explicit markers: http 502/503/504, bad gateway, service unavailable, gateway timeout, upstream connect error, upstream, provider temporarily unavailable, and llm call failed. The #2546 regression fixture still classifies correctly because it contains HTTP 502 Bad Gateway (verified by the existing issue_2546_* tests).

Copilot

Pull request overview

Adds a bridge-level error boundary to prevent raw orchestrator/engine failure strings (tracebacks, internal paths, upstream HTTP bodies) from being shown to end users, replacing them with short sanitized messages while keeping full detail in server logs.

Changes:

Introduce bridge::user_facing_errors to classify raw ThreadOutcome::Failed strings and generate user-safe messages.
Route ThreadOutcome::Failed through a new router helper that logs the raw error and responds with the sanitized message.
Add unit tests for classification/sanitization and caller-level router wiring regression tests.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
src/bridge/user_facing_errors.rs	New sanitizer + classifier for failed-thread error strings, with unit tests.
src/bridge/router.rs	Wires failed thread outcomes through sanitizer and adds caller-level regression tests.
src/bridge/mod.rs	Registers the new `user_facing_errors` module.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-20T15:29:42Z

+    // Provider unavailability. Matches the exact shape of the issue
+    // #2546 traceback (`HTTP 502 Bad Gateway`) and also covers 503/504
+    // and common upstream-timeout phrasings.
+    if lower.contains("http 502")
+        || lower.contains("http 503")
+        || lower.contains("http 504")
+        || lower.contains("bad gateway")
+        || lower.contains("service unavailable")
+        || lower.contains("gateway timeout")
+        || lower.contains("upstream connect error")
+        || lower.contains("llm call failed")
+        || lower.contains("provider nearai")
+        || lower.contains("request failed")
+    {
+        return FailureCategory::LlmUnavailable;
+    }


classify_failure’s LlmUnavailable branch matches on very broad substrings ("request failed", and to a lesser extent "provider nearai"). This will misclassify non-5xx provider failures (e.g. "Provider openai_codex request failed: HTTP 400 Bad Request ...") as LlmUnavailable, showing the user the wrong guidance. Consider removing "request failed" from this branch (and relying on 502/503/504 + gateway/service-unavailable keywords), or gating it behind an explicit 5xx/timeout indicator (e.g. http 5, timeout, connect error) so 4xx/other failures fall back to Unknown/more specific categories.

Thanks — addressed in 925c47b. Removed both "request failed" and "provider nearai" from the LlmUnavailable substring set. A string like Provider openai_codex request failed: HTTP 400 Bad Request now correctly falls through to Unknown instead of being labeled as transient unavailability. Added non_5xx_provider_failure_is_not_llm_unavailable using exactly the fixture you suggested as regression coverage. The #2546 502 path still matches via the explicit http 502 / bad gateway markers.

Copilot · 2026-04-20T15:29:43Z

+                .into()
+        }
+        FailureCategory::AuthFailure => {
+            "The AI model could not authenticate. Please reconnect the provider and try again."


The AuthFailure user-facing text says “Please reconnect the provider…”, but this router is used across channels (e.g. web/telegram). “Reconnect” is web-UI-specific and may be unclear elsewhere. Consider rewording to a channel-agnostic instruction like “Please re-authenticate the provider and try again.”

Suggested change

"The AI model could not authenticate. Please reconnect the provider and try again."

"The AI model could not authenticate. Please re-authenticate the provider and try again."

Thanks — good catch. The router is shared across web / telegram / CLI so "reconnect" was too web-specific. Changed to "The AI model could not authenticate. Please re-authenticate the provider and try again." in 925c47b. Added auth_failure_message_is_channel_agnostic to assert the copy never regresses to channel-specific verbs like "reconnect".

…view - Drop overly broad substring matches ("tokens used", "unauthorized", "request failed", "provider nearai") that caused misclassification. - Reword AuthFailure user message to be channel-agnostic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings April 20, 2026 15:24

github-actions Bot added size: L 200-499 changed lines risk: low Changes to docs, tests, or low-risk modules contributor: core 20+ merged PRs labels Apr 20, 2026

Copilot started reviewing on behalf of ilblackdragon April 20, 2026 15:24 View session

gemini-code-assist Bot reviewed Apr 20, 2026

View reviewed changes

Copilot AI reviewed Apr 20, 2026

View reviewed changes

github-actions Bot added size: XL 500+ changed lines and removed size: L 200-499 changed lines labels Apr 20, 2026

serrrfirat merged commit 336bdb1 into staging Apr 20, 2026
18 checks passed

serrrfirat deleted the fix/2546-orchestrator-502-error-message branch April 20, 2026 18:46

github-actions Bot mentioned this pull request Apr 20, 2026

chore: promote staging to staging-promote/e35099de-24682241364 (2026-04-20 19:04 UTC) #2758

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bridge): sanitize orchestrator failures before showing them to users (#2546)#2747

fix(bridge): sanitize orchestrator failures before showing them to users (#2546)#2747
serrrfirat merged 2 commits intostagingfrom
fix/2546-orchestrator-502-error-message

ilblackdragon commented Apr 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Uh oh!

ilblackdragon Apr 20, 2026

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Uh oh!

ilblackdragon Apr 20, 2026

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Uh oh!

ilblackdragon Apr 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 20, 2026

Uh oh!

ilblackdragon Apr 20, 2026

Uh oh!

Copilot AI Apr 20, 2026

Uh oh!

ilblackdragon Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		\|\| lower.contains("context_length_exceeded")
		\|\| lower.contains("tokens used")

		\|\| lower.contains("session renewal failed")
		\|\| lower.contains("unauthorized")

	"The AI model could not authenticate. Please reconnect the provider and try again."
	"The AI model could not authenticate. Please re-authenticate the provider and try again."

Conversation

ilblackdragon commented Apr 20, 2026

Summary

Test plan

Notes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

ilblackdragon Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

ilblackdragon Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

ilblackdragon Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

ilblackdragon Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

ilblackdragon Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants