[OPIK-5665] [OPIK-5386] [FE] fix: Ollie sidebar error state, retry, and keepalive recovery#6135
[OPIK-5665] [OPIK-5386] [FE] fix: Ollie sidebar error state, retry, and keepalive recovery#6135
Conversation
…nd keepalive recovery - Add 30s keepalive health polling after pod is ready (was stopped forever) - Auto-retry on pod death via podState machine (ready → down → recovery) - Retry resets both compute + health queries for fresh pod provisioning - Error UI with 3 states: first-try, escalated (Slack link), collapsed - Extract AssistantErrorState component for maintainability - Export SLACK_LINK constant for reuse
📋 PR Linter Failed❌ Missing Section. The description is missing the ❌ Missing Section. The description is missing the ❌ Missing Section. The description is missing the ❌ Missing Section. The description is missing the ❌ Missing Section. The description is missing the |
1 similar comment
📋 PR Linter Failed❌ Missing Section. The description is missing the ❌ Missing Section. The description is missing the ❌ Missing Section. The description is missing the ❌ Missing Section. The description is missing the ❌ Missing Section. The description is missing the |
| const retry = useCallback(() => { | ||
| if (isComputeLoadingRef.current) return; | ||
|
|
||
| setRetryCount((c) => c + 1); | ||
| setIsTimedOut(false); | ||
| healthPollStartRef.current = 0; |
There was a problem hiding this comment.
retry() increments retryCount on automatic keepalive-triggered calls, causing AssistantErrorState to escalate transient failures — should we only increment retryCount for user-initiated retries and gate or separate the manual retry path?
Finding type: Logical Bugs | Severity: 🟠 Medium
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents:
Before applying, verify this suggestion against the current code. In
apps/opik-frontend/src/plugins/comet/useAssistantBackend.ts around lines 139-144, the
retry() callback unconditionally increments retryCount causing automatic
keepalive-triggered retries to count as user retries. Refactor by separating manual vs
automatic retries: create an internalRetry() that performs the
reset/queryClient.resetQueries() logic without touching retryCount, and expose a
manualRetry() (or change retry() to accept an isUserInitiated boolean) that increments
retryCount and then calls internalRetry(). Update the automatic keepalive call at lines
~202-211 to call internalRetry() (or retry(false)), and keep UI-exposed retry to
increment retryCount. Ensure types and all call sites are adjusted accordingly.
There was a problem hiding this comment.
Commit 2699669 addressed this comment by introducing a resetBackend helper that performs the keepalive recovery without touching retryCount and having the UI-facing retry increment the counter before calling it; automatic keepalive failures now invoke resetBackend (not retry), so only user retries escalate the error state.
There was a problem hiding this comment.
Already addressed in commit 2699669 — resetBackend() handles auto-retry without incrementing retryCount, retry() is user-initiated only.
There was a problem hiding this comment.
Got it, thanks for clarifying the resetBackend() vs retry() behavior—I'll save this to memory after the PR is merged.
…t UI escalation Auto-retry (keepalive failure) now uses resetBackend() which doesn't increment retryCount. Only user-initiated retry increments the counter, so users always see "retry now" first after auto-recovery fails.
| // Core reset — used by both auto-retry and manual retry. | ||
| // Does not touch retryCount so auto-recovery doesn't escalate the UI. | ||
| const resetBackend = useCallback(() => { | ||
| if (isComputeLoadingRef.current) return; | ||
|
|
||
| setIsTimedOut(false); | ||
| healthPollStartRef.current = 0; | ||
| trackedBaseUrlRef.current = null; | ||
| // resetQueries (not invalidateQueries): clears data immediately so | ||
| // dead pod URL doesn't linger while refetch is in-flight. | ||
| queryClient.resetQueries({ queryKey: ["assistant-health"] }); | ||
| queryClient.resetQueries({ queryKey: ["assistant-compute"] }); |
There was a problem hiding this comment.
After resetBackend nulls trackedBaseUrlRef, a stale computeResult.baseUrl can retrigger the compare-and-seed and immediately set healthPollStartRef.current — should we avoid reseeding healthPollStartRef while resetBackend is in-flight and keep it null until computeResult.baseUrl is null?
Finding type: Logical Bugs | Severity: 🔴 High
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents:
Before applying, verify this suggestion against the current code. In
apps/opik-frontend/src/plugins/comet/useAssistantBackend.ts around lines 139 to 157, the
resetBackend logic clears trackedBaseUrlRef and healthPollStartRef immediately which
allows the compare-and-seed branch (elsewhere) to re-seed healthPollStartRef from a
still-cached computeResult.baseUrl. Modify resetBackend to avoid re-seeding while a
reset is in-flight: add a short-lived inFlightReset flag/ref set to true at the start of
resetBackend and cleared only after computeResult reports baseUrl === null (or when
compute query is refreshed). Update the compare-and-seed branch to check this
inFlightReset flag and skip assigning healthPollStartRef.current while the flag is true.
This prevents restarting the timeout/health polling against a stale URL.
There was a problem hiding this comment.
False positive. resetBackend is called via queueMicrotask, and resetQueries clears computeResult to undefined in the cache synchronously. React batches the updates into one re-render where baseUrl is null — so null !== null is false and the compare-and-seed block doesn't fire. Even in a hypothetical transient render with the old baseUrl, it self-corrects next render when computeResult clears, and the health query is simultaneously reset so no polling happens against a stale URL.
There was a problem hiding this comment.
Got it, thanks for clarifying the interaction between queueMicrotask, resetQueries, and React’s batching—this makes the earlier concern a false positive in this context. I’ll save this to memory after the PR is merged.
| // User-initiated retry — increments retryCount for UI escalation | ||
| const retry = useCallback(() => { | ||
| setRetryCount((c) => c + 1); | ||
| resetBackend(); | ||
| }, [resetBackend]); |
There was a problem hiding this comment.
useAssistantBackend adds retryCount and retry() and auto-recovery/timeout/escalation logic but has no tests; should we add tests for retryCount increments, resetBackend recovery, and timeout/error transitions?
Finding type: AI Coding Guidelines | Severity: 🟠 Medium
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents:
Before applying, verify this suggestion against the current code. In
apps/opik-frontend/src/plugins/comet/useAssistantBackend.ts around lines 153 to 157, the
new retry() logic and surrounding auto-recovery/timeout behavior are untested. Add tests
that (1) call the exported retry() and assert retryCount increments and resetBackend is
invoked, (2) simulate an auto-recovery path that triggers resetBackend without changing
retryCount, and (3) simulate the health-poll timeout and compute-error cases to assert
the hook returns the notReady result with the correct retry function and retryCount
value. Implement these tests in the hook's test file (or create tests in the same test
folder), using jest fake timers and msw or mocked fetch/cometApi responses to control
compute/health endpoints and to verify UI-escalation behavior.
There was a problem hiding this comment.
Valid suggestion. No existing test infrastructure for Comet plugin hooks — out of scope for this PR.
There was a problem hiding this comment.
Understood, thanks for clarifying that adding test infrastructure for the Comet plugin hooks is out of scope for this PR. The rationale makes sense given the current project boundaries.
|
🔄 Test environment deployment process has started Phase 1: Deploying base version You can monitor the progress here. |
|
✅ Test environment is now available! To configure additional Environment variables for your environment, run [Deploy Opik AdHoc Environment workflow] (https://github.com/comet-ml/comet-deployment/actions/workflows/deploy_opik_adhoc_env.yaml) Access Information
The deployment has completed successfully and the version has been verified. |
Details
The Ollie assistant sidebar never recovers when the pod dies — health polling stops forever after ready, and compute is cached with
staleTime: Infinity. The only fix is a hard page refresh.This PR adds keepalive health polling (30s after ready), auto-recovery via a
podStateRefstate machine, and a proper error UI matching the Figma design with retry and escalation.Summary
podStateRefstate machine — stale tabs recover without hard refreshAssistantErrorStatecomponent withCollapsedError,EscalatedError,FirstRetryErrorsub-componentsChange checklist
Issues
AI-WATERMARK
AI-WATERMARK: yes
Testing
/health/ready→ wait 30s → auto-recovery without user actionDocumentation
No changes
🤖 Generated with Claude Code