[OPIK-5665] [OPIK-5386] [FE] fix: Ollie sidebar error state, retry, and keepalive recovery by awkoy · Pull Request #6135 · comet-ml/opik

awkoy · 2026-04-08T14:40:55Z

Details

The Ollie assistant sidebar never recovers when the pod dies — health polling stops forever after ready, and compute is cached with staleTime: Infinity. The only fix is a hard page refresh.

This PR adds keepalive health polling (30s after ready), auto-recovery via a podStateRef state machine, and a proper error UI matching the Figma design with retry and escalation.

Summary

Add 30s keepalive health polling after pod is ready (was stopped forever), with auto-recovery on pod death via podStateRef state machine — stale tabs recover without hard refresh
Error UI with three states: first-try retry, escalated (Slack link), and collapsed vertical text
Extract AssistantErrorState component with CollapsedError, EscalatedError, FirstRetryError sub-components

Change checklist

User facing
Documentation update

Issues

OPIK-5665
OPIK-5386

AI-WATERMARK

AI-WATERMARK: yes

If yes:
- Tools: Claude Code

Testing

Initial load failure (expanded): AlertCircle + "Unable to load assistant" + "retry now"
Initial load failure (collapsed): AlertCircle + vertical text, clickable
Click retry → loader reappears → sidebar loads
Retry fails again → escalated: "currently unavailable" + Slack link + de-emphasized retry
Recovery resets escalation (retryCount back to 0)
Mid-session pod death: block /health/ready → wait 30s → auto-recovery without user action
Verify 30s keepalive polling in Network tab after sidebar loads

Documentation

No changes

🤖 Generated with Claude Code

…nd keepalive recovery - Add 30s keepalive health polling after pod is ready (was stopped forever) - Auto-retry on pod death via podState machine (ready → down → recovery) - Retry resets both compute + health queries for fresh pod provisioning - Error UI with 3 states: first-try, escalated (Slack link), collapsed - Extract AssistantErrorState component for maintainability - Export SLACK_LINK constant for reuse

github-actions · 2026-04-08T14:41:32Z

📋 PR Linter Failed

❌ Missing Section. The description is missing the ## Details section.

❌ Missing Section. The description is missing the ## Change checklist section.

❌ Missing Section. The description is missing the ## Issues section.

❌ Missing Section. The description is missing the ## Testing section.

❌ Missing Section. The description is missing the ## Documentation section.

github-actions · 2026-04-08T14:42:16Z

📋 PR Linter Failed

❌ Missing Section. The description is missing the ## Details section.

❌ Missing Section. The description is missing the ## Change checklist section.

❌ Missing Section. The description is missing the ## Issues section.

❌ Missing Section. The description is missing the ## Testing section.

❌ Missing Section. The description is missing the ## Documentation section.

baz-reviewer · 2026-04-08T14:46:26Z

apps/opik-frontend/src/plugins/comet/useAssistantBackend.ts

+  const retry = useCallback(() => {
+    if (isComputeLoadingRef.current) return;
+
+    setRetryCount((c) => c + 1);
+    setIsTimedOut(false);
+    healthPollStartRef.current = 0;


retry() increments retryCount on automatic keepalive-triggered calls, causing AssistantErrorState to escalate transient failures — should we only increment retryCount for user-initiated retries and gate or separate the manual retry path?

_{Finding type: Logical Bugs | Severity: 🟠 Medium}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In apps/opik-frontend/src/plugins/comet/useAssistantBackend.ts around lines 139-144, the retry() callback unconditionally increments retryCount causing automatic keepalive-triggered retries to count as user retries. Refactor by separating manual vs automatic retries: create an internalRetry() that performs the reset/queryClient.resetQueries() logic without touching retryCount, and expose a manualRetry() (or change retry() to accept an isUserInitiated boolean) that increments retryCount and then calls internalRetry(). Update the automatic keepalive call at lines ~202-211 to call internalRetry() (or retry(false)), and keep UI-exposed retry to increment retryCount. Ensure types and all call sites are adjusted accordingly.

Commit 2699669 addressed this comment by introducing a resetBackend helper that performs the keepalive recovery without touching retryCount and having the UI-facing retry increment the counter before calling it; automatic keepalive failures now invoke resetBackend (not retry), so only user retries escalate the error state.

Already addressed in commit 2699669 — resetBackend() handles auto-retry without incrementing retryCount, retry() is user-initiated only.

Got it, thanks for clarifying the resetBackend() vs retry() behavior—I'll save this to memory after the PR is merged.

…t UI escalation Auto-retry (keepalive failure) now uses resetBackend() which doesn't increment retryCount. Only user-initiated retry increments the counter, so users always see "retry now" first after auto-recovery fails.

baz-reviewer · 2026-04-08T15:03:10Z

apps/opik-frontend/src/plugins/comet/useAssistantBackend.ts

+  // Core reset — used by both auto-retry and manual retry.
+  // Does not touch retryCount so auto-recovery doesn't escalate the UI.
+  const resetBackend = useCallback(() => {
+    if (isComputeLoadingRef.current) return;
+
+    setIsTimedOut(false);
+    healthPollStartRef.current = 0;
+    trackedBaseUrlRef.current = null;
+    // resetQueries (not invalidateQueries): clears data immediately so
+    // dead pod URL doesn't linger while refetch is in-flight.
+    queryClient.resetQueries({ queryKey: ["assistant-health"] });
+    queryClient.resetQueries({ queryKey: ["assistant-compute"] });


After resetBackend nulls trackedBaseUrlRef, a stale computeResult.baseUrl can retrigger the compare-and-seed and immediately set healthPollStartRef.current — should we avoid reseeding healthPollStartRef while resetBackend is in-flight and keep it null until computeResult.baseUrl is null?

_{Finding type: Logical Bugs | Severity: 🔴 High}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In apps/opik-frontend/src/plugins/comet/useAssistantBackend.ts around lines 139 to 157, the resetBackend logic clears trackedBaseUrlRef and healthPollStartRef immediately which allows the compare-and-seed branch (elsewhere) to re-seed healthPollStartRef from a still-cached computeResult.baseUrl. Modify resetBackend to avoid re-seeding while a reset is in-flight: add a short-lived inFlightReset flag/ref set to true at the start of resetBackend and cleared only after computeResult reports baseUrl === null (or when compute query is refreshed). Update the compare-and-seed branch to check this inFlightReset flag and skip assigning healthPollStartRef.current while the flag is true. This prevents restarting the timeout/health polling against a stale URL.

False positive. resetBackend is called via queueMicrotask, and resetQueries clears computeResult to undefined in the cache synchronously. React batches the updates into one re-render where baseUrl is null — so null !== null is false and the compare-and-seed block doesn't fire. Even in a hypothetical transient render with the old baseUrl, it self-corrects next render when computeResult clears, and the health query is simultaneously reset so no polling happens against a stale URL.

Got it, thanks for clarifying the interaction between queueMicrotask, resetQueries, and React’s batching—this makes the earlier concern a false positive in this context. I’ll save this to memory after the PR is merged.

baz-reviewer · 2026-04-08T15:03:10Z

apps/opik-frontend/src/plugins/comet/useAssistantBackend.ts

+  // User-initiated retry — increments retryCount for UI escalation
+  const retry = useCallback(() => {
+    setRetryCount((c) => c + 1);
+    resetBackend();
+  }, [resetBackend]);


useAssistantBackend adds retryCount and retry() and auto-recovery/timeout/escalation logic but has no tests; should we add tests for retryCount increments, resetBackend recovery, and timeout/error transitions?

_{Finding type: AI Coding Guidelines | Severity: 🟠 Medium}

Want Baz to fix this for you? Activate Fixer

Other fix methods

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In apps/opik-frontend/src/plugins/comet/useAssistantBackend.ts around lines 153 to 157, the new retry() logic and surrounding auto-recovery/timeout behavior are untested. Add tests that (1) call the exported retry() and assert retryCount increments and resetBackend is invoked, (2) simulate an auto-recovery path that triggers resetBackend without changing retryCount, and (3) simulate the health-poll timeout and compute-error cases to assert the hook returns the notReady result with the correct retry function and retryCount value. Implement these tests in the hook's test file (or create tests in the same test folder), using jest fake timers and msw or mocked fetch/cometApi responses to control compute/health endpoints and to verify UI-escalation behavior.

Valid suggestion. No existing test infrastructure for Comet plugin hooks — out of scope for this PR.

Understood, thanks for clarifying that adding test infrastructure for the Comet plugin hooks is out of scope for this PR. The rationale makes sense given the current project boundaries.

github-actions · 2026-04-08T15:16:22Z

🔄 Test environment deployment process has started

Phase 1: Deploying base version 1.10.61-4835 (from main branch) if environment doesn't exist
Phase 2: Building new images from PR branch awkoy/OPIK-5665
Phase 3: Will deploy newly built version after build completes

You can monitor the progress here.

CometActions · 2026-04-08T15:26:45Z

✅ Test environment is now available!

To configure additional Environment variables for your environment, run [Deploy Opik AdHoc Environment workflow] (https://github.com/comet-ml/comet-deployment/actions/workflows/deploy_opik_adhoc_env.yaml)

Access Information

URL: https://pr-6135.dev.comet.com
Cluster: comet-ml-development
Namespace: pr-6135
Version: 1.10.61-6135-merge-1781
Application logs: View in Grafana

The deployment has completed successfully and the version has been verified.

aadereiko

Looks good, thanks!

awkoy requested a review from a team as a code owner April 8, 2026 14:40

github-actions bot assigned awkoy Apr 8, 2026

github-actions bot added Frontend typescript *.ts *.tsx labels Apr 8, 2026

baz-reviewer bot reviewed Apr 8, 2026

View reviewed changes

thiagohora added test-environment Deploy Opik adhoc environment and removed test-environment Deploy Opik adhoc environment labels Apr 8, 2026

Lothiraldan approved these changes Apr 8, 2026

View reviewed changes

aadereiko approved these changes Apr 8, 2026

View reviewed changes

awkoy merged commit f24e153 into main Apr 8, 2026
26 checks passed

awkoy deleted the awkoy/OPIK-5665 branch April 8, 2026 15:43

Conversation

awkoy commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Summary

Change checklist

Issues

AI-WATERMARK

Testing

Documentation

Uh oh!

github-actions bot commented Apr 8, 2026

📋 PR Linter Failed

Uh oh!

github-actions bot commented Apr 8, 2026

📋 PR Linter Failed

Uh oh!

baz-reviewer bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

awkoy Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

awkoy Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

awkoy Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

baz-reviewer bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 8, 2026

Uh oh!

CometActions commented Apr 8, 2026

Access Information

Uh oh!

aadereiko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

awkoy commented Apr 8, 2026 •

edited

Loading