Skip to content

[OPIK-5665] [OPIK-5386] [FE] fix: Ollie sidebar error state, retry, and keepalive recovery#6135

Merged
awkoy merged 2 commits intomainfrom
awkoy/OPIK-5665
Apr 8, 2026
Merged

[OPIK-5665] [OPIK-5386] [FE] fix: Ollie sidebar error state, retry, and keepalive recovery#6135
awkoy merged 2 commits intomainfrom
awkoy/OPIK-5665

Conversation

@awkoy
Copy link
Copy Markdown
Contributor

@awkoy awkoy commented Apr 8, 2026

Details

The Ollie assistant sidebar never recovers when the pod dies — health polling stops forever after ready, and compute is cached with staleTime: Infinity. The only fix is a hard page refresh.

This PR adds keepalive health polling (30s after ready), auto-recovery via a podStateRef state machine, and a proper error UI matching the Figma design with retry and escalation.

Summary

  • Add 30s keepalive health polling after pod is ready (was stopped forever), with auto-recovery on pod death via podStateRef state machine — stale tabs recover without hard refresh
  • Error UI with three states: first-try retry, escalated (Slack link), and collapsed vertical text
  • Extract AssistantErrorState component with CollapsedError, EscalatedError, FirstRetryError sub-components

Change checklist

  • User facing
  • Documentation update

Issues

  • OPIK-5665
  • OPIK-5386

AI-WATERMARK

AI-WATERMARK: yes

  • If yes:
    • Tools: Claude Code

Testing

  • Initial load failure (expanded): AlertCircle + "Unable to load assistant" + "retry now"
  • Initial load failure (collapsed): AlertCircle + vertical text, clickable
  • Click retry → loader reappears → sidebar loads
  • Retry fails again → escalated: "currently unavailable" + Slack link + de-emphasized retry
  • Recovery resets escalation (retryCount back to 0)
  • Mid-session pod death: block /health/ready → wait 30s → auto-recovery without user action
  • Verify 30s keepalive polling in Network tab after sidebar loads

Documentation

No changes

🤖 Generated with Claude Code

…nd keepalive recovery

- Add 30s keepalive health polling after pod is ready (was stopped forever)
- Auto-retry on pod death via podState machine (ready → down → recovery)
- Retry resets both compute + health queries for fresh pod provisioning
- Error UI with 3 states: first-try, escalated (Slack link), collapsed
- Extract AssistantErrorState component for maintainability
- Export SLACK_LINK constant for reuse
@awkoy awkoy requested a review from a team as a code owner April 8, 2026 14:40
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

📋 PR Linter Failed

Missing Section. The description is missing the ## Details section.


Missing Section. The description is missing the ## Change checklist section.


Missing Section. The description is missing the ## Issues section.


Missing Section. The description is missing the ## Testing section.


Missing Section. The description is missing the ## Documentation section.

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

📋 PR Linter Failed

Missing Section. The description is missing the ## Details section.


Missing Section. The description is missing the ## Change checklist section.


Missing Section. The description is missing the ## Issues section.


Missing Section. The description is missing the ## Testing section.


Missing Section. The description is missing the ## Documentation section.

Comment on lines +139 to +144
const retry = useCallback(() => {
if (isComputeLoadingRef.current) return;

setRetryCount((c) => c + 1);
setIsTimedOut(false);
healthPollStartRef.current = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retry() increments retryCount on automatic keepalive-triggered calls, causing AssistantErrorState to escalate transient failures — should we only increment retryCount for user-initiated retries and gate or separate the manual retry path?

Finding type: Logical Bugs | Severity: 🟠 Medium


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In
apps/opik-frontend/src/plugins/comet/useAssistantBackend.ts around lines 139-144, the
retry() callback unconditionally increments retryCount causing automatic
keepalive-triggered retries to count as user retries. Refactor by separating manual vs
automatic retries: create an internalRetry() that performs the
reset/queryClient.resetQueries() logic without touching retryCount, and expose a
manualRetry() (or change retry() to accept an isUserInitiated boolean) that increments
retryCount and then calls internalRetry(). Update the automatic keepalive call at lines
~202-211 to call internalRetry() (or retry(false)), and keep UI-exposed retry to
increment retryCount. Ensure types and all call sites are adjusted accordingly.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit 2699669 addressed this comment by introducing a resetBackend helper that performs the keepalive recovery without touching retryCount and having the UI-facing retry increment the counter before calling it; automatic keepalive failures now invoke resetBackend (not retry), so only user retries escalate the error state.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already addressed in commit 2699669resetBackend() handles auto-retry without incrementing retryCount, retry() is user-initiated only.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for clarifying the resetBackend() vs retry() behavior—I'll save this to memory after the PR is merged.

…t UI escalation

Auto-retry (keepalive failure) now uses resetBackend() which doesn't
increment retryCount. Only user-initiated retry increments the counter,
so users always see "retry now" first after auto-recovery fails.
Comment on lines +139 to +150
// Core reset — used by both auto-retry and manual retry.
// Does not touch retryCount so auto-recovery doesn't escalate the UI.
const resetBackend = useCallback(() => {
if (isComputeLoadingRef.current) return;

setIsTimedOut(false);
healthPollStartRef.current = 0;
trackedBaseUrlRef.current = null;
// resetQueries (not invalidateQueries): clears data immediately so
// dead pod URL doesn't linger while refetch is in-flight.
queryClient.resetQueries({ queryKey: ["assistant-health"] });
queryClient.resetQueries({ queryKey: ["assistant-compute"] });
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After resetBackend nulls trackedBaseUrlRef, a stale computeResult.baseUrl can retrigger the compare-and-seed and immediately set healthPollStartRef.current — should we avoid reseeding healthPollStartRef while resetBackend is in-flight and keep it null until computeResult.baseUrl is null?

Finding type: Logical Bugs | Severity: 🔴 High


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In
apps/opik-frontend/src/plugins/comet/useAssistantBackend.ts around lines 139 to 157, the
resetBackend logic clears trackedBaseUrlRef and healthPollStartRef immediately which
allows the compare-and-seed branch (elsewhere) to re-seed healthPollStartRef from a
still-cached computeResult.baseUrl. Modify resetBackend to avoid re-seeding while a
reset is in-flight: add a short-lived inFlightReset flag/ref set to true at the start of
resetBackend and cleared only after computeResult reports baseUrl === null (or when
compute query is refreshed). Update the compare-and-seed branch to check this
inFlightReset flag and skip assigning healthPollStartRef.current while the flag is true.
This prevents restarting the timeout/health polling against a stale URL.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False positive. resetBackend is called via queueMicrotask, and resetQueries clears computeResult to undefined in the cache synchronously. React batches the updates into one re-render where baseUrl is null — so null !== null is false and the compare-and-seed block doesn't fire. Even in a hypothetical transient render with the old baseUrl, it self-corrects next render when computeResult clears, and the health query is simultaneously reset so no polling happens against a stale URL.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for clarifying the interaction between queueMicrotask, resetQueries, and React’s batching—this makes the earlier concern a false positive in this context. I’ll save this to memory after the PR is merged.

Comment on lines +153 to +157
// User-initiated retry — increments retryCount for UI escalation
const retry = useCallback(() => {
setRetryCount((c) => c + 1);
resetBackend();
}, [resetBackend]);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useAssistantBackend adds retryCount and retry() and auto-recovery/timeout/escalation logic but has no tests; should we add tests for retryCount increments, resetBackend recovery, and timeout/error transitions?

Finding type: AI Coding Guidelines | Severity: 🟠 Medium


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In
apps/opik-frontend/src/plugins/comet/useAssistantBackend.ts around lines 153 to 157, the
new retry() logic and surrounding auto-recovery/timeout behavior are untested. Add tests
that (1) call the exported retry() and assert retryCount increments and resetBackend is
invoked, (2) simulate an auto-recovery path that triggers resetBackend without changing
retryCount, and (3) simulate the health-poll timeout and compute-error cases to assert
the hook returns the notReady result with the correct retry function and retryCount
value. Implement these tests in the hook's test file (or create tests in the same test
folder), using jest fake timers and msw or mocked fetch/cometApi responses to control
compute/health endpoints and to verify UI-escalation behavior.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid suggestion. No existing test infrastructure for Comet plugin hooks — out of scope for this PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, thanks for clarifying that adding test infrastructure for the Comet plugin hooks is out of scope for this PR. The rationale makes sense given the current project boundaries.

@thiagohora thiagohora added test-environment Deploy Opik adhoc environment and removed test-environment Deploy Opik adhoc environment labels Apr 8, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 8, 2026

🔄 Test environment deployment process has started

Phase 1: Deploying base version 1.10.61-4835 (from main branch) if environment doesn't exist
Phase 2: Building new images from PR branch awkoy/OPIK-5665
Phase 3: Will deploy newly built version after build completes

You can monitor the progress here.

@CometActions
Copy link
Copy Markdown
Collaborator

Test environment is now available!

To configure additional Environment variables for your environment, run [Deploy Opik AdHoc Environment workflow] (https://github.com/comet-ml/comet-deployment/actions/workflows/deploy_opik_adhoc_env.yaml)

Access Information

The deployment has completed successfully and the version has been verified.

Copy link
Copy Markdown
Collaborator

@aadereiko aadereiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@awkoy awkoy merged commit f24e153 into main Apr 8, 2026
26 checks passed
@awkoy awkoy deleted the awkoy/OPIK-5665 branch April 8, 2026 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants