Skip to content

[OPIK-4727] [FE] feat: optimizer screens face-lift#5554

Merged
YarivHashaiComet merged 92 commits intomainfrom
itamar/optimizer-screens-face-lift
Mar 17, 2026
Merged

[OPIK-4727] [FE] feat: optimizer screens face-lift#5554
YarivHashaiComet merged 92 commits intomainfrom
itamar/optimizer-screens-face-lift

Conversation

@itamargolan
Copy link
Copy Markdown
Contributor

@itamargolan itamargolan commented Mar 6, 2026

Details

Comprehensive face-lift of the optimizer screens UI with improvements spanning both frontend and backend. This PR introduces new KPI cards with metric comparison (baseline vs current), configuration diff views for comparing trial prompts/settings, an optimization progress chart with candidate lineage visualization, trial status indicators, and various UI polish improvements including theme-aware styling, consistent iconography, and lowercase labels. The backend adds dataset_item_count support and a new MUTATION experiment type for GEPA crossover tracking. The optimizer SDK (GEPA v2) gains reflection-based prompt evolution with multi-parent crossover support.

  • New components: MetricComparisonCell, ConfigurationDiffContent, OptimizationProgressChart, TrialKPICards, CompareOptimizationsPage views
  • Backend: dataset_item_count on Experiment/Optimization APIs, MUTATION experiment type, optimization search criteria
  • Optimizer SDK: GEPA v2 with reflection-based evolution, candidate tracking, crossover rendering
  • UI polish: theme-aware diff badges, correct column type icons, darkened success green, lowercase labels, chart tooltip fixes
  • Backward compatibility: studio_config fallback for older optimizations, feature toggle restoration

Change checklist

  • User facing
  • Documentation update

Issues

  • OPIK-4727
  • OPIK-4687

AI-WATERMARK

AI-WATERMARK: yes

  • Tools: Claude Code
  • Model(s): Claude Opus 4.6
  • Scope: Full implementation with human review
  • Human verification: yes

Testing

  • Frontend lint: cd apps/opik-frontend && npm run lint
  • Frontend build: cd apps/opik-frontend && npm run build
  • Manual verification of optimizer screens UI in browser
  • Verified KPI cards, chart tooltips, diff views, column icons, trial status badges

Documentation

N/A - internal UI improvements

itamargolan and others added 30 commits February 26, 2026 10:40
Implements a new optimization framework (`apps/opik-optimizer`) that
decouples optimizer algorithms from experiment execution, persistence,
and UI concerns. Integrates via the existing optimization studio pipeline
(Redis queue → Python backend → subprocess).

Key components:
- Orchestrator: central lifecycle controller with sampler, validator,
  materializer, result aggregator, and event emitter
- StupidOptimizer: 2-step test optimizer (3 candidates → best → 2 more)
- EvaluationAdapter: wraps SDK evaluate_optimization_suite_trial()
- Backend integration: new Redis queue, framework_optimizer job processor,
  framework_runner subprocess entry point

Also adds evaluate_optimization_suite_trial() to the Python SDK, combining
optimization trial linkage with evaluation suite behavior (evaluators and
execution policy from the dataset).

53 unit + integration tests passing. Verified end-to-end against Comet cloud
with real LLM calls, UI progress chart, prompt display, and score tracking.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix AttributeError in framework_runner.py: dataset.get_items() returns
  dicts, use item["id"] instead of item.id
- Fix hard-coded hex color in TrialPassedCell.tsx: use text-success CSS
  class instead of text-[#12B76A] for proper dark theme support

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add opik:optimizer-framework to default RQ queue names so framework
  jobs actually get consumed by workers
- Add dataset size guard in orchestrator before sample_split to provide
  a clear error message for datasets with fewer than 2 items
- Extract shared optimizer_job_helper.py to deduplicate identical logic
  between optimizer.py and framework_optimizer.py
- Extract checkIsEvaluationSuite helper in optimizations.ts to
  deduplicate predicate shared between CompareTrialsPage and
  useCompareOptimizationsData
- Fix hardcoded "pass_rate" in experiment_executor.py to use the actual
  metric_type parameter

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e-item datasets

Splits the combined feedback/experiment scores into distinct fields in the
Optimization API and DAO so the frontend can fall back to experiment_scores
when feedback_scores lack the objective. Allows single-item datasets by
returning a train-only split instead of raising. Extracts shared runner
environment setup into runner_common.py.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The baseline was evaluated on split.validation_item_ids, which with an
80/20 split ratio meant only 1 out of 5 items was used. This gave an
unrepresentative baseline score. Now uses the full dataset_item_ids list.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add rich metadata to each experiment so the UI can aggregate and
visualize the optimization trajectory. Key changes:

- step_index increments only when candidate changes (not per eval)
- candidate_id is stable across re-evaluations of the same prompt
- parent_candidate_ids always set correctly for derived candidates
- New metadata fields: batch_index, num_items, capture_traces, eval_purpose
- Refactor optimizer package: protocol + factory pattern for registration
- Add GEPA adapter bridging GEPA callbacks to framework metadata
- Fix BE tests for experimentScores null and queue routing
- Add docs: ADDING_AN_OPTIMIZER.md and GEPA_IMPLEMENTATION.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove register_optimizer public API and OptimizerFactory class;
  replace with a simple dict in _load_registry()
- framework_runner: avoid holding full dataset items in memory
- Update docs and tests to match simplified factory

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve conflict in CompareTrialsPage.tsx: keep both workspaceName
(for useExperimentsList) and canViewDatasets permission guard from main,
plus isEvaluationSuite prop from our branch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…iments

- Replace sequential step_index counter with parent-lineage derivation
  (max parent step + 1), so all re-evaluations of the same candidate
  share the same step_index
- Ensure every non-baseline experiment carries parent_candidate_ids,
  enabling the UI to draw lineage graphs
- Pass batch_index, num_items, capture_traces, and eval_purpose through
  to experiment metadata for richer visualization
- Revert runner scripts to direct invocation (remove runner_common.py)
- Update unit tests to match new metadata contract

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove canonical_config_hash from Candidate and TrialResult types,
  candidate_materializer, experiment_executor, and all tests
- Delete util/hashing.py module (unused — GEPA does minibatching so
  config-hash dedup would block valid re-evaluations)
- Merge SdkEventEmitter and LoggingEventEmitter into a single
  EventEmitter class with optional optimization_id
- Update GEPA_IMPLEMENTATION.md to reflect parent_ids tracking fixes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…through context

- Replace CandidateConfig dataclass with dict[str, Any] type alias
- Add baseline_config field to OptimizationContext (caller-provided, opaque)
- Orchestrator passes baseline_config through without knowing its structure
- Optimizers copy baseline_config and override prompt_messages only
- Remove result_aggregator module (inlined into evaluation_adapter)
- Move gepa imports to runtime (lazy) for optional dependency
- Fix protocol.py training_set/validation_set types to list[dict]
- Update ADDING_AN_OPTIMIZER.md to reflect all changes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dependency on gepa

The gepa tests patch gepa.core.adapter.EvaluationBatch and gepa.optimize,
requiring the optional gepa package at import time. Moving them to
tests/library_integration/gepa/ with pytest.importorskip("gepa") keeps
the unit suite fast and dependency-free.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ep progress

Optimizers no longer receive or call event_emitter directly. The
EvaluationAdapter now auto-detects step_index changes during evaluate()
and emits on_step_started internally. GEPAProgressCallback simplified
to only forward GEPA events to the adapter.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use caplog to verify logger.info output includes optimization ID and
event details, instead of just checking calls don't crash.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… UI detection

evaluate_optimization_suite_trial was creating experiments without
evaluation_method="evaluation_suite", causing the backend to default
to "dataset". The frontend checkIsEvaluationSuite now uses the explicit
evaluation_method field instead of heuristic score detection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tion

Adds a guard to evaluate_suite and evaluate_optimization_suite_trial that
checks dataset.dataset_type == "evaluation_suite" before proceeding. This
prevents silently running an ineffective suite trial on a plain dataset
with no scoring rules.

- Add dataset_type param to Dataset constructor, populated at all call sites
- Add dataset_type property to Dataset
- Add _validate_dataset_is_evaluation_suite in evaluator.py
- Update tests and add rejection test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…on flow

evaluate_suite and evaluate_optimization_suite_trial had their entire body
duplicated. Extract shared logic into _run_suite_evaluation, parameterized
by optimization_id and dataset filters.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comprehensive face-lift for optimizer screens including new KPI cards,
metric comparison cells, configuration diff views, progress charts,
trial status indicators, and backend dataset_item_count support.
Also adds backward compatibility for SDK-based optimizations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Dataset name column: hover icon instead of clickable link
- Split Accuracy into Pass rate + Accuracy columns with compact metric display
- Conditionally hide Accuracy column when no old-type optimizations exist
- Remove Logs/Configuration tabs from single optimization page
- Fall back to studio_config for configuration display on old optimizations
- Chart tooltip: remove pass rate percentage background color
- Fix dataset hover icon vertical centering
- Restore feature toggle for optimization studio

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add "evaluating" status for scored candidates not yet selected
- Best candidate always "passed" (never "evaluating")
- Score < best → immediately pruned
- Sibling with children (including ghost) → pruned
- Pulsing animation on last passed candidate at highest step
  (not always on best score)
- After completion: descendants or best = passed, rest = pruned

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
itamargolan and others added 2 commits March 16, 2026 22:16
- Ghost dot participates in same overlap grouping as regular dots
- Remove dead GHOST_OVERLAP_OFFSET_PER_DOT constant
- Move GHOST_ID to module scope
- Add null guard in computeInProgressStatus (remove non-null assertion)
- Show trend arrow when baseline is 0% (icon-only for infinite change)
- Refactor computeCandidateStatuses into focused helpers

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…eChartData

14 tests covering: baseline, running, evaluating, passed, pruned states,
ghost parent pruning, descendant detection, in-progress vs completed logic

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@CometActions
Copy link
Copy Markdown
Collaborator

🌙 Nightly cleanup: The test environment for this PR (pr-5554) has been cleaned up to free cluster resources. PVCs are preserved — re-deploy to restore the environment.

@CometActions CometActions removed the test-environment Deploy Opik adhoc environment label Mar 17, 2026
Candidates at the same step but from different parents are independent
branches and should not prune each other. Changed grouping key from
stepIndex to sorted parentCandidateIds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@itamargolan itamargolan added the test-environment Deploy Opik adhoc environment label Mar 17, 2026
@github-actions
Copy link
Copy Markdown
Contributor

🔄 Test environment deployment process has started

Phase 1: Deploying base version 1.10.40 (from main branch) if environment doesn't exist
Phase 2: Building new images from PR branch itamar/optimizer-screens-face-lift
Phase 3: Will deploy newly built version after build completes

You can monitor the progress here.

@CometActions
Copy link
Copy Markdown
Collaborator

Test environment is now available!

To configure additional Environment variables for your environment, run [Deploy Opik AdHoc Environment workflow] (https://github.com/comet-ml/comet-deployment/actions/workflows/deploy_opik_adhoc_env.yaml)

Access Information

The deployment has completed successfully and the version has been verified.

- Show objective metric name instead of "Accuracy" in KPI card and table
- Pass baselineCandidate via column meta instead of scanning per cell
  (reduces O(3*R*N) to O(1) baseline lookups per render)
- Rename buildDescendantsSet → buildAncestorSet (matches traversal direction)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Shared helper for objective column/card label used by both
useOptimizationColumns and getMetricKPICardConfigs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@itamargolan itamargolan added test-environment Deploy Opik adhoc environment and removed test-environment Deploy Opik adhoc environment labels Mar 17, 2026
@github-actions
Copy link
Copy Markdown
Contributor

🔄 Test environment deployment process has started

Phase 1: Deploying base version 1.10.40 (from main branch) if environment doesn't exist
Phase 2: Building new images from PR branch itamar/optimizer-screens-face-lift
Phase 3: Will deploy newly built version after build completes

You can monitor the progress here.

@CometActions
Copy link
Copy Markdown
Collaborator

Test environment is now available!

To configure additional Environment variables for your environment, run [Deploy Opik AdHoc Environment workflow] (https://github.com/comet-ml/comet-deployment/actions/workflows/deploy_opik_adhoc_env.yaml)

Access Information

The deployment has completed successfully and the version has been verified.

JetoPistola
JetoPistola previously approved these changes Mar 17, 2026
Copy link
Copy Markdown
Contributor

@JetoPistola JetoPistola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BE changes look good ✅

All review findings have been addressed — either fixed in code or acknowledged with rationale:

  • Sentinel nil UUID → replaced with early return
  • MATERIALIZE INDEX → added to migration
  • Aggregation field test coverage → added
  • datasetItemCount redundancy → field removed
  • LIKE metacharacter escaping → fixed
  • Blocking JDBI call → acknowledged, deferred to follow-up
  • argMax tiebreaking → accepted as intentional
  • Weighted p50 approximation → acknowledged
  • Score clamping → intentional (pass rates are 0-1)
  • Feature flag removal → intentional (dataset versioning rollout)

🤖 Review posted via /review-github-pr

aadereiko
aadereiko previously approved these changes Mar 17, 2026
Copy link
Copy Markdown
Collaborator

@aadereiko aadereiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FE parts looks to go, the comments that are left are NIT, feel free to fix the in a follow up PR

…mizations

- Extract hover tooltip into ChartTooltip.tsx component
- Move getOptimizationMetadata, aggregateCandidates, mergeExperimentScores
  from useOptimizationExperiments to lib/optimizations.ts
- Move CANDIDATE_SORT_FIELD_MAP, sortCandidates to lib/optimizations.ts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@itamargolan itamargolan dismissed stale reviews from aadereiko and JetoPistola via 962d311 March 17, 2026 18:04
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@aadereiko aadereiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for fixing comments!

@YarivHashaiComet
Copy link
Copy Markdown
Collaborator

@itamargolan i'm merging this one for you

Copy link
Copy Markdown
Member

@andrescrz andrescrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As agreed, I left my post-merge feedback. My main feedback goes around:

  • Experiment aggregations coverage.
  • Dataset version coverage.
  • Query performance, mostly traversing workspaces.

The rest is minor considerations.

Optional<Dataset> findByName(@Bind("workspace_id") String workspaceId, @Bind("name") String name);

@SqlQuery("SELECT id FROM datasets WHERE workspace_id = :workspace_id AND name LIKE CONCAT('%', :name, '%') ESCAPE '\\\\'")
List<UUID> findIdsByPartialName(@Bind("workspace_id") String workspaceId, @Bind("name") String name);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an already method existing for this functionality that should be reused instead of using an existing one.

In addition, this is the old table. Currently dataset versioning is enabled, you need to check if that versioning is covered in your PR. Otherwise, your changes might not work completely.

For dataset_versions, you need to cover indexing on workspaceId and name, I believe that's not the case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigated this. dataset_versions does not have a name column — version names ("v1", "v2") are computed via ROW_NUMBER() in queries, not stored. Versions inherit names from the parent datasets table via the dataset_id FK. So DatasetDAO.findIdsByPartialName on the datasets table is correct — it returns the dataset IDs whose names match, and those same IDs are the parent of any versions.

If you had a different table/method in mind for reuse, happy to look into it — but the current approach seems correct given the schema.

Copy link
Copy Markdown
Member

@andrescrz andrescrz Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I meant the find method in this class, which also finds by partial name matching. Additionally it's paginated, limited and sorted. The only difference is that it returns all fields, but in MySQL is fine, you can just ignore everything but the id.

The idea is to not to duplicate existing methods and try to always limit and paginate unbounded methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backend dependencies Pull requests that update a dependency file Frontend java Pull requests that update Java code Python SDK python Pull requests that update Python code test-environment Deploy Opik adhoc environment tests Including test files, or tests related like configuration. typescript *.ts *.tsx

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants