[OPIK-4727] [FE] feat: optimizer screens face-lift by itamargolan · Pull Request #5554 · comet-ml/opik

itamargolan · 2026-03-06T17:21:05Z

Details

Comprehensive face-lift of the optimizer screens UI with improvements spanning both frontend and backend. This PR introduces new KPI cards with metric comparison (baseline vs current), configuration diff views for comparing trial prompts/settings, an optimization progress chart with candidate lineage visualization, trial status indicators, and various UI polish improvements including theme-aware styling, consistent iconography, and lowercase labels. The backend adds dataset_item_count support and a new MUTATION experiment type for GEPA crossover tracking. The optimizer SDK (GEPA v2) gains reflection-based prompt evolution with multi-parent crossover support.

New components: MetricComparisonCell, ConfigurationDiffContent, OptimizationProgressChart, TrialKPICards, CompareOptimizationsPage views
Backend: dataset_item_count on Experiment/Optimization APIs, MUTATION experiment type, optimization search criteria
Optimizer SDK: GEPA v2 with reflection-based evolution, candidate tracking, crossover rendering
UI polish: theme-aware diff badges, correct column type icons, darkened success green, lowercase labels, chart tooltip fixes
Backward compatibility: studio_config fallback for older optimizations, feature toggle restoration

Change checklist

User facing
Documentation update

Issues

OPIK-4727
OPIK-4687

AI-WATERMARK

AI-WATERMARK: yes

Tools: Claude Code
Model(s): Claude Opus 4.6
Scope: Full implementation with human review
Human verification: yes

Testing

Frontend lint: cd apps/opik-frontend && npm run lint
Frontend build: cd apps/opik-frontend && npm run build
Manual verification of optimizer screens UI in browser
Verified KPI cards, chart tooltips, diff views, column icons, trial status badges

Documentation

N/A - internal UI improvements

Implements a new optimization framework (`apps/opik-optimizer`) that decouples optimizer algorithms from experiment execution, persistence, and UI concerns. Integrates via the existing optimization studio pipeline (Redis queue → Python backend → subprocess). Key components: - Orchestrator: central lifecycle controller with sampler, validator, materializer, result aggregator, and event emitter - StupidOptimizer: 2-step test optimizer (3 candidates → best → 2 more) - EvaluationAdapter: wraps SDK evaluate_optimization_suite_trial() - Backend integration: new Redis queue, framework_optimizer job processor, framework_runner subprocess entry point Also adds evaluate_optimization_suite_trial() to the Python SDK, combining optimization trial linkage with evaluation suite behavior (evaluators and execution policy from the dataset). 53 unit + integration tests passing. Verified end-to-end against Comet cloud with real LLM calls, UI progress chart, prompt display, and score tracking. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Fix AttributeError in framework_runner.py: dataset.get_items() returns dicts, use item["id"] instead of item.id - Fix hard-coded hex color in TrialPassedCell.tsx: use text-success CSS class instead of text-[#12B76A] for proper dark theme support Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add opik:optimizer-framework to default RQ queue names so framework jobs actually get consumed by workers - Add dataset size guard in orchestrator before sample_split to provide a clear error message for datasets with fewer than 2 items - Extract shared optimizer_job_helper.py to deduplicate identical logic between optimizer.py and framework_optimizer.py - Extract checkIsEvaluationSuite helper in optimizations.ts to deduplicate predicate shared between CompareTrialsPage and useCompareOptimizationsData - Fix hardcoded "pass_rate" in experiment_executor.py to use the actual metric_type parameter Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…e-item datasets Splits the combined feedback/experiment scores into distinct fields in the Optimization API and DAO so the frontend can fall back to experiment_scores when feedback_scores lack the objective. Allows single-item datasets by returning a train-only split instead of raising. Extracts shared runner environment setup into runner_common.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The baseline was evaluated on split.validation_item_ids, which with an 80/20 split ratio meant only 1 out of 5 items was used. This gave an unrepresentative baseline score. Now uses the full dataset_item_ids list. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add rich metadata to each experiment so the UI can aggregate and visualize the optimization trajectory. Key changes: - step_index increments only when candidate changes (not per eval) - candidate_id is stable across re-evaluations of the same prompt - parent_candidate_ids always set correctly for derived candidates - New metadata fields: batch_index, num_items, capture_traces, eval_purpose - Refactor optimizer package: protocol + factory pattern for registration - Add GEPA adapter bridging GEPA callbacks to framework metadata - Fix BE tests for experimentScores null and queue routing - Add docs: ADDING_AN_OPTIMIZER.md and GEPA_IMPLEMENTATION.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove register_optimizer public API and OptimizerFactory class; replace with a simple dict in _load_registry() - framework_runner: avoid holding full dataset items in memory - Update docs and tests to match simplified factory Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Resolve conflict in CompareTrialsPage.tsx: keep both workspaceName (for useExperimentsList) and canViewDatasets permission guard from main, plus isEvaluationSuite prop from our branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…iments - Replace sequential step_index counter with parent-lineage derivation (max parent step + 1), so all re-evaluations of the same candidate share the same step_index - Ensure every non-baseline experiment carries parent_candidate_ids, enabling the UI to draw lineage graphs - Pass batch_index, num_items, capture_traces, and eval_purpose through to experiment metadata for richer visualization - Revert runner scripts to direct invocation (remove runner_common.py) - Update unit tests to match new metadata contract Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove canonical_config_hash from Candidate and TrialResult types, candidate_materializer, experiment_executor, and all tests - Delete util/hashing.py module (unused — GEPA does minibatching so config-hash dedup would block valid re-evaluations) - Merge SdkEventEmitter and LoggingEventEmitter into a single EventEmitter class with optional optimization_id - Update GEPA_IMPLEMENTATION.md to reflect parent_ids tracking fixes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…through context - Replace CandidateConfig dataclass with dict[str, Any] type alias - Add baseline_config field to OptimizationContext (caller-provided, opaque) - Orchestrator passes baseline_config through without knowing its structure - Optimizers copy baseline_config and override prompt_messages only - Remove result_aggregator module (inlined into evaluation_adapter) - Move gepa imports to runtime (lazy) for optional dependency - Fix protocol.py training_set/validation_set types to list[dict] - Update ADDING_AN_OPTIMIZER.md to reflect all changes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…dependency on gepa The gepa tests patch gepa.core.adapter.EvaluationBatch and gepa.optimize, requiring the optional gepa package at import time. Moving them to tests/library_integration/gepa/ with pytest.importorskip("gepa") keeps the unit suite fast and dependency-free. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ep progress Optimizers no longer receive or call event_emitter directly. The EvaluationAdapter now auto-detects step_index changes during evaluate() and emits on_step_started internally. GEPAProgressCallback simplified to only forward GEPA events to the adapter. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use caplog to verify logger.info output includes optimization ID and event details, instead of just checking calls don't crash. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… UI detection evaluate_optimization_suite_trial was creating experiments without evaluation_method="evaluation_suite", causing the backend to default to "dataset". The frontend checkIsEvaluationSuite now uses the explicit evaluation_method field instead of heuristic score detection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…tion Adds a guard to evaluate_suite and evaluate_optimization_suite_trial that checks dataset.dataset_type == "evaluation_suite" before proceeding. This prevents silently running an ineffective suite trial on a plain dataset with no scoring rules. - Add dataset_type param to Dataset constructor, populated at all call sites - Add dataset_type property to Dataset - Add _validate_dataset_is_evaluation_suite in evaluator.py - Update tests and add rejection test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…on flow evaluate_suite and evaluate_optimization_suite_trial had their entire body duplicated. Extract shared logic into _run_suite_evaluation, parameterized by optimization_id and dataset filters. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Comprehensive face-lift for optimizer screens including new KPI cards, metric comparison cells, configuration diff views, progress charts, trial status indicators, and backend dataset_item_count support. Also adds backward compatibility for SDK-based optimizations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Dataset name column: hover icon instead of clickable link - Split Accuracy into Pass rate + Accuracy columns with compact metric display - Conditionally hide Accuracy column when no old-type optimizations exist - Remove Logs/Configuration tabs from single optimization page - Fall back to studio_config for configuration display on old optimizations - Chart tooltip: remove pass rate percentage background color - Fix dataset hover icon vertical centering - Restore feature toggle for optimization studio Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add "evaluating" status for scored candidates not yet selected - Best candidate always "passed" (never "evaluating") - Score < best → immediately pruned - Sibling with children (including ghost) → pruned - Pulsing animation on last passed candidate at highest step (not always on best score) - After completion: descendants or best = passed, rest = pruned Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

.../src/components/pages-shared/experiments/OptimizationProgressChart/optimizationChartUtils.ts

- Ghost dot participates in same overlap grouping as regular dots - Remove dead GHOST_OVERLAP_OFFSET_PER_DOT constant - Move GHOST_ID to module scope - Add null guard in computeInProgressStatus (remove non-null assertion) - Show trend arrow when baseline is 0% (icon-only for infinite change) - Refactor computeCandidateStatuses into focused helpers Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…eChartData 14 tests covering: baseline, running, evaluating, passed, pruned states, ghost parent pruning, descendant detection, in-progress vs completed logic Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

.../src/components/pages-shared/experiments/OptimizationProgressChart/optimizationChartUtils.ts

CometActions · 2026-03-17T04:41:02Z

🌙 Nightly cleanup: The test environment for this PR (pr-5554) has been cleaned up to free cluster resources. PVCs are preserved — re-deploy to restore the environment.

Candidates at the same step but from different parents are independent branches and should not prune each other. Changed grouping key from stepIndex to sorted parentCandidateIds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-03-17T12:32:11Z

🔄 Test environment deployment process has started

Phase 1: Deploying base version 1.10.40 (from main branch) if environment doesn't exist
Phase 2: Building new images from PR branch itamar/optimizer-screens-face-lift
Phase 3: Will deploy newly built version after build completes

You can monitor the progress here.

CometActions · 2026-03-17T12:41:07Z

✅ Test environment is now available!

To configure additional Environment variables for your environment, run [Deploy Opik AdHoc Environment workflow] (https://github.com/comet-ml/comet-deployment/actions/workflows/deploy_opik_adhoc_env.yaml)

Access Information

URL: https://pr-5554.dev.comet.com
Cluster: comet-ml-development
Namespace: pr-5554
Version: 1.10.40-5554-merge-1609
Application logs: View in Grafana

The deployment has completed successfully and the version has been verified.

apps/opik-frontend/src/components/pages/OptimizationPage/TrialMetricCells.tsx

...ik-frontend/src/components/pages-shared/experiments/OptimizationProgressChart/ScatterDot.tsx

.../src/components/pages-shared/experiments/OptimizationProgressChart/optimizationChartUtils.ts

...ents/pages-shared/experiments/OptimizationProgressChart/OptimizationProgressChartContent.tsx

- Show objective metric name instead of "Accuracy" in KPI card and table - Pass baselineCandidate via column meta instead of scanning per cell (reduces O(3*R*N) to O(1) baseline lookups per render) - Rename buildDescendantsSet → buildAncestorSet (matches traversal direction) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

apps/opik-frontend/src/components/pages/OptimizationPage/useOptimizationColumns.ts

Shared helper for objective column/card label used by both useOptimizationColumns and getMetricKPICardConfigs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-03-17T13:54:13Z

🔄 Test environment deployment process has started

Phase 1: Deploying base version 1.10.40 (from main branch) if environment doesn't exist
Phase 2: Building new images from PR branch itamar/optimizer-screens-face-lift
Phase 3: Will deploy newly built version after build completes

You can monitor the progress here.

CometActions · 2026-03-17T14:03:44Z

✅ Test environment is now available!

To configure additional Environment variables for your environment, run [Deploy Opik AdHoc Environment workflow] (https://github.com/comet-ml/comet-deployment/actions/workflows/deploy_opik_adhoc_env.yaml)

Access Information

URL: https://pr-5554.dev.comet.com
Cluster: comet-ml-development
Namespace: pr-5554
Version: 1.10.40-5554-merge-1610
Application logs: View in Grafana

The deployment has completed successfully and the version has been verified.

JetoPistola

BE changes look good ✅

All review findings have been addressed — either fixed in code or acknowledged with rationale:

Sentinel nil UUID → replaced with early return
MATERIALIZE INDEX → added to migration
Aggregation field test coverage → added
datasetItemCount redundancy → field removed
LIKE metacharacter escaping → fixed
Blocking JDBI call → acknowledged, deferred to follow-up
argMax tiebreaking → accepted as intentional
Weighted p50 approximation → acknowledged
Score clamping → intentional (pass rates are 0-1)
Feature flag removal → intentional (dataset versioning rollout)

🤖 Review posted via /review-github-pr

aadereiko

The FE parts looks to go, the comments that are left are NIT, feel free to fix the in a follow up PR

...ents/pages-shared/experiments/OptimizationProgressChart/OptimizationProgressChartContent.tsx

apps/opik-frontend/src/components/pages/OptimizationPage/useOptimizationExperiments.ts

…mizations - Extract hover tooltip into ChartTooltip.tsx component - Move getOptimizationMetadata, aggregateCandidates, mergeExperimentScores from useOptimizationExperiments to lib/optimizations.ts - Move CANDIDATE_SORT_FIELD_MAP, sortCandidates to lib/optimizations.ts Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

apps/opik-frontend/src/lib/optimizations.ts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

aadereiko

LGTM, thanks for fixing comments!

YarivHashaiComet · 2026-03-17T18:48:33Z

@itamargolan i'm merging this one for you

andrescrz

As agreed, I left my post-merge feedback. My main feedback goes around:

Experiment aggregations coverage.
Dataset version coverage.
Query performance, mostly traversing workspaces.

The rest is minor considerations.

...nd/src/main/java/com/comet/opik/domain/experiments/aggregations/ExperimentAggregatesDAO.java

andrescrz · 2026-03-18T13:51:12Z

apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetDAO.java

    Optional<Dataset> findByName(@Bind("workspace_id") String workspaceId, @Bind("name") String name);

+    @SqlQuery("SELECT id FROM datasets WHERE workspace_id = :workspace_id AND name LIKE CONCAT('%', :name, '%') ESCAPE '\\\\'")
+    List<UUID> findIdsByPartialName(@Bind("workspace_id") String workspaceId, @Bind("name") String name);


There's an already method existing for this functionality that should be reused instead of using an existing one.

In addition, this is the old table. Currently dataset versioning is enabled, you need to check if that versioning is covered in your PR. Otherwise, your changes might not work completely.

For dataset_versions, you need to cover indexing on workspaceId and name, I believe that's not the case.

Investigated this. dataset_versions does not have a name column — version names ("v1", "v2") are computed via ROW_NUMBER() in queries, not stored. Versions inherit names from the parent datasets table via the dataset_id FK. So DatasetDAO.findIdsByPartialName on the datasets table is correct — it returns the dataset IDs whose names match, and those same IDs are the parent of any versions.

If you had a different table/method in mind for reuse, happy to look into it — but the current approach seems correct given the schema.

Yes, I meant the find method in this class, which also finds by partial name matching. Additionally it's paginated, limited and sorted. The only difference is that it returns all fields, but in MySQL is fine, you can just ignore everything but the id.

The idea is to not to duplicate existing methods and try to always limit and paginate unbounded methods.

apps/opik-backend/src/main/java/com/comet/opik/domain/ExperimentItemService.java

apps/opik-backend/src/main/java/com/comet/opik/domain/OptimizationDAO.java

apps/opik-backend/src/main/java/com/comet/opik/domain/OptimizationService.java

itamargolan and others added 30 commits February 26, 2026 10:40

design doc

82790ae

FE communication and ERD additions

69efe6f

ui reporting events in flow

772eed9

changes

444a002

add reason to TrialItemRun

5bee1ad

Adjustments for UI and framework review

9e672b5

fix: extract shared getBestOptimizationScore helper to deduplicate logic

899b6b2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge branch 'main' into itamar/new-optimizer-framework

36238e6

Merge branch 'main' into itamar/new-optimizer-framework

c3e4b93

Merge branch 'main' into itamar/new-optimizer-framework

2dd844d

test: assert on actual log messages in event emitter tests

2142952

Use caplog to verify logger.info output includes optimization ID and event details, instead of just checking calls don't crash. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge branch 'main' into itamar/new-optimizer-framework

4cb8ffd

baz-reviewer bot reviewed Mar 16, 2026

View reviewed changes

.../src/components/pages-shared/experiments/OptimizationProgressChart/optimizationChartUtils.ts Outdated Show resolved Hide resolved

.../src/components/pages-shared/experiments/OptimizationProgressChart/optimizationChartUtils.ts Outdated Show resolved Hide resolved

itamargolan and others added 2 commits March 16, 2026 22:16

baz-reviewer bot reviewed Mar 17, 2026

View reviewed changes

.../src/components/pages-shared/experiments/OptimizationProgressChart/optimizationChartUtils.ts Show resolved Hide resolved

CometActions removed the test-environment Deploy Opik adhoc environment label Mar 17, 2026

itamargolan added the test-environment Deploy Opik adhoc environment label Mar 17, 2026

JetoPistola reviewed Mar 17, 2026

View reviewed changes

baz-reviewer bot reviewed Mar 17, 2026

View reviewed changes

apps/opik-frontend/src/components/pages/OptimizationPage/useOptimizationColumns.ts Show resolved Hide resolved

refactor: extract getObjectiveLabel to lib/optimizations

fc4de5d

Shared helper for objective column/card label used by both useOptimizationColumns and getMetricKPICardConfigs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

itamargolan added test-environment Deploy Opik adhoc environment and removed test-environment Deploy Opik adhoc environment labels Mar 17, 2026

JetoPistola previously approved these changes Mar 17, 2026

View reviewed changes

aadereiko previously approved these changes Mar 17, 2026

View reviewed changes

itamargolan dismissed stale reviews from aadereiko and JetoPistola via 962d311 March 17, 2026 18:04

baz-reviewer bot reviewed Mar 17, 2026

View reviewed changes

apps/opik-frontend/src/lib/optimizations.ts Show resolved Hide resolved

fix: default feedbackScores to [] in mergeExperimentScores

fc9fc35

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

aadereiko approved these changes Mar 17, 2026

View reviewed changes

andrescrz reviewed Mar 18, 2026

View reviewed changes

itamargolan mentioned this pull request Mar 18, 2026

[OPIK-4727] [BE] fix: address PR #5554 review feedback #5725

Draft

2 tasks

Conversation

itamargolan commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Change checklist

Issues

AI-WATERMARK

Testing

Documentation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CometActions commented Mar 17, 2026

Uh oh!

github-actions bot commented Mar 17, 2026

Uh oh!

CometActions commented Mar 17, 2026

Access Information

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 17, 2026

Uh oh!

CometActions commented Mar 17, 2026

Access Information

Uh oh!

JetoPistola left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aadereiko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aadereiko left a comment

Choose a reason for hiding this comment

Uh oh!

YarivHashaiComet commented Mar 17, 2026

Uh oh!

andrescrz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andrescrz Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

itamargolan Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

andrescrz Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

itamargolan commented Mar 6, 2026 •

edited

Loading

JetoPistola left a comment •

edited

Loading

andrescrz Mar 20, 2026 •

edited

Loading