[OPIK-4727] [FE] feat: optimizer screens face-lift#5554
[OPIK-4727] [FE] feat: optimizer screens face-lift#5554YarivHashaiComet merged 92 commits intomainfrom
Conversation
Implements a new optimization framework (`apps/opik-optimizer`) that decouples optimizer algorithms from experiment execution, persistence, and UI concerns. Integrates via the existing optimization studio pipeline (Redis queue → Python backend → subprocess). Key components: - Orchestrator: central lifecycle controller with sampler, validator, materializer, result aggregator, and event emitter - StupidOptimizer: 2-step test optimizer (3 candidates → best → 2 more) - EvaluationAdapter: wraps SDK evaluate_optimization_suite_trial() - Backend integration: new Redis queue, framework_optimizer job processor, framework_runner subprocess entry point Also adds evaluate_optimization_suite_trial() to the Python SDK, combining optimization trial linkage with evaluation suite behavior (evaluators and execution policy from the dataset). 53 unit + integration tests passing. Verified end-to-end against Comet cloud with real LLM calls, UI progress chart, prompt display, and score tracking. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix AttributeError in framework_runner.py: dataset.get_items() returns dicts, use item["id"] instead of item.id - Fix hard-coded hex color in TrialPassedCell.tsx: use text-success CSS class instead of text-[#12B76A] for proper dark theme support Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add opik:optimizer-framework to default RQ queue names so framework jobs actually get consumed by workers - Add dataset size guard in orchestrator before sample_split to provide a clear error message for datasets with fewer than 2 items - Extract shared optimizer_job_helper.py to deduplicate identical logic between optimizer.py and framework_optimizer.py - Extract checkIsEvaluationSuite helper in optimizations.ts to deduplicate predicate shared between CompareTrialsPage and useCompareOptimizationsData - Fix hardcoded "pass_rate" in experiment_executor.py to use the actual metric_type parameter Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e-item datasets Splits the combined feedback/experiment scores into distinct fields in the Optimization API and DAO so the frontend can fall back to experiment_scores when feedback_scores lack the objective. Allows single-item datasets by returning a train-only split instead of raising. Extracts shared runner environment setup into runner_common.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The baseline was evaluated on split.validation_item_ids, which with an 80/20 split ratio meant only 1 out of 5 items was used. This gave an unrepresentative baseline score. Now uses the full dataset_item_ids list. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add rich metadata to each experiment so the UI can aggregate and visualize the optimization trajectory. Key changes: - step_index increments only when candidate changes (not per eval) - candidate_id is stable across re-evaluations of the same prompt - parent_candidate_ids always set correctly for derived candidates - New metadata fields: batch_index, num_items, capture_traces, eval_purpose - Refactor optimizer package: protocol + factory pattern for registration - Add GEPA adapter bridging GEPA callbacks to framework metadata - Fix BE tests for experimentScores null and queue routing - Add docs: ADDING_AN_OPTIMIZER.md and GEPA_IMPLEMENTATION.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove register_optimizer public API and OptimizerFactory class; replace with a simple dict in _load_registry() - framework_runner: avoid holding full dataset items in memory - Update docs and tests to match simplified factory Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve conflict in CompareTrialsPage.tsx: keep both workspaceName (for useExperimentsList) and canViewDatasets permission guard from main, plus isEvaluationSuite prop from our branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…iments - Replace sequential step_index counter with parent-lineage derivation (max parent step + 1), so all re-evaluations of the same candidate share the same step_index - Ensure every non-baseline experiment carries parent_candidate_ids, enabling the UI to draw lineage graphs - Pass batch_index, num_items, capture_traces, and eval_purpose through to experiment metadata for richer visualization - Revert runner scripts to direct invocation (remove runner_common.py) - Update unit tests to match new metadata contract Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove canonical_config_hash from Candidate and TrialResult types, candidate_materializer, experiment_executor, and all tests - Delete util/hashing.py module (unused — GEPA does minibatching so config-hash dedup would block valid re-evaluations) - Merge SdkEventEmitter and LoggingEventEmitter into a single EventEmitter class with optional optimization_id - Update GEPA_IMPLEMENTATION.md to reflect parent_ids tracking fixes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…through context - Replace CandidateConfig dataclass with dict[str, Any] type alias - Add baseline_config field to OptimizationContext (caller-provided, opaque) - Orchestrator passes baseline_config through without knowing its structure - Optimizers copy baseline_config and override prompt_messages only - Remove result_aggregator module (inlined into evaluation_adapter) - Move gepa imports to runtime (lazy) for optional dependency - Fix protocol.py training_set/validation_set types to list[dict] - Update ADDING_AN_OPTIMIZER.md to reflect all changes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…dependency on gepa
The gepa tests patch gepa.core.adapter.EvaluationBatch and gepa.optimize,
requiring the optional gepa package at import time. Moving them to
tests/library_integration/gepa/ with pytest.importorskip("gepa") keeps
the unit suite fast and dependency-free.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ep progress Optimizers no longer receive or call event_emitter directly. The EvaluationAdapter now auto-detects step_index changes during evaluate() and emits on_step_started internally. GEPAProgressCallback simplified to only forward GEPA events to the adapter. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use caplog to verify logger.info output includes optimization ID and event details, instead of just checking calls don't crash. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… UI detection evaluate_optimization_suite_trial was creating experiments without evaluation_method="evaluation_suite", causing the backend to default to "dataset". The frontend checkIsEvaluationSuite now uses the explicit evaluation_method field instead of heuristic score detection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tion Adds a guard to evaluate_suite and evaluate_optimization_suite_trial that checks dataset.dataset_type == "evaluation_suite" before proceeding. This prevents silently running an ineffective suite trial on a plain dataset with no scoring rules. - Add dataset_type param to Dataset constructor, populated at all call sites - Add dataset_type property to Dataset - Add _validate_dataset_is_evaluation_suite in evaluator.py - Update tests and add rejection test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…on flow evaluate_suite and evaluate_optimization_suite_trial had their entire body duplicated. Extract shared logic into _run_suite_evaluation, parameterized by optimization_id and dataset filters. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comprehensive face-lift for optimizer screens including new KPI cards, metric comparison cells, configuration diff views, progress charts, trial status indicators, and backend dataset_item_count support. Also adds backward compatibility for SDK-based optimizations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Dataset name column: hover icon instead of clickable link - Split Accuracy into Pass rate + Accuracy columns with compact metric display - Conditionally hide Accuracy column when no old-type optimizations exist - Remove Logs/Configuration tabs from single optimization page - Fall back to studio_config for configuration display on old optimizations - Chart tooltip: remove pass rate percentage background color - Fix dataset hover icon vertical centering - Restore feature toggle for optimization studio Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add "evaluating" status for scored candidates not yet selected - Best candidate always "passed" (never "evaluating") - Score < best → immediately pruned - Sibling with children (including ghost) → pruned - Pulsing animation on last passed candidate at highest step (not always on best score) - After completion: descendants or best = passed, rest = pruned Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
.../src/components/pages-shared/experiments/OptimizationProgressChart/optimizationChartUtils.ts
Outdated
Show resolved
Hide resolved
.../src/components/pages-shared/experiments/OptimizationProgressChart/optimizationChartUtils.ts
Outdated
Show resolved
Hide resolved
- Ghost dot participates in same overlap grouping as regular dots - Remove dead GHOST_OVERLAP_OFFSET_PER_DOT constant - Move GHOST_ID to module scope - Add null guard in computeInProgressStatus (remove non-null assertion) - Show trend arrow when baseline is 0% (icon-only for infinite change) - Refactor computeCandidateStatuses into focused helpers Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…eChartData 14 tests covering: baseline, running, evaluating, passed, pruned states, ghost parent pruning, descendant detection, in-progress vs completed logic Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
.../src/components/pages-shared/experiments/OptimizationProgressChart/optimizationChartUtils.ts
Show resolved
Hide resolved
|
🌙 Nightly cleanup: The test environment for this PR ( |
Candidates at the same step but from different parents are independent branches and should not prune each other. Changed grouping key from stepIndex to sorted parentCandidateIds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
🔄 Test environment deployment process has started Phase 1: Deploying base version You can monitor the progress here. |
|
✅ Test environment is now available! To configure additional Environment variables for your environment, run [Deploy Opik AdHoc Environment workflow] (https://github.com/comet-ml/comet-deployment/actions/workflows/deploy_opik_adhoc_env.yaml) Access Information
The deployment has completed successfully and the version has been verified. |
apps/opik-frontend/src/components/pages/OptimizationPage/TrialMetricCells.tsx
Outdated
Show resolved
Hide resolved
...ik-frontend/src/components/pages-shared/experiments/OptimizationProgressChart/ScatterDot.tsx
Show resolved
Hide resolved
.../src/components/pages-shared/experiments/OptimizationProgressChart/optimizationChartUtils.ts
Outdated
Show resolved
Hide resolved
...ents/pages-shared/experiments/OptimizationProgressChart/OptimizationProgressChartContent.tsx
Show resolved
Hide resolved
- Show objective metric name instead of "Accuracy" in KPI card and table - Pass baselineCandidate via column meta instead of scanning per cell (reduces O(3*R*N) to O(1) baseline lookups per render) - Rename buildDescendantsSet → buildAncestorSet (matches traversal direction) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
apps/opik-frontend/src/components/pages/OptimizationPage/useOptimizationColumns.ts
Show resolved
Hide resolved
Shared helper for objective column/card label used by both useOptimizationColumns and getMetricKPICardConfigs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
🔄 Test environment deployment process has started Phase 1: Deploying base version You can monitor the progress here. |
|
✅ Test environment is now available! To configure additional Environment variables for your environment, run [Deploy Opik AdHoc Environment workflow] (https://github.com/comet-ml/comet-deployment/actions/workflows/deploy_opik_adhoc_env.yaml) Access Information
The deployment has completed successfully and the version has been verified. |
There was a problem hiding this comment.
BE changes look good ✅
All review findings have been addressed — either fixed in code or acknowledged with rationale:
- Sentinel nil UUID → replaced with early return
MATERIALIZE INDEX→ added to migration- Aggregation field test coverage → added
datasetItemCountredundancy → field removed- LIKE metacharacter escaping → fixed
- Blocking JDBI call → acknowledged, deferred to follow-up
argMaxtiebreaking → accepted as intentional- Weighted p50 approximation → acknowledged
- Score clamping → intentional (pass rates are 0-1)
- Feature flag removal → intentional (dataset versioning rollout)
🤖 Review posted via /review-github-pr
aadereiko
left a comment
There was a problem hiding this comment.
The FE parts looks to go, the comments that are left are NIT, feel free to fix the in a follow up PR
...ents/pages-shared/experiments/OptimizationProgressChart/OptimizationProgressChartContent.tsx
Outdated
Show resolved
Hide resolved
apps/opik-frontend/src/components/pages/OptimizationPage/useOptimizationExperiments.ts
Outdated
Show resolved
Hide resolved
apps/opik-frontend/src/components/pages/OptimizationPage/useOptimizationExperiments.ts
Outdated
Show resolved
Hide resolved
…mizations - Extract hover tooltip into ChartTooltip.tsx component - Move getOptimizationMetadata, aggregateCandidates, mergeExperimentScores from useOptimizationExperiments to lib/optimizations.ts - Move CANDIDATE_SORT_FIELD_MAP, sortCandidates to lib/optimizations.ts Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
aadereiko
left a comment
There was a problem hiding this comment.
LGTM, thanks for fixing comments!
|
@itamargolan i'm merging this one for you |
andrescrz
left a comment
There was a problem hiding this comment.
As agreed, I left my post-merge feedback. My main feedback goes around:
- Experiment aggregations coverage.
- Dataset version coverage.
- Query performance, mostly traversing workspaces.
The rest is minor considerations.
...nd/src/main/java/com/comet/opik/domain/experiments/aggregations/ExperimentAggregatesDAO.java
Show resolved
Hide resolved
| Optional<Dataset> findByName(@Bind("workspace_id") String workspaceId, @Bind("name") String name); | ||
|
|
||
| @SqlQuery("SELECT id FROM datasets WHERE workspace_id = :workspace_id AND name LIKE CONCAT('%', :name, '%') ESCAPE '\\\\'") | ||
| List<UUID> findIdsByPartialName(@Bind("workspace_id") String workspaceId, @Bind("name") String name); |
There was a problem hiding this comment.
There's an already method existing for this functionality that should be reused instead of using an existing one.
In addition, this is the old table. Currently dataset versioning is enabled, you need to check if that versioning is covered in your PR. Otherwise, your changes might not work completely.
For dataset_versions, you need to cover indexing on workspaceId and name, I believe that's not the case.
There was a problem hiding this comment.
Investigated this. dataset_versions does not have a name column — version names ("v1", "v2") are computed via ROW_NUMBER() in queries, not stored. Versions inherit names from the parent datasets table via the dataset_id FK. So DatasetDAO.findIdsByPartialName on the datasets table is correct — it returns the dataset IDs whose names match, and those same IDs are the parent of any versions.
If you had a different table/method in mind for reuse, happy to look into it — but the current approach seems correct given the schema.
There was a problem hiding this comment.
Yes, I meant the find method in this class, which also finds by partial name matching. Additionally it's paginated, limited and sorted. The only difference is that it returns all fields, but in MySQL is fine, you can just ignore everything but the id.
The idea is to not to duplicate existing methods and try to always limit and paginate unbounded methods.
Details
Comprehensive face-lift of the optimizer screens UI with improvements spanning both frontend and backend. This PR introduces new KPI cards with metric comparison (baseline vs current), configuration diff views for comparing trial prompts/settings, an optimization progress chart with candidate lineage visualization, trial status indicators, and various UI polish improvements including theme-aware styling, consistent iconography, and lowercase labels. The backend adds
dataset_item_countsupport and a newMUTATIONexperiment type for GEPA crossover tracking. The optimizer SDK (GEPA v2) gains reflection-based prompt evolution with multi-parent crossover support.Change checklist
Issues
AI-WATERMARK
AI-WATERMARK: yes
Testing
cd apps/opik-frontend && npm run lintcd apps/opik-frontend && npm run buildDocumentation
N/A - internal UI improvements