[OPIK-5650] [BE/FE] feat: support evaluation suites in playground#6092
[OPIK-5650] [BE/FE] feat: support evaluation suites in playground#6092itamargolan merged 41 commits intomainfrom
Conversation
Add backend experiment execution endpoint and frontend eval suite flow so playground can run evaluation suite datasets with server-side assertion processing and poll-based progress tracking. Made-with: Cursor
📋 PR Linter Failed❌ Missing Section. The description is missing the ❌ Missing Section. The description is missing the ❌ Missing Section. The description is missing the ❌ Missing Section. The description is missing the |
1 similar comment
📋 PR Linter Failed❌ Missing Section. The description is missing the ❌ Missing Section. The description is missing the ❌ Missing Section. The description is missing the ❌ Missing Section. The description is missing the |
Backend Tests - Integration Group 16242 tests 242 ✅ 5m 57s ⏱️ Results for commit eee9991. ♻️ This comment has been updated with latest results. |
apps/opik-backend/src/main/java/com/comet/opik/domain/ExperimentExecutionService.java
Outdated
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/ExperimentExecutionService.java
Outdated
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/ExperimentExecutionService.java
Outdated
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/ExperimentExecutionService.java
Outdated
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/ExperimentItemProcessor.java
Outdated
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/ExperimentItemProcessor.java
Outdated
Show resolved
Hide resolved
...-backend/src/main/java/com/comet/opik/api/resources/v1/events/EvalSuiteAssertionSampler.java
Outdated
Show resolved
Hide resolved
...-backend/src/main/java/com/comet/opik/api/resources/v1/events/EvalSuiteAssertionSampler.java
Outdated
Show resolved
Hide resolved
...-backend/src/main/java/com/comet/opik/api/resources/v1/events/EvalSuiteAssertionSampler.java
Outdated
Show resolved
Hide resolved
...-backend/src/main/java/com/comet/opik/api/resources/v1/events/EvalSuiteAssertionSampler.java
Outdated
Show resolved
Hide resolved
...-backend/src/main/java/com/comet/opik/api/resources/v1/events/EvalSuiteAssertionSampler.java
Outdated
Show resolved
Hide resolved
...-backend/src/main/java/com/comet/opik/api/resources/v1/events/EvalSuiteAssertionSampler.java
Outdated
Show resolved
Hide resolved
...-backend/src/main/java/com/comet/opik/api/resources/v1/events/EvalSuiteAssertionSampler.java
Outdated
Show resolved
Hide resolved
...-backend/src/main/java/com/comet/opik/api/resources/v1/events/EvalSuiteAssertionSampler.java
Outdated
Show resolved
Hide resolved
...-backend/src/main/java/com/comet/opik/api/resources/v1/events/EvalSuiteAssertionSampler.java
Outdated
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/ExperimentItemProcessor.java
Outdated
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/ExperimentItemProcessor.java
Outdated
Show resolved
Hide resolved
apps/opik-frontend/src/api/playground/createLogPlaygroundProcessor.ts
Outdated
Show resolved
Hide resolved
apps/opik-frontend/src/api/playground/createLogPlaygroundProcessor.ts
Outdated
Show resolved
Hide resolved
- Switch to @requiredargsconstructor convention in EvalSuiteAssertionSampler and ExperimentItemProcessor - Remove SDK references from comments, rename methods (fetchDatasetEvaluators, getMetadataString, toLangChain4jMessage, etc.) - Fix log patterns: pass exception as last param instead of e.getMessage() - Split catch: UncheckedIOException for deserialization, Exception for other errors - Replace generateDeterministicId with IdGenerator.generateId() (UUID v7) - Pre-process evaluators outside trace loop via PreparedEvaluator record - Add dataset version filtering to DatasetItemStreamRequest - Add null validation for datasetId with BadRequestException - Extract buildMessagesInput/buildLlmOutput helpers to deduplicate trace/span creation - Simplify buildTemplateContext using forEach - Add backward-compatibility comment on OnlineScoringLlmAsJudgeScorer Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backend Tests - Unit Tests1 638 tests 1 636 ✅ 1m 1s ⏱️ Results for commit 7142907. ♻️ This comment has been updated with latest results. |
…trace Collect unique dataset item IDs upfront, fetch and prepare evaluators once per item, then look up from a map inside the trace loop. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…aluators Pass the userName from TracesCreated event through to the reactive context instead of hardcoding "system". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…h and fix unit tests - Rename metadata key to eval_suite_dataset_version_hash across BE, FE, and tests - Fix ExperimentExecutionServiceTest: add datasetId to test requests to match the null-safety validation added earlier - Update test for missing datasetId to assert BadRequestException Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
...-backend/src/main/java/com/comet/opik/api/resources/v1/events/EvalSuiteAssertionSampler.java
Outdated
Show resolved
Hide resolved
- Add getItemEvaluatorsByDatasetId to DAO/service for single-query batch fetch of all item evaluators in a dataset version - Refactor EvalSuiteAssertionSampler to use batch fetch instead of per-item reactive calls - Update RunOnDatasetDialog to reflect dataset/evaluation suite choice with dynamic button text and labels Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
...-backend/src/main/java/com/comet/opik/api/resources/v1/events/EvalSuiteAssertionSampler.java
Outdated
Show resolved
Hide resolved
...-backend/src/main/java/com/comet/opik/api/resources/v1/events/EvalSuiteAssertionSampler.java
Outdated
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemVersionDAO.java
Outdated
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemVersionDAO.java
Outdated
Show resolved
Hide resolved
apps/opik-frontend/src/v2/pages/PlaygroundPage/RunOnDatasetDialog.tsx
Outdated
Show resolved
Hide resolved
The FE-orchestrated path (createLogPlaygroundProcessor) is only used for regular datasets. Eval suites use the BE-orchestrated path exclusively, so evalSuiteDatasetId, evalSuiteVersionHash, and evaluationMethod fields on LogQueueParams were never set and the related trace metadata / experiment blocks were unreachable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The prefetchItemEvaluators method was missing USER_NAME in contextWrite, causing makeFluxContextAware to throw NoSuchElementException (silently caught), which meant item-level assertions were never calculated. Also fixes test config construction to use getJsonNodeFromString instead of readTree to properly parse evaluator config JSON. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add UUID validation for eval_suite_dataset_item_id in EvalSuiteAssertionSampler - Fix ClickHouse dedup ordering in DatasetItemVersionDAO (filter after LIMIT 1 BY) - Add EXPERIMENT_STATUS enum and use constants instead of string literals - Add two-phase polling (running → evaluating) for eval suite experiments - Extract nested ternary into helper function in RunOnDatasetDialog - Add progress indicator with phase-aware display (running/evaluating) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
...kend/src/test/java/com/comet/opik/api/resources/v1/events/EvalSuiteAssertionSamplerTest.java
Show resolved
Hide resolved
.../PlaygroundPage/PlaygroundOutputs/PlaygroundOutputScores/PlaygroundOutputAssertionStatus.tsx
Outdated
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemVersionDAO.java
Outdated
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/DatasetItemVersionDAO.java
Outdated
Show resolved
Hide resolved
apps/opik-frontend/src/v2/pages/PlaygroundPage/useActionButtonActions.tsx
Outdated
Show resolved
Hide resolved
...k-backend/src/main/java/com/comet/opik/api/resources/v1/events/EvalSuiteEvaluatorMapper.java
Show resolved
Hide resolved
...k-backend/src/main/java/com/comet/opik/api/resources/v1/events/EvalSuiteEvaluatorMapper.java
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/api/events/ExperimentItemToProcess.java
Show resolved
Hide resolved
apps/opik-frontend/src/v2/pages/PlaygroundPage/useActionButtonActions.tsx
Show resolved
Hide resolved
...-backend/src/main/java/com/comet/opik/api/resources/v1/events/EvalSuiteAssertionSampler.java
Show resolved
Hide resolved
...-backend/src/main/java/com/comet/opik/api/resources/v1/events/EvalSuiteAssertionSampler.java
Show resolved
Hide resolved
…stead of UserMessage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove broad catch(Exception) that silently swallowed runtime errors. Keep only UncheckedIOException for deserialization failures and add evaluator config to the log for debuggability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
...k-backend/src/main/java/com/comet/opik/api/resources/v1/events/EvalSuiteEvaluatorMapper.java
Outdated
Show resolved
Hide resolved
…torMapper Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
thiagohora
left a comment
There was a problem hiding this comment.
Thanks for addressing all my comments. Only a few remaining issues and I'll approve. Let's address mainly the reactive issues
...src/main/java/com/comet/opik/api/resources/v1/events/ExperimentItemProcessingSubscriber.java
Outdated
Show resolved
Hide resolved
...src/main/java/com/comet/opik/api/resources/v1/events/ExperimentItemProcessingSubscriber.java
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/ExperimentItemProcessor.java
Outdated
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/ExperimentItemProcessor.java
Outdated
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/ExperimentTracePersistence.java
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/ExperimentExecutionService.java
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/ExperimentTracePersistence.java
Outdated
Show resolved
Hide resolved
…afe score routing Replace implicit string-based categoryName check with a ScoreDestination enum to make score routing (feedback_scores vs assertion_results) explicit and type-safe. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…code quality - Wrap blocking LLM call in Mono.fromCallable with boundedElastic scheduler - Remove redundant Mono.defer and subscribeOn from subscriber - Add TTL to Redis failure counter to prevent memory leaks - Extract PersistenceContext record to reduce parameter count - Add projectName to ExperimentItem creation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…sionHash, parallelize trace+span creation - Distinguish user errors (invalid versionHash) from transient DB failures in fetchDatasetExecutionPolicy - Run trace and span creation in parallel via Mono.when() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Backend Tests - Integration Group 1 23 files 23 suites 3m 0s ⏱️ Results for commit 3d28528. ♻️ This comment has been updated with latest results. |
| private record LlmCallResult( | ||
| ChatCompletionResponse response, | ||
| String errorType, | ||
| String errorMessage, | ||
| Instant startTime, | ||
| Instant endTime) { |
There was a problem hiding this comment.
LlmCallResult lacks @Builder(toBuilder = true) and is instantiated directly, should we add that annotation and switch instantiations to LlmCallResult.builder()...build()?
new LlmCallResult(...) => LlmCallResult.builder()...build()
Finding type: AI Coding Guidelines | Severity: 🟢 Low
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents:
Before applying, verify this suggestion against the current code. In
apps/opik-backend/src/main/java/com/comet/opik/domain/ExperimentItemProcessor.java
around lines 28-33, the record LlmCallResult is defined without a builder; annotate it
with @Builder(toBuilder = true) (and add the Lombok import if missing) so it becomes a
proper DTO per the project convention. Then update all instantiations around lines 47-53
(the places returning new LlmCallResult(...)) to use
LlmCallResult.builder().response(...).errorType(...).errorMessage(...).startTime(...).endTime(...).build()
(set only the fields used in each branch), replacing positional constructors so callers
get a toBuilder() hook and comply with the documented pattern.
| @DisplayName("null categoryName resolves to FEEDBACK_SCORES") | ||
| void nullCategoryResolvesToFeedbackScores() { | ||
| ScoreDestination destination = SUITE_ASSERTION_CATEGORY.equals(null) | ||
| ? ScoreDestination.ASSERTION_RESULTS | ||
| : ScoreDestination.FEEDBACK_SCORES; | ||
|
|
||
| assertThat(destination).isEqualTo(ScoreDestination.FEEDBACK_SCORES); | ||
| } | ||
|
|
||
| @Test | ||
| @DisplayName("arbitrary categoryName resolves to FEEDBACK_SCORES") | ||
| void arbitraryCategoryResolvesToFeedbackScores() { |
There was a problem hiding this comment.
nullCategoryResolvesToFeedbackScores and arbitraryCategoryResolvesToFeedbackScores duplicate the same assertion — should we collapse them into a single @ParameterizedTest with @NullSource and @ValueSource(strings = "some_other_category")?
@ParameterizedTest
@NullSource
@valuesource(strings = "some_other_category")
void categoryResolvesToFeedbackScores(String category) { /* ... */ }
Finding type: Use parameterized tests | Severity: 🟢 Low
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents:
Before applying, verify this suggestion against the current code. In
apps/opik-backend/src/test/java/com/comet/opik/api/ScoreDestinationTest.java around
lines 26-43, the methods `nullCategoryResolvesToFeedbackScores` and
`arbitraryCategoryResolvesToFeedbackScores` duplicate the same logic (calling
SUITE_ASSERTION_CATEGORY.equals(...) and asserting FEEDBACK_SCORES) with only the input
literal differing. Replace both tests with a single parameterized test: annotate a new
method with @ParameterizedTest, add @NullSource and @ValueSource(strings =
"some_other_category"), accept a String parameter for the category, compute the
destination the same way, and assert it equals ScoreDestination.FEEDBACK_SCORES. Also
add the necessary imports for ParameterizedTest, NullSource, and ValueSource and remove
the two original test methods.
apps/opik-backend/src/main/java/com/comet/opik/domain/FeedbackScoreService.java
Show resolved
Hide resolved
| @lombok.Builder | ||
| record PersistenceContext( | ||
| @NonNull UUID traceId, | ||
| @NonNull String projectName, | ||
| @NonNull ExperimentExecutionRequest.PromptVariant prompt, | ||
| @NonNull List<ExperimentExecutionRequest.PromptVariant.Message> renderedMessages, | ||
| ChatCompletionResponse llmResponse, | ||
| String errorType, | ||
| String errorMessage, | ||
| @NonNull Instant startTime, | ||
| @NonNull Instant endTime, | ||
| @NonNull UUID experimentId, |
There was a problem hiding this comment.
PersistenceContext uses @lombok.Builder without toBuilder = true — should we switch to @lombok.Builder(toBuilder = true) to match the DTO convention in .agents/skills/opik-backend/SKILL.md?
Finding type: AI Coding Guidelines | Severity: 🟢 Low
Want Baz to fix this for you? Activate Fixer
Other fix methods
Prompt for AI Agents:
Before applying, verify this suggestion against the current code. In
apps/opik-backend/src/main/java/com/comet/opik/domain/ExperimentTracePersistence.java
around lines 38 to 53, the PersistenceContext record is annotated with @lombok.Builder
but missing toBuilder = true. Update the annotation to @lombok.Builder(toBuilder = true)
on the PersistenceContext record declaration so callers can call toBuilder() to
clone/modify instances. Make no other changes to the record.
Backend Tests - Integration Group 14242 tests 242 ✅ 8m 17s ⏱️ Results for commit 3d28528. ♻️ This comment has been updated with latest results. |
Python SDK E2E Tests Results (Python 3.12)365 tests ±0 363 ✅ ±0 14m 41s ⏱️ -4s Results for commit 3d28528. ± Comparison against base commit bde0715. This pull request removes 1 and adds 1 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
Python SDK E2E Tests Results (Python 3.13)365 tests ±0 363 ✅ ±0 14m 28s ⏱️ -3s Results for commit 3d28528. ± Comparison against base commit bde0715. This pull request removes 1 and adds 1 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
Python SDK E2E Tests Results (Python 3.14)365 tests ±0 363 ✅ ±0 14m 12s ⏱️ -30s Results for commit 3d28528. ± Comparison against base commit bde0715. This pull request removes 1 and adds 1 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
Python SDK E2E Tests Results (Python 3.10)365 tests ±0 363 ✅ ±0 14m 34s ⏱️ -10s Results for commit 3d28528. ± Comparison against base commit bde0715. This pull request removes 1 and adds 1 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
…instead of storing it Make scoreDestination() a derived method on FeedbackScoreItem that computes routing from categoryName, eliminating the stored field. This ensures correct routing for all entry points (JSON API, internal scorer, builder) with a single source of truth. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| private static final String SUITE_ASSERTION_CATEGORY = "suite_assertion"; | ||
|
|
||
| public static ScoreDestination fromCategoryName(String categoryName) { | ||
| return SUITE_ASSERTION_CATEGORY.equals(categoryName) | ||
| ? ASSERTION_RESULTS |
There was a problem hiding this comment.
SUITE_ASSERTION_CATEGORY is duplicated in ScoreDestination and EvalSuiteAssertionSampler, should we centralize it in ScoreDestination and have the sampler reuse it?
Finding type: Code Dedup and Conventions | Severity: 🟢 Low
Want Baz to fix this for you? Activate Fixer
thiagohora
left a comment
There was a problem hiding this comment.
Thanks for the patience, great work!
Details
Adds BE-orchestrated experiment execution for evaluation suites in the playground. When a user runs a dataset that is an evaluation suite, the frontend delegates to a new backend endpoint (
POST /v1/private/experiments/execute) instead of doing client-side streaming. The backend creates experiments per prompt variant, processes dataset items asynchronously (LLM calls, trace/span creation), and triggers assertion evaluation via the existing online scoring pipeline.Key changes:
ExperimentExecutionServiceorchestrates experiment creation, async item processing, and status transitionsExperimentItemProcessorhandles per-item LLM calls and trace/span/experiment-item creationEvalSuiteAssertionSamplerlistens for completed traces and enqueues evaluators for LLM-as-judge scoring withcategoryName = "suite_assertion"so results route toassertion_resultsOnlineScoringLlmAsJudgeScorerapplies optional score-name mapping and category (backward compatible — null values preserve existing behavior)Change checklist
Issues
AI-WATERMARK
AI-WATERMARK: yes
Testing
cd apps/opik-backend && mvn compile -DskipTests— backend compiles cleanlycd apps/opik-backend && mvn spotless:apply— formatting passesOnlineScoringLlmAsJudgeScorerpreserves original score names and null categoryName whenscoreNameMapping/categoryNameare null (regular online scoring path)Documentation
N/A — internal backend changes, no public API documentation updates needed.
Demo video
Screen.Recording.2026-04-06.at.23.07.56.mov