Skip to content

[OPIK-5650] [BE/FE] feat: support evaluation suites in playground#6092

Merged
itamargolan merged 41 commits intomainfrom
itamarg/OPIK-5650/support-eval-suites-in-playground
Apr 9, 2026
Merged

[OPIK-5650] [BE/FE] feat: support evaluation suites in playground#6092
itamargolan merged 41 commits intomainfrom
itamarg/OPIK-5650/support-eval-suites-in-playground

Conversation

@itamargolan
Copy link
Copy Markdown
Contributor

@itamargolan itamargolan commented Apr 6, 2026

Details

Adds BE-orchestrated experiment execution for evaluation suites in the playground. When a user runs a dataset that is an evaluation suite, the frontend delegates to a new backend endpoint (POST /v1/private/experiments/execute) instead of doing client-side streaming. The backend creates experiments per prompt variant, processes dataset items asynchronously (LLM calls, trace/span creation), and triggers assertion evaluation via the existing online scoring pipeline.

Key changes:

  • ExperimentExecutionService orchestrates experiment creation, async item processing, and status transitions
  • ExperimentItemProcessor handles per-item LLM calls and trace/span/experiment-item creation
  • EvalSuiteAssertionSampler listens for completed traces and enqueues evaluators for LLM-as-judge scoring with categoryName = "suite_assertion" so results route to assertion_results
  • OnlineScoringLlmAsJudgeScorer applies optional score-name mapping and category (backward compatible — null values preserve existing behavior)
  • Frontend shows assertion pass/fail status and pass rate in playground output cells for eval suite runs

Change checklist

  • User facing
  • Documentation update

Issues

  • OPIK-5650

AI-WATERMARK

AI-WATERMARK: yes

  • If yes:
    • Tools: Claude Code
    • Model(s): Claude Opus 4.6
    • Scope: PR review comment fixes (code style, naming, log patterns, null safety, code deduplication, Javadoc)
    • Human verification: yes — all changes reviewed

Testing

  • cd apps/opik-backend && mvn compile -DskipTests — backend compiles cleanly
  • cd apps/opik-backend && mvn spotless:apply — formatting passes
  • Manual testing: run eval suite experiment in playground, verify pass/fail tags and pass rate display
  • Backward compatibility verified by code inspection: OnlineScoringLlmAsJudgeScorer preserves original score names and null categoryName when scoreNameMapping/categoryName are null (regular online scoring path)

Documentation

N/A — internal backend changes, no public API documentation updates needed.

Demo video

Screen.Recording.2026-04-06.at.23.07.56.mov

Add backend experiment execution endpoint and frontend eval suite
flow so playground can run evaluation suite datasets with server-side
assertion processing and poll-based progress tracking.

Made-with: Cursor
@github-actions github-actions bot added dependencies Pull requests that update a dependency file java Pull requests that update Java code Frontend Backend tests Including test files, or tests related like configuration. typescript *.ts *.tsx labels Apr 6, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

📋 PR Linter Failed

Missing Section. The description is missing the ## Details section.


Missing Section. The description is missing the ## Change checklist section.


Missing Section. The description is missing the ## Testing section.


Missing Section. The description is missing the ## Documentation section.

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

📋 PR Linter Failed

Missing Section. The description is missing the ## Details section.


Missing Section. The description is missing the ## Change checklist section.


Missing Section. The description is missing the ## Testing section.


Missing Section. The description is missing the ## Documentation section.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

Backend Tests - Integration Group 16

242 tests   242 ✅  5m 57s ⏱️
 10 suites    0 💤
 10 files      0 ❌

Results for commit eee9991.

♻️ This comment has been updated with latest results.

- Switch to @requiredargsconstructor convention in EvalSuiteAssertionSampler and ExperimentItemProcessor
- Remove SDK references from comments, rename methods (fetchDatasetEvaluators, getMetadataString, toLangChain4jMessage, etc.)
- Fix log patterns: pass exception as last param instead of e.getMessage()
- Split catch: UncheckedIOException for deserialization, Exception for other errors
- Replace generateDeterministicId with IdGenerator.generateId() (UUID v7)
- Pre-process evaluators outside trace loop via PreparedEvaluator record
- Add dataset version filtering to DatasetItemStreamRequest
- Add null validation for datasetId with BadRequestException
- Extract buildMessagesInput/buildLlmOutput helpers to deduplicate trace/span creation
- Simplify buildTemplateContext using forEach
- Add backward-compatibility comment on OnlineScoringLlmAsJudgeScorer

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

Backend Tests - Unit Tests

1 638 tests   1 636 ✅  1m 1s ⏱️
  209 suites      2 💤
  209 files        0 ❌

Results for commit 7142907.

♻️ This comment has been updated with latest results.

itamargolan and others added 2 commits April 6, 2026 18:40
…trace

Collect unique dataset item IDs upfront, fetch and prepare evaluators
once per item, then look up from a map inside the trace loop.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…aluators

Pass the userName from TracesCreated event through to the reactive
context instead of hardcoding "system".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…h and fix unit tests

- Rename metadata key to eval_suite_dataset_version_hash across BE, FE, and tests
- Fix ExperimentExecutionServiceTest: add datasetId to test requests to
  match the null-safety validation added earlier
- Update test for missing datasetId to assert BadRequestException

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add getItemEvaluatorsByDatasetId to DAO/service for single-query
  batch fetch of all item evaluators in a dataset version
- Refactor EvalSuiteAssertionSampler to use batch fetch instead of
  per-item reactive calls
- Update RunOnDatasetDialog to reflect dataset/evaluation suite choice
  with dynamic button text and labels

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
itamargolan and others added 3 commits April 6, 2026 21:36
The FE-orchestrated path (createLogPlaygroundProcessor) is only used for
regular datasets. Eval suites use the BE-orchestrated path exclusively,
so evalSuiteDatasetId, evalSuiteVersionHash, and evaluationMethod fields
on LogQueueParams were never set and the related trace metadata / experiment
blocks were unreachable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The prefetchItemEvaluators method was missing USER_NAME in contextWrite,
causing makeFluxContextAware to throw NoSuchElementException (silently
caught), which meant item-level assertions were never calculated.

Also fixes test config construction to use getJsonNodeFromString instead
of readTree to properly parse evaluator config JSON.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add UUID validation for eval_suite_dataset_item_id in EvalSuiteAssertionSampler
- Fix ClickHouse dedup ordering in DatasetItemVersionDAO (filter after LIMIT 1 BY)
- Add EXPERIMENT_STATUS enum and use constants instead of string literals
- Add two-phase polling (running → evaluating) for eval suite experiments
- Extract nested ternary into helper function in RunOnDatasetDialog
- Add progress indicator with phase-aware display (running/evaluating)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot removed the dependencies Pull requests that update a dependency file label Apr 7, 2026
@itamargolan itamargolan marked this pull request as ready for review April 7, 2026 02:59
@itamargolan itamargolan requested a review from a team as a code owner April 7, 2026 02:59
Address baz reviewer comment: use Lombok @builder(toBuilder = true) and
@nonnull on DatasetEvaluatorsResult and PreparedEvaluator records per
project DTO conventions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
itamargolan and others added 2 commits April 8, 2026 15:13
…stead of UserMessage

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove broad catch(Exception) that silently swallowed runtime errors.
Keep only UncheckedIOException for deserialization failures and add
evaluator config to the log for debuggability.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…torMapper

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@thiagohora thiagohora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all my comments. Only a few remaining issues and I'll approve. Let's address mainly the reactive issues

Copy link
Copy Markdown
Contributor

@thiagohora thiagohora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor items from the latest pass.

itamargolan and others added 3 commits April 9, 2026 07:30
…afe score routing

Replace implicit string-based categoryName check with a ScoreDestination enum
to make score routing (feedback_scores vs assertion_results) explicit and type-safe.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…code quality

- Wrap blocking LLM call in Mono.fromCallable with boundedElastic scheduler
- Remove redundant Mono.defer and subscribeOn from subscriber
- Add TTL to Redis failure counter to prevent memory leaks
- Extract PersistenceContext record to reduce parameter count
- Add projectName to ExperimentItem creation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…sionHash, parallelize trace+span creation

- Distinguish user errors (invalid versionHash) from transient DB failures
  in fetchDatasetExecutionPolicy
- Run trace and span creation in parallel via Mono.when()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@itamargolan itamargolan requested a review from thiagohora April 9, 2026 12:24
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

Backend Tests - Integration Group 1

 23 files   23 suites   3m 0s ⏱️
413 tests 413 ✅ 0 💤 0 ❌
340 runs  340 ✅ 0 💤 0 ❌

Results for commit 3d28528.

♻️ This comment has been updated with latest results.

Comment on lines +28 to +33
private record LlmCallResult(
ChatCompletionResponse response,
String errorType,
String errorMessage,
Instant startTime,
Instant endTime) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LlmCallResult lacks @Builder(toBuilder = true) and is instantiated directly, should we add that annotation and switch instantiations to LlmCallResult.builder()...build()?

new LlmCallResult(...) => LlmCallResult.builder()...build()

Finding type: AI Coding Guidelines | Severity: 🟢 Low


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In
apps/opik-backend/src/main/java/com/comet/opik/domain/ExperimentItemProcessor.java
around lines 28-33, the record LlmCallResult is defined without a builder; annotate it
with @Builder(toBuilder = true) (and add the Lombok import if missing) so it becomes a
proper DTO per the project convention. Then update all instantiations around lines 47-53
(the places returning new LlmCallResult(...)) to use
LlmCallResult.builder().response(...).errorType(...).errorMessage(...).startTime(...).endTime(...).build()
(set only the fields used in each branch), replacing positional constructors so callers
get a toBuilder() hook and comply with the documented pattern.

Comment on lines +26 to +37
@DisplayName("null categoryName resolves to FEEDBACK_SCORES")
void nullCategoryResolvesToFeedbackScores() {
ScoreDestination destination = SUITE_ASSERTION_CATEGORY.equals(null)
? ScoreDestination.ASSERTION_RESULTS
: ScoreDestination.FEEDBACK_SCORES;

assertThat(destination).isEqualTo(ScoreDestination.FEEDBACK_SCORES);
}

@Test
@DisplayName("arbitrary categoryName resolves to FEEDBACK_SCORES")
void arbitraryCategoryResolvesToFeedbackScores() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nullCategoryResolvesToFeedbackScores and arbitraryCategoryResolvesToFeedbackScores duplicate the same assertion — should we collapse them into a single @ParameterizedTest with @NullSource and @ValueSource(strings = "some_other_category")?

@ParameterizedTest
@NullSource
@valuesource(strings = "some_other_category")
void categoryResolvesToFeedbackScores(String category) { /* ... */ }

Finding type: Use parameterized tests | Severity: 🟢 Low


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In
apps/opik-backend/src/test/java/com/comet/opik/api/ScoreDestinationTest.java around
lines 26-43, the methods `nullCategoryResolvesToFeedbackScores` and
`arbitraryCategoryResolvesToFeedbackScores` duplicate the same logic (calling
SUITE_ASSERTION_CATEGORY.equals(...) and asserting FEEDBACK_SCORES) with only the input
literal differing. Replace both tests with a single parameterized test: annotate a new
method with @ParameterizedTest, add @NullSource and @ValueSource(strings =
"some_other_category"), accept a String parameter for the category, compute the
destination the same way, and assert it equals ScoreDestination.FEEDBACK_SCORES. Also
add the necessary imports for ParameterizedTest, NullSource, and ValueSource and remove
the two original test methods.

Comment on lines +38 to +49
@lombok.Builder
record PersistenceContext(
@NonNull UUID traceId,
@NonNull String projectName,
@NonNull ExperimentExecutionRequest.PromptVariant prompt,
@NonNull List<ExperimentExecutionRequest.PromptVariant.Message> renderedMessages,
ChatCompletionResponse llmResponse,
String errorType,
String errorMessage,
@NonNull Instant startTime,
@NonNull Instant endTime,
@NonNull UUID experimentId,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PersistenceContext uses @lombok.Builder without toBuilder = true — should we switch to @lombok.Builder(toBuilder = true) to match the DTO convention in .agents/skills/opik-backend/SKILL.md?

Finding type: AI Coding Guidelines | Severity: 🟢 Low


Want Baz to fix this for you? Activate Fixer

Other fix methods

Fix in Cursor

Prompt for AI Agents:

Before applying, verify this suggestion against the current code. In
apps/opik-backend/src/main/java/com/comet/opik/domain/ExperimentTracePersistence.java
around lines 38 to 53, the PersistenceContext record is annotated with @lombok.Builder
but missing toBuilder = true. Update the annotation to @lombok.Builder(toBuilder = true)
on the PersistenceContext record declaration so callers can call toBuilder() to
clone/modify instances. Make no other changes to the record.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

Python SDK Compatibility V1 E2E Tests Results (Python 3.11)

119 tests  ±0   119 ✅ ±0   3m 51s ⏱️ -1s
  1 suites ±0     0 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit 3d28528. ± Comparison against base commit bde0715.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

Backend Tests - Integration Group 14

242 tests   242 ✅  8m 17s ⏱️
 23 suites    0 💤
 23 files      0 ❌

Results for commit 3d28528.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

Python SDK Compatibility V1 E2E Tests Results (Python 3.12)

119 tests  ±0   119 ✅ ±0   4m 5s ⏱️ +12s
  1 suites ±0     0 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit 3d28528. ± Comparison against base commit bde0715.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

Python SDK Compatibility V1 E2E Tests Results (Python 3.14)

119 tests  ±0   119 ✅ ±0   3m 57s ⏱️ +5s
  1 suites ±0     0 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit 3d28528. ± Comparison against base commit bde0715.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

Python SDK Compatibility V1 E2E Tests Results (Python 3.13)

119 tests  ±0   119 ✅ ±0   3m 38s ⏱️ ±0s
  1 suites ±0     0 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit 3d28528. ± Comparison against base commit bde0715.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

Python SDK Compatibility V1 E2E Tests Results (Python 3.10)

119 tests  ±0   119 ✅ ±0   3m 52s ⏱️ -1s
  1 suites ±0     0 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit 3d28528. ± Comparison against base commit bde0715.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

Python SDK E2E Tests Results (Python 3.12)

365 tests  ±0   363 ✅ ±0   14m 41s ⏱️ -4s
  1 suites ±0     2 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit 3d28528. ± Comparison against base commit bde0715.

This pull request removes 1 and adds 1 tests. Note that renamed tests count towards both.
tests.e2e.test_tracing ‑ test_opik_client__update_trace__happy_flow[None-None-None-None-019d6d29-12dc-77db-a092-74dd8ab795d8]
tests.e2e.test_tracing ‑ test_opik_client__update_trace__happy_flow[None-None-None-None-019d725c-1e3b-7e5d-835e-0e5d3b608d9a]

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

Python SDK E2E Tests Results (Python 3.13)

365 tests  ±0   363 ✅ ±0   14m 28s ⏱️ -3s
  1 suites ±0     2 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit 3d28528. ± Comparison against base commit bde0715.

This pull request removes 1 and adds 1 tests. Note that renamed tests count towards both.
tests.e2e.test_tracing ‑ test_opik_client__update_trace__happy_flow[None-None-None-None-019d6d28-c583-7483-a48d-c6b29ad9a99d]
tests.e2e.test_tracing ‑ test_opik_client__update_trace__happy_flow[None-None-None-None-019d725c-56a4-7d6c-8f48-94ff43cf2fa8]

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

Python SDK E2E Tests Results (Python 3.14)

365 tests  ±0   363 ✅ ±0   14m 12s ⏱️ -30s
  1 suites ±0     2 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit 3d28528. ± Comparison against base commit bde0715.

This pull request removes 1 and adds 1 tests. Note that renamed tests count towards both.
tests.e2e.test_tracing ‑ test_opik_client__update_trace__happy_flow[None-None-None-None-019d6d28-b8d6-7bba-8c33-1f926d8a2612]
tests.e2e.test_tracing ‑ test_opik_client__update_trace__happy_flow[None-None-None-None-019d725c-5c1c-79b6-b7d0-c3de82f345c3]

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

Python SDK E2E Tests Results (Python 3.10)

365 tests  ±0   363 ✅ ±0   14m 34s ⏱️ -10s
  1 suites ±0     2 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit 3d28528. ± Comparison against base commit bde0715.

This pull request removes 1 and adds 1 tests. Note that renamed tests count towards both.
tests.e2e.test_tracing ‑ test_opik_client__update_trace__happy_flow[None-None-None-None-019d6d28-f32a-7009-a6fa-3b49dc246710]
tests.e2e.test_tracing ‑ test_opik_client__update_trace__happy_flow[None-None-None-None-019d725c-66bc-736e-94ae-87e32580cf53]

♻️ This comment has been updated with latest results.

…instead of storing it

Make scoreDestination() a derived method on FeedbackScoreItem that computes routing
from categoryName, eliminating the stored field. This ensures correct routing for all
entry points (JSON API, internal scorer, builder) with a single source of truth.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment on lines +13 to +17
private static final String SUITE_ASSERTION_CATEGORY = "suite_assertion";

public static ScoreDestination fromCategoryName(String categoryName) {
return SUITE_ASSERTION_CATEGORY.equals(categoryName)
? ASSERTION_RESULTS
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SUITE_ASSERTION_CATEGORY is duplicated in ScoreDestination and EvalSuiteAssertionSampler, should we centralize it in ScoreDestination and have the sampler reuse it?

Finding type: Code Dedup and Conventions | Severity: 🟢 Low


Want Baz to fix this for you? Activate Fixer

Copy link
Copy Markdown
Contributor

@thiagohora thiagohora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patience, great work!

@itamargolan itamargolan merged commit 4dc8015 into main Apr 9, 2026
79 checks passed
@itamargolan itamargolan deleted the itamarg/OPIK-5650/support-eval-suites-in-playground branch April 9, 2026 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backend Frontend java Pull requests that update Java code tests Including test files, or tests related like configuration. typescript *.ts *.tsx

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants