[issue-3764] [P SDK] [FE] [BE] [Docs] Introduce experiment scoring functions by jverre · Pull Request #3989 · comet-ml/opik

jverre · 2025-11-07T18:20:39Z

Details

Introduces the concept of experiment scores which allow you to log experiment level scores based on experiment results. This allows you to log metrics like f1-score, recall or last for example.

from typing import List
from opik.evaluation import evaluate, test_result
from opik.evaluation.metrics import Hallucination, score_result

# Define an experiment score function
def compute_hallucination_max(
    test_results: List[test_result.TestResult],
) -> List[score_result.ScoreResult]:
    """Compute the maximum hallucination score across all test results."""
    hallucination_scores = [
        result.score_results[0].value 
        for result in test_results 
        if result.score_results and len(result.score_results) > 0
    ]
    
    if not hallucination_scores:
        return []
    
    return [
        score_result.ScoreResult(
            name="hallucination_metric (max)",
            value=max(hallucination_scores),
            reason=f"Maximum hallucination score across {len(hallucination_scores)} test cases"
        )
    ]

# Run evaluation with experiment scoring functions
evaluation = evaluate(
    dataset=dataset,
    task=evaluation_task,
    scoring_metrics=[Hallucination()],
    experiment_scoring_functions=[compute_hallucination_max],
    experiment_name="My experiment"
)

# Access experiment scores from the result
print(f"Experiment scores: {evaluation.experiment_scores}")

In the FE, the following places have been updated:

Evaluation table in home page
Experiment list page: Chart and table was updated with special care taken to support groups and sorting
Single experiment page: Tags top of page and feedback scores table where updated

The documentation was also updated to include this feature.

Change checklist

User facing
Documentation update

Issues

Resolves [FR]: Support for population based metrics #3764
OPIK-2884

Testing

SDK and BE tests were added. Manual testing was also completed.

Documentation

Documentation was updated

github-actions · 2025-11-07T18:20:54Z

📋 PR Linter Failed

❌ Invalid Title Format. Your PR title must include a ticket/issue number and may optionally include component tags ([FE], [BE], etc.).

Internal contributors: Open a JIRA ticket and link to it: [OPIK-xxxx] or [CUST-xxxx] or [DEV-xxxx] [COMPONENT] Your change
External contributors: Open a Github Issue and link to it via its number: [issue-xxxx] [COMPONENT] Your change
No ticket: Use [NA] [COMPONENT] Your change (Issues section not required)

Example: [issue-3108] [BE] [FE] Fix authentication bug or [OPIK-1234] Fix bug or [NA] Update README

Copilot

Pull Request Overview

This PR introduces experiment scores functionality, allowing users to log aggregate metrics (like f1-score, recall, or custom statistics) at the experiment level. These scores are computed from test results and stored separately from per-trace feedback scores, enabling better experiment-level analytics.

Key changes include:

Python SDK support for computing and logging experiment scores via the evaluate() function
Backend storage and retrieval of experiment scores in ClickHouse
Frontend display of experiment scores alongside feedback scores in experiment lists, comparison views, and charts
TypeScript SDK type definitions for experiment scores

Reviewed Changes

Copilot reviewed 68 out of 68 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`sdks/python/src/opik/evaluation/evaluator.py`	Added `experiment_scores` parameter to evaluation functions and logic to compute/log scores
`sdks/python/src/opik/evaluation/types.py`	Defined `ExperimentScoreFunction` type for score computation functions
`sdks/python/src/opik/rest_api/types/experiment_score*.py`	Auto-generated Pydantic models for experiment scores
`apps/opik-backend/src/main/java/com/comet/opik/api/ExperimentScore.java`	Java model for experiment scores with validation
`apps/opik-backend/src/main/java/com/comet/opik/domain/ExperimentDAO.java`	Database operations for storing/retrieving experiment scores
`apps/opik-backend/src/main/resources/liquibase/db-app-analytics/migrations/000046_add_experiment_scores_to_experiments.sql`	Database migration adding experiment_scores column
`apps/opik-frontend/src/components/pages/ExperimentsPage/ExperimentsPage.tsx`	Frontend logic to merge and display feedback scores and experiment scores
`sdks/typescript/src/opik/rest_api/api/types/ExperimentScore*.ts`	Auto-generated TypeScript types for experiment scores
`apps/opik-frontend/src/lib/sorting.ts`	Sorting support for experiment_scores columns
`sdks/python/tests/unit/evaluation/test_evaluate.py`	Unit tests for experiment scores functionality
`apps/opik-backend/src/test/java/com/comet/opik/api/resources/v1/priv/ExperimentsResourceTest.java`	Backend integration tests for experiment scores CRUD operations

sdks/python/src/opik/evaluation/evaluator.py

sdks/python/examples/evaluation_example.py

apps/opik-frontend/src/components/pages/ExperimentsPage/ExperimentsPage.tsx

apps/opik-backend/src/main/java/com/comet/opik/domain/sorting/SortingQueryBuilder.java

apps/opik-backend/src/main/java/com/comet/opik/domain/FeedbackScoreDAO.java

apps/opik-backend/src/test/java/com/comet/opik/api/resources/utils/ExperimentsTestUtils.java

github-actions · 2025-11-07T18:27:10Z

SDK E2E Tests Results

108 tests 107 ✅ 5m 33s ⏱️
1 suites 0 💤
1 files 1 ❌

For more details on these failures, see this check.

Results for commit 035b08b.

♻️ This comment has been updated with latest results.

github-actions · 2025-11-07T18:50:41Z

Backend Tests Results

351 files ± 0 351 suites ±0 55m 48s ⏱️ + 6m 34s
5 884 tests + 13 5 877 ✅ + 13 7 💤 ±0 0 ❌ ±0
5 857 runs +1 184 5 850 ✅ +1 184 7 💤 ±0 0 ❌ ±0

Results for commit 05cfbe1. ± Comparison against base commit 0ee8b93.

♻️ This comment has been updated with latest results.

sdks/python/src/opik/evaluation/evaluator.py

sdks/python/src/opik/evaluation/rest_operations.py

alexkuzmik

One more comment - we need at least one e2e test for this feature in the SDK.

andrescrz

I left many comments, but focus on the important ones:

DAO query approach in service side aggregations: should be moved to the DB level.
Exceptions in sorting logic: better harmonise the query.

apps/opik-backend/src/main/java/com/comet/opik/api/Experiment.java

apps/opik-backend/src/main/java/com/comet/opik/api/ExperimentUpdate.java

apps/opik-backend/src/main/java/com/comet/opik/api/FeedbackScoreNames.java

apps/opik-backend/src/main/java/com/comet/opik/domain/ExperimentDAO.java

apps/opik-backend/src/main/java/com/comet/opik/domain/FeedbackScoreService.java

apps/opik-backend/src/main/java/com/comet/opik/domain/sorting/SortingQueryBuilder.java

...ources/liquibase/db-app-analytics/migrations/000046_add_experiment_scores_to_experiments.sql

apps/opik-backend/src/test/java/com/comet/opik/api/resources/utils/ExperimentsTestUtils.java

github-actions · 2025-11-20T14:41:30Z

🔄 Test environment deployment started

Building images for PR #3989...

You can monitor the build progress here.

github-actions · 2025-11-20T14:42:47Z

SDK Unit Tests Results

0 tests 0 ✅ 0s ⏱️
0 suites 0 💤
0 files 0 ❌

Results for commit be64446.

♻️ This comment has been updated with latest results.

github-actions · 2025-12-02T09:50:22Z

🌿 Preview your docs: https://opik-preview-1c9da394-08b4-45cd-914d-245075e1f7fb.docs.buildwithfern.com/docs/opik

No broken links found

github-actions · 2025-12-02T11:17:43Z

🌿 Preview your docs: https://opik-preview-e78daf90-7c18-4b7e-a2c3-a26e2d520f33.docs.buildwithfern.com/docs/opik

No broken links found

github-actions · 2025-12-02T11:47:41Z

🌿 Preview your docs: https://opik-preview-85880fde-31e9-482f-a4eb-02047c90871b.docs.buildwithfern.com/docs/opik

No broken links found

…xperiment score logic. Removed experiment score references and updated feedback score components to handle aggregated scores. Adjusted column definitions and metadata across multiple pages for consistency.

github-actions · 2025-12-02T13:06:44Z

🔄 Test environment deployment started

Building images for PR #3989...

You can monitor the build progress here.

…experiment-precomputed-metrics

github-actions · 2025-12-02T13:08:42Z

🌿 Preview your docs: https://opik-preview-1dcee891-f2b1-4812-87b5-c257c8176e44.docs.buildwithfern.com/docs/opik

No broken links found

github-actions · 2025-12-02T13:08:53Z

🔄 Test environment deployment started

Building images for PR #3989...

You can monitor the build progress here.

github-actions · 2025-12-02T13:10:33Z

🌿 Preview your docs: https://opik-preview-9ef8631b-dd0d-4feb-80b4-909dff1b2736.docs.buildwithfern.com/docs/opik

No broken links found

CometActions · 2025-12-02T13:14:28Z

✅ Test environment is now available!

Access Information

URL: https://pr-3989.dev.comet.com
Cluster: comet-ml-development
Namespace: pr-3989
Version: 1.9.37-3989-merge-616
Application logs: View in Grafana

The deployment has completed successfully and the version has been verified.

CometActions · 2025-12-02T13:16:21Z

✅ Test environment is now available!

Access Information

URL: https://pr-3989.dev.comet.com
Cluster: comet-ml-development
Namespace: pr-3989
Version: 1.9.37-3989-merge-617
Application logs: View in Grafana

The deployment has completed successfully and the version has been verified.

aadereiko · 2025-12-02T13:34:41Z

apps/opik-frontend/src/components/shared/DataTableCells/FeedbackScoreListCell.tsx

-  const sortedList = feedbackScoreList.sort((c1, c2) =>
-    c1.name.localeCompare(c2.name),
-  );
+  const sortedList = scoreList.sort((c1, c2) => c1.name.localeCompare(c2.name));


sort mutates the original array so

Suggested change

const sortedList = scoreList.sort((c1, c2) => c1.name.localeCompare(c2.name));

const sortedList = scoreList.slice().sort((c1, c2) => c1.name.localeCompare(c2.name));

is way safer

Good idea to avoid having multiple references, like sortedList (When it's the same as scoreList after sorting)

Let's do it in a future PR

aadereiko

Good job! Thanks for your patience in handling all the back and forth comments

jverre added 5 commits November 7, 2025 16:26

Hide experiment_scores columns in the single experiment table

1d00408

Add SDK support for experiment_scores

81fe6c8

Add SDK support for experiment_scores

21c4aef

Add BE functionality

2f1231e

Typescript autogenerated code

4f1b508

Copilot AI review requested due to automatic review settings November 7, 2025 18:20

jverre requested review from a team as code owners November 7, 2025 18:20

github-actions bot assigned jverre Nov 7, 2025

jverre changed the title ~~[#3764] [FE] [BE] [Docs] Introduce experiment scores~~ [issue-3764] [FE] [BE] [Docs] Introduce experiment scores Nov 7, 2025

Copilot AI reviewed Nov 7, 2025

View reviewed changes

This comment was marked as outdated.

Sign in to view

Documentation and FE update

54510fa

alexkuzmik requested changes Nov 10, 2025

View reviewed changes

sdks/python/src/opik/evaluation/evaluator.py Outdated Show resolved Hide resolved

sdks/python/src/opik/evaluation/rest_operations.py Outdated Show resolved Hide resolved

alexkuzmik requested changes Nov 10, 2025

View reviewed changes

jverre added 2 commits November 12, 2025 13:10

Address PR comments

7fede23

Address PR comments

e8adcdf

andrescrz requested changes Nov 13, 2025

View reviewed changes

Fix PR comments

d07c998

jverre added the test-environment Deploy Opik adhoc environment label Nov 20, 2025

comet-ml deleted a comment from github-actions bot Nov 20, 2025

jverre added test-environment Deploy Opik adhoc environment and removed test-environment Deploy Opik adhoc environment labels Nov 20, 2025

github-actions bot assigned idoberko2 Dec 2, 2025

Merge branch 'main' into jacques/experiment-precomputed-metrics

9de1d25

github-actions bot assigned Lothiraldan Dec 2, 2025

Merge branch 'main' into jacques/experiment-precomputed-metrics

035b08b

github-actions bot assigned YarivHashaiComet Dec 2, 2025

Refactor score handling in various components to unify feedback and e…

021a8a8

…xperiment score logic. Removed experiment score references and updated feedback score components to handle aggregated scores. Adjusted column definitions and metadata across multiple pages for consistency.

JetoPistola added test-environment Deploy Opik adhoc environment and removed test-environment Deploy Opik adhoc environment labels Dec 2, 2025

JetoPistola added 2 commits December 2, 2025 15:07

Merge branch 'main' of https://github.com/comet-ml/opik into jacques/…

994d47a

…experiment-precomputed-metrics

Add migration to include experiment_scores column in experiments table

05cfbe1

JetoPistola removed the test-environment Deploy Opik adhoc environment label Dec 2, 2025

github-actions bot added the test-environment Deploy Opik adhoc environment label Dec 2, 2025

JetoPistola removed the test-environment Deploy Opik adhoc environment label Dec 2, 2025

JetoPistola added the test-environment Deploy Opik adhoc environment label Dec 2, 2025

JetoPistola requested a review from aadereiko December 2, 2025 13:11

aadereiko reviewed Dec 2, 2025

View reviewed changes

andrescrz approved these changes Dec 2, 2025

View reviewed changes

aadereiko approved these changes Dec 2, 2025

View reviewed changes

JetoPistola merged commit b7a3134 into main Dec 2, 2025
136 of 138 checks passed

JetoPistola deleted the jacques/experiment-precomputed-metrics branch December 2, 2025 13:44

dsblank mentioned this pull request Mar 31, 2026

[FR]: Support for classification metrics (Precision, Recall, F1) with dataset-level evaluation #5988

Open

	const sortedList = scoreList.sort((c1, c2) => c1.name.localeCompare(c2.name));
	const sortedList = scoreList.slice().sort((c1, c2) => c1.name.localeCompare(c2.name));

Conversation

jverre commented Nov 7, 2025 • edited by JetoPistola Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Change checklist

Issues

Testing

Documentation

Uh oh!

github-actions bot commented Nov 7, 2025 • edited by atlassian bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📋 PR Linter Failed

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

github-actions bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SDK E2E Tests Results

Uh oh!

github-actions bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backend Tests Results

Uh oh!

Uh oh!

Uh oh!

alexkuzmik left a comment

Choose a reason for hiding this comment

Uh oh!

andrescrz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Nov 20, 2025

Uh oh!

github-actions bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SDK Unit Tests Results

Uh oh!

github-actions bot commented Dec 2, 2025

Uh oh!

github-actions bot commented Dec 2, 2025

Uh oh!

github-actions bot commented Dec 2, 2025

Uh oh!

github-actions bot commented Dec 2, 2025

Uh oh!

github-actions bot commented Dec 2, 2025

Uh oh!

github-actions bot commented Dec 2, 2025

Uh oh!

github-actions bot commented Dec 2, 2025

Uh oh!

CometActions commented Dec 2, 2025

Access Information

Uh oh!

CometActions commented Dec 2, 2025

Access Information

Uh oh!

aadereiko Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

jverre commented Nov 7, 2025 •

edited by JetoPistola

Loading

github-actions bot commented Nov 7, 2025 •

edited by atlassian bot

Loading

github-actions bot commented Nov 7, 2025 •

edited

Loading

github-actions bot commented Nov 7, 2025 •

edited

Loading

github-actions bot commented Nov 20, 2025 •

edited

Loading

JetoPistola Dec 2, 2025 •

edited

Loading