Skip to content

[issue-3764] [P SDK] [FE] [BE] [Docs] Introduce experiment scoring functions#3989

Merged
JetoPistola merged 50 commits intomainfrom
jacques/experiment-precomputed-metrics
Dec 2, 2025
Merged

[issue-3764] [P SDK] [FE] [BE] [Docs] Introduce experiment scoring functions#3989
JetoPistola merged 50 commits intomainfrom
jacques/experiment-precomputed-metrics

Conversation

@jverre
Copy link
Copy Markdown
Collaborator

@jverre jverre commented Nov 7, 2025

Details

Introduces the concept of experiment scores which allow you to log experiment level scores based on experiment results. This allows you to log metrics like f1-score, recall or last for example.

from typing import List
from opik.evaluation import evaluate, test_result
from opik.evaluation.metrics import Hallucination, score_result

# Define an experiment score function
def compute_hallucination_max(
    test_results: List[test_result.TestResult],
) -> List[score_result.ScoreResult]:
    """Compute the maximum hallucination score across all test results."""
    hallucination_scores = [
        result.score_results[0].value 
        for result in test_results 
        if result.score_results and len(result.score_results) > 0
    ]
    
    if not hallucination_scores:
        return []
    
    return [
        score_result.ScoreResult(
            name="hallucination_metric (max)",
            value=max(hallucination_scores),
            reason=f"Maximum hallucination score across {len(hallucination_scores)} test cases"
        )
    ]

# Run evaluation with experiment scoring functions
evaluation = evaluate(
    dataset=dataset,
    task=evaluation_task,
    scoring_metrics=[Hallucination()],
    experiment_scoring_functions=[compute_hallucination_max],
    experiment_name="My experiment"
)

# Access experiment scores from the result
print(f"Experiment scores: {evaluation.experiment_scores}")
Screenshot 2025-11-07 at 18 15 06

In the FE, the following places have been updated:

  1. Evaluation table in home page
  2. Experiment list page: Chart and table was updated with special care taken to support groups and sorting
  3. Single experiment page: Tags top of page and feedback scores table where updated

The documentation was also updated to include this feature.

Change checklist

  • User facing
  • Documentation update

Issues

Testing

SDK and BE tests were added. Manual testing was also completed.

Documentation

Documentation was updated

Copilot AI review requested due to automatic review settings November 7, 2025 18:20
@jverre jverre requested review from a team as code owners November 7, 2025 18:20
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Nov 7, 2025

📋 PR Linter Failed

Invalid Title Format. Your PR title must include a ticket/issue number and may optionally include component tags ([FE], [BE], etc.).

  • Internal contributors: Open a JIRA ticket and link to it: [OPIK-xxxx] or [CUST-xxxx] or [DEV-xxxx] [COMPONENT] Your change
  • External contributors: Open a Github Issue and link to it via its number: [issue-xxxx] [COMPONENT] Your change
  • No ticket: Use [NA] [COMPONENT] Your change (Issues section not required)

Example: [issue-3108] [BE] [FE] Fix authentication bug or [OPIK-1234] Fix bug or [NA] Update README

@jverre jverre changed the title [#3764] [FE] [BE] [Docs] Introduce experiment scores [issue-3764] [FE] [BE] [Docs] Introduce experiment scores Nov 7, 2025
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces experiment scores functionality, allowing users to log aggregate metrics (like f1-score, recall, or custom statistics) at the experiment level. These scores are computed from test results and stored separately from per-trace feedback scores, enabling better experiment-level analytics.

Key changes include:

  • Python SDK support for computing and logging experiment scores via the evaluate() function
  • Backend storage and retrieval of experiment scores in ClickHouse
  • Frontend display of experiment scores alongside feedback scores in experiment lists, comparison views, and charts
  • TypeScript SDK type definitions for experiment scores

Reviewed Changes

Copilot reviewed 68 out of 68 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
sdks/python/src/opik/evaluation/evaluator.py Added experiment_scores parameter to evaluation functions and logic to compute/log scores
sdks/python/src/opik/evaluation/types.py Defined ExperimentScoreFunction type for score computation functions
sdks/python/src/opik/rest_api/types/experiment_score*.py Auto-generated Pydantic models for experiment scores
apps/opik-backend/src/main/java/com/comet/opik/api/ExperimentScore.java Java model for experiment scores with validation
apps/opik-backend/src/main/java/com/comet/opik/domain/ExperimentDAO.java Database operations for storing/retrieving experiment scores
apps/opik-backend/src/main/resources/liquibase/db-app-analytics/migrations/000046_add_experiment_scores_to_experiments.sql Database migration adding experiment_scores column
apps/opik-frontend/src/components/pages/ExperimentsPage/ExperimentsPage.tsx Frontend logic to merge and display feedback scores and experiment scores
sdks/typescript/src/opik/rest_api/api/types/ExperimentScore*.ts Auto-generated TypeScript types for experiment scores
apps/opik-frontend/src/lib/sorting.ts Sorting support for experiment_scores columns
sdks/python/tests/unit/evaluation/test_evaluate.py Unit tests for experiment scores functionality
apps/opik-backend/src/test/java/com/comet/opik/api/resources/v1/priv/ExperimentsResourceTest.java Backend integration tests for experiment scores CRUD operations

@github-actions

This comment was marked as outdated.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Nov 7, 2025

SDK E2E Tests Results

108 tests   107 ✅  5m 33s ⏱️
  1 suites    0 💤
  1 files      1 ❌

For more details on these failures, see this check.

Results for commit 035b08b.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Nov 7, 2025

Backend Tests Results

  351 files  ±    0    351 suites  ±0   55m 48s ⏱️ + 6m 34s
5 884 tests +   13  5 877 ✅ +   13  7 💤 ±0  0 ❌ ±0 
5 857 runs  +1 184  5 850 ✅ +1 184  7 💤 ±0  0 ❌ ±0 

Results for commit 05cfbe1. ± Comparison against base commit 0ee8b93.

♻️ This comment has been updated with latest results.

Copy link
Copy Markdown
Collaborator

@alexkuzmik alexkuzmik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more comment - we need at least one e2e test for this feature in the SDK.

Copy link
Copy Markdown
Member

@andrescrz andrescrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left many comments, but focus on the important ones:

  1. DAO query approach in service side aggregations: should be moved to the DB level.
  2. Exceptions in sorting logic: better harmonise the query.

@jverre jverre added the test-environment Deploy Opik adhoc environment label Nov 20, 2025
@comet-ml comet-ml deleted a comment from github-actions bot Nov 20, 2025
@jverre jverre added test-environment Deploy Opik adhoc environment and removed test-environment Deploy Opik adhoc environment labels Nov 20, 2025
@github-actions
Copy link
Copy Markdown
Contributor

🔄 Test environment deployment started

Building images for PR #3989...

You can monitor the build progress here.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Nov 20, 2025

SDK Unit Tests Results

0 tests   0 ✅  0s ⏱️
0 suites  0 💤
0 files    0 ❌

Results for commit be64446.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Dec 2, 2025

🌿 Preview your docs: https://opik-preview-1c9da394-08b4-45cd-914d-245075e1f7fb.docs.buildwithfern.com/docs/opik

No broken links found

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Dec 2, 2025

🌿 Preview your docs: https://opik-preview-e78daf90-7c18-4b7e-a2c3-a26e2d520f33.docs.buildwithfern.com/docs/opik

No broken links found

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Dec 2, 2025

🌿 Preview your docs: https://opik-preview-85880fde-31e9-482f-a4eb-02047c90871b.docs.buildwithfern.com/docs/opik

No broken links found

…xperiment score logic. Removed experiment score references and updated feedback score components to handle aggregated scores. Adjusted column definitions and metadata across multiple pages for consistency.
@JetoPistola JetoPistola added test-environment Deploy Opik adhoc environment and removed test-environment Deploy Opik adhoc environment labels Dec 2, 2025
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Dec 2, 2025

🔄 Test environment deployment started

Building images for PR #3989...

You can monitor the build progress here.

@JetoPistola JetoPistola removed the test-environment Deploy Opik adhoc environment label Dec 2, 2025
@github-actions github-actions bot added the test-environment Deploy Opik adhoc environment label Dec 2, 2025
@JetoPistola JetoPistola removed the test-environment Deploy Opik adhoc environment label Dec 2, 2025
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Dec 2, 2025

🌿 Preview your docs: https://opik-preview-1dcee891-f2b1-4812-87b5-c257c8176e44.docs.buildwithfern.com/docs/opik

No broken links found

@JetoPistola JetoPistola added the test-environment Deploy Opik adhoc environment label Dec 2, 2025
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Dec 2, 2025

🔄 Test environment deployment started

Building images for PR #3989...

You can monitor the build progress here.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Dec 2, 2025

🌿 Preview your docs: https://opik-preview-9ef8631b-dd0d-4feb-80b4-909dff1b2736.docs.buildwithfern.com/docs/opik

No broken links found

@JetoPistola JetoPistola requested a review from aadereiko December 2, 2025 13:11
@CometActions
Copy link
Copy Markdown
Collaborator

Test environment is now available!

Access Information

The deployment has completed successfully and the version has been verified.

@CometActions
Copy link
Copy Markdown
Collaborator

Test environment is now available!

Access Information

The deployment has completed successfully and the version has been verified.

const sortedList = feedbackScoreList.sort((c1, c2) =>
c1.name.localeCompare(c2.name),
);
const sortedList = scoreList.sort((c1, c2) => c1.name.localeCompare(c2.name));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort mutates the original array so

Suggested change
const sortedList = scoreList.sort((c1, c2) => c1.name.localeCompare(c2.name));
const sortedList = scoreList.slice().sort((c1, c2) => c1.name.localeCompare(c2.name));

is way safer

Copy link
Copy Markdown
Contributor

@JetoPistola JetoPistola Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea to avoid having multiple references, like sortedList (When it's the same as scoreList after sorting)

Let's do it in a future PR

Copy link
Copy Markdown
Collaborator

@aadereiko aadereiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! Thanks for your patience in handling all the back and forth comments

@JetoPistola JetoPistola merged commit b7a3134 into main Dec 2, 2025
136 of 138 checks passed
@JetoPistola JetoPistola deleted the jacques/experiment-precomputed-metrics branch December 2, 2025 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backend documentation Improvements or additions to documentation Frontend java Pull requests that update Java code python Pull requests that update Python code Python-SDK test-environment Deploy Opik adhoc environment tests Including test files, or tests related like configuration. typescript *.ts *.tsx Typescript-SDK

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FR]: Support for population based metrics

10 participants