[issue-3764] [P SDK] [FE] [BE] [Docs] Introduce experiment scoring functions#3989
[issue-3764] [P SDK] [FE] [BE] [Docs] Introduce experiment scoring functions#3989JetoPistola merged 50 commits intomainfrom
Conversation
📋 PR Linter Failed❌ Invalid Title Format. Your PR title must include a ticket/issue number and may optionally include component tags (
Example: |
There was a problem hiding this comment.
Pull Request Overview
This PR introduces experiment scores functionality, allowing users to log aggregate metrics (like f1-score, recall, or custom statistics) at the experiment level. These scores are computed from test results and stored separately from per-trace feedback scores, enabling better experiment-level analytics.
Key changes include:
- Python SDK support for computing and logging experiment scores via the
evaluate()function - Backend storage and retrieval of experiment scores in ClickHouse
- Frontend display of experiment scores alongside feedback scores in experiment lists, comparison views, and charts
- TypeScript SDK type definitions for experiment scores
Reviewed Changes
Copilot reviewed 68 out of 68 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
sdks/python/src/opik/evaluation/evaluator.py |
Added experiment_scores parameter to evaluation functions and logic to compute/log scores |
sdks/python/src/opik/evaluation/types.py |
Defined ExperimentScoreFunction type for score computation functions |
sdks/python/src/opik/rest_api/types/experiment_score*.py |
Auto-generated Pydantic models for experiment scores |
apps/opik-backend/src/main/java/com/comet/opik/api/ExperimentScore.java |
Java model for experiment scores with validation |
apps/opik-backend/src/main/java/com/comet/opik/domain/ExperimentDAO.java |
Database operations for storing/retrieving experiment scores |
apps/opik-backend/src/main/resources/liquibase/db-app-analytics/migrations/000046_add_experiment_scores_to_experiments.sql |
Database migration adding experiment_scores column |
apps/opik-frontend/src/components/pages/ExperimentsPage/ExperimentsPage.tsx |
Frontend logic to merge and display feedback scores and experiment scores |
sdks/typescript/src/opik/rest_api/api/types/ExperimentScore*.ts |
Auto-generated TypeScript types for experiment scores |
apps/opik-frontend/src/lib/sorting.ts |
Sorting support for experiment_scores columns |
sdks/python/tests/unit/evaluation/test_evaluate.py |
Unit tests for experiment scores functionality |
apps/opik-backend/src/test/java/com/comet/opik/api/resources/v1/priv/ExperimentsResourceTest.java |
Backend integration tests for experiment scores CRUD operations |
apps/opik-frontend/src/components/pages/ExperimentsPage/ExperimentsPage.tsx
Outdated
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/sorting/SortingQueryBuilder.java
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/FeedbackScoreDAO.java
Outdated
Show resolved
Hide resolved
apps/opik-backend/src/test/java/com/comet/opik/api/resources/utils/ExperimentsTestUtils.java
Show resolved
Hide resolved
This comment was marked as outdated.
This comment was marked as outdated.
SDK E2E Tests Results108 tests 107 ✅ 5m 33s ⏱️ For more details on these failures, see this check. Results for commit 035b08b. ♻️ This comment has been updated with latest results. |
alexkuzmik
left a comment
There was a problem hiding this comment.
One more comment - we need at least one e2e test for this feature in the SDK.
andrescrz
left a comment
There was a problem hiding this comment.
I left many comments, but focus on the important ones:
- DAO query approach in service side aggregations: should be moved to the DB level.
- Exceptions in sorting logic: better harmonise the query.
apps/opik-backend/src/main/java/com/comet/opik/api/Experiment.java
Outdated
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/api/ExperimentUpdate.java
Outdated
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/api/FeedbackScoreNames.java
Outdated
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/ExperimentDAO.java
Outdated
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/ExperimentDAO.java
Outdated
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/FeedbackScoreService.java
Outdated
Show resolved
Hide resolved
apps/opik-backend/src/main/java/com/comet/opik/domain/sorting/SortingQueryBuilder.java
Outdated
Show resolved
Hide resolved
...ources/liquibase/db-app-analytics/migrations/000046_add_experiment_scores_to_experiments.sql
Outdated
Show resolved
Hide resolved
apps/opik-backend/src/test/java/com/comet/opik/api/resources/utils/ExperimentsTestUtils.java
Outdated
Show resolved
Hide resolved
apps/opik-backend/src/test/java/com/comet/opik/api/resources/utils/ExperimentsTestUtils.java
Show resolved
Hide resolved
SDK Unit Tests Results0 tests 0 ✅ 0s ⏱️ Results for commit be64446. ♻️ This comment has been updated with latest results. |
|
🌿 Preview your docs: https://opik-preview-1c9da394-08b4-45cd-914d-245075e1f7fb.docs.buildwithfern.com/docs/opik No broken links found |
|
🌿 Preview your docs: https://opik-preview-e78daf90-7c18-4b7e-a2c3-a26e2d520f33.docs.buildwithfern.com/docs/opik No broken links found |
|
🌿 Preview your docs: https://opik-preview-85880fde-31e9-482f-a4eb-02047c90871b.docs.buildwithfern.com/docs/opik No broken links found |
…xperiment score logic. Removed experiment score references and updated feedback score components to handle aggregated scores. Adjusted column definitions and metadata across multiple pages for consistency.
|
🌿 Preview your docs: https://opik-preview-1dcee891-f2b1-4812-87b5-c257c8176e44.docs.buildwithfern.com/docs/opik No broken links found |
|
🌿 Preview your docs: https://opik-preview-9ef8631b-dd0d-4feb-80b4-909dff1b2736.docs.buildwithfern.com/docs/opik No broken links found |
|
✅ Test environment is now available! Access Information
The deployment has completed successfully and the version has been verified. |
|
✅ Test environment is now available! Access Information
The deployment has completed successfully and the version has been verified. |
| const sortedList = feedbackScoreList.sort((c1, c2) => | ||
| c1.name.localeCompare(c2.name), | ||
| ); | ||
| const sortedList = scoreList.sort((c1, c2) => c1.name.localeCompare(c2.name)); |
There was a problem hiding this comment.
sort mutates the original array so
| const sortedList = scoreList.sort((c1, c2) => c1.name.localeCompare(c2.name)); | |
| const sortedList = scoreList.slice().sort((c1, c2) => c1.name.localeCompare(c2.name)); |
is way safer
There was a problem hiding this comment.
Good idea to avoid having multiple references, like sortedList (When it's the same as scoreList after sorting)
Let's do it in a future PR
aadereiko
left a comment
There was a problem hiding this comment.
Good job! Thanks for your patience in handling all the back and forth comments
Details
Introduces the concept of experiment scores which allow you to log experiment level scores based on experiment results. This allows you to log metrics like
f1-score,recallorlastfor example.In the FE, the following places have been updated:
The documentation was also updated to include this feature.
Change checklist
Issues
Testing
SDK and BE tests were added. Manual testing was also completed.
Documentation
Documentation was updated