Skip to content

feat(datasets): Add OpikEvaluationDataset to experimental datasets#1364

Open
lrcouto wants to merge 18 commits intomainfrom
opik-evaluation-dataset
Open

feat(datasets): Add OpikEvaluationDataset to experimental datasets#1364
lrcouto wants to merge 18 commits intomainfrom
opik-evaluation-dataset

Conversation

@lrcouto
Copy link
Copy Markdown
Contributor

@lrcouto lrcouto commented Mar 31, 2026

Description

Introduces a production-ready OpikEvaluationDataset to the experimental datasets in kedro-plugins, closing issue #1322. This implementation is the Opik counterpart to LangfuseEvaluationDataset (#1347) and follows the same design principles and API surface.

Key Features

Implementation changes:

  • Validates credentials (api_key required), sync policy, and filepath extension at initialization
  • Explicit sync modes: local (default, local file as source of truth — upserts all items to remote on every load()) and remote (Opik as source of truth, no local file interaction)
  • Items are deduplicated by content hash on the remote side (Opik SDK behavior); items with id fields are also deduplicated locally during save() merge
  • Warnings issued for items without id fields that cannot be tracked across syncs

Opik-specific behaviour:

  • Opik requires item IDs to be valid UUIDs. Human-readable IDs from local files are stripped before uploaded, as Opik auto-generates UUIDs. Deduplication is content hash-based and is not affected by the id field.
  • metadata is accepted as a constructor param so the API is the same as in LangfuseEvaluationDataset, but Opik's create_dataset() does not accept a metadata argument. The value is stored locally and returned by _describe() but is not propagated to the remote dataset.
  • No snapshot versioning: Opik does not support pinning load() to a historical snapshot (unlike Langfuse's ISO 8601 version param).

Design philosophy:

Lifecycle operations (delete dataset, delete items, clear) are delegated to the native Opik SDK rather than exposed through OpikEvaluationDataset. This matches what LangfuseEvaluationDataset does and keeps the dataset class focused on the local/remote sync. Item deletion requires Opik's internal UUIDs, which are not tracked in the local file, making a thin wrapper of limited value.

Testing & Documentation

  • Unit tests included in kedro_datasets_experimental/tests/opik (mirroring LangfuseEvaluationDataset test structure)
  • Class docstring with sync policy explanation, item format, catalog YAML examples, and Python API example
  • opik/README.md extended with an OpikEvaluationDataset section and a Langfuse to Opik migration guide covering catalog changes, credential key differences, experiment runner API differences, scorer/task signature differences, and known limitations.

Development notes

How to test

opik_credentials:
  api_key: "opik-api-key"
  workspace: "workspace-name"

openai:
  openai_api_key: "openai-api-key"

Then we can run the pipeline and check the result on the opik dashboard:

kedro run --pipeline intent_detection_evaluation_opik --params user_model=gpt-4o

Expected in the Opik dashboard:

  • A new dataset named evaluations/intent_agent_evaluation appears under Datasets
  • Items are visible with auto-generated UUIDs (human-readable IDs are stripped, this is expected)
  • An experiment run appears under Experiments with scoring results per item
image

Run the pipeline a second time without changing the local file. Content hash deduplication means no new items are created in Opik, the item count stays the same.

Expected: dataset item count unchanged; a new experiment run is recorded.

image

Then, add a new item to data/intent_detection/evaluation/intent_evaluation.json and rerun. The new item should appear in the remote dataset after load.

image

A note on versioning:

In the Langfuse evaluation pipeline, experiment names are derived from a stable integer version number pinned in the catalog (load_args.version). This makes experiment runs reproducible and traceable to a specific prompt version.

The Opik equivalent is not yet available. OpikPromptDataset currently loads the latest prompt commit by default and exposes its commit hash at runtime, so the kedro-academy pipeline derives the experiment name from that hash (e.g. intent_eval_a3f2c1b0_model_gpt-4o). This works in practice but has the drawback of now having verson pinning on the catalog. There is no way to declaratively pin OpikPromptDataset to a specific prompt version the way Langfuse does with load_args.version: 1. Swapping prompt versions requires editing the local file and re-syncing, rather than a catalog change.

This works to test versioning directly using the Opik SDK, but to get something equivalent to what's being done with LangfuseEvaluationDataset we'd have to implement version pinning for OpikPromptDataset. There's an issue open on it already (#1348).

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Updated jsonschema/kedro-catalog-X.XX.json if necessary
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes
  • Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)

Signed-off-by: Laura Couto <laurarccouto@gmail.com>
@lrcouto lrcouto changed the title Add OpikEvaluationDataset feat(datasets): Add OpikEvaluationDataset to experimental datasets Mar 31, 2026
lrcouto and others added 6 commits March 31, 2026 15:59
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
lrcouto added 6 commits April 1, 2026 17:26
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
@lrcouto lrcouto marked this pull request as ready for review April 1, 2026 22:04
lrcouto and others added 4 commits April 1, 2026 22:55
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
…n_dataset.py

Co-authored-by: Ravi Kumar Pilla <ravi_kumar_pilla@mckinsey.com>
Signed-off-by: L. R. Couto <57910428+lrcouto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants