feat(datasets): Add `OpikEvaluationDataset` to experimental datasets by lrcouto · Pull Request #1364 · kedro-org/kedro-plugins

lrcouto · 2026-03-31T18:42:40Z

Description

Introduces a production-ready OpikEvaluationDataset to the experimental datasets in kedro-plugins, closing issue #1322. This implementation is the Opik counterpart to LangfuseEvaluationDataset (#1347) and follows the same design principles and API surface.

Key Features

Implementation changes:

Validates credentials (api_key required), sync policy, and filepath extension at initialization
Explicit sync modes: local (default, local file as source of truth — upserts all items to remote on every load()) and remote (Opik as source of truth, no local file interaction)
Items are deduplicated by content hash on the remote side (Opik SDK behavior); items with id fields are also deduplicated locally during save() merge
Warnings issued for items without id fields that cannot be tracked across syncs

Opik-specific behaviour:

Opik requires item IDs to be valid UUIDs. Human-readable IDs from local files are stripped before uploaded, as Opik auto-generates UUIDs. Deduplication is content hash-based and is not affected by the id field.
metadata is accepted as a constructor param so the API is the same as in LangfuseEvaluationDataset, but Opik's create_dataset() does not accept a metadata argument. The value is stored locally and returned by _describe() but is not propagated to the remote dataset.
No snapshot versioning: Opik does not support pinning load() to a historical snapshot (unlike Langfuse's ISO 8601 version param).

Design philosophy:

Lifecycle operations (delete dataset, delete items, clear) are delegated to the native Opik SDK rather than exposed through OpikEvaluationDataset. This matches what LangfuseEvaluationDataset does and keeps the dataset class focused on the local/remote sync. Item deletion requires Opik's internal UUIDs, which are not tracked in the local file, making a thin wrapper of limited value.

Testing & Documentation

Unit tests included in kedro_datasets_experimental/tests/opik (mirroring LangfuseEvaluationDataset test structure)
Class docstring with sync policy explanation, item format, catalog YAML examples, and Python API example
opik/README.md extended with an OpikEvaluationDataset section and a Langfuse to Opik migration guide covering catalog changes, credential key differences, experiment runner API differences, scorer/task signature differences, and known limitations.

Development notes

How to test

Clone the opik-evaluation branch of kedro-academy
Set up credentials in conf/local/credentials.yml:

opik_credentials:
  api_key: "opik-api-key"
  workspace: "workspace-name"

openai:
  openai_api_key: "openai-api-key"

Then we can run the pipeline and check the result on the opik dashboard:

kedro run --pipeline intent_detection_evaluation_opik --params user_model=gpt-4o

Expected in the Opik dashboard:

A new dataset named evaluations/intent_agent_evaluation appears under Datasets
Items are visible with auto-generated UUIDs (human-readable IDs are stripped, this is expected)
An experiment run appears under Experiments with scoring results per item

Run the pipeline a second time without changing the local file. Content hash deduplication means no new items are created in Opik, the item count stays the same.

Expected: dataset item count unchanged; a new experiment run is recorded.

Then, add a new item to data/intent_detection/evaluation/intent_evaluation.json and rerun. The new item should appear in the remote dataset after load.

A note on versioning:

In the Langfuse evaluation pipeline, experiment names are derived from a stable integer version number pinned in the catalog (load_args.version). This makes experiment runs reproducible and traceable to a specific prompt version.

The Opik equivalent is not yet available. OpikPromptDataset currently loads the latest prompt commit by default and exposes its commit hash at runtime, so the kedro-academy pipeline derives the experiment name from that hash (e.g. intent_eval_a3f2c1b0_model_gpt-4o). This works in practice but has the drawback of now having verson pinning on the catalog. There is no way to declaratively pin OpikPromptDataset to a specific prompt version the way Langfuse does with load_args.version: 1. Swapping prompt versions requires editing the local file and re-syncing, rather than a catalog change.

This works to test versioning directly using the Opik SDK, but to get something equivalent to what's being done with LangfuseEvaluationDataset we'd have to implement version pinning for OpikPromptDataset. There's an issue open on it already (#1348).

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Updated jsonschema/kedro-catalog-X.XX.json if necessary
Added a description of this change in the relevant RELEASE.md file
Added tests to cover my changes
Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

…plugins into opik-evaluation-dataset

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

kedro-datasets/kedro_datasets_experimental/opik/opik_evaluation_dataset.py

…n_dataset.py Co-authored-by: Ravi Kumar Pilla <ravi_kumar_pilla@mckinsey.com> Signed-off-by: L. R. Couto <57910428+lrcouto@users.noreply.github.com>

Add OpikEvaluationDataset

c7f48fa

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

lrcouto changed the title ~~Add OpikEvaluationDataset~~ feat(datasets): Add OpikEvaluationDataset to experimental datasets Mar 31, 2026

lrcouto and others added 6 commits March 31, 2026 15:59

Add unit tests

8fa7b41

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

Lint

e994596

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

Lint

10fc12a

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

Docstring

9555cc0

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

Add OpikEvaluationDataset stuff to the readme

d6b8a11

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

Merge branch 'main' into opik-evaluation-dataset

da2e722

ElenaKhaustova mentioned this pull request Apr 1, 2026

feat(datasets): Shorten pyproject.toml extra names for langfuse, opik, and langchain datasets #1365

Draft

9 tasks

lrcouto added 6 commits April 1, 2026 17:26

Add OpikEvaluationDataset

e21d9b4

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

Add unit tests

3a5573e

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

Lint

ea358be

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

Lint

0024f0b

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

Docstring

13422c5

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

Add OpikEvaluationDataset stuff to the readme

a6459fe

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

lrcouto marked this pull request as ready for review April 1, 2026 22:04

lrcouto requested review from ElenaKhaustova, merelcht, rashidakanchwala and ravi-kumar-pilla April 2, 2026 00:04

lrcouto and others added 4 commits April 1, 2026 22:55

Docs and release note

e897d54

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

Merge branch 'opik-evaluation-dataset' of github.com:kedro-org/kedro-…

40b4972

…plugins into opik-evaluation-dataset

Typo

66f26e0

Signed-off-by: Laura Couto <laurarccouto@gmail.com>

Merge branch 'main' into opik-evaluation-dataset

da36d5c

ravi-kumar-pilla reviewed Apr 7, 2026

View reviewed changes

kedro-datasets/kedro_datasets_experimental/opik/opik_evaluation_dataset.py Outdated Show resolved Hide resolved

Update kedro-datasets/kedro_datasets_experimental/opik/opik_evaluatio…

f211529

…n_dataset.py Co-authored-by: Ravi Kumar Pilla <ravi_kumar_pilla@mckinsey.com> Signed-off-by: L. R. Couto <57910428+lrcouto@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datasets): Add `OpikEvaluationDataset` to experimental datasets#1364

feat(datasets): Add `OpikEvaluationDataset` to experimental datasets#1364
lrcouto wants to merge 18 commits intomainfrom
opik-evaluation-dataset

lrcouto commented Mar 31, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lrcouto commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key Features

Testing & Documentation

Development notes

How to test

Developer Certificate of Origin

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lrcouto commented Mar 31, 2026 •

edited

Loading