Skip to content

feat(datasets): Shorten pyproject.toml extra names for langfuse, opik, and langchain datasets#1365

Draft
ElenaKhaustova wants to merge 32 commits intomainfrom
feat/rename-genai-extras
Draft

feat(datasets): Shorten pyproject.toml extra names for langfuse, opik, and langchain datasets#1365
ElenaKhaustova wants to merge 32 commits intomainfrom
feat/rename-genai-extras

Conversation

@ElenaKhaustova
Copy link
Copy Markdown
Contributor

@ElenaKhaustova ElenaKhaustova commented Mar 31, 2026

Description

Context: #1347 (comment)

Development notes

  • Removes redundant package-family prefix from dataset-specific extras (e.g. langfuse-langfusepromptdataset → langfuse-promptdataset), making install commands shorter and more consistent with other extras.*
  • Meta-extras (langfuse, opik, langchain) are updated to reference the new names. The installed dependencies are unchanged.
  • Updates all references in langfuse and opik READMEs to match the new names

Test plan

  • pip install kedro-datasets[langfuse-promptdataset] resolves correctly
  • pip install kedro-datasets[langfuse] still installs all langfuse deps
  • CI passes (no functional code changes)

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Updated jsonschema/kedro-catalog-X.XX.json if necessary
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes
  • Received approvals from at least half of the TSC (required for adding a new, non-experimental dataset)

ElenaKhaustova and others added 30 commits March 16, 2026 15:16
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: ElenaKhaustova <157851531+ElenaKhaustova@users.noreply.github.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
@ElenaKhaustova ElenaKhaustova marked this pull request as ready for review March 31, 2026 19:39
Signed-off-by: Elena Khaustova <ymax70rus@gmail.com>
@ElenaKhaustova ElenaKhaustova marked this pull request as draft March 31, 2026 19:43
@ElenaKhaustova ElenaKhaustova marked this pull request as ready for review March 31, 2026 19:44
@ElenaKhaustova ElenaKhaustova requested review from SajidAlamQB, ankatiyar, merelcht and ravi-kumar-pilla and removed request for SajidAlamQB and merelcht March 31, 2026 19:44
Copy link
Copy Markdown
Contributor

@ravi-kumar-pilla ravi-kumar-pilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ElenaKhaustova ElenaKhaustova requested a review from lrcouto March 31, 2026 19:55
Copy link
Copy Markdown
Contributor

@SajidAlamQB SajidAlamQB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ElenaKhaustova!

Copy link
Copy Markdown
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better 👍

@ElenaKhaustova ElenaKhaustova self-assigned this Apr 1, 2026
@ElenaKhaustova ElenaKhaustova moved this to In Review in Kedro 🔶 Apr 1, 2026
Base automatically changed from feat/langfuse-evaluation-dataset to main April 1, 2026 10:08
@ankatiyar
Copy link
Copy Markdown
Contributor

The shorter names look good but I'm worried it strays from the convention we have set for the dependencies - the dataset after all is called langchain.LangchainPromptDataset etc. The dependency names are too long but they follow the same standard across all datasets.
These datasets are all experimental of course so maybe this is okay.
When they graduate, we could also consider renaming them opik.OpikPromptDataset & opik.OpikTraceDataset to opik.PromptDataset and opik.TraceDataset but that might be too disruptive for the users who already have adopted them. 🤔 In the case of choosing between renaming datasets or having long dependency names, it would be better to stick with long dependencies in my opinion.
I'm not opposed to this change but just checking if we'll be setting a precedent where the dataset and dependency name might be slightly different and users have to double check in the pyproject.toml here to make sure which one it is!

@ElenaKhaustova
Copy link
Copy Markdown
Contributor Author

ElenaKhaustova commented Apr 1, 2026

The shorter names look good but I'm worried it strays from the convention we have set for the dependencies - the dataset after all is called langchain.LangchainPromptDataset etc. The dependency names are too long but they follow the same standard across all datasets.

Yes, that is a valid point, and I think we should also rename langchain.LangchainPromptDataset to langchain.PromptDataset. The problem is that if we go ahead with this for the already-released datasets we will probably need to add the short name as an alias with a deprecation warning on the old name. For example, from kedro_datasets_experimental.langfuse import PromptDataset works, and LangfusePromptDataset still works but logs a deprecation. To give users a migration window.

And I see the following pros for renaming:

  • shorter names for dependencies
  • langfuse.EvaluationDataset is arguably clearer than langfuse.LangfuseEvaluationDataset since the package already tells you it's Langfuse.
  • It also matches how the core LangChain datasets work — langchain.ChatOpenAIDataset, not langchain.LangChainChatOpenAIDataset.

So we probably should either rename both dependencies and datasets or leave them as is. My question: do you think it's worth it? @merelcht, @ankatiyar

@ankatiyar
Copy link
Copy Markdown
Contributor

Ideally I also like the dataset names to be langfuse.PromptDataset and langfuse.TraceDataset (similarly for opik etc) , then the dependency names could also be short and follow all convention. We have precedent for this as well with various CSVDataset/JSONDataset/ParquetDatasets (pandas and dask). We also have some datasets that repeat the package name (svmlight.SVMLightDataset, netcdf.NetCDFDataset)

Since these are experimental datasets, we could update the names (with or without a deprecation warning, but for user experience aliasing might be good). We would also have to update the projects in kedro-academy, the starter and maybe blog posts? We can do it now (while they're still experimental and maybe not many people use it in the wild) or when we consider graduating them (people will have to migrate anyway, could rename datasets at the same time).
However, if we're not updating the names, maybe we could stick with the longer dependencies for now? Will defer to @merelcht's judgement

@merelcht
Copy link
Copy Markdown
Member

merelcht commented Apr 1, 2026

I was thinking about this too, thanks for raising it @ankatiyar. I think we should take the benefit of these being experimental and just doing the rename without a transition period. Normally I would definitely be against that, but the whole point of these being experimental is that we're allowing slightly less solid datasets to be released and therefore breaking changes can happen while polishing the datasets between releases.

As an extra check we can have a look at telemetry to see if many people are using these datasets and make a decision based on that if we see adoption is already high.

@ElenaKhaustova
Copy link
Copy Markdown
Contributor Author

We had a discussion with @merelcht and decided to do the full renaming, but hold it until #1364 is completed

@ElenaKhaustova ElenaKhaustova marked this pull request as draft April 1, 2026 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

5 participants