Skip to content

update datasets dependency to >=3.0.2,<4.0.0 (results in 3.6.0)#213

Draft
ArneBinder wants to merge 19 commits intomainfrom
update-datasets-dependency
Draft

update datasets dependency to >=3.0.2,<4.0.0 (results in 3.6.0)#213
ArneBinder wants to merge 19 commits intomainfrom
update-datasets-dependency

Conversation

@ArneBinder
Copy link
Copy Markdown
Owner

@ArneBinder ArneBinder commented Oct 7, 2025

We require >=3.0.2 because of huggingface/datasets#7234, otherwise Brat datasets break (since datasets>=2.16.0).

related: #93, #212

requires:

dataset scripts require:

Changes:

  • use trust_remote_code=True in tests (required since datasets 2.20.0)
    • add hack to test_load_dataset_conll2003(_single_split): pass base_dataset_kwargs=dict(trust_remote_code=True). This should not be necessary anymore once this is handled in the conll2003 pie dataset script
  • we set trust_remote_code=True per default pie_datasets.load_dataset and in pie_datasets.DatasetDict.load_dataset to stay backwards compatible
  • adjust some builder tests since exception types and messages did change: test_builder_class_with_kwargs_wrong_parameter and test_builder_class_with_base_dataset_kwargs_wrong_parameter
  • also test DatasetDict.load_dataset in test_load_dataset_conll2003
  • remove dependency restrictions, they shouldn't be required anymore: numpy = "<2.0.0", pyarrow = "^13", and fsspec = "<2023.9.0"

Upgrade instructions: Adjust PIE dataset scripts to work with trust_remote_code if their base dataset requires it. Either

  1. add entries {"trust_remote_code": True} for all config names and None to BASE_CONFIG_KWARGS_DICT, or
  2. switch to a parquet based base dataset (there is often a branch on the HF hub with the converted dataset). However, this may required to derive the dataset builder class from pie_datasets.ArrowBasedBuilder (instead of pie_datasets.GeneratorBasedBuilder).

TODO:

Follow-ups:

  • update all dependencies to latest version via poetry update
  • update dataset scripts

@ArneBinder ArneBinder self-assigned this Oct 7, 2025
@ArneBinder ArneBinder changed the title update datasets dependency update datasets dependency Oct 7, 2025
@ArneBinder ArneBinder marked this pull request as draft October 7, 2025 14:02
@ArneBinder ArneBinder force-pushed the update-datasets-dependency branch 3 times, most recently from 13d3fc2 to 9f383b7 Compare October 8, 2025 15:42
@codecov
Copy link
Copy Markdown

codecov bot commented Oct 8, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.62%. Comparing base (f687772) to head (3e42d27).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #213      +/-   ##
==========================================
- Coverage   97.47%   93.62%   -3.85%     
==========================================
  Files           5       10       +5     
  Lines         396      989     +593     
==========================================
+ Hits          386      926     +540     
- Misses         10       63      +53     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ArneBinder ArneBinder added the breaking Breaking Changes label Oct 8, 2025
@ArneBinder ArneBinder changed the title update datasets dependency update datasets dependency to 3.6.0 Oct 8, 2025
@ArneBinder ArneBinder changed the title update datasets dependency to 3.6.0 update datasets dependency to >=3.0.2,<4.0.0 (results in 3.6.0) Oct 8, 2025
@ArneBinder ArneBinder force-pushed the update-datasets-dependency branch from 371b3bb to 707b06e Compare October 10, 2025 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Breaking Changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant