Skip to content

dtypes support when reading csv files into Polars#598

Closed
grofte wants to merge 1 commit intokedro-org:mainfrom
grofte:fix/dtypes-load-polars
Closed

dtypes support when reading csv files into Polars#598
grofte wants to merge 1 commit intokedro-org:mainfrom
grofte:fix/dtypes-load-polars

Conversation

@grofte
Copy link
Copy Markdown

@grofte grofte commented Mar 7, 2024

Description

Polars needs Polars dtypes, rather than strings, when reading a csv if you want to set the dtypes rather than depend on schema inference. So this takes a string argument like "pl.Int64" in a sequence or mapping and substitutes it with the actual pl.Int64.

Development notes

I've tested it in the spaceflights tutorial, both that it works and fails as expected and that it works for all three ways of loading a CSV into Polars. For CSVDataset and Eager I did a .to_pandas() in the first line processing companies and for Lazy I did .collect().to_pandas().
I got the same number of passes, fails, and errors running make test_no_spark before and after adding the new code.
The code is basically the same for all three classes and could possibly be abstracted away. But I'm not sure about your preferences there.
I haven't written a test because I wasn't sure about requirements and standards. If you give me some hints I will try to write some tests.

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes

@astrojuanlu
Copy link
Copy Markdown
Contributor

@grofte
Copy link
Copy Markdown
Author

grofte commented Mar 7, 2024

I will have a look, thank you.

@grofte
Copy link
Copy Markdown
Author

grofte commented Mar 7, 2024

Doh

my_polars_dataset:
  type: polars.CSVDataset
  filepath: data/01_raw/my_dataset.csv
  load_args:
    dtypes:
      product_age: "${polars:Float64}"
      group_identifier: "${polars:Utf8}"
    try_parse_dates: true

@noklam
Copy link
Copy Markdown
Contributor

noklam commented Mar 20, 2024

@grofte Is this PR still relevant? custom resolver should be the default solution for non-primitive type.

@astrojuanlu
Copy link
Copy Markdown
Contributor

Wondering: is there a way to ship a custom resolver with a plugin? So that users do pip install kedro-datasets[polars] and have the $polars resolver already available.

This is clearly the preferred option but I foresee this might become a common point of friction, maybe we can alleviate it somehow.

@grofte
Copy link
Copy Markdown
Author

grofte commented Mar 21, 2024

I'll delete it - definitely not relevant.
Maybe I'll make a PR on the docstring to add a link to the stuff about the omega config loader. There's already a link in load arguments to Polars so I assume it would be okay to add?

@merelcht
Copy link
Copy Markdown
Member

Hi @grofte are you still interested in doing something with this PR or can I close it?

@astrojuanlu
Copy link
Copy Markdown
Contributor

I think this PR is safe to close.

The open questions are:

  • How can we make this easier for users. The PR stems from some confusion that is cleared in the docs, but can still surprise people. From my comment above:

Wondering: is there a way to ship a custom resolver with a plugin? So that users who do pip install kedro-datasets[polars] already have the $polars resolver available and they don't need to define it themselves.

  • One outstanding question from @grofte:

Maybe I'll make a PR on the docstring to add a link to the stuff about the omega config loader. There's already a link in load arguments to Polars so I assume it would be okay to add?

@astrojuanlu
Copy link
Copy Markdown
Contributor

Closing and opened #687

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants