dtypes support when reading csv files into Polars#598
dtypes support when reading csv files into Polars#598grofte wants to merge 1 commit intokedro-org:mainfrom
Conversation
|
Thanks for the PR @grofte ! Are you aware of this though? https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#how-to-use-resolvers-in-the-omegaconfigloader |
|
I will have a look, thank you. |
|
Doh my_polars_dataset:
type: polars.CSVDataset
filepath: data/01_raw/my_dataset.csv
load_args:
dtypes:
product_age: "${polars:Float64}"
group_identifier: "${polars:Utf8}"
try_parse_dates: true |
|
@grofte Is this PR still relevant? custom resolver should be the default solution for non-primitive type. |
|
Wondering: is there a way to ship a custom resolver with a plugin? So that users do This is clearly the preferred option but I foresee this might become a common point of friction, maybe we can alleviate it somehow. |
|
I'll delete it - definitely not relevant. |
|
Hi @grofte are you still interested in doing something with this PR or can I close it? |
|
I think this PR is safe to close. The open questions are:
|
|
Closing and opened #687 |
Description
Polars needs Polars dtypes, rather than strings, when reading a csv if you want to set the dtypes rather than depend on schema inference. So this takes a string argument like "pl.Int64" in a sequence or mapping and substitutes it with the actual pl.Int64.
Development notes
I've tested it in the spaceflights tutorial, both that it works and fails as expected and that it works for all three ways of loading a CSV into Polars. For CSVDataset and Eager I did a
.to_pandas()in the first line processingcompaniesand for Lazy I did.collect().to_pandas().I got the same number of passes, fails, and errors running
make test_no_sparkbefore and after adding the new code.The code is basically the same for all three classes and could possibly be abstracted away. But I'm not sure about your preferences there.
I haven't written a test because I wasn't sure about requirements and standards. If you give me some hints I will try to write some tests.
Checklist
RELEASE.mdfile