Skip to content

added PyarrowTableResult #830

Merged
skrawcz merged 1 commit intomainfrom
feat/pyarrow-result-builder
Apr 22, 2024
Merged

added PyarrowTableResult #830
skrawcz merged 1 commit intomainfrom
feat/pyarrow-result-builder

Conversation

@zilto
Copy link
Contributor

@zilto zilto commented Apr 17, 2024

You can pass to.SAVER(dependencies=["NODE_NAME"], combine=PyarrowTableResult()) to convert the specified node to the pyarrow.Table before materialization. The first motivation was to support more than pd.DataFrame and pyarrow.Table with the dlt DataSaver plugin. More generally, it can be useful for platform teams that want to have a "single way to store parquet files" that is independent of the specific API of a library (e.g., pandas, polars)

see #829 for more details

Changes

  • added h_pyarrow and tests
  • updated the dlt plugin example notebook

How I tested this

  • added 2 tests

Notes

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

Comment on lines +15 to +21
for example:
- pandas
- polars
- dask
- vaex
- ibis
- duckdb results
Copy link
Contributor

@skrawcz skrawcz Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to be stricter on types...

Copy link
Contributor

@skrawcz skrawcz Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g.

def input_types(self) -> List[Type[Type]]:
    """Gives the applicable types to this result builder.
    This is optional for backwards compatibility, but is recommended.

    :return: A list of types that this can apply to.
    """
    _types = []
    try:
       import ...
   except ...
    return _types

Copy link
Contributor Author

@zilto zilto Apr 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, the real check is if it implements __dataframe__(), which is done through pyarrow.interchange.from_dataframe() under build_result(). The PyarrowTableResult serve a slightly different role of "universal adapter" to help us avoid maintaining an explicit list of types (which is bound to grow). I opted to not include input_types() if it was to return Any.

@skrawcz skrawcz merged commit 26bc1cc into main Apr 22, 2024
@skrawcz skrawcz deleted the feat/pyarrow-result-builder branch April 22, 2024 03:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants