Skip to content

refactor: Add CompliantDataFrame.from_arrow#2306

Merged
dangotbanned merged 26 commits intomainfrom
df-from-arrow
Mar 29, 2025
Merged

refactor: Add CompliantDataFrame.from_arrow#2306
dangotbanned merged 26 commits intomainfrom
df-from-arrow

Conversation

@dangotbanned
Copy link
Copy Markdown
Member

@dangotbanned dangotbanned commented Mar 28, 2025

What type of PR is this? (check all applicable)

  • 💾 Refactor
  • ✨ Feature
  • 🐛 Bug Fix
  • 🔧 Optimization
  • 📝 Documentation
  • ✅ Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

  • Refine the error handling for each backend
    • Most of it can live in ArrowDataFrame.from_arrow
  • Widen to accept pa.Table
    • since we support pyarrow>=11, I think this function can still provide some value for those earlier versions
    • The Arrow* impl has similar logic to pa.table in pyarrow==13.0.0

- Defined `ArrowStreamExportable` at runtime here
- Currently, just a typing-only def in `nw.functions`
Marked as draft since I'll do another refactor pass
Couldn't understand the `TypeVar` that `_PandasConvertible` has resolved when partially binding the method
Makes more sense for the logic to stay there
- Still need to refine the error handling for each backend
- Most of it can live in `ArrowDataFrame.from_arrow`
- Also want to widen this to accept `pa.Table`
  - since we support `pyarrow>=11`, I think this function still provide some value for those earlier versions
@dangotbanned dangotbanned added internal pyarrow Issue is related to pyarrow backend labels Mar 28, 2025
@dangotbanned dangotbanned mentioned this pull request Mar 28, 2025
6 tasks
@dangotbanned
Copy link
Copy Markdown
Member Author

@MarcoGorelli I'm close to finishing, just wanted to check in and see if you had any thoughts on supporting pa.Table here?

I left some notes in the description on why it makes sense to me 🙂

Would be very easy to support across all versions

@MarcoGorelli
Copy link
Copy Markdown
Member

thanks @dangotbanned ! sure, should be good to accept pa.table pre-pycapsule

@dangotbanned
Copy link
Copy Markdown
Member Author

dangotbanned commented Mar 28, 2025

sure, should be good to accept pa.table pre-pycapsule

Perfect, thanks @MarcoGorelli

Still need to test how pa.table responds to pa.Table for post-pycapsule.
I think it would use __arrow_c_stream__ but there's only one way to find out

Update

Yeah, all good!

pyarrow>=14

import pyarrow as pa

data = {"ab": [1, 2, 3], "ba": [4, 5, 6]}
df = pa.table(data)
>>> df
pyarrow.Table
ab: int64
ba: int64
----
ab: [[1,2,3]]
ba: [[4,5,6]]

repeat = pa.table(df)
>>> repeat
pyarrow.Table
ab: int64
ba: int64
----
ab: [[1,2,3]]
ba: [[4,5,6]]

So we can just document support for both - and what I have already adds support for pa.Table w/ pyarrow<14:

IntoArrowTable: TypeAlias = ArrowStreamExportable | pa.Table

- Following the new changes, this path should work (adb6bf1)
- Unclear on whether the `pandas` import will occur though
monkeypatch.delitem(sys.modules, "pandas")
df = nw.from_native(tbl, eager_only=True)
result = nw.from_arrow(df, backend=pl)
if PYARROW_VERSION < (14,):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/narwhals-dev/narwhals/actions/runs/14134927029/job/39604239875?pr=2306

Feel like this would be good to have coverage for, but not sure how feasible that is with the current CI.
Adding a # pragma: no cover seems wrong

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it means the previous xfail(s) were never triggered?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli I was planning to adapt the rest of the tests in a similar way following

Do you have a preference on what to do here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a no cover for now in (7bf8b27)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no strong preference

@dangotbanned dangotbanned changed the title refactor(DRAFT): Add CompliantDataFrame.from_arrow refactor: Add CompliantDataFrame.from_arrow Mar 28, 2025
backend_version = context._backend_version
if isinstance(data, pa.Table):
native = data
elif backend_version >= (14,) or isinstance(data, Collection):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supporting Collection here might seem strange.

My thinking is that were a list | tuple | dict of pyarrow objects passed - pa.table would still be capable of turning that into a pa.Table:
https://github.com/apache/arrow/blob/7c1b96e26910dd553ee0ccf0792e1854af6b0021/python/pyarrow/table.pxi#L4993-L5000

@dangotbanned dangotbanned marked this pull request as ready for review March 28, 2025 18:43
@dangotbanned dangotbanned mentioned this pull request Mar 29, 2025
10 tasks
Copy link
Copy Markdown
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, thanks @dangotbanned !

Comment on lines +132 to +133
msg = "congratulations, you entered unreachable code - please report a bug"
raise AssertionError(msg)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, can't believe i wrote this, we may want to make a utility with a consistent message for these at some point

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly made me smile when I copied it over 😄

@dangotbanned dangotbanned merged commit a9e06d9 into main Mar 29, 2025
28 checks passed
@dangotbanned dangotbanned deleted the df-from-arrow branch March 29, 2025 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

internal pyarrow Issue is related to pyarrow backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants