Skip to content

fix: don't downcast large_string to string unnecessarily in concat_str for PyArrow#2176

Merged
MarcoGorelli merged 6 commits into
narwhals-dev:mainfrom
MarcoGorelli:large-string
Mar 14, 2025
Merged

fix: don't downcast large_string to string unnecessarily in concat_str for PyArrow#2176
MarcoGorelli merged 6 commits into
narwhals-dev:mainfrom
MarcoGorelli:large-string

Conversation

@MarcoGorelli
Copy link
Copy Markdown
Member

@MarcoGorelli MarcoGorelli commented Mar 9, 2025

closes #2097

What type of PR is this? (check all applicable)

  • 💾 Refactor
  • ✨ Feature
  • 🐛 Bug Fix
  • 🔧 Optimization
  • 📝 Documentation
  • ✅ Test
  • 🐳 Other

Related issues

  • Related issue #<issue number>
  • Closes #<issue number>

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

@MarcoGorelli MarcoGorelli force-pushed the large-string branch 2 times, most recently from 82c17c3 to 44ca7f6 Compare March 9, 2025 16:30
@MarcoGorelli
Copy link
Copy Markdown
Member Author

this would break a plotly test

FAILED tests/test_optional/test_px/test_px_hover.py::test_sunburst_hoverdict_color[pyarrow] - pyarrow.lib.ArrowInvalid: Schema at index 1 was different: 
labels: large_string
parent: large_string
id: large_string
pop: double
country: large_string
continent: large_string
lifeExp: double
vs
labels: large_string
parent: string
id: large_string
pop: double
country: large_string
continent: large_string
lifeExp: double

I don't want to rush this then, leaving it out of tomorrow's release, we'll think about it for the next one

@MarcoGorelli
Copy link
Copy Markdown
Member Author

The plotly test could be fixed by making nw.lit('a') default to large_string in PyArrow. But...should we?

will continue on the issue

@MarcoGorelli MarcoGorelli changed the title fix: map nw.String to pa.large_string (instead of pa.string) and nw.List to pa.large_list instead of pa.list_ fix: don't downcast large_string to string unnecessarily in concat_str Mar 14, 2025
@MarcoGorelli MarcoGorelli changed the title fix: don't downcast large_string to string unnecessarily in concat_str fix: don't downcast large_string to string unnecessarily in concat_str for PyArrow Mar 14, 2025
@MarcoGorelli MarcoGorelli marked this pull request as ready for review March 14, 2025 09:39
Comment thread narwhals/_arrow/namespace.py Outdated
Comment thread narwhals/_arrow/utils.py Outdated
Comment thread narwhals/_arrow/utils.py Outdated
Comment thread narwhals/_arrow/utils.py
return series._from_native_series(concat), offset_left + offset_right


def cast_to_comparable_string_types(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli I think I can shorten this and avoid the # type: ignore[arg-type].

Mind if I add a commit?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure thaknks

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli I think I can shorten this and avoid the # type: ignore[arg-type].

I managed to get there in fewer characters - but LOC is bound by the name chunked_arrays 😄

(4dfc6a3)

I'll conclude my round of code golf there for today

Comment on lines +2231 to +2232
schema: The DataFrame schema as Schema or dict of {name: type}. If not
specified, the schema will be inferred by the native library.
Copy link
Copy Markdown
Member

@dangotbanned dangotbanned Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more of a question than a suggestion.

Is there a preference between these two?

If not provided

and

If not specified

I probably wouldn't have noticed this if the lit(..., dtype=...) doc hadn't shown up in the diff

dtype: The data type of the literal value. If not provided, the data type will
be inferred by the native library.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dunno, don't really mind

@MarcoGorelli
Copy link
Copy Markdown
Member Author

thanks Dan!

@MarcoGorelli MarcoGorelli merged commit df38225 into narwhals-dev:main Mar 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

api: use large_string / large_list when converting Narwhals types -> PyArrow?

2 participants