feat: add multivalue replacement in .str.replace[_all]#2886
feat: add multivalue replacement in .str.replace[_all]#2886MarcoGorelli merged 18 commits intonarwhals-dev:mainfrom
.str.replace[_all]#2886Conversation
|
Also want a review for the error propagation and the tests. There are a decent number of backends that do not support this feature so I raised an error before calling into it. In the tests I just marked as xfail, but perhaps it makes more sense to check that the underlying TypeError is being raised? |
I'd say we should be pretty explicit on which backends will support it Is it just Apologies if I'm misreading the tests 🫣 |
Nope I think you've nailed it. Polars and |
- propagate from backend errors - allow for backend patches to automatically override behavior
|
Added some more explicit error checks, and delayed triggering errors unless the backend fails (this should force our tests to begin failing if the backend does support this feature in the future). After aggregating this behavior, here is a table showing the support for each feature by backend.
edit: @dangotbanned added Ibis support for |
narwhals/_pandas_like/series_str.py
Outdated
| _, other_native = align_and_extract_native(self.compliant, value) | ||
| values: list[str] | ||
| if literal: | ||
| values = [ | ||
| string.replace(pattern, repl, n) | ||
| for string, repl in zip(self.native, value) | ||
| ] | ||
| else: | ||
| regex = re.compile(pattern) | ||
| values = [ | ||
| regex.sub(repl, string, count=max(n, 0)) | ||
| for string, repl in zip(self.native, value) | ||
| ] | ||
| constructor = self.compliant.__native_namespace__().Series | ||
| series = constructor(values, dtype=self.native.dtype, name=self.native.name) | ||
| return self.with_native(series) |
There was a problem hiding this comment.
I feel like this is going to be a footgun for performance, but could be convinced with a benchmark or two
How do the pandas, polars + duckdb versions scale for these lengths?
1_00010_000100_0001_000_000
There was a problem hiding this comment.
This will definitely be a slower implementation since it's pure Python, however the decision point would be between this and no support at all since the built-in .str.replace doesn't support the multivalue feature. So I was leaning towards "let's support this at the pandas level even if it is slow (akin to the pandas object dtype, which strings still default to)"
If you're still interested I'll code up a benchmark to show you the requested values.
There was a problem hiding this comment.
If you're still interested I'll code up a benchmark to show you the requested values.
Indeed I am! 🙂
If we're happy with the trade-off, then I think we can do the same for pyarrow as well?
Two other options:
- Emit a warning like we do in
PandasLikeGroupBywhen usingapply - Try to find some way to do this using more of the
pandasAPI.
My main concern is the pandas docs are full of warnings against iterating over Series/DataFrame - and currently this pays the cost twice by using zip 🤔
There was a problem hiding this comment.
thanks for looking into this!
between this and no support at all
I'm very strongly on the "no support at all" side
My general philosophy is that we should be translating to a library's own API - if they then follow slow paths internally, then whatever, that's up to them
There was a problem hiding this comment.
I'm in line with the opinions here, and I'm more than happy to declare the multivalue replacement in pandas as not supported and push up another commit removing that conditional branch to land this.
Just wanted to share my reasoning for incl. the pandas code above
- pandas defaults to storing strings as
"object"dtype backed, which under the hood is operating at the Python level. - we're adding a feature that is supported by 2 backends and not supported by 6+ other backends. Is this a problem, fine, as expected? Just thinking that as an end-user it would stink to see that a feature has such limited support.
Curious to everyone's thoughts on the above especially around point 2.?
There was a problem hiding this comment.
- sure, and that's pandas decision. if/when they switch to something better, then so long as we stick to the pandas api, then we'll also get those improvements. if we implement anything ourselves in python, then we won't
- i'm quite interested in enabling narwhals to be a nice front-end to duckdb, so tbh i'm ok with it so long as these conditions are met (which, as far as i can tell, they are):
- it's supported by at least polars lazy and duckdb
- it's very simple to implement
There was a problem hiding this comment.
Curious to everyone's thoughts on the above especially around point 2.?
i'm quite interested in enabling narwhals to be a nice front-end to duckdb,
it's supported by at least polars lazy and duckdb
My preference would be to not support it, since it can only serve a narrow section of users.
I'm fine with most of this - but the narrower we go - the more likely we are to push compatibility paths back to users:
👍 eager-only
👍 lazy-only
👍 eager + some lazy
👍 lazy excluding dask
👍 most of lazy
👎 only duckdb + polars
The second line in the docs is https://narwhals-dev.github.io/narwhals/
Full API support: cuDF, Modin, pandas, Polars, PyArrow.
There should probably be a * in there, but generally I'd like to push for features that can meet that claim 🙂
There was a problem hiding this comment.
One way that I'm trying to think of Narwhals is "what is a good dataframe API, which doesn't make too many assumptions about its backends, meant to look like?"
To me, such a good API generally supports expressified arguments
If in this case it's just a matter of passing things down as expressified arguments, then I think it should be fine
There was a problem hiding this comment.
To me, such a good API generally supports expressified arguments
@MarcoGorelli it seems like you've been spared my thoughts of what defines a good API for now 😏
We now have:
I'm more than happy to move forward on a feature supported by everything except: pandas_like, pyarrow, dask
There was a problem hiding this comment.
i'm confused by what the first sentence means sorry
narwhals/_spark_like/expr_str.py
Outdated
| if isinstance(value, str): | ||
|
|
||
| return self.compliant._with_elementwise(func) | ||
| def func(expr: Column) -> Column: | ||
| return replace_all(expr, pattern_, F.lit(value)) | ||
|
|
||
| return self.compliant._with_elementwise(func) | ||
| return self.compliant._with_elementwise( | ||
| lambda expr, value: replace_all(expr, pattern_, value), value=value | ||
| ) |
There was a problem hiding this comment.
@camriddell I had a read through the pyspark docs and was curious how you reached the conclusions in (#2886 (comment))?
This one was failing because pyspark.sql.functions.lit doesn't accept narwhals._spark_like.expr.SparkLikeExpr
All I've done here is rewritten it in the style of what you had for duckdb and it seems to work 🙂
There was a problem hiding this comment.
I was going to look into this today, I managed to break my local pyspark so had to first resolve that.
There was a problem hiding this comment.
Ahh I had pyspark working at some point, but haven't had much luck for a while
sqlframe has been fine for me though and installing with this should get you everything except pyspark and cudf 🙏
uv pip install -e ".[dask,modin]" --group "local-dev" --reinstall
narwhals/_polars/series.py
Outdated
| def replace( | ||
| self, | ||
| pattern: str, | ||
| value: str | PolarsSeries, | ||
| *, | ||
| literal: bool = False, | ||
| n: int = 1, | ||
| ) -> PolarsSeries: | ||
| if isinstance(value, Series): | ||
| value = value.to_native() | ||
|
|
||
| return self._compliant_series._with_native( | ||
| self._compliant_series.native.str.replace( | ||
| pattern, | ||
| value, # type: ignore[arg-type] | ||
| literal=literal, | ||
| n=n, | ||
| ) | ||
| ) | ||
|
|
||
| def replace_all( | ||
| self, pattern: str, value: str | PolarsSeries, *, literal: bool = False | ||
| ) -> PolarsSeries: | ||
| if isinstance(value, Series): | ||
| value = value.to_native() | ||
|
|
||
| return self._compliant_series._with_native( | ||
| self._compliant_series.native.str.replace_all( | ||
| pattern, | ||
| value, # type: ignore[arg-type] | ||
| literal=literal, | ||
| ) | ||
| ) |
There was a problem hiding this comment.
So there's a few things going on here:
- The annotation is
str | PolarsSeries(Compliant), but the runtime check is onSeries(Narwhals) (https://github.com/camriddell/narwhals/blob/e422e99e3dac9e73df8fcf980b4f42ece76c4058/narwhals/_polars/series.py#L17)
i. I can't think of a case where we would have anw.Serieshere - According to
polars.Series.str.replace_all, this wouldn't be supported?
i. It could be a documentation issue forpolars, since it looks like this uses@expr_dispatch
ii. Otherwise, we should do something likePolarsExpr.from_native(pl.lit(value.native)), and re-use thePolarExpr*version like inzfillnarwhals/narwhals/_polars/series.py
Lines 702 to 706 in 277881a
There was a problem hiding this comment.
I believe both 1. and 2. are related to the below, let me know if this does not address your concerns.
Polars supports passing a Series to value in polars.Series.str.replace_all so I wanted to capture that behavior here as well. My intuition was that a user should be able to do the following
nw.Series(…).str.replace(…, nw.Series(…))However because the Polars layer is thin, we never coerce passed arguments to their native types to be handled within the _polars.series layer (at least when I used the function signature above, I would get a nw.Series passed into the function).
I was actually planning to open an issue on Polars because the Series objects also support more than just a replacement str (disagreeing with the type hint of value in polars.Series.str.replace_all).
>>> import polars as pl # 1.30.0
>>> import numpy as np
>>> pl.Series(['abc', 'ab def']).str.replace('ab', pl.Series(['def', 'GHI'])) # replace_all works the same
shape: (2,)
Series: '' [str]
[
"defc"
"GHI def"
]
>>> pl.Series(['abc', 'ab def']).str.replace('ab', np.array(['def', 'GHI']))
shape: (2,)
Series: '' [str]
[
"defc"
"GHI def"
]There was a problem hiding this comment.
i. I can't think of a case where we would have a
nw.Serieshere
Okay I had a closer look and what you've written is correct from a typing perspective.
The issue is we really should've passed the CompliantSeries down instead, because then we can just use the __getattr__ instead of adding new methods
I'm gonna push a fix for that now 🥳
Updated
fix: Pass down CompliantSeries not nw.Series
Weirdly, the same thing has the wrong name on nw.Series 🤔
Lines 376 to 381 in 0e80656
…ddell/narwhals into pr/camriddell/2886
Pretty much the same as narwhals-dev#2886 (comment) I've added annotations for all the other ones that would accept the same
|
@MarcoGorelli and @dangotbanned I believe this is ready for a more formal review. Thanks for all of your help getting this to the finish line! |
|
thanks @camriddell for bearing with this one, could you resolve the conflicts please and we can try to get it in? |
|
@MarcoGorelli resolved the conflicts aka incorporated the new generic |
|
Nit: would using elementwise instead of multivalue be a better fit for the title? |
I'm neutral on this, I think elementwise alone is a little ambiguous since replacement already occurs elementwise (on the input). |
Not sure if I'm misreading this or the test (or if it's out of date), but it looks like you've got this supported for Ibis too now? |
MarcoGorelli
left a comment
There was a problem hiding this comment.
Looks great, thank you so much @camriddell , and @dangotbanned for review
Before merging, please remember to include 'cudf' in the xfails (same places as pandas/pyarrow/etc)
It's out of date, @dangotbanned took excellent care of the Ibis support! |
|
gentle ping on this, i think it's really close to the finish line |
Thanks for the nudge, I had missed your earlier comment about |
|
@MarcoGorelli I believe this is ready for a final look through! |
|
thanks, shipping it got a feeling there's gonna be some cudf failures but we need to do some cudf fixups anyway further expressifications welcome! |

What type of PR is this? (check all applicable)
Related issues
Checklist
If you have comments or can explain your changes, please do so below
@MarcoGorelli any thoughts on the docs for this feature? Is it worth adding examples or do you think just the update to the typing here suffices?