Skip to content

feat: add multivalue replacement in .str.replace[_all]#2886

Merged
MarcoGorelli merged 18 commits intonarwhals-dev:mainfrom
camriddell:enh/str-replace_multivalue
Aug 28, 2025
Merged

feat: add multivalue replacement in .str.replace[_all]#2886
MarcoGorelli merged 18 commits intonarwhals-dev:mainfrom
camriddell:enh/str-replace_multivalue

Conversation

@camriddell
Copy link
Member

What type of PR is this? (check all applicable)

  • 💾 Refactor
  • ✨ Feature
  • 🐛 Bug Fix
  • 🔧 Optimization
  • 📝 Documentation
  • ✅ Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

@MarcoGorelli any thoughts on the docs for this feature? Is it worth adding examples or do you think just the update to the typing here suffices?

@camriddell
Copy link
Member Author

Also want a review for the error propagation and the tests. There are a decent number of backends that do not support this feature so I raised an error before calling into it. In the tests I just marked as xfail, but perhaps it makes more sense to check that the underlying TypeError is being raised?

@dangotbanned
Copy link
Member

dangotbanned commented Jul 25, 2025

any thoughts on the docs for this feature?

I'd say we should be pretty explicit on which backends will support it

Is it just polars and duckdb?

Apologies if I'm misreading the tests 🫣

@camriddell
Copy link
Member Author

any thoughts on the docs for this feature?

I'd say we should be pretty explicit on which backends will support it

Is it just polars and duckdb?

Apologies if I'm misreading the tests 🫣

Nope I think you've nailed it. Polars and duckdb support it natively. I added pandas via manual iteration (since pandas drops down to slow iteration for a number of ops anyways).

@camriddell
Copy link
Member Author

camriddell commented Jul 28, 2025

Added some more explicit error checks, and delayed triggering errors unless the backend fails (this should force our tests to begin failing if the backend does support this feature in the future). After aggregating this behavior, here is a table showing the support for each feature by backend.

Backend str.replace[scalar] str.replace[multivalue] str.replace_all[scalar] str.replace_all[multivalue]
pandas ✅ (pure Python) ✅ (pure Python)
polars[eager]
polars[lazy]
duckdb
spark_like
ibis
pyarrow
dask

edit: @dangotbanned added Ibis support for replace_all[multivalue]

Comment on lines +34 to +49
_, other_native = align_and_extract_native(self.compliant, value)
values: list[str]
if literal:
values = [
string.replace(pattern, repl, n)
for string, repl in zip(self.native, value)
]
else:
regex = re.compile(pattern)
values = [
regex.sub(repl, string, count=max(n, 0))
for string, repl in zip(self.native, value)
]
constructor = self.compliant.__native_namespace__().Series
series = constructor(values, dtype=self.native.dtype, name=self.native.name)
return self.with_native(series)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is going to be a footgun for performance, but could be convinced with a benchmark or two

How do the pandas, polars + duckdb versions scale for these lengths?

  1. 1_000
  2. 10_000
  3. 100_000
  4. 1_000_000

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will definitely be a slower implementation since it's pure Python, however the decision point would be between this and no support at all since the built-in .str.replace doesn't support the multivalue feature. So I was leaning towards "let's support this at the pandas level even if it is slow (akin to the pandas object dtype, which strings still default to)"

If you're still interested I'll code up a benchmark to show you the requested values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're still interested I'll code up a benchmark to show you the requested values.

Indeed I am! 🙂

If we're happy with the trade-off, then I think we can do the same for pyarrow as well?

Two other options:

  1. Emit a warning like we do in PandasLikeGroupBy when using apply
  2. Try to find some way to do this using more of the pandas API.

My main concern is the pandas docs are full of warnings against iterating over Series/DataFrame - and currently this pays the cost twice by using zip 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for looking into this!

between this and no support at all

I'm very strongly on the "no support at all" side

My general philosophy is that we should be translating to a library's own API - if they then follow slow paths internally, then whatever, that's up to them

Copy link
Member Author

@camriddell camriddell Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in line with the opinions here, and I'm more than happy to declare the multivalue replacement in pandas as not supported and push up another commit removing that conditional branch to land this.

Just wanted to share my reasoning for incl. the pandas code above

  1. pandas defaults to storing strings as "object" dtype backed, which under the hood is operating at the Python level.
  2. we're adding a feature that is supported by 2 backends and not supported by 6+ other backends. Is this a problem, fine, as expected? Just thinking that as an end-user it would stink to see that a feature has such limited support.

Curious to everyone's thoughts on the above especially around point 2.?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. sure, and that's pandas decision. if/when they switch to something better, then so long as we stick to the pandas api, then we'll also get those improvements. if we implement anything ourselves in python, then we won't
  2. i'm quite interested in enabling narwhals to be a nice front-end to duckdb, so tbh i'm ok with it so long as these conditions are met (which, as far as i can tell, they are):
  • it's supported by at least polars lazy and duckdb
  • it's very simple to implement

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@camriddell

Curious to everyone's thoughts on the above especially around point 2.?

@MarcoGorelli

i'm quite interested in enabling narwhals to be a nice front-end to duckdb,
it's supported by at least polars lazy and duckdb

My preference would be to not support it, since it can only serve a narrow section of users.
I'm fine with most of this - but the narrower we go - the more likely we are to push compatibility paths back to users:

👍 eager-only
👍 lazy-only
👍 eager + some lazy
👍 lazy excluding dask
👍 most of lazy
👎 only duckdb + polars

The second line in the docs is https://narwhals-dev.github.io/narwhals/

Full API support: cuDF, Modin, pandas, Polars, PyArrow.

There should probably be a * in there, but generally I'd like to push for features that can meet that claim 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way that I'm trying to think of Narwhals is "what is a good dataframe API, which doesn't make too many assumptions about its backends, meant to look like?"

To me, such a good API generally supports expressified arguments

If in this case it's just a matter of passing things down as expressified arguments, then I think it should be fine

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, such a good API generally supports expressified arguments

@MarcoGorelli it seems like you've been spared my thoughts of what defines a good API for now 😏

We now have:

I'm more than happy to move forward on a feature supported by everything except: pandas_like, pyarrow, dask

Copy link
Member

@MarcoGorelli MarcoGorelli Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm confused by what the first sentence means sorry

Comment on lines +33 to +41
if isinstance(value, str):

return self.compliant._with_elementwise(func)
def func(expr: Column) -> Column:
return replace_all(expr, pattern_, F.lit(value))

return self.compliant._with_elementwise(func)
return self.compliant._with_elementwise(
lambda expr, value: replace_all(expr, pattern_, value), value=value
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@camriddell I had a read through the pyspark docs and was curious how you reached the conclusions in (#2886 (comment))?

This one was failing because pyspark.sql.functions.lit doesn't accept narwhals._spark_like.expr.SparkLikeExpr

All I've done here is rewritten it in the style of what you had for duckdb and it seems to work 🙂

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to look into this today, I managed to break my local pyspark so had to first resolve that.

Copy link
Member

@dangotbanned dangotbanned Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I had pyspark working at some point, but haven't had much luck for a while

sqlframe has been fine for me though and installing with this should get you everything except pyspark and cudf 🙏

uv pip install -e ".[dask,modin]" --group "local-dev"  --reinstall

Comment on lines +708 to +740
def replace(
self,
pattern: str,
value: str | PolarsSeries,
*,
literal: bool = False,
n: int = 1,
) -> PolarsSeries:
if isinstance(value, Series):
value = value.to_native()

return self._compliant_series._with_native(
self._compliant_series.native.str.replace(
pattern,
value, # type: ignore[arg-type]
literal=literal,
n=n,
)
)

def replace_all(
self, pattern: str, value: str | PolarsSeries, *, literal: bool = False
) -> PolarsSeries:
if isinstance(value, Series):
value = value.to_native()

return self._compliant_series._with_native(
self._compliant_series.native.str.replace_all(
pattern,
value, # type: ignore[arg-type]
literal=literal,
)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there's a few things going on here:

  1. The annotation is str | PolarsSeries (Compliant), but the runtime check is on Series (Narwhals) (https://github.com/camriddell/narwhals/blob/e422e99e3dac9e73df8fcf980b4f42ece76c4058/narwhals/_polars/series.py#L17)
    i. I can't think of a case where we would have a nw.Series here
  2. According to polars.Series.str.replace_all, this wouldn't be supported?
    i. It could be a documentation issue for polars, since it looks like this uses @expr_dispatch
    ii. Otherwise, we should do something like PolarsExpr.from_native(pl.lit(value.native)), and re-use the PolarExpr* version like in zfill
    def zfill(self, width: int) -> PolarsSeries:
    series = self._compliant_series
    name = series.name
    ns = series.__narwhals_namespace__()
    return series.to_frame().select(ns.col(name).str.zfill(width)).get_column(name)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe both 1. and 2. are related to the below, let me know if this does not address your concerns.

Polars supports passing a Series to value in polars.Series.str.replace_all so I wanted to capture that behavior here as well. My intuition was that a user should be able to do the following

nw.Series(…).str.replace(…, nw.Series(…))

However because the Polars layer is thin, we never coerce passed arguments to their native types to be handled within the _polars.series layer (at least when I used the function signature above, I would get a nw.Series passed into the function).


I was actually planning to open an issue on Polars because the Series objects also support more than just a replacement str (disagreeing with the type hint of value in polars.Series.str.replace_all).

>>> import polars as pl # 1.30.0
>>> import numpy as np
>>> pl.Series(['abc', 'ab def']).str.replace('ab', pl.Series(['def', 'GHI'])) # replace_all works the same
shape: (2,)
Series: '' [str]
[
	"defc"
	"GHI def"
]
>>> pl.Series(['abc', 'ab def']).str.replace('ab', np.array(['def', 'GHI']))
shape: (2,)
Series: '' [str]
[
	"defc"
	"GHI def"
]

Copy link
Member

@dangotbanned dangotbanned Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i. I can't think of a case where we would have a nw.Series here

Okay I had a closer look and what you've written is correct from a typing perspective.

The issue is we really should've passed the CompliantSeries down instead, because then we can just use the __getattr__ instead of adding new methods

I'm gonna push a fix for that now 🥳

Updated

fix: Pass down CompliantSeries not nw.Series

Weirdly, the same thing has the wrong name on nw.Series 🤔

narwhals/narwhals/series.py

Lines 376 to 381 in 0e80656

def _extract_native(self, arg: Any) -> Any:
from narwhals.series import Series
if isinstance(arg, Series):
return arg._compliant_series
return arg

@dangotbanned dangotbanned added the enhancement New feature or request label Aug 1, 2025
@camriddell camriddell marked this pull request as ready for review August 5, 2025 16:09
@camriddell
Copy link
Member Author

@MarcoGorelli and @dangotbanned I believe this is ready for a more formal review.

Thanks for all of your help getting this to the finish line!

@MarcoGorelli
Copy link
Member

thanks @camriddell for bearing with this one, could you resolve the conflicts please and we can try to get it in?

@camriddell
Copy link
Member Author

@MarcoGorelli resolved the conflicts aka incorporated the new generic _sql backend.

@dangotbanned
Copy link
Member

Nit: would using elementwise instead of multivalue be a better fit for the title?

@camriddell
Copy link
Member Author

camriddell commented Aug 13, 2025

Nit: would using elementwise instead of multivalue be a better fit for the title?

I'm neutral on this, I think elementwise alone is a little ambiguous since replacement already occurs elementwise (on the input).
That said, I'm not tied to multivalue either because I also think its not entirely clear what is being talked about. Definitely open to refining the terms used here!

@MarcoGorelli
Copy link
Member

Added some more explicit error checks, and delayed triggering errors unless the backend fails (this should force our tests to begin failing if the backend does support this feature in the future). After aggregating this behavior, here is a table showing the support for each feature by backend.

Backend str.replace[scalar] str.replace[multivalue] str.replace_all[scalar] str.replace_all[multivalue]
pandas ✅ ✅ (pure Python) ✅ ✅ (pure Python)
polars[eager] ✅ ✅ ✅ ✅
polars[lazy] ✅ ✅ ✅ ✅
duckdb ❌ ❌ ✅ ✅
spark_like ❌ ❌ ✅ ✅
ibis ❌ ❌ ✅ ❌
pyarrow ✅ ❌ ✅ ❌
dask ✅ ❌ ✅ ❌

Not sure if I'm misreading this or the test (or if it's out of date), but it looks like you've got this supported for Ibis too now?

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you so much @camriddell , and @dangotbanned for review

Before merging, please remember to include 'cudf' in the xfails (same places as pandas/pyarrow/etc)

@camriddell
Copy link
Member Author

Added some more explicit error checks, and delayed triggering errors unless the backend fails (this should force our tests to begin failing if the backend does support this feature in the future). After aggregating this behavior, here is a table showing the support for each feature by backend.
Backend str.replace[scalar] str.replace[multivalue] str.replace_all[scalar] str.replace_all[multivalue]
pandas ✅ ✅ (pure Python) ✅ ✅ (pure Python)
polars[eager] ✅ ✅ ✅ ✅
polars[lazy] ✅ ✅ ✅ ✅
duckdb ❌ ❌ ✅ ✅
spark_like ❌ ❌ ✅ ✅
ibis ❌ ❌ ✅ ❌
pyarrow ✅ ❌ ✅ ❌
dask ✅ ❌ ✅ ❌

Not sure if I'm misreading this or the test (or if it's out of date), but it looks like you've got this supported for Ibis too now?

It's out of date, @dangotbanned took excellent care of the Ibis support!
I'll update the table

@MarcoGorelli
Copy link
Member

gentle ping on this, i think it's really close to the finish line

@camriddell
Copy link
Member Author

gentle ping on this, i think it's really close to the finish line

Thanks for the nudge, I had missed your earlier comment about cudf

@camriddell
Copy link
Member Author

@MarcoGorelli I believe this is ready for a final look through!

@MarcoGorelli
Copy link
Member

thanks, shipping it

got a feeling there's gonna be some cudf failures but we need to do some cudf fixups anyway

further expressifications welcome!

@MarcoGorelli MarcoGorelli merged commit bfb61bf into narwhals-dev:main Aug 28, 2025
31 of 33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enh]: Allow .str.replace to receive Expr as value

3 participants