-
Notifications
You must be signed in to change notification settings - Fork 185
refactor: rewrite expression-parsing for the 1,432th time #2478
Copy link
Copy link
Closed
Labels
Description
- I've got big plans
- please, no more refactors!
- big plans
I think the expression parsing metadata needs rewriting, yet again, in light of:
- error reporting: unify error message for
nw.col('a').fill_null(3).over('b')#2444 - feat: add
rankfor Lazy backends #2310 (comment)
First, what we need to ensure:
All the following should be allowed
nw.col('a').mean().over('b')nw.col('a').rank()nw.col('a').rank().over('b')
All the following should be disallowed:
nw.col('a').mean().over('b', order_by='i')nw.col('a').rank().over('b', order_by='i')nw.col('a').rank().abs().over('b')
All the following should be marked as order-dependent so that they can be disallowed for LazyFrame:
nw.col('a').diff()nw.col('a').diff().abs().over(order_by='i')(diffwasn't immediately followed byover(order_by=...))
Solution
First, ExprKind gets removed, we don't need it
ExprMetadata contain the following:
last_node_is_window: whether the last node was a window (diff,shift,rolling_*,cum_*)last_node_is_partitionable_window: whether the last node was a partitionable window (rank,is_unique)is_partitionable: whetherover(partition_by='a')can be applied to the current expression.Falseby default, asnw.col('a').over('b')doesn't make senseis_partitioned: whether the expression has already has.overapplied to it (whether withpartition_byset, or with the implicit global (nw.lit(1)) partition).Falseby defaultis_orderable: whetherover(order_by='i')can be applied to the current expression.Falseby default, asnw.col('a').over(order_by='i')doesn't make senserequires_ordering: whether the expression depends on physical row order.Falseby defaultpreserves_length: whether it preserves the original expression's lengthis_scalar_like: whether it produces a single value (aggregation or literal)changes_length: whether it changes length but without aggregating (e.g.Expr.drop_nulls)
Composition (anything non-noted stays unchangedD):
- if an elementwise operation (e.g.
.abs) is applied all attributes exceptlast_node_is_*are preserved - if an aggregation is applied, then:
is_partitionabletoTrueis_orderabletoFalse(nw.col('a').mean().over(order_by='i')doesn't make sense and is almost certainly a user error)changes_lengthtoFalseis_scalar_liketoTruepreserves_lengthtoFalse
- if a window function, like
rolling_*/diff/cum_*/shift, is applied, then set:last_node_is_windowset toTrueis_partitionabletoTrueis_orderabletoTrueis_scalar_liketoTruepreserves_lengthtoFalse
- if a non-elementwise expression is applied (
rank,is_unique), then set:last_node_is_partitionable_windowset toTrueis_partitionabletoFalseis_partitionedtoTrueis_orderabletoFalse
- if a filtration is applied (
drop_nulls,filter), then set:is_partitionabletoFalseis_orderabletoFalseis_scalar_liketoFalse, raise ifis_scalar_likeis True (can't donw.col('a').mean().drop_nulls())preserves_lengthtoFalse
- if
overis applied, set:is_partitionabletoFalse, raise if it wasFalseandpartition_bywas specified UNLESSlast_node_is_partitionable_window, in which case the currentpartition_byis combined with the last expression'spartition_byis_partitionedtoTrue, raise if it's alreadyTrue(that would be a double-over, not allowed) UNLESSlast_node_is_partitionable_window, in which case the currentpartition_byis combined with the last expression'spartition_byis_orderabletoFalse, raise if it wasFalseandorder_bywas specifiedchanges_lengthtoFalse, raise if it wasTruerequires_orderingtoFalseiflast_node_is_windowandorder_bywas specified andis_orderableisTrue
- for n-ary operations, set:
is_partitionabletoTrueif any input is partitionableis_partitionedtoTrueif any input is partitionedis_orderabletoTrueif any input is orderablerequires_orderingtoTrueif any input requires orderingpreserves_lengthtoTrueif any input preserves lengthis_scalar_liketoTrueif all inputs are scalar-likechanges_lengthtoFalse, and raise if any input changes length
How this achieves the desired disallowments:
nw.col('a').rank().abs().over('b'):ranksetsis_partitionedtoTrue, thenabssetslast_node_is_partitioned_windowtoFalse, andoverfinds thatis_partitionedisTrueandlast_node_is_partitioned_windowisFalseand so disallows the operation
There's likely some mistakes / things I'm missing and overlooking here, but I just wanted to start by trying to collect my thoughts
EDIT
changes_lengthis unnecessary, it's determined bypreserves_lengthandis_scalar_likepartitionable_window=>unorderable_window- also need to preserve that
.diff().shift().over(order_by=...)requires physical row order
Reactions are currently unavailable