Skip to content

refactor: rewrite expression-parsing for the 1,432th time #2478

@MarcoGorelli

Description

@MarcoGorelli
  • I've got big plans
  • please, no more refactors!
  • big plans

I think the expression parsing metadata needs rewriting, yet again, in light of:

First, what we need to ensure:

All the following should be allowed

  • nw.col('a').mean().over('b')
  • nw.col('a').rank()
  • nw.col('a').rank().over('b')

All the following should be disallowed:

  • nw.col('a').mean().over('b', order_by='i')
  • nw.col('a').rank().over('b', order_by='i')
  • nw.col('a').rank().abs().over('b')

All the following should be marked as order-dependent so that they can be disallowed for LazyFrame:

  • nw.col('a').diff()
  • nw.col('a').diff().abs().over(order_by='i') (diff wasn't immediately followed by over(order_by=...))

Solution

First, ExprKind gets removed, we don't need it

ExprMetadata contain the following:

  • last_node_is_window: whether the last node was a window (diff, shift, rolling_*, cum_*)
  • last_node_is_partitionable_window: whether the last node was a partitionable window (rank, is_unique)
  • is_partitionable: whether over(partition_by='a') can be applied to the current expression. False by default, as nw.col('a').over('b') doesn't make sense
  • is_partitioned: whether the expression has already has .over applied to it (whether with partition_by set, or with the implicit global (nw.lit(1)) partition). False by default
  • is_orderable: whether over(order_by='i') can be applied to the current expression. False by default, as nw.col('a').over(order_by='i') doesn't make sense
  • requires_ordering: whether the expression depends on physical row order. False by default
  • preserves_length: whether it preserves the original expression's length
  • is_scalar_like: whether it produces a single value (aggregation or literal)
  • changes_length: whether it changes length but without aggregating (e.g. Expr.drop_nulls)

Composition (anything non-noted stays unchangedD):

  • if an elementwise operation (e.g. .abs) is applied all attributes except last_node_is_* are preserved
  • if an aggregation is applied, then:
    • is_partitionable to True
    • is_orderable to False (nw.col('a').mean().over(order_by='i') doesn't make sense and is almost certainly a user error)
    • changes_length to False
    • is_scalar_like to True
    • preserves_length to False
  • if a window function, like rolling_* / diff / cum_* / shift, is applied, then set:
    • last_node_is_window set to True
    • is_partitionable to True
    • is_orderable to True
    • is_scalar_like to True
    • preserves_length to False
  • if a non-elementwise expression is applied (rank, is_unique), then set:
    • last_node_is_partitionable_window set to True
    • is_partitionable to False
    • is_partitioned to True
    • is_orderable to False
  • if a filtration is applied (drop_nulls, filter), then set:
    • is_partitionable to False
    • is_orderable to False
    • is_scalar_like to False, raise if is_scalar_like is True (can't do nw.col('a').mean().drop_nulls())
    • preserves_length to False
  • if over is applied, set:
    • is_partitionable to False, raise if it was False and partition_by was specified UNLESS last_node_is_partitionable_window, in which case the current partition_by is combined with the last expression's partition_by
    • is_partitioned to True, raise if it's already True (that would be a double-over, not allowed) UNLESS last_node_is_partitionable_window, in which case the current partition_by is combined with the last expression's partition_by
    • is_orderable to False, raise if it was False and order_by was specified
    • changes_length to False, raise if it was True
    • requires_ordering to False if last_node_is_window and order_by was specified and is_orderable is True
  • for n-ary operations, set:
    • is_partitionable to True if any input is partitionable
    • is_partitioned to True if any input is partitioned
    • is_orderable to True if any input is orderable
    • requires_ordering to True if any input requires ordering
    • preserves_length to True if any input preserves length
    • is_scalar_like to True if all inputs are scalar-like
    • changes_length to False, and raise if any input changes length

How this achieves the desired disallowments:

  • nw.col('a').rank().abs().over('b'): rank sets is_partitioned to True, then abs sets last_node_is_partitioned_window to False, and over finds that is_partitioned is True and last_node_is_partitioned_window is False and so disallows the operation

There's likely some mistakes / things I'm missing and overlooking here, but I just wanted to start by trying to collect my thoughts

EDIT

  • changes_length is unnecessary, it's determined by preserves_length and is_scalar_like
  • partitionable_window => unorderable_window
  • also need to preserve that .diff().shift().over(order_by=...) requires physical row order

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions