[Story] Use datasource statistics in cudf-polars

In streaming cudf-polars, we do not use datasource statistics (row-count and unique-value estimates) to inform the physical plan yet. We use the average file size (per column) to set the partition size for `Scan`/`DataFrameScan` operations, but we don't leverage sampled statistics to choose between shuffling and tree reductions, or to repartition after `Join` or `GroupBy` operations.

Now that https://github.com/rapidsai/cudf/pull/19276 is in, we now have the necessary classes to store/track the statistics needed for these potential optimizations.

**Next Steps**:

- [x] [0] Make the number of sampled parquet files and parquet row-groups configurable (#19389).
- [x] [1] Implement a `post_traversal` pass over the un-lowered IR graph to populate `dict[IR, dict[str, ColumnStats]]` and `dict[IR, RowCount]` data structure with base (i.e. source) statistics (#19390).
    - This traversal will **not** update the `ColumnStats.unique_stats` attribute for each column yet.
    - The goal of this traversal is to make sure `DataSourceInfo` and source-based row-count estimates are fully propagated.
    - We can also use this traversal to call `add_unique_stats_column` for known `GroupBy` and `Distinct` key columns. This way, the first call too `DataSourceInfo.unique_stats(*)` (expected during a later traversal) will collect row-group information for *all* known `GroupBy`/`Distinct` keys.
- [x] [2] We leverage `DataSourceInfo.unique_stats(*)` statistics during lowering to avoid the **need** for the `unique_fraction` user configuration (#19391).
    - **NOTE**: This step can also be implemented **after** (4). Steps (3)-(5) do **not** depend on this feature.
    - This will require us to attach the `dict[IR, dict[str, ColumnStats]]` data structure to the caching-visitor `state` (so it is accessible to the lowering logic for each IR node).
- [x] [3] We update the traversal in (1) to also collect join and join-key information in other data structures (#19392).
- [x] [4] Implement a second `post_traversal` pass over the un-lowered IR graph to leverage the join heuristics in (3) to adjust the `dict[IR, RowCount]` values and populate the `ColumnStats.unique_stats` attributes (#19393).
- [ ] [5] **Use** the features from (1)-(4) to inject repartitioning after operations that lead to a drop in the cardinality estimate. In order to do this **well**, we probably need to *refine* the changes made in (0)-(4). If row-count estimates are insufficient, this optimization may be ineffective and/or risky.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Story] Use datasource statistics in cudf-polars #19388

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Story] Use datasource statistics in cudf-polars #19388

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions