-
Notifications
You must be signed in to change notification settings - Fork 1k
[Story] Use datasource statistics in cudf-polars #19388
Copy link
Copy link
Open
6 / 66 of 6 issues completedLabels
PythonAffects Python cuDF API.Affects Python cuDF API.cudf-polarsIssues specific to cudf-polarsIssues specific to cudf-polarsfeature requestNew feature or requestNew feature or request
Description
In streaming cudf-polars, we do not use datasource statistics (row-count and unique-value estimates) to inform the physical plan yet. We use the average file size (per column) to set the partition size for Scan/DataFrameScan operations, but we don't leverage sampled statistics to choose between shuffling and tree reductions, or to repartition after Join or GroupBy operations.
Now that #19276 is in, we now have the necessary classes to store/track the statistics needed for these potential optimizations.
Next Steps:
- [0] Make the number of sampled parquet files and parquet row-groups configurable ([FEA] Make
max_file_samplesandmax_rg_samplesconfigurable in cudf-polars #19389). - [1] Implement a
post_traversalpass over the un-lowered IR graph to populatedict[IR, dict[str, ColumnStats]]anddict[IR, RowCount]data structure with base (i.e. source) statistics ([FEA] Usepost_traversalto populate "base" column statistics #19390).- This traversal will not update the
ColumnStats.unique_statsattribute for each column yet. - The goal of this traversal is to make sure
DataSourceInfoand source-based row-count estimates are fully propagated. - We can also use this traversal to call
add_unique_stats_columnfor knownGroupByandDistinctkey columns. This way, the first call tooDataSourceInfo.unique_stats(*)(expected during a later traversal) will collect row-group information for all knownGroupBy/Distinctkeys.
- This traversal will not update the
- [2] We leverage
DataSourceInfo.unique_stats(*)statistics during lowering to avoid the need for theunique_fractionuser configuration ([FEA] Leverage column statistics to replaceunique_fractionconfiguration #19391).- NOTE: This step can also be implemented after (4). Steps (3)-(5) do not depend on this feature.
- This will require us to attach the
dict[IR, dict[str, ColumnStats]]data structure to the caching-visitorstate(so it is accessible to the lowering logic for each IR node).
- [3] We update the traversal in (1) to also collect join and join-key information in other data structures ([FEA] Collect join-key information while gathering statistics in cudf-polars #19392).
- [4] Implement a second
post_traversalpass over the un-lowered IR graph to leverage the join heuristics in (3) to adjust thedict[IR, RowCount]values and populate theColumnStats.unique_statsattributes ([FEA] Add a second IR-statistics traversal for cardinality estimates #19393). - [5] Use the features from (1)-(4) to inject repartitioning after operations that lead to a drop in the cardinality estimate. In order to do this well, we probably need to refine the changes made in (0)-(4). If row-count estimates are insufficient, this optimization may be ineffective and/or risky.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
PythonAffects Python cuDF API.Affects Python cuDF API.cudf-polarsIssues specific to cudf-polarsIssues specific to cudf-polarsfeature requestNew feature or requestNew feature or request
Type
Projects
Status
Todo