Add API to "initialize" column statistics#19447
Add API to "initialize" column statistics#19447rapids-bot[bot] merged 26 commits intorapidsai:branch-25.10from
Conversation
|
Retargeted to 25.10 since we're in burndown |
TomAugspurger
left a comment
There was a problem hiding this comment.
Thanks, gave a quick pass.
One thing that would be helpful to me at least is a high-level overview of how the components (base stats, ColumnStats, extract_base_stats, etc.) piece together in overview.md
TomAugspurger
left a comment
There was a problem hiding this comment.
Started to look through the extract_base_stats implementations. I think there starting to make sense, but I need to look at Join more closely.
| column_stats[name] = primary_child_stats.get( | ||
| name, ColumnStats(name=name) | ||
| ).copy() |
There was a problem hiding this comment.
Is it possible for stats to only be available on one child (one of primary, other)? And if so, do we want to attempt to fall back to other when name isn't in primary_child_stats?
There was a problem hiding this comment.
And how much of thie primary, other thing is going to survive long term? Will we eventually be somehow combining these two statistics to propagate selectivity / cardinality estimates?
There was a problem hiding this comment.
Is it possible for stats to only be available on one child (one of primary, other)?
My intention is to store a ColumnStats object for all columns in the schema for each IR node - Even if ColumnStat.value is None.
And how much of thie primary, other thing is going to survive long term? Will we eventually be somehow combining these two statistics to propagate selectivity / cardinality estimates?
The terms "primary" and "other" are only meant to establish the origin of each column in this scenario. I think it's a bit different from the primary-vs-foreign key concept, but you are correct that we may be propagating the "wrong" ColumnStats for the key columns.
My hope is that we can update StatsCollector to look something like this in a follow-up (#19392):
class JoinKey:
"""Basic Join-key information."""
def __init__(self, ir: IR, names: list[str]) -> None:
self.ir = ir
self.names = names
class StatsCollector:
"""Column statistics collector."""
def __init__(self) -> None:
self.row_count: dict[IR, ColumnStat[int]] = {}
self.column_stats: dict[IR, dict[str, ColumnStats]] = {}
self.join_keys: defaultdict[JoinKey, set[JoinKey]] = defaultdict(set[JoinKey])
self.joins: dict[IR, tuple[JoinKey, JoinKey]] = {}This way, we can:
- Use the same
collect_base_statstraversal to also populatejoin_keys/joinsfor allJoinnodes. - Calculate "equivalence sets" (i.e. sets of columns that have the same total unique-count for a PK-FK join).
- Use equivalence-set information to set "correct" unique-value statistics for each node (including
Joins) in a subsequent traversal.
There was a problem hiding this comment.
Calculate "equivalence sets" (i.e. sets of columns that have the same total unique-count for a PK-FK join).
When you say "same total unique count" do you mean close enough to inform planning decisions?
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Matt711
left a comment
There was a problem hiding this comment.
Thanks Rick, mainly questions nothing blocking.
| column_stats[name] = primary_child_stats.get( | ||
| name, ColumnStats(name=name) | ||
| ).copy() |
There was a problem hiding this comment.
Calculate "equivalence sets" (i.e. sets of columns that have the same total unique-count for a PK-FK join).
When you say "same total unique count" do you mean close enough to inform planning decisions?
python/cudf_polars/docs/overview.md
Outdated
| - `ColumnStats`: This class is used to group together the "base" | ||
| `DataSourceInfo` reference and the current `UniqueStats` estimates | ||
| for a specific IR + column combination. |
There was a problem hiding this comment.
Can you explain why we have them bundled together? I think it gives you a way to tell how the statistics changed over different operations like filters and joins.
There was a problem hiding this comment.
This is just so we don't need to complicate StatsCollector with a distinct dict[IR, foo] mapping for everything we want to track. We just track all per-column statistics in a single dict[IR, dict[str, ColumnStats]] mapping that can be modified in the future (if needed). I added a brief note that the bundling is to "simplify the design and maintenance of StatsCollector".
| - `UniqueStats`: Since we usually sample both the unique-value | ||
| **count** and the unique-value **fraction** of a column at once, |
There was a problem hiding this comment.
Just curious: Are these the only stats that will be included in UniqueStats?
There was a problem hiding this comment.
Probably yes. However, other sub-statisics could be added if needed (do you have something in mind?).
python/cudf_polars/docs/overview.md
Outdated
| the partition count when a Parquet-based `Scan` node is lowered | ||
| by the `cudf-polars` streaming executor. | ||
|
|
||
| These statistics will also be used for other purposes in the future. |
There was a problem hiding this comment.
Probably can add some examples of how they will be used. Also okay if you don't since I imagine these docs are going to change frequently.
There was a problem hiding this comment.
Changed to:
In the future, these statistics will also be used for
parallel-algorithm selection and intermediate repartitioning.
| ) | ||
|
|
||
|
|
||
| class StatsCollector: |
There was a problem hiding this comment.
Have we discussed tracking statistics for expression nodes? Curious if you see any value there
There was a problem hiding this comment.
For now, we only care about tracking statistics for pre-lowered IR nodes. In the near term, we are mostly interested in accounting for Join and GroupBy. However, it is true that we may need to traverse an Expr graph within a Select node to estimate changes in cardinality and unique count. We shouldn't need to keep track of anything in StatsCollector for this, but the exact design is TBD.
|
@TomAugspurger @Matt711 @wence- - Thanks for your help with this. I believe this is ready for a final review. |
TomAugspurger
left a comment
There was a problem hiding this comment.
I think all my earlier questions have been addressed. I think it'd be useful to keep this moving along so that we can see how things look when everything is wired together.
Yeah, it's definitely worth stating that nothing here is set in stone. We will most-likely revise some of these decisions during/after the implementation of the remaining "story". It's not really possible for anyone to perform a comprehensive review until more of the pieces are in place. |
|
/merge |
Closes rapidsai#19390 - Adds simple `StatsCollector` API - Adds `collect_base_stats` API (tested in this PR, but not *used* anywhere internally yet) - Adds `initialize_column_stats` dispatch functions and registers IR-specific logic for various IR sub-classes (this dispatch function is used by `collect_base_stats`). Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Matthew Murray (https://github.com/Matt711) - Tom Augspurger (https://github.com/TomAugspurger) URL: rapidsai#19447
Description
Closes #19390
StatsCollectorAPIcollect_base_statsAPI (tested in this PR, but not used anywhere internally yet)initialize_column_statsdispatch functions and registers IR-specific logic for various IR sub-classes (this dispatch function is used bycollect_base_stats).Checklist