Add API to "initialize" column statistics by rjzamora · Pull Request #19447 · rapidsai/cudf

rjzamora · 2025-07-21T20:25:58Z

Description

Closes #19390

Adds simple StatsCollector API
Adds collect_base_stats API (tested in this PR, but not used anywhere internally yet)
Adds initialize_column_stats dispatch functions and registers IR-specific logic for various IR sub-classes (this dispatch function is used by collect_base_stats).

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

vyasr · 2025-07-21T20:40:55Z

Retargeted to 25.10 since we're in burndown

TomAugspurger

Thanks, gave a quick pass.

One thing that would be helpful to me at least is a high-level overview of how the components (base stats, ColumnStats, extract_base_stats, etc.) piece together in overview.md

python/cudf_polars/cudf_polars/experimental/statistics.py

python/cudf_polars/tests/experimental/test_stats.py

…traversal

TomAugspurger

Started to look through the extract_base_stats implementations. I think there starting to make sense, but I need to look at Join more closely.

python/cudf_polars/tests/experimental/test_stats.py

python/cudf_polars/cudf_polars/experimental/base.py

python/cudf_polars/cudf_polars/experimental/statistics.py

TomAugspurger · 2025-07-23T19:21:30Z

python/cudf_polars/cudf_polars/experimental/statistics.py

+            column_stats[name] = primary_child_stats.get(
+                name, ColumnStats(name=name)
+            ).copy()


Is it possible for stats to only be available on one child (one of primary, other)? And if so, do we want to attempt to fall back to other when name isn't in primary_child_stats?

And how much of thie primary, other thing is going to survive long term? Will we eventually be somehow combining these two statistics to propagate selectivity / cardinality estimates?

Is it possible for stats to only be available on one child (one of primary, other)?

My intention is to store a ColumnStats object for all columns in the schema for each IR node - Even if ColumnStat.value is None.

And how much of thie primary, other thing is going to survive long term? Will we eventually be somehow combining these two statistics to propagate selectivity / cardinality estimates?

The terms "primary" and "other" are only meant to establish the origin of each column in this scenario. I think it's a bit different from the primary-vs-foreign key concept, but you are correct that we may be propagating the "wrong" ColumnStats for the key columns.

My hope is that we can update StatsCollector to look something like this in a follow-up (#19392):

class JoinKey: """Basic Join-key information.""" def __init__(self, ir: IR, names: list[str]) -> None: self.ir = ir self.names = names class StatsCollector: """Column statistics collector.""" def __init__(self) -> None: self.row_count: dict[IR, ColumnStat[int]] = {} self.column_stats: dict[IR, dict[str, ColumnStats]] = {} self.join_keys: defaultdict[JoinKey, set[JoinKey]] = defaultdict(set[JoinKey]) self.joins: dict[IR, tuple[JoinKey, JoinKey]] = {}

This way, we can:

Use the same collect_base_stats traversal to also populate join_keys/joins for all Join nodes.

Calculate "equivalence sets" (i.e. sets of columns that have the same total unique-count for a PK-FK join).

Use equivalence-set information to set "correct" unique-value statistics for each node (including Joins) in a subsequent traversal.

Calculate "equivalence sets" (i.e. sets of columns that have the same total unique-count for a PK-FK join).

When you say "same total unique count" do you mean close enough to inform planning decisions?

python/cudf_polars/cudf_polars/experimental/statistics.py

python/cudf_polars/docs/overview.md

…traversal

copy-pr-bot · 2025-07-24T14:09:44Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

…traversal

python/cudf_polars/cudf_polars/experimental/base.py

Matt711

Thanks Rick, mainly questions nothing blocking.

python/cudf_polars/docs/overview.md

Matt711 · 2025-07-24T14:14:17Z

python/cudf_polars/cudf_polars/experimental/statistics.py

+            column_stats[name] = primary_child_stats.get(
+                name, ColumnStats(name=name)
+            ).copy()


Calculate "equivalence sets" (i.e. sets of columns that have the same total unique-count for a PK-FK join).

When you say "same total unique count" do you mean close enough to inform planning decisions?

python/cudf_polars/cudf_polars/experimental/base.py

Matt711 · 2025-07-29T14:12:02Z

python/cudf_polars/docs/overview.md

+- `ColumnStats`: This class is used to group together the "base"
+`DataSourceInfo` reference and the current `UniqueStats` estimates
+for a specific IR + column combination.


Can you explain why we have them bundled together? I think it gives you a way to tell how the statistics changed over different operations like filters and joins.

This is just so we don't need to complicate StatsCollector with a distinct dict[IR, foo] mapping for everything we want to track. We just track all per-column statistics in a single dict[IR, dict[str, ColumnStats]] mapping that can be modified in the future (if needed). I added a brief note that the bundling is to "simplify the design and maintenance of StatsCollector".

Matt711 · 2025-07-29T14:17:11Z

python/cudf_polars/docs/overview.md

+- `UniqueStats`: Since we usually sample both the unique-value
+**count** and the unique-value **fraction** of a column at once,


Just curious: Are these the only stats that will be included in UniqueStats?

Probably yes. However, other sub-statisics could be added if needed (do you have something in mind?).

Matt711 · 2025-07-29T14:23:21Z

python/cudf_polars/docs/overview.md

+the partition count when a Parquet-based `Scan` node is lowered
+by the `cudf-polars` streaming executor.
+
+These statistics will also be used for other purposes in the future.


Probably can add some examples of how they will be used. Also okay if you don't since I imagine these docs are going to change frequently.

Changed to:

In the future, these statistics will also be used for
parallel-algorithm selection and intermediate repartitioning.

Matt711 · 2025-07-29T14:36:23Z

python/cudf_polars/cudf_polars/experimental/base.py

+        )
+
+
+class StatsCollector:


Have we discussed tracking statistics for expression nodes? Curious if you see any value there

For now, we only care about tracking statistics for pre-lowered IR nodes. In the near term, we are mostly interested in accounting for Join and GroupBy. However, it is true that we may need to traverse an Expr graph within a Select node to estimate changes in cardinality and unique count. We shouldn't need to keep track of anything in StatsCollector for this, but the exact design is TBD.

…traversal

rjzamora · 2025-08-18T12:43:59Z

@TomAugspurger @Matt711 @wence- - Thanks for your help with this. I believe this is ready for a final review.

TomAugspurger

I think all my earlier questions have been addressed. I think it'd be useful to keep this moving along so that we can see how things look when everything is wired together.

rjzamora · 2025-08-18T13:32:31Z

I think it'd be useful to keep this moving along so that we can see how things look when everything is wired together.

Yeah, it's definitely worth stating that nothing here is set in stone. We will most-likely revise some of these decisions during/after the implementation of the remaining "story".

It's not really possible for anyone to perform a comprehensive review until more of the pieces are in place.

rjzamora · 2025-08-18T17:03:38Z

/merge

Closes rapidsai#19390 - Adds simple `StatsCollector` API - Adds `collect_base_stats` API (tested in this PR, but not *used* anywhere internally yet) - Adds `initialize_column_stats` dispatch functions and registers IR-specific logic for various IR sub-classes (this dispatch function is used by `collect_base_stats`). Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Matthew Murray (https://github.com/Matt711) - Tom Augspurger (https://github.com/TomAugspurger) URL: rapidsai#19447

expand test coverage

c7feddf

rjzamora self-assigned this Jul 21, 2025

rjzamora requested a review from a team as a code owner July 21, 2025 20:25

rjzamora added feature request New feature or request 2 - In Progress Currently a work in progress non-breaking Non-breaking change labels Jul 21, 2025

rjzamora requested review from Matt711 and vyasr July 21, 2025 20:26

rjzamora added cudf-polars Issues specific to cudf-polars cudf.polars labels Jul 21, 2025

github-project-automation bot added this to cuDF Python Jul 21, 2025

github-actions bot added the Python Affects Python cuDF API. label Jul 21, 2025

GPUtester moved this to In Progress in cuDF Python Jul 21, 2025

Merge branch 'branch-25.08' into base-stats-traversal

477d57f

vyasr changed the base branch from branch-25.08 to branch-25.10 July 21, 2025 20:41

rjzamora added 4 commits July 21, 2025 16:00

Merge branch 'branch-25.10' into base-stats-traversal

1d404c6

consolidate stats tests

9f51936

add distinct coverage

cb3ff97

don't need coverage for multi-child fall-back for now

0e80add

TomAugspurger reviewed Jul 22, 2025

View reviewed changes

rjzamora added 5 commits July 22, 2025 12:44

Merge remote-tracking branch 'upstream/branch-25.10' into base-stats-…

1cd6bc7

…traversal

drop unnecessary _extract_base_stats_preserve

7d6ea69

Fix Join behavior

8bea521

Merge remote-tracking branch 'upstream/branch-25.10' into base-stats-…

74ad4f1

…traversal

drop unnecessary dispatch functions

4627854

TomAugspurger reviewed Jul 23, 2025

View reviewed changes

python/cudf_polars/tests/experimental/test_stats.py Show resolved Hide resolved

python/cudf_polars/cudf_polars/experimental/base.py Show resolved Hide resolved

python/cudf_polars/cudf_polars/experimental/statistics.py Outdated Show resolved Hide resolved

update docs and change 'rename' to 'copy'

19eff91

TomAugspurger reviewed Jul 23, 2025

View reviewed changes

update names

7ee10ef

Merge remote-tracking branch 'upstream/branch-25.10' into base-stats-…

ff39a1c

…traversal

rjzamora changed the title ~~[WIP] Add API to populate "base" column statistics~~ [WIP][DNM] Add API to populate "base" column statistics Jul 24, 2025

rjzamora marked this pull request as draft July 24, 2025 14:09

rjzamora added 2 commits July 24, 2025 09:18

track child-ColumnStats in ColumnStats

e8fa8d6

Merge remote-tracking branch 'upstream/branch-25.10' into base-stats-…

08402fb

…traversal

rjzamora marked this pull request as ready for review July 24, 2025 16:19

rjzamora changed the title ~~[WIP][DNM] Add API to populate "base" column statistics~~ [WIP] Add API to "initialize" column statistics Jul 24, 2025

rjzamora commented Jul 24, 2025

View reviewed changes

python/cudf_polars/cudf_polars/experimental/base.py Show resolved Hide resolved

rjzamora changed the title ~~[WIP] Add API to "initialize" column statistics~~ Add API to "initialize" column statistics Jul 28, 2025

rjzamora added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jul 28, 2025

rjzamora added 3 commits July 28, 2025 08:31

Merge branch 'branch-25.10' into base-stats-traversal

b17b1c3

Merge branch 'branch-25.10' into base-stats-traversal

fc959ec

Merge branch 'branch-25.10' into base-stats-traversal

1c83d45

Matt711 approved these changes Jul 29, 2025

View reviewed changes

rjzamora added 4 commits August 14, 2025 10:07

Merge branch 'branch-25.10' into base-stats-traversal

7b0befa

Merge remote-tracking branch 'upstream/branch-25.10' into base-stats-…

9a3b8a9

…traversal

update overview.md

b35cbaa

Merge branch 'branch-25.10' into base-stats-traversal

6cce566

TomAugspurger approved these changes Aug 18, 2025

View reviewed changes

rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Aug 18, 2025

rapids-bot bot merged commit fd7e082 into rapidsai:branch-25.10 Aug 18, 2025
116 checks passed

github-project-automation bot moved this from In Progress to Done in cuDF Python Aug 18, 2025

rjzamora deleted the base-stats-traversal branch August 18, 2025 17:04

		- `UniqueStats`: Since we usually sample both the unique-value
		count and the unique-value fraction of a column at once,

Conversation

rjzamora commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

vyasr commented Jul 21, 2025

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

copy-pr-bot bot commented Jul 24, 2025

Uh oh!

Uh oh!

Matt711 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjzamora commented Aug 18, 2025

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

rjzamora commented Aug 18, 2025

Uh oh!

rjzamora commented Aug 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rjzamora commented Jul 21, 2025 •

edited

Loading