Skip to content

Add API to "initialize" column statistics#19447

Merged
rapids-bot[bot] merged 26 commits intorapidsai:branch-25.10from
rjzamora:base-stats-traversal
Aug 18, 2025
Merged

Add API to "initialize" column statistics#19447
rapids-bot[bot] merged 26 commits intorapidsai:branch-25.10from
rjzamora:base-stats-traversal

Conversation

@rjzamora
Copy link
Copy Markdown
Member

@rjzamora rjzamora commented Jul 21, 2025

Description

Closes #19390

  • Adds simple StatsCollector API
  • Adds collect_base_stats API (tested in this PR, but not used anywhere internally yet)
  • Adds initialize_column_stats dispatch functions and registers IR-specific logic for various IR sub-classes (this dispatch function is used by collect_base_stats).

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@rjzamora rjzamora self-assigned this Jul 21, 2025
@rjzamora rjzamora requested a review from a team as a code owner July 21, 2025 20:25
@rjzamora rjzamora added feature request New feature or request 2 - In Progress Currently a work in progress non-breaking Non-breaking change labels Jul 21, 2025
@rjzamora rjzamora requested review from Matt711 and vyasr July 21, 2025 20:26
@rjzamora rjzamora added cudf-polars Issues specific to cudf-polars cudf.polars labels Jul 21, 2025
@github-actions github-actions bot added the Python Affects Python cuDF API. label Jul 21, 2025
@GPUtester GPUtester moved this to In Progress in cuDF Python Jul 21, 2025
@vyasr
Copy link
Copy Markdown
Contributor

vyasr commented Jul 21, 2025

Retargeted to 25.10 since we're in burndown

@vyasr vyasr changed the base branch from branch-25.08 to branch-25.10 July 21, 2025 20:41
Copy link
Copy Markdown
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, gave a quick pass.

One thing that would be helpful to me at least is a high-level overview of how the components (base stats, ColumnStats, extract_base_stats, etc.) piece together in overview.md

Copy link
Copy Markdown
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started to look through the extract_base_stats implementations. I think there starting to make sense, but I need to look at Join more closely.

Comment on lines +119 to +121
column_stats[name] = primary_child_stats.get(
name, ColumnStats(name=name)
).copy()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for stats to only be available on one child (one of primary, other)? And if so, do we want to attempt to fall back to other when name isn't in primary_child_stats?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And how much of thie primary, other thing is going to survive long term? Will we eventually be somehow combining these two statistics to propagate selectivity / cardinality estimates?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for stats to only be available on one child (one of primary, other)?

My intention is to store a ColumnStats object for all columns in the schema for each IR node - Even if ColumnStat.value is None.

And how much of thie primary, other thing is going to survive long term? Will we eventually be somehow combining these two statistics to propagate selectivity / cardinality estimates?

The terms "primary" and "other" are only meant to establish the origin of each column in this scenario. I think it's a bit different from the primary-vs-foreign key concept, but you are correct that we may be propagating the "wrong" ColumnStats for the key columns.

My hope is that we can update StatsCollector to look something like this in a follow-up (#19392):

class JoinKey:
    """Basic Join-key information."""
    def __init__(self, ir: IR, names: list[str]) -> None:
        self.ir = ir
        self.names = names

class StatsCollector:
    """Column statistics collector."""
    def __init__(self) -> None:
        self.row_count: dict[IR, ColumnStat[int]] = {}
        self.column_stats: dict[IR, dict[str, ColumnStats]] = {}
        self.join_keys: defaultdict[JoinKey, set[JoinKey]] = defaultdict(set[JoinKey])
        self.joins: dict[IR, tuple[JoinKey, JoinKey]] = {}

This way, we can:

  1. Use the same collect_base_stats traversal to also populate join_keys/joins for all Join nodes.
  2. Calculate "equivalence sets" (i.e. sets of columns that have the same total unique-count for a PK-FK join).
  3. Use equivalence-set information to set "correct" unique-value statistics for each node (including Joins) in a subsequent traversal.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calculate "equivalence sets" (i.e. sets of columns that have the same total unique-count for a PK-FK join).

When you say "same total unique count" do you mean close enough to inform planning decisions?

@rjzamora rjzamora changed the title [WIP] Add API to populate "base" column statistics [WIP][DNM] Add API to populate "base" column statistics Jul 24, 2025
@rjzamora rjzamora marked this pull request as draft July 24, 2025 14:09
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Jul 24, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rjzamora rjzamora marked this pull request as ready for review July 24, 2025 16:19
@rjzamora rjzamora changed the title [WIP][DNM] Add API to populate "base" column statistics [WIP] Add API to "initialize" column statistics Jul 24, 2025
@rjzamora rjzamora changed the title [WIP] Add API to "initialize" column statistics Add API to "initialize" column statistics Jul 28, 2025
@rjzamora rjzamora added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jul 28, 2025
Copy link
Copy Markdown
Contributor

@Matt711 Matt711 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Rick, mainly questions nothing blocking.

Comment on lines +119 to +121
column_stats[name] = primary_child_stats.get(
name, ColumnStats(name=name)
).copy()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calculate "equivalence sets" (i.e. sets of columns that have the same total unique-count for a PK-FK join).

When you say "same total unique count" do you mean close enough to inform planning decisions?

Comment on lines +420 to +422
- `ColumnStats`: This class is used to group together the "base"
`DataSourceInfo` reference and the current `UniqueStats` estimates
for a specific IR + column combination.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why we have them bundled together? I think it gives you a way to tell how the statistics changed over different operations like filters and joins.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just so we don't need to complicate StatsCollector with a distinct dict[IR, foo] mapping for everything we want to track. We just track all per-column statistics in a single dict[IR, dict[str, ColumnStats]] mapping that can be modified in the future (if needed). I added a brief note that the bundling is to "simplify the design and maintenance of StatsCollector".

Comment on lines +409 to +410
- `UniqueStats`: Since we usually sample both the unique-value
**count** and the unique-value **fraction** of a column at once,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: Are these the only stats that will be included in UniqueStats?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably yes. However, other sub-statisics could be added if needed (do you have something in mind?).

the partition count when a Parquet-based `Scan` node is lowered
by the `cudf-polars` streaming executor.

These statistics will also be used for other purposes in the future.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably can add some examples of how they will be used. Also okay if you don't since I imagine these docs are going to change frequently.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to:

In the future, these statistics will also be used for
parallel-algorithm selection and intermediate repartitioning.

)


class StatsCollector:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we discussed tracking statistics for expression nodes? Curious if you see any value there

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, we only care about tracking statistics for pre-lowered IR nodes. In the near term, we are mostly interested in accounting for Join and GroupBy. However, it is true that we may need to traverse an Expr graph within a Select node to estimate changes in cardinality and unique count. We shouldn't need to keep track of anything in StatsCollector for this, but the exact design is TBD.

@rjzamora
Copy link
Copy Markdown
Member Author

@TomAugspurger @Matt711 @wence- - Thanks for your help with this. I believe this is ready for a final review.

Copy link
Copy Markdown
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all my earlier questions have been addressed. I think it'd be useful to keep this moving along so that we can see how things look when everything is wired together.

@rjzamora
Copy link
Copy Markdown
Member Author

I think it'd be useful to keep this moving along so that we can see how things look when everything is wired together.

Yeah, it's definitely worth stating that nothing here is set in stone. We will most-likely revise some of these decisions during/after the implementation of the remaining "story".

It's not really possible for anyone to perform a comprehensive review until more of the pieces are in place.

@rjzamora rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Aug 18, 2025
@rjzamora
Copy link
Copy Markdown
Member Author

/merge

@rapids-bot rapids-bot bot merged commit fd7e082 into rapidsai:branch-25.10 Aug 18, 2025
116 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in cuDF Python Aug 18, 2025
@rjzamora rjzamora deleted the base-stats-traversal branch August 18, 2025 17:04
galipremsagar pushed a commit to galipremsagar/cudf that referenced this pull request Aug 18, 2025
Closes rapidsai#19390

- Adds simple `StatsCollector` API
- Adds `collect_base_stats` API (tested in this PR, but not *used* anywhere internally yet)
- Adds `initialize_column_stats` dispatch functions and registers IR-specific logic for various IR sub-classes (this dispatch function is used by `collect_base_stats`).

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Matthew Murray (https://github.com/Matt711)
  - Tom Augspurger (https://github.com/TomAugspurger)

URL: rapidsai#19447
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

5 - Ready to Merge Testing and reviews complete, ready to merge cudf-polars Issues specific to cudf-polars feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

[FEA] Use post_traversal to populate "base" column statistics

5 participants