Skip to content

Rename "cardinality_factor" configuration to "unique_fraction"#19273

Merged
rapids-bot[bot] merged 4 commits intorapidsai:branch-25.08from
rjzamora:rename-cardinality-factor
Jul 3, 2025
Merged

Rename "cardinality_factor" configuration to "unique_fraction"#19273
rapids-bot[bot] merged 4 commits intorapidsai:branch-25.08from
rjzamora:rename-cardinality-factor

Conversation

@rjzamora
Copy link
Copy Markdown
Member

@rjzamora rjzamora commented Jul 2, 2025

Description

This PR splits off some of the changes used by the ongoing column-statistics work (e.g. #19130).

  • Renames "cardinality_factor" to "unique_fraction", because the original name doesn't really make any sense.
    • I've been trying to rename this config for a while, and would really like to get it done in 25.08.
    • I don't think we need backwards compatibility for "cardinality_factor", but this PR adds it (just to be safe).
  • Adds central _get_unique_fractions utility to extract the unique-value statistics for a specific subset of columns. This logic is currently repeated several times, and it will be much easier to incorporate sampled statistics (in a follow-up) if the logic is all in one place.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@rjzamora rjzamora self-assigned this Jul 2, 2025
@rjzamora rjzamora added the 2 - In Progress Currently a work in progress label Jul 2, 2025
@rjzamora rjzamora requested a review from a team as a code owner July 2, 2025 15:24
@rjzamora rjzamora added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jul 2, 2025
@rjzamora rjzamora requested review from Matt711 and mroeschke July 2, 2025 15:24
@github-actions github-actions bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels Jul 2, 2025
@GPUtester GPUtester moved this to In Progress in cuDF Python Jul 2, 2025
Co-authored-by: Tom Augspurger <tom.augspurger88@gmail.com>
@rjzamora rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 2 - In Progress Currently a work in progress labels Jul 3, 2025
@rjzamora
Copy link
Copy Markdown
Member Author

rjzamora commented Jul 3, 2025

/merge

@rapids-bot rapids-bot bot merged commit 369d060 into rapidsai:branch-25.08 Jul 3, 2025
93 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in cuDF Python Jul 3, 2025
@rjzamora rjzamora deleted the rename-cardinality-factor branch July 3, 2025 13:04
@rapids-bot
Copy link
Copy Markdown
Contributor

rapids-bot bot commented Jul 3, 2025

Failed to merge PR using squash strategy.

rapids-bot bot pushed a commit that referenced this pull request Jul 16, 2025
Probably supersedes #19130

The goal of this PR is to define the classes needed to store column statistics for an `IR` node. Some cirteria:

- We need the statistics for a column to contain a reference to the underlying datasource information (e.g. unique-value statistics, row-count, and average storage/file size). 
- We want caching for each datasource and column.
- We want the option to perform metadata/data sampling lazily on the datasource.
- We want our Parquet partitioning logic to use the same infrastructure (to avoid redundant sampling).
- We want to record when a specific statistic is "exact" (rather than estimated).

Also related:
- #19258
- #19273

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Tom Augspurger (https://github.com/TomAugspurger)
  - Matthew Murray (https://github.com/Matt711)

URL: #19276
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

5 - Ready to Merge Testing and reviews complete, ready to merge cudf-polars Issues specific to cudf-polars improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants