Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
a7295f9
add basic stats classes
rjzamora Jul 2, 2025
962b0df
cleanup
rjzamora Jul 2, 2025
ce1f735
add test coverage
rjzamora Jul 2, 2025
7249b02
Merge branch 'branch-25.08' into stats-classes
rjzamora Jul 2, 2025
3bb6cc6
Merge remote-tracking branch 'upstream/branch-25.08' into stats-classes
rjzamora Jul 3, 2025
c307f21
cleanup
rjzamora Jul 3, 2025
115ba93
adjust teset coverage
rjzamora Jul 3, 2025
4756216
address partial code review
rjzamora Jul 3, 2025
0d0d04f
more cleanup
rjzamora Jul 3, 2025
79d9d61
Merge remote-tracking branch 'upstream/branch-25.08' into stats-classes
rjzamora Jul 3, 2025
77e0836
further test coverage
rjzamora Jul 3, 2025
30c061a
adjust coverage further
rjzamora Jul 3, 2025
8376f94
Merge branch 'branch-25.08' into stats-classes
rjzamora Jul 7, 2025
08c95ee
Merge branch 'branch-25.08' into stats-classes
rjzamora Jul 8, 2025
e82e499
Merge branch 'branch-25.08' into stats-classes
rjzamora Jul 9, 2025
e027966
Merge branch 'branch-25.08' into stats-classes
rjzamora Jul 9, 2025
06a7274
Merge remote-tracking branch 'upstream/branch-25.08' into stats-classes
rjzamora Jul 10, 2025
585aa87
small refactor
rjzamora Jul 10, 2025
47bf2dc
Merge remote-tracking branch 'upstream/branch-25.08' into stats-classes
rjzamora Jul 10, 2025
b3f5449
rename source to source_info for clarity
rjzamora Jul 10, 2025
36de941
Merge remote-tracking branch 'upstream/branch-25.08' into stats-classes
rjzamora Jul 14, 2025
acc987f
use dataclasses
rjzamora Jul 14, 2025
9158e18
remove comment
rjzamora Jul 14, 2025
8903708
remove more comments
rjzamora Jul 14, 2025
6072ed1
refactor unique stats under UniqueStats class
rjzamora Jul 14, 2025
da23a23
Merge remote-tracking branch 'upstream/branch-25.08' into stats-classes
rjzamora Jul 14, 2025
d6594fe
add test coverage for csv
rjzamora Jul 14, 2025
4cc694a
use ParquetMetadata cache
rjzamora Jul 14, 2025
bd784d3
Merge remote-tracking branch 'upstream/branch-25.08' into stats-classes
rjzamora Jul 14, 2025
b356dd0
update docstring
rjzamora Jul 14, 2025
ecb9c8c
Merge branch 'branch-25.08' into stats-classes
rjzamora Jul 14, 2025
185322d
Merge branch 'branch-25.08' into stats-classes
rjzamora Jul 15, 2025
593d79d
Merge remote-tracking branch 'upstream/branch-25.08' into stats-classes
rjzamora Jul 15, 2025
7fcda5d
address code review
rjzamora Jul 15, 2025
e84e8b1
Merge branch 'branch-25.08' into stats-classes
rjzamora Jul 15, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 108 additions & 1 deletion python/cudf_polars/cudf_polars/experimental/base.py
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I am adding the "base" stats/info classes to the base module, because these classes should not have any type dependencies (and should be available to use in other modules without any circular-dependency worries). We may want to add a dedicated statistics module to implement the IR-specific logic for populating/propagating this logic, but I'll leave that decision for a follow-up.

Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@

from __future__ import annotations

from typing import TYPE_CHECKING, Any
import dataclasses
from typing import TYPE_CHECKING, Any, Generic, TypeVar

if TYPE_CHECKING:
from collections.abc import Generator, Iterator
Expand Down Expand Up @@ -44,3 +45,109 @@ def __rich_repr__(self) -> Generator[Any, None, None]:
def get_key_name(node: Node) -> str:
"""Generate the key name for a Node."""
return f"{type(node).__name__.lower()}-{hash(node)}"


T = TypeVar("T")


@dataclasses.dataclass
class ColumnStat(Generic[T]):
"""
Generic column-statistic.

Parameters
----------
value
Statistics value. Value will be None
if the statistics is unknown.
exact
Whether the statistics is known exactly.
"""

value: T | None = None
exact: bool = False


@dataclasses.dataclass
class UniqueStats:
"""
Unique-value statistics.

Parameters
----------
count
Unique-value count.
fraction
Unique-value fraction. This corresponds to the total
number of unique values (count) divided by the total
number of rows.
"""

count: ColumnStat[int] = dataclasses.field(default_factory=ColumnStat[int])
fraction: ColumnStat[float] = dataclasses.field(default_factory=ColumnStat[float])


class DataSourceInfo:
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we will want to record/track RowCountInfo, UniqueInfo, and StorageSizeInfo for the underlying datasources used in our query. In order to manage this "source" information in one place, I'm introducing a DataSourceInfo class. This class is designed with "lazy" metadata/data sampling in mind.

We define Parquet- and DataFrame-specific sub-classes for DataSourceInfo in experimental.io.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think DataSourceInfo can be an abstract base class

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will get a DataSourceInfo object for CSV and Json data, so it can't be an abstract base class unless I add another EmptySourceInfo class for them to use (or duplicate the logic in CsvSourceInfo and JsonSourceInfo).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right makes sense

"""
Datasource information.

Notes
-----
This class should be sub-classed for specific
datasource types (e.g. Parquet, DataFrame, etc.).
The required properties/methods enable lazy
sampling of the underlying datasource.
"""

@property
def row_count(self) -> ColumnStat[int]:
"""Data source row-count estimate."""
return ColumnStat[int]() # pragma: no cover

def unique_stats(self, column: str) -> UniqueStats:
"""Return unique-value statistics for a column."""
return UniqueStats() # pragma: no cover

def storage_size(self, column: str) -> ColumnStat[int]:
"""Return the average column size for a single file."""
return ColumnStat[int]()

def add_unique_stats_column(self, column: str) -> None:
"""Add a column needing unique-value information."""


class ColumnStats:
"""
Column statistics.

Parameters
----------
name
Column name.
source
Datasource information.
source_name
Source-column name.
unique_stats
Unique-value statistics.
"""

__slots__ = ("name", "source_info", "source_name", "unique_stats")

name: str
source_info: DataSourceInfo
source_name: str
unique_stats: UniqueStats

def __init__(
self,
name: str,
*,
source_info: DataSourceInfo | None = None,
source_name: str | None = None,
unique_stats: UniqueStats | None = None,
) -> None:
self.name = name
self.source_info = source_info or DataSourceInfo()
self.source_name = source_name or name
self.unique_stats = unique_stats or UniqueStats()
Loading