[Data] - Add Dataset Summary API#58862
Conversation
Signed-off-by: Goutam <goutam@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a Dataset.summary() API to compute column-wise statistics based on data types. The implementation is well-structured, with most of the logic in a new ray.data.stats module. The design is extensible, allowing users to override default aggregations. The changes include new base functionality in AggregateFnV2 to support formatted statistics, a new DatasetSummary class to hold and present the results, and comprehensive tests. My review found a couple of areas for improvement: one for robustness in type comparison and another for documentation clarity. Overall, this is a great addition to Ray Data.
|
/gemini summary |
Summary of ChangesThis pull request introduces a powerful new Highlights
Changelog
Activity
|
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
|
/gemini summary |
Summary of ChangesThis pull request introduces a new Highlights
Changelog
Activity
|
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
python/ray/data/aggregate.py
Outdated
| } | ||
|
|
||
| # Fallback to scalar result for non-list values or unexpandable Nones | ||
| return {stat_name: (value, agg_type)} |
There was a problem hiding this comment.
do we want to return this of if value and labels are None?
Signed-off-by: Goutam <goutam@anyscale.com>
|
/gemini summary |
Summary of ChangesThis pull request introduces a powerful new Highlights
Changelog
Activity
|
python/ray/data/aggregate.py
Outdated
| def get_result_labels(self) -> Optional[List[str]]: | ||
| """Return labels for list-valued results. | ||
|
|
||
| For aggregators that return list results (e.g., quantiles), this method | ||
| returns meaningful labels for each element in the list. If the aggregator | ||
| returns a scalar result or doesn't have meaningful labels, returns None. | ||
|
|
||
| Returns: | ||
| List of string labels for each element in the result list, or None. | ||
| """ | ||
| return None | ||
|
|
||
| def format_stats( | ||
| self, value: Any, agg_type: "pa.DataType", original_type: "pa.DataType" | ||
| ) -> Dict[str, Tuple[Any, "pa.DataType"]]: | ||
| """Format aggregation result into stat entries. | ||
|
|
||
| Takes the raw aggregation result and formats it into one or more stat | ||
| entries. For scalar results, returns a single entry. For list results, | ||
| expands into multiple indexed entries. | ||
|
|
||
| Args: | ||
| value: The aggregation result value | ||
| agg_type: PyArrow type of the aggregation result | ||
| original_type: PyArrow type of the original column | ||
|
|
||
| Returns: | ||
| Dictionary mapping stat names to (value, type) tuples |
There was a problem hiding this comment.
Why is this part of aggregation? This has nothing to do with Aggregations themselves
There was a problem hiding this comment.
So each aggregate will have its own labels and formatting structure.
get_result_labels(): The aggregator knows what its list elements mean (e.g., quantiles knows [0, 1, 2] represent ["0.25", "0.5", "0.75"])
I can move format_stats out of aggregate.py
There was a problem hiding this comment.
Output of each aggregation is fixed / configured -- if you passed quantiles [0.5, 0.99] you'd expect to get back list of 2 values, right?
There was a problem hiding this comment.
Then it will appear as approx_quantile[{idx}] instead of approx_quantile[{quantiles[idx]}]. Which is also fine
There was a problem hiding this comment.
Moved out the labels function out of aggregates
| except (TypeError, ValueError, pa.ArrowInvalid): | ||
| # Cast problematic columns to null type | ||
| null_col = pa.nulls(len(col), type=pa.null()) | ||
| result_data[col_name] = null_col.to_pandas() |
There was a problem hiding this comment.
When would this occur?
There was a problem hiding this comment.
tensor_type = ArrowTensorType(shape=(2, 2), dtype=pa.float32())
table = pa.Table.from_pydict({
"image_col": pa.array([None, None], type=tensor_type)
})
col = table.column("image_col")
col.to_pandas()
There was a problem hiding this comment.
is one such example
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
| if not self.is_arrow_type(): | ||
| return False |
## Description
Add functionality to dataset that computes stats on columns based on the
underlying pyarrow dtype.
Allows the user to compute different stats per dtype (basically override
the setting if needed)
Example Usage:
```
ds = ray.data.from_items([
{"age": 25, "salary": 50000, "name": "Alice", "city": "NYC"},
{"age": 30, "salary": 60000, "name": None, "city": "LA"},
{"age": 0, "salary": None, "name": "Bob", "city": None},
])
summary = ds.summary()
summary.to_pandas()
```
OUTPUT
```
statistic age city name salary
0 approx_quantile[0] 25.000000 None None 60000.000000
1 approx_topk[0] NaN {'city': 'LA', 'count': 1} {'count': 1, 'name': 'Bob'} NaN
2 approx_topk[1] NaN {'city': 'NYC', 'count': 1} {'count': 1, 'name': 'Alice'} NaN
3 count 3.000000 3 3 3.000000
4 max 30.000000 NaN NaN 60000.000000
5 mean 18.333333 None None 55000.000000
6 min 0.000000 NaN NaN 50000.000000
7 missing_pct 0.000000 33.333333 33.333333 33.333333
8 std 13.123346 None None 5000.000000
9 zero_pct 33.333333 None None 0.000000
```
## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".
## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.
---------
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Description
Add functionality to dataset that computes stats on columns based on the underlying pyarrow dtype.
Allows the user to compute different stats per dtype (basically override the setting if needed)
Example Usage:
OUTPUT
Related issues
Additional information