[Data] - Add Dataset Summary API by goutamvenkat-anyscale · Pull Request #58862 · ray-project/ray

goutamvenkat-anyscale · 2025-11-20T23:44:19Z

Description

Add functionality to dataset that computes stats on columns based on the underlying pyarrow dtype.

Allows the user to compute different stats per dtype (basically override the setting if needed)

Example Usage:

ds = ray.data.from_items([
    {"age": 25, "salary": 50000, "name": "Alice", "city": "NYC"},
    {"age": 30, "salary": 60000, "name": None, "city": "LA"},
    {"age": 0, "salary": None, "name": "Bob", "city": None},
])

summary = ds.summary()
summary.to_pandas()

OUTPUT

statistic        age                         city                           name        salary
            0  approx_quantile[0]  25.000000                         None                           None  60000.000000
            1        approx_topk[0]        NaN   {'city': 'LA', 'count': 1}    {'count': 1, 'name': 'Bob'}           NaN
            2        approx_topk[1]        NaN  {'city': 'NYC', 'count': 1}  {'count': 1, 'name': 'Alice'}           NaN
            3                 count   3.000000                            3                              3      3.000000
            4                   max  30.000000                          NaN                            NaN  60000.000000
            5                  mean  18.333333                         None                           None  55000.000000
            6                   min   0.000000                          NaN                            NaN  50000.000000
            7           missing_pct   0.000000                    33.333333                      33.333333     33.333333
            8                   std  13.123346                         None                           None   5000.000000
            9              zero_pct  33.333333                         None                           None      0.000000

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Goutam <goutam@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces a Dataset.summary() API to compute column-wise statistics based on data types. The implementation is well-structured, with most of the logic in a new ray.data.stats module. The design is extensible, allowing users to override default aggregations. The changes include new base functionality in AggregateFnV2 to support formatted statistics, a new DatasetSummary class to hold and present the results, and comprehensive tests. My review found a couple of areas for improvement: one for robustness in type comparison and another for documentation clarity. Overall, this is a great addition to Ray Data.

python/ray/data/dataset.py

python/ray/data/stats.py

goutamvenkat-anyscale · 2025-11-20T23:46:14Z

/gemini summary

gemini-code-assist · 2025-11-20T23:46:35Z

Summary of Changes

This pull request introduces a powerful new summary() API to Ray Data, significantly enhancing its data profiling capabilities. The API automatically generates a statistical overview of a dataset's columns, intelligently applying different aggregations based on data types. It provides a structured, customizable, and easily consumable output, making it much simpler for users to understand the characteristics of their data. The changes involve substantial refactoring of the aggregation system to support type-aware statistic generation and robust handling of various PyArrow data types.

Highlights

New Dataset Summary API: Introduced a new summary() API to ray.data.Dataset that computes comprehensive statistical summaries of columns based on their PyArrow dtypes. This API is designed to provide quick insights into dataset characteristics.
Customizable Aggregations: The summary() API allows users to override default aggregations for specific data types using the override_dtype_agg_mapping parameter, offering flexibility in statistical analysis.
Enhanced Aggregator Functionality: The AggregateFnV2 class has been extended with new methods (get_stat_name, get_result_labels, format_stats) to better handle and format aggregation results, particularly for list-valued outputs like quantiles and top-k.
Robust Dtype Matching: New utility functions (_matches_dtype, _get_aggregators_for_dtype) have been added to ray.data.datatype and ray.data.stats to enable more flexible and robust matching of column data types to appropriate aggregators, including pattern matching for logical types (e.g., temporal types).
Structured Output and Pandas Integration: The summary results are encapsulated in a new DatasetSummary object, which provides methods to convert the summary to a Pandas DataFrame (to_pandas()) and access statistics for individual columns (get_column_stats()). It also includes logic for safely converting problematic PyArrow extension types to Pandas.

Changelog

python/ray/data/init.py
- Imported DatasetSummary and added it to the __all__ export list.
python/ray/data/aggregate.py
- Added Tuple to imports.
- Imported pyarrow as pa for type checking.
- Introduced _stat_name attribute to AggregateFnV2 to store the base stat name.
- Added get_stat_name(), get_result_labels(), and format_stats() methods to AggregateFnV2 for improved handling and formatting of aggregation results.
- Overrode get_result_labels() in ApproximateQuantile to provide quantile values as labels for list results.
python/ray/data/dataset.py
- Imported AggregateFnV2 and DataType.
- Imported DatasetSummary, _build_summary_table, _dtype_aggregators_for_dataset, and _parse_summary_stats from ray.data.stats.
- Added the new summary() method to the Dataset class, which orchestrates the computation and presentation of dataset statistics using the new type-aware aggregation logic.
python/ray/data/datatype.py
- Added _matches_dtype function to compare column dtypes with mapping keys, supporting both exact and pattern matching for logical types.
python/ray/data/stats.py
- Imported pandas as pd and convert_to_pyarrow_array.
- Introduced DatasetSummary dataclass with to_pandas() and get_column_stats() methods, including logic for safe Pandas conversion.
- Introduced DtypeAggregators dataclass to hold column-to-dtype mapping and aggregators.
- Refactored aggregator generation by removing categorical_aggregators, vector_aggregators, and FeatureAggregators.
- Added temporal_aggregators for temporal types and basic_aggregators as a general fallback.
- Introduced default_dtype_aggregators() to provide a mapping from DataType to aggregator factory functions.
- Added _get_fallback_aggregators for heuristic-based type detection and _get_aggregators_for_dtype for selecting aggregators based on dtype and custom mappings.
- Replaced feature_aggregators_for_dataset with _dtype_aggregators_for_dataset to utilize the new dtype-based aggregation logic.
- Added _parse_summary_stats to process raw aggregation results into schema-matching and schema-changing categories.
- Added _create_pyarrow_array and _build_summary_table for constructing PyArrow tables from parsed statistics, handling type inference and preservation.
python/ray/data/tests/test_dataset_stats.py
- Updated imports to reflect changes in ray.data.stats.
- Replaced TestFeatureAggregatorsForDataset with TestDtypeAggregatorsForDataset to test the new _dtype_aggregators_for_dataset logic.
- Added TestIndividualAggregatorFunctions to test numerical_aggregators, temporal_aggregators, and basic_aggregators.
- Added TestDefaultDtypeAggregators to verify the default mappings.
- Introduced TestDatasetSummary to thoroughly test the new Dataset.summary() API, covering various scenarios including custom mappings, column filtering, and detailed value verification.

Activity

goutamvenkat-anyscale requested a summary of the pull request.
A review comment suggested clarifying in the docstring that the std statistic in summary() computes population standard deviation (ddof=0), unlike Dataset.std() which defaults to sample standard deviation (ddof=1).
A review comment suggested using direct pyarrow.DataType equality comparison (agg_type == original_type) instead of string representation comparison (str(agg_type) == str(original_type)) for robustness in type checking.

python/ray/data/stats.py

python/ray/data/aggregate.py

Signed-off-by: Goutam <goutam@anyscale.com>

python/ray/data/aggregate.py

python/ray/data/stats.py

Signed-off-by: Goutam <goutam@anyscale.com>

python/ray/data/aggregate.py

python/ray/data/stats.py

Signed-off-by: Goutam <goutam@anyscale.com>

python/ray/data/datatype.py

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale · 2025-11-21T21:30:06Z

/gemini summary

gemini-code-assist · 2025-11-21T21:30:31Z

Summary of Changes

This pull request introduces a new summary() API to Ray Data, providing a powerful and flexible way to generate statistical profiles of datasets. It intelligently applies different aggregations based on column data types, offers customization options for these aggregations, and presents the results in a structured, easily consumable format, including conversion to Pandas DataFrames. This enhancement significantly improves data understanding and exploration within Ray Data.

Highlights

New Dataset Summary API: Introduced a new summary() API to ray.data.Dataset that computes comprehensive statistical summaries of columns based on their PyArrow dtypes, providing quick insights into dataset characteristics.
Type-Aware Aggregation: The API intelligently applies different aggregations based on column data types (e.g., numerical, string, temporal), with default statistics tailored for each type.
Customizable Aggregations: Users can override default aggregations for specific data types using the override_dtype_agg_mapping parameter, offering flexibility in statistical analysis.
Enhanced Aggregator Functionality: The AggregateFnV2 class has been extended with new methods (get_stat_name, get_result_labels, format_stats) to better handle and format complex aggregation results, particularly for list-valued outputs like quantiles and top-k.
Structured Output and Pandas Integration: Summary results are encapsulated in a new DatasetSummary object, providing methods to convert the summary to a Pandas DataFrame (to_pandas()) and access statistics for individual columns (get_column_stats()), including robust handling for PyArrow extension types.
Improved Dtype Matching: New utility functions (_matches_dtype) enable more flexible and robust matching of column data types to appropriate aggregators, including pattern matching for logical types (e.g., temporal types).

Changelog

python/ray/data/BUILD.bazel
- Increased the test size for test_dataset_stats from 'small' to 'large'.
python/ray/data/init.py
- Imported DatasetSummary and added it to the module's __all__ export list.
python/ray/data/aggregate.py
- Added Tuple to imports and pyarrow for type checking.
- Introduced _stat_name attribute to AggregateFnV2 to store the base stat name.
- Added get_stat_name(), get_result_labels(), and format_stats() methods to AggregateFnV2 for improved handling and formatting of aggregation results.
- Overrode get_result_labels() in ApproximateQuantile to provide quantile values as labels for list results.
python/ray/data/dataset.py
- Imported AggregateFnV2, DataType, DatasetSummary, and several helper functions from ray.data.stats.
- Added the new summary() method to the Dataset class, which orchestrates the computation and presentation of dataset statistics using the new type-aware aggregation logic.
python/ray/data/datatype.py
- Added _matches_dtype function to compare column dtypes with mapping keys, supporting both exact and pattern matching for logical types.
python/ray/data/stats.py
- Introduced the DatasetSummary dataclass with to_pandas() and get_column_stats() methods, including logic for safe Pandas conversion.
- Introduced the DtypeAggregators dataclass to hold column-to-dtype mapping and aggregators.
- Refactored aggregator generation by removing categorical_aggregators, vector_aggregators, and FeatureAggregators.
- Added temporal_aggregators for temporal types and basic_aggregators as a general fallback.
- Introduced default_dtype_aggregators() to provide a mapping from DataType to aggregator factory functions.
- Added _get_fallback_aggregators for heuristic-based type detection and _get_aggregators_for_dtype for selecting aggregators based on dtype and custom mappings.
- Replaced feature_aggregators_for_dataset with _dtype_aggregators_for_dataset to utilize the new dtype-based aggregation logic.
- Added _parse_summary_stats to process raw aggregation results into schema-matching and schema-changing categories.
- Added _create_pyarrow_array and _build_summary_table for constructing PyArrow tables from parsed statistics, handling type inference and preservation.
python/ray/data/tests/test_dataset_stats.py
- Updated imports to reflect changes in ray.data.stats.
- Replaced TestFeatureAggregatorsForDataset with TestDtypeAggregatorsForDataset to test the new _dtype_aggregators_for_dataset logic.
- Added TestIndividualAggregatorFunctions to test numerical_aggregators, temporal_aggregators, and basic_aggregators.
- Added TestDefaultDtypeAggregators to verify the default mappings.
- Introduced TestDatasetSummary to thoroughly test the new Dataset.summary() API, covering various scenarios including custom mappings, column filtering, and detailed value verification.

Activity

The pull request author, goutamvenkat-anyscale, requested a summary of the pull request.
A bot review suggested clarifying in the summary() docstring that the std statistic uses population standard deviation (ddof=0), unlike Dataset.std()'s default sample standard deviation.
Another bot review recommended using direct pyarrow.DataType equality comparison (agg_type == original_type) instead of string comparison for robustness in type checking.
A bot review identified a potential bug where zip(labels, value) could silently drop statistics if label and value lengths mismatch.
A bot review pointed out an issue with format_stats not properly handling empty lists, treating them as scalar results.
A bot review highlighted a bug in schema matching where the decision was made based on the overall list type rather than individual formatted stat types.
A bot review noted a missing null check in format_stats for list-type aggregations returning None.
A bot review found an incomplete list type check in format_stats that incorrectly defaults to pa.float64() for large_list or fixed_size_list types.
A bot review identified a bug where user-defined dtype overrides could be shadowed by default pattern-matching types due to the order of mapping application.
A bot review reported a missing case for TENSOR types in the _matches_dtype pattern matching logic.

python/ray/data/aggregate.py

Signed-off-by: Goutam <goutam@anyscale.com>

python/ray/data/stats.py

Signed-off-by: Goutam <goutam@anyscale.com>

python/ray/data/aggregate.py

Signed-off-by: Goutam <goutam@anyscale.com>

cem-anyscale · 2025-11-22T00:26:36Z

python/ray/data/aggregate.py

+                }
+
+        # Fallback to scalar result for non-list values or unexpandable Nones
+        return {stat_name: (value, agg_type)}


do we want to return this of if value and labels are None?

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale · 2025-11-24T19:19:40Z

/gemini summary

gemini-code-assist · 2025-11-24T19:20:07Z

Summary of Changes

This pull request introduces a powerful new summary() API to Ray Data, significantly enhancing its data profiling capabilities. The API automatically generates a statistical overview of a dataset's columns, intelligently applying different aggregations based on data types. It provides a structured, customizable, and easily consumable output, making it much simpler for users to understand the characteristics of their data. The changes involve substantial refactoring of the aggregation system to support type-aware statistic generation and robust handling of various PyArrow data types, addressing several identified edge cases and bugs.

Highlights

New Dataset Summary API: Introduced a new summary() API to ray.data.Dataset that computes comprehensive statistical summaries of columns based on their PyArrow dtypes, providing quick insights into dataset characteristics.
Type-Aware Aggregation: The API intelligently applies different aggregations based on column data types (e.g., numerical, string, temporal), with default statistics tailored for each type.
Flexible Customization: Users can override default aggregations for specific data types using the override_dtype_agg_mapping parameter, offering flexibility in statistical analysis.
Enhanced Aggregator Functionality: The AggregateFnV2 class has been extended with new methods (get_stat_name, get_result_labels, format_stats) to better handle and format complex, list-valued aggregation results (e.g., quantiles, top-k), including robust handling for nulls and empty lists.
Robust Type Matching: Improved utility functions (_matches_dtype) for flexible and robust matching of column data types, including pattern matching for logical types like temporal or tensor types, ensuring user-defined overrides take precedence.
Structured Output: Summary results are encapsulated in a new DatasetSummary object, providing methods to convert the summary to a Pandas DataFrame (to_pandas()) and access statistics for individual columns (get_column_stats()), with safe handling for PyArrow extension types.

Changelog

doc/source/data/api/_autogen.rst
- Added stats.DatasetSummary to the autogenerated API documentation.
doc/source/data/api/dataset.rst
- Included DatasetSummary in the API documentation.
python/ray/data/BUILD.bazel
- Increased the test size for test_dataset_stats from 'small' to 'large'.
python/ray/data/init.py
- Imported DatasetSummary and added it to the module's __all__ export list.
python/ray/data/aggregate.py
- Added Tuple to imports and pyarrow for type checking.
- Introduced _stat_name attribute to AggregateFnV2 to store the base stat name.
- Added get_stat_name(), get_result_labels(), and format_stats() methods to AggregateFnV2 for improved handling and formatting of aggregation results, including robust handling for list-valued outputs, nulls, and empty lists.
- Overrode get_result_labels() in ApproximateQuantile to provide quantile values as labels for list results.
python/ray/data/dataset.py
- Imported AggregateFnV2, DataType, DatasetSummary, and several helper functions from ray.data.stats.
- Added the new summary() method to the Dataset class, which orchestrates the computation and presentation of dataset statistics using the new type-aware aggregation logic.
python/ray/data/datatype.py
- Added _matches_dtype function to compare column dtypes with mapping keys, supporting both exact and pattern matching for logical types including TENSOR.
python/ray/data/stats.py
- Introduced the DatasetSummary dataclass with to_pandas() and get_column_stats() methods, including logic for safe Pandas conversion using _safe_convert_table.
- Introduced the _DtypeAggregators dataclass to hold column-to-dtype mapping and aggregators.
- Refactored aggregator generation logic, replacing older functions with _numerical_aggregators, _temporal_aggregators, _basic_aggregators, _default_dtype_aggregators, _get_fallback_aggregators, and _get_aggregators_for_dtype.
- Ensured user-provided dtype_agg_mapping takes precedence over default patterns.
- Added _parse_summary_stats to process raw aggregation results into schema-matching and schema-changing categories, correctly classifying based on individual stat types.
- Added _create_pyarrow_array and _build_summary_table for constructing PyArrow tables from parsed statistics, handling type inference and preservation, and robustly creating arrays from potentially problematic data.
python/ray/data/tests/test_dataset_stats.py
- Updated imports to reflect changes in ray.data.stats.
- Replaced TestFeatureAggregatorsForDataset with TestDtypeAggregatorsForDataset to test the new _dtype_aggregators_for_dataset logic.
- Added TestIndividualAggregatorFunctions to test _numerical_aggregators, _temporal_aggregators, and _basic_aggregators.
- Added TestDefaultDtypeAggregators to verify the default mappings.
- Introduced TestDatasetSummary to thoroughly test the new Dataset.summary() API, covering various scenarios including custom mappings, column filtering, and detailed value verification.

Activity

goutamvenkat-anyscale requested a summary of the pull request.
A review comment suggested clarifying in the summary() docstring that the std statistic computes population standard deviation (ddof=0), unlike Dataset.std() which defaults to sample standard deviation (ddof=1).
A review comment suggested using direct pyarrow.DataType equality comparison (agg_type == original_type) instead of string representation comparison (str(agg_type) == str(original_type)) for robustness in type checking.
A bot review identified a potential bug where zip(labels, value) could silently drop statistics if label and value lengths mismatch.
A bot review pointed out an issue with format_stats not properly handling empty lists, treating them as scalar results.
A bot review highlighted a bug in schema matching where the decision was made based on the overall list type rather than individual formatted stat types.
A bot review noted a missing null check in format_stats for list-type aggregations returning None.
A bot review found an incomplete list type check in format_stats that incorrectly defaults to pa.float64() for large_list or fixed_size_list types.
A bot review identified a bug where user-defined dtype overrides could be shadowed by default pattern-matching types due to the order of mapping application.
A bot review reported a missing case for TENSOR types in the _matches_dtype pattern matching logic.
A bot review noted that format_stats incorrectly infers scalar type for None values when is_list_type is true but agg_type is not a list.
A bot review pointed out that _extract_column_from_table was not using the safe pandas conversion method, bypassing error handling.
cem-anyscale asked for clarification on the return behavior when both value and labels are None.

python/ray/data/aggregate.py

alexeykudinkin · 2025-12-03T21:06:04Z

python/ray/data/aggregate.py

+    def get_result_labels(self) -> Optional[List[str]]:
+        """Return labels for list-valued results.
+
+        For aggregators that return list results (e.g., quantiles), this method
+        returns meaningful labels for each element in the list. If the aggregator
+        returns a scalar result or doesn't have meaningful labels, returns None.
+
+        Returns:
+            List of string labels for each element in the result list, or None.
+        """
+        return None
+
+    def format_stats(
+        self, value: Any, agg_type: "pa.DataType", original_type: "pa.DataType"
+    ) -> Dict[str, Tuple[Any, "pa.DataType"]]:
+        """Format aggregation result into stat entries.
+
+        Takes the raw aggregation result and formats it into one or more stat
+        entries. For scalar results, returns a single entry. For list results,
+        expands into multiple indexed entries.
+
+        Args:
+            value: The aggregation result value
+            agg_type: PyArrow type of the aggregation result
+            original_type: PyArrow type of the original column
+
+        Returns:
+            Dictionary mapping stat names to (value, type) tuples


Why is this part of aggregation? This has nothing to do with Aggregations themselves

So each aggregate will have its own labels and formatting structure.

get_result_labels(): The aggregator knows what its list elements mean (e.g., quantiles knows [0, 1, 2] represent ["0.25", "0.5", "0.75"])

I can move format_stats out of aggregate.py

Output of each aggregation is fixed / configured -- if you passed quantiles [0.5, 0.99] you'd expect to get back list of 2 values, right?

Then it will appear as approx_quantile[{idx}] instead of approx_quantile[{quantiles[idx]}]. Which is also fine

Moved out the labels function out of aggregates

python/ray/data/stats.py

alexeykudinkin · 2025-12-03T21:10:43Z

python/ray/data/stats.py

+                except (TypeError, ValueError, pa.ArrowInvalid):
+                    # Cast problematic columns to null type
+                    null_col = pa.nulls(len(col), type=pa.null())
+                    result_data[col_name] = null_col.to_pandas()


When would this occur?

tensor_type = ArrowTensorType(shape=(2, 2), dtype=pa.float32()) table = pa.Table.from_pydict({ "image_col": pa.array([None, None], type=tensor_type) }) col = table.column("image_col") col.to_pandas()

is one such example

python/ray/data/datatype.py

Signed-off-by: Goutam <goutam@anyscale.com>

python/ray/data/dataset.py

python/ray/data/stats.py

Signed-off-by: Goutam <goutam@anyscale.com>

python/ray/data/datatype.py

Signed-off-by: Goutam <goutam@anyscale.com>

python/ray/data/dataset.py

alexeykudinkin · 2025-12-04T22:50:12Z

python/ray/data/datatype.py

+            if not self.is_arrow_type():
+                return False


Seems redundant

## Description Add functionality to dataset that computes stats on columns based on the underlying pyarrow dtype. Allows the user to compute different stats per dtype (basically override the setting if needed) Example Usage: ``` ds = ray.data.from_items([ {"age": 25, "salary": 50000, "name": "Alice", "city": "NYC"}, {"age": 30, "salary": 60000, "name": None, "city": "LA"}, {"age": 0, "salary": None, "name": "Bob", "city": None}, ]) summary = ds.summary() summary.to_pandas() ``` OUTPUT ``` statistic age city name salary 0 approx_quantile[0] 25.000000 None None 60000.000000 1 approx_topk[0] NaN {'city': 'LA', 'count': 1} {'count': 1, 'name': 'Bob'} NaN 2 approx_topk[1] NaN {'city': 'NYC', 'count': 1} {'count': 1, 'name': 'Alice'} NaN 3 count 3.000000 3 3 3.000000 4 max 30.000000 NaN NaN 60000.000000 5 mean 18.333333 None None 55000.000000 6 min 0.000000 NaN NaN 50000.000000 7 missing_pct 0.000000 33.333333 33.333333 33.333333 8 std 13.123346 None None 5000.000000 9 zero_pct 33.333333 None None 0.000000 ``` ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

goutamvenkat-anyscale added 2 commits November 20, 2025 15:32

[Data] - Add Dataset Summary API

1497874

Signed-off-by: Goutam <goutam@anyscale.com>

bug fix

972dcdf

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale requested a review from a team as a code owner November 20, 2025 23:44

gemini-code-assist bot reviewed Nov 20, 2025

View reviewed changes

python/ray/data/dataset.py Show resolved Hide resolved

python/ray/data/stats.py Outdated Show resolved Hide resolved

cursor bot reviewed Nov 20, 2025

View reviewed changes

python/ray/data/stats.py Outdated Show resolved Hide resolved

python/ray/data/aggregate.py Outdated Show resolved Hide resolved

Couple of fixes

5317586

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Nov 21, 2025

View reviewed changes

python/ray/data/aggregate.py Outdated Show resolved Hide resolved

python/ray/data/stats.py Show resolved Hide resolved

ray-gardener bot added usability docs An issue or change related to documentation data Ray Data-related issues labels Nov 21, 2025

Fix comments

c95787b

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale added the go add ONLY when ready to merge, run all tests label Nov 21, 2025

Doc tests

8c06cae

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Nov 21, 2025

View reviewed changes

python/ray/data/aggregate.py Outdated Show resolved Hide resolved

cursor bot reviewed Nov 21, 2025

View reviewed changes

python/ray/data/aggregate.py Outdated Show resolved Hide resolved

python/ray/data/stats.py Outdated Show resolved Hide resolved

Some clean up

289e6ac

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Nov 21, 2025

View reviewed changes

python/ray/data/datatype.py Outdated Show resolved Hide resolved

More comments

f020d1b

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Nov 21, 2025

View reviewed changes

python/ray/data/aggregate.py Outdated Show resolved Hide resolved

Falsy

c61ab57

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Nov 21, 2025

View reviewed changes

python/ray/data/stats.py Show resolved Hide resolved

goutamvenkat-anyscale added 2 commits November 21, 2025 13:47

One more fix

7d452d9

Signed-off-by: Goutam <goutam@anyscale.com>

Doc

7a088d1

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Nov 21, 2025

View reviewed changes

python/ray/data/aggregate.py Outdated Show resolved Hide resolved

One more fix

2e6f373

Signed-off-by: Goutam <goutam@anyscale.com>

cem-anyscale approved these changes Nov 22, 2025

View reviewed changes

goutamvenkat-anyscale added 2 commits November 21, 2025 17:20

One more doctest

b48dfe4

Signed-off-by: Goutam <goutam@anyscale.com>

idk

c5f7f36

Signed-off-by: Goutam <goutam@anyscale.com>

Merge branch 'master' into goutam/summary_stuff

4602646

gvspraveen requested a review from alexeykudinkin December 2, 2025 23:05

alexeykudinkin requested changes Dec 3, 2025

View reviewed changes

Address comments

8e56f82

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Dec 4, 2025

View reviewed changes

python/ray/data/dataset.py Show resolved Hide resolved

python/ray/data/stats.py Outdated Show resolved Hide resolved

Couple more bugs

04f55bc

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Dec 4, 2025

View reviewed changes

python/ray/data/datatype.py Show resolved Hide resolved

goutamvenkat-anyscale added 4 commits December 3, 2025 22:24

Couple more fixes

b08301a

Signed-off-by: Goutam <goutam@anyscale.com>

Fix docs

af07007

Signed-off-by: Goutam <goutam@anyscale.com>

one more

a3f1d5d

Signed-off-by: Goutam <goutam@anyscale.com>

Fix tests

f2e84ed

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Dec 4, 2025

View reviewed changes

python/ray/data/dataset.py Show resolved Hide resolved

Merge branch 'master' into goutam/summary_stuff

cc755ff

alexeykudinkin approved these changes Dec 4, 2025

View reviewed changes

python/ray/data/datatype.py

Comment on lines +141 to +142

if not self.is_arrow_type():

return False

Copy link

Contributor

alexeykudinkin Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems redundant

alexeykudinkin merged commit 35c3933 into ray-project:master Dec 5, 2025
6 checks passed

goutamvenkat-anyscale deleted the goutam/summary_stuff branch December 5, 2025 19:23

Conversation

goutamvenkat-anyscale commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

goutamvenkat-anyscale commented Nov 20, 2025

Uh oh!

gemini-code-assist bot commented Nov 20, 2025

Summary of Changes

Highlights

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

goutamvenkat-anyscale commented Nov 21, 2025

Uh oh!

gemini-code-assist bot commented Nov 21, 2025

Summary of Changes

Highlights

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

goutamvenkat-anyscale commented Nov 24, 2025

Uh oh!

gemini-code-assist bot commented Nov 24, 2025

Summary of Changes

Highlights

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

goutamvenkat-anyscale Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

goutamvenkat-anyscale commented Nov 20, 2025 •

edited

Loading

goutamvenkat-anyscale Dec 4, 2025 •

edited

Loading