Skip to content

[Data] - Add Dataset Summary API#58862

Merged
alexeykudinkin merged 21 commits intoray-project:masterfrom
goutamvenkat-anyscale:goutam/summary_stuff
Dec 5, 2025
Merged

[Data] - Add Dataset Summary API#58862
alexeykudinkin merged 21 commits intoray-project:masterfrom
goutamvenkat-anyscale:goutam/summary_stuff

Conversation

@goutamvenkat-anyscale
Copy link
Contributor

@goutamvenkat-anyscale goutamvenkat-anyscale commented Nov 20, 2025

Description

Add functionality to dataset that computes stats on columns based on the underlying pyarrow dtype.

Allows the user to compute different stats per dtype (basically override the setting if needed)

Example Usage:

ds = ray.data.from_items([
    {"age": 25, "salary": 50000, "name": "Alice", "city": "NYC"},
    {"age": 30, "salary": 60000, "name": None, "city": "LA"},
    {"age": 0, "salary": None, "name": "Bob", "city": None},
])

summary = ds.summary()
summary.to_pandas()

OUTPUT

statistic        age                         city                           name        salary
            0  approx_quantile[0]  25.000000                         None                           None  60000.000000
            1        approx_topk[0]        NaN   {'city': 'LA', 'count': 1}    {'count': 1, 'name': 'Bob'}           NaN
            2        approx_topk[1]        NaN  {'city': 'NYC', 'count': 1}  {'count': 1, 'name': 'Alice'}           NaN
            3                 count   3.000000                            3                              3      3.000000
            4                   max  30.000000                          NaN                            NaN  60000.000000
            5                  mean  18.333333                         None                           None  55000.000000
            6                   min   0.000000                          NaN                            NaN  50000.000000
            7           missing_pct   0.000000                    33.333333                      33.333333     33.333333
            8                   std  13.123346                         None                           None   5000.000000
            9              zero_pct  33.333333                         None                           None      0.000000

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale requested a review from a team as a code owner November 20, 2025 23:44
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a Dataset.summary() API to compute column-wise statistics based on data types. The implementation is well-structured, with most of the logic in a new ray.data.stats module. The design is extensible, allowing users to override default aggregations. The changes include new base functionality in AggregateFnV2 to support formatted statistics, a new DatasetSummary class to hold and present the results, and comprehensive tests. My review found a couple of areas for improvement: one for robustness in type comparison and another for documentation clarity. Overall, this is a great addition to Ray Data.

@goutamvenkat-anyscale
Copy link
Contributor Author

/gemini summary

@gemini-code-assist
Copy link
Contributor

Summary of Changes

This pull request introduces a powerful new summary() API to Ray Data, significantly enhancing its data profiling capabilities. The API automatically generates a statistical overview of a dataset's columns, intelligently applying different aggregations based on data types. It provides a structured, customizable, and easily consumable output, making it much simpler for users to understand the characteristics of their data. The changes involve substantial refactoring of the aggregation system to support type-aware statistic generation and robust handling of various PyArrow data types.

Highlights

  • New Dataset Summary API: Introduced a new summary() API to ray.data.Dataset that computes comprehensive statistical summaries of columns based on their PyArrow dtypes. This API is designed to provide quick insights into dataset characteristics.
  • Customizable Aggregations: The summary() API allows users to override default aggregations for specific data types using the override_dtype_agg_mapping parameter, offering flexibility in statistical analysis.
  • Enhanced Aggregator Functionality: The AggregateFnV2 class has been extended with new methods (get_stat_name, get_result_labels, format_stats) to better handle and format aggregation results, particularly for list-valued outputs like quantiles and top-k.
  • Robust Dtype Matching: New utility functions (_matches_dtype, _get_aggregators_for_dtype) have been added to ray.data.datatype and ray.data.stats to enable more flexible and robust matching of column data types to appropriate aggregators, including pattern matching for logical types (e.g., temporal types).
  • Structured Output and Pandas Integration: The summary results are encapsulated in a new DatasetSummary object, which provides methods to convert the summary to a Pandas DataFrame (to_pandas()) and access statistics for individual columns (get_column_stats()). It also includes logic for safely converting problematic PyArrow extension types to Pandas.
Changelog
  • python/ray/data/init.py
    • Imported DatasetSummary and added it to the __all__ export list.
  • python/ray/data/aggregate.py
    • Added Tuple to imports.
    • Imported pyarrow as pa for type checking.
    • Introduced _stat_name attribute to AggregateFnV2 to store the base stat name.
    • Added get_stat_name(), get_result_labels(), and format_stats() methods to AggregateFnV2 for improved handling and formatting of aggregation results.
    • Overrode get_result_labels() in ApproximateQuantile to provide quantile values as labels for list results.
  • python/ray/data/dataset.py
    • Imported AggregateFnV2 and DataType.
    • Imported DatasetSummary, _build_summary_table, _dtype_aggregators_for_dataset, and _parse_summary_stats from ray.data.stats.
    • Added the new summary() method to the Dataset class, which orchestrates the computation and presentation of dataset statistics using the new type-aware aggregation logic.
  • python/ray/data/datatype.py
    • Added _matches_dtype function to compare column dtypes with mapping keys, supporting both exact and pattern matching for logical types.
  • python/ray/data/stats.py
    • Imported pandas as pd and convert_to_pyarrow_array.
    • Introduced DatasetSummary dataclass with to_pandas() and get_column_stats() methods, including logic for safe Pandas conversion.
    • Introduced DtypeAggregators dataclass to hold column-to-dtype mapping and aggregators.
    • Refactored aggregator generation by removing categorical_aggregators, vector_aggregators, and FeatureAggregators.
    • Added temporal_aggregators for temporal types and basic_aggregators as a general fallback.
    • Introduced default_dtype_aggregators() to provide a mapping from DataType to aggregator factory functions.
    • Added _get_fallback_aggregators for heuristic-based type detection and _get_aggregators_for_dtype for selecting aggregators based on dtype and custom mappings.
    • Replaced feature_aggregators_for_dataset with _dtype_aggregators_for_dataset to utilize the new dtype-based aggregation logic.
    • Added _parse_summary_stats to process raw aggregation results into schema-matching and schema-changing categories.
    • Added _create_pyarrow_array and _build_summary_table for constructing PyArrow tables from parsed statistics, handling type inference and preservation.
  • python/ray/data/tests/test_dataset_stats.py
    • Updated imports to reflect changes in ray.data.stats.
    • Replaced TestFeatureAggregatorsForDataset with TestDtypeAggregatorsForDataset to test the new _dtype_aggregators_for_dataset logic.
    • Added TestIndividualAggregatorFunctions to test numerical_aggregators, temporal_aggregators, and basic_aggregators.
    • Added TestDefaultDtypeAggregators to verify the default mappings.
    • Introduced TestDatasetSummary to thoroughly test the new Dataset.summary() API, covering various scenarios including custom mappings, column filtering, and detailed value verification.
Activity
  • goutamvenkat-anyscale requested a summary of the pull request.
  • A review comment suggested clarifying in the docstring that the std statistic in summary() computes population standard deviation (ddof=0), unlike Dataset.std() which defaults to sample standard deviation (ddof=1).
  • A review comment suggested using direct pyarrow.DataType equality comparison (agg_type == original_type) instead of string representation comparison (str(agg_type) == str(original_type)) for robustness in type checking.

Signed-off-by: Goutam <goutam@anyscale.com>
@ray-gardener ray-gardener bot added usability docs An issue or change related to documentation data Ray Data-related issues labels Nov 21, 2025
Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale added the go add ONLY when ready to merge, run all tests label Nov 21, 2025
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale
Copy link
Contributor Author

/gemini summary

@gemini-code-assist
Copy link
Contributor

Summary of Changes

This pull request introduces a new summary() API to Ray Data, providing a powerful and flexible way to generate statistical profiles of datasets. It intelligently applies different aggregations based on column data types, offers customization options for these aggregations, and presents the results in a structured, easily consumable format, including conversion to Pandas DataFrames. This enhancement significantly improves data understanding and exploration within Ray Data.

Highlights

  • New Dataset Summary API: Introduced a new summary() API to ray.data.Dataset that computes comprehensive statistical summaries of columns based on their PyArrow dtypes, providing quick insights into dataset characteristics.
  • Type-Aware Aggregation: The API intelligently applies different aggregations based on column data types (e.g., numerical, string, temporal), with default statistics tailored for each type.
  • Customizable Aggregations: Users can override default aggregations for specific data types using the override_dtype_agg_mapping parameter, offering flexibility in statistical analysis.
  • Enhanced Aggregator Functionality: The AggregateFnV2 class has been extended with new methods (get_stat_name, get_result_labels, format_stats) to better handle and format complex aggregation results, particularly for list-valued outputs like quantiles and top-k.
  • Structured Output and Pandas Integration: Summary results are encapsulated in a new DatasetSummary object, providing methods to convert the summary to a Pandas DataFrame (to_pandas()) and access statistics for individual columns (get_column_stats()), including robust handling for PyArrow extension types.
  • Improved Dtype Matching: New utility functions (_matches_dtype) enable more flexible and robust matching of column data types to appropriate aggregators, including pattern matching for logical types (e.g., temporal types).
Changelog
  • python/ray/data/BUILD.bazel
    • Increased the test size for test_dataset_stats from 'small' to 'large'.
  • python/ray/data/init.py
    • Imported DatasetSummary and added it to the module's __all__ export list.
  • python/ray/data/aggregate.py
    • Added Tuple to imports and pyarrow for type checking.
    • Introduced _stat_name attribute to AggregateFnV2 to store the base stat name.
    • Added get_stat_name(), get_result_labels(), and format_stats() methods to AggregateFnV2 for improved handling and formatting of aggregation results.
    • Overrode get_result_labels() in ApproximateQuantile to provide quantile values as labels for list results.
  • python/ray/data/dataset.py
    • Imported AggregateFnV2, DataType, DatasetSummary, and several helper functions from ray.data.stats.
    • Added the new summary() method to the Dataset class, which orchestrates the computation and presentation of dataset statistics using the new type-aware aggregation logic.
  • python/ray/data/datatype.py
    • Added _matches_dtype function to compare column dtypes with mapping keys, supporting both exact and pattern matching for logical types.
  • python/ray/data/stats.py
    • Introduced the DatasetSummary dataclass with to_pandas() and get_column_stats() methods, including logic for safe Pandas conversion.
    • Introduced the DtypeAggregators dataclass to hold column-to-dtype mapping and aggregators.
    • Refactored aggregator generation by removing categorical_aggregators, vector_aggregators, and FeatureAggregators.
    • Added temporal_aggregators for temporal types and basic_aggregators as a general fallback.
    • Introduced default_dtype_aggregators() to provide a mapping from DataType to aggregator factory functions.
    • Added _get_fallback_aggregators for heuristic-based type detection and _get_aggregators_for_dtype for selecting aggregators based on dtype and custom mappings.
    • Replaced feature_aggregators_for_dataset with _dtype_aggregators_for_dataset to utilize the new dtype-based aggregation logic.
    • Added _parse_summary_stats to process raw aggregation results into schema-matching and schema-changing categories.
    • Added _create_pyarrow_array and _build_summary_table for constructing PyArrow tables from parsed statistics, handling type inference and preservation.
  • python/ray/data/tests/test_dataset_stats.py
    • Updated imports to reflect changes in ray.data.stats.
    • Replaced TestFeatureAggregatorsForDataset with TestDtypeAggregatorsForDataset to test the new _dtype_aggregators_for_dataset logic.
    • Added TestIndividualAggregatorFunctions to test numerical_aggregators, temporal_aggregators, and basic_aggregators.
    • Added TestDefaultDtypeAggregators to verify the default mappings.
    • Introduced TestDatasetSummary to thoroughly test the new Dataset.summary() API, covering various scenarios including custom mappings, column filtering, and detailed value verification.
Activity
  • The pull request author, goutamvenkat-anyscale, requested a summary of the pull request.
  • A bot review suggested clarifying in the summary() docstring that the std statistic uses population standard deviation (ddof=0), unlike Dataset.std()'s default sample standard deviation.
  • Another bot review recommended using direct pyarrow.DataType equality comparison (agg_type == original_type) instead of string comparison for robustness in type checking.
  • A bot review identified a potential bug where zip(labels, value) could silently drop statistics if label and value lengths mismatch.
  • A bot review pointed out an issue with format_stats not properly handling empty lists, treating them as scalar results.
  • A bot review highlighted a bug in schema matching where the decision was made based on the overall list type rather than individual formatted stat types.
  • A bot review noted a missing null check in format_stats for list-type aggregations returning None.
  • A bot review found an incomplete list type check in format_stats that incorrectly defaults to pa.float64() for large_list or fixed_size_list types.
  • A bot review identified a bug where user-defined dtype overrides could be shadowed by default pattern-matching types due to the order of mapping application.
  • A bot review reported a missing case for TENSOR types in the _matches_dtype pattern matching logic.

Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
}

# Fallback to scalar result for non-list values or unexpandable Nones
return {stat_name: (value, agg_type)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to return this of if value and labels are None?

Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale
Copy link
Contributor Author

/gemini summary

@gemini-code-assist
Copy link
Contributor

Summary of Changes

This pull request introduces a powerful new summary() API to Ray Data, significantly enhancing its data profiling capabilities. The API automatically generates a statistical overview of a dataset's columns, intelligently applying different aggregations based on data types. It provides a structured, customizable, and easily consumable output, making it much simpler for users to understand the characteristics of their data. The changes involve substantial refactoring of the aggregation system to support type-aware statistic generation and robust handling of various PyArrow data types, addressing several identified edge cases and bugs.

Highlights

  • New Dataset Summary API: Introduced a new summary() API to ray.data.Dataset that computes comprehensive statistical summaries of columns based on their PyArrow dtypes, providing quick insights into dataset characteristics.
  • Type-Aware Aggregation: The API intelligently applies different aggregations based on column data types (e.g., numerical, string, temporal), with default statistics tailored for each type.
  • Flexible Customization: Users can override default aggregations for specific data types using the override_dtype_agg_mapping parameter, offering flexibility in statistical analysis.
  • Enhanced Aggregator Functionality: The AggregateFnV2 class has been extended with new methods (get_stat_name, get_result_labels, format_stats) to better handle and format complex, list-valued aggregation results (e.g., quantiles, top-k), including robust handling for nulls and empty lists.
  • Robust Type Matching: Improved utility functions (_matches_dtype) for flexible and robust matching of column data types, including pattern matching for logical types like temporal or tensor types, ensuring user-defined overrides take precedence.
  • Structured Output: Summary results are encapsulated in a new DatasetSummary object, providing methods to convert the summary to a Pandas DataFrame (to_pandas()) and access statistics for individual columns (get_column_stats()), with safe handling for PyArrow extension types.
Changelog
  • doc/source/data/api/_autogen.rst
    • Added stats.DatasetSummary to the autogenerated API documentation.
  • doc/source/data/api/dataset.rst
    • Included DatasetSummary in the API documentation.
  • python/ray/data/BUILD.bazel
    • Increased the test size for test_dataset_stats from 'small' to 'large'.
  • python/ray/data/init.py
    • Imported DatasetSummary and added it to the module's __all__ export list.
  • python/ray/data/aggregate.py
    • Added Tuple to imports and pyarrow for type checking.
    • Introduced _stat_name attribute to AggregateFnV2 to store the base stat name.
    • Added get_stat_name(), get_result_labels(), and format_stats() methods to AggregateFnV2 for improved handling and formatting of aggregation results, including robust handling for list-valued outputs, nulls, and empty lists.
    • Overrode get_result_labels() in ApproximateQuantile to provide quantile values as labels for list results.
  • python/ray/data/dataset.py
    • Imported AggregateFnV2, DataType, DatasetSummary, and several helper functions from ray.data.stats.
    • Added the new summary() method to the Dataset class, which orchestrates the computation and presentation of dataset statistics using the new type-aware aggregation logic.
  • python/ray/data/datatype.py
    • Added _matches_dtype function to compare column dtypes with mapping keys, supporting both exact and pattern matching for logical types including TENSOR.
  • python/ray/data/stats.py
    • Introduced the DatasetSummary dataclass with to_pandas() and get_column_stats() methods, including logic for safe Pandas conversion using _safe_convert_table.
    • Introduced the _DtypeAggregators dataclass to hold column-to-dtype mapping and aggregators.
    • Refactored aggregator generation logic, replacing older functions with _numerical_aggregators, _temporal_aggregators, _basic_aggregators, _default_dtype_aggregators, _get_fallback_aggregators, and _get_aggregators_for_dtype.
    • Ensured user-provided dtype_agg_mapping takes precedence over default patterns.
    • Added _parse_summary_stats to process raw aggregation results into schema-matching and schema-changing categories, correctly classifying based on individual stat types.
    • Added _create_pyarrow_array and _build_summary_table for constructing PyArrow tables from parsed statistics, handling type inference and preservation, and robustly creating arrays from potentially problematic data.
  • python/ray/data/tests/test_dataset_stats.py
    • Updated imports to reflect changes in ray.data.stats.
    • Replaced TestFeatureAggregatorsForDataset with TestDtypeAggregatorsForDataset to test the new _dtype_aggregators_for_dataset logic.
    • Added TestIndividualAggregatorFunctions to test _numerical_aggregators, _temporal_aggregators, and _basic_aggregators.
    • Added TestDefaultDtypeAggregators to verify the default mappings.
    • Introduced TestDatasetSummary to thoroughly test the new Dataset.summary() API, covering various scenarios including custom mappings, column filtering, and detailed value verification.
Activity
  • goutamvenkat-anyscale requested a summary of the pull request.
  • A review comment suggested clarifying in the summary() docstring that the std statistic computes population standard deviation (ddof=0), unlike Dataset.std() which defaults to sample standard deviation (ddof=1).
  • A review comment suggested using direct pyarrow.DataType equality comparison (agg_type == original_type) instead of string representation comparison (str(agg_type) == str(original_type)) for robustness in type checking.
  • A bot review identified a potential bug where zip(labels, value) could silently drop statistics if label and value lengths mismatch.
  • A bot review pointed out an issue with format_stats not properly handling empty lists, treating them as scalar results.
  • A bot review highlighted a bug in schema matching where the decision was made based on the overall list type rather than individual formatted stat types.
  • A bot review noted a missing null check in format_stats for list-type aggregations returning None.
  • A bot review found an incomplete list type check in format_stats that incorrectly defaults to pa.float64() for large_list or fixed_size_list types.
  • A bot review identified a bug where user-defined dtype overrides could be shadowed by default pattern-matching types due to the order of mapping application.
  • A bot review reported a missing case for TENSOR types in the _matches_dtype pattern matching logic.
  • A bot review noted that format_stats incorrectly infers scalar type for None values when is_list_type is true but agg_type is not a list.
  • A bot review pointed out that _extract_column_from_table was not using the safe pandas conversion method, bypassing error handling.
  • cem-anyscale asked for clarification on the return behavior when both value and labels are None.

Comment on lines +237 to +264
def get_result_labels(self) -> Optional[List[str]]:
"""Return labels for list-valued results.

For aggregators that return list results (e.g., quantiles), this method
returns meaningful labels for each element in the list. If the aggregator
returns a scalar result or doesn't have meaningful labels, returns None.

Returns:
List of string labels for each element in the result list, or None.
"""
return None

def format_stats(
self, value: Any, agg_type: "pa.DataType", original_type: "pa.DataType"
) -> Dict[str, Tuple[Any, "pa.DataType"]]:
"""Format aggregation result into stat entries.

Takes the raw aggregation result and formats it into one or more stat
entries. For scalar results, returns a single entry. For list results,
expands into multiple indexed entries.

Args:
value: The aggregation result value
agg_type: PyArrow type of the aggregation result
original_type: PyArrow type of the original column

Returns:
Dictionary mapping stat names to (value, type) tuples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this part of aggregation? This has nothing to do with Aggregations themselves

Copy link
Contributor Author

@goutamvenkat-anyscale goutamvenkat-anyscale Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So each aggregate will have its own labels and formatting structure.

get_result_labels(): The aggregator knows what its list elements mean (e.g., quantiles knows [0, 1, 2] represent ["0.25", "0.5", "0.75"])

I can move format_stats out of aggregate.py

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Output of each aggregation is fixed / configured -- if you passed quantiles [0.5, 0.99] you'd expect to get back list of 2 values, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it will appear as approx_quantile[{idx}] instead of approx_quantile[{quantiles[idx]}]. Which is also fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved out the labels function out of aggregates

Comment on lines +73 to +76
except (TypeError, ValueError, pa.ArrowInvalid):
# Cast problematic columns to null type
null_col = pa.nulls(len(col), type=pa.null())
result_data[col_name] = null_col.to_pandas()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would this occur?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tensor_type = ArrowTensorType(shape=(2, 2), dtype=pa.float32())

table = pa.Table.from_pydict({
    "image_col": pa.array([None, None], type=tensor_type)
})
col = table.column("image_col")
col.to_pandas() 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is one such example

Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: Goutam <goutam@anyscale.com>
Comment on lines +141 to +142
if not self.is_arrow_type():
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems redundant

@alexeykudinkin alexeykudinkin merged commit 35c3933 into ray-project:master Dec 5, 2025
6 checks passed
@goutamvenkat-anyscale goutamvenkat-anyscale deleted the goutam/summary_stuff branch December 5, 2025 19:23
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
## Description
Add functionality to dataset that computes stats on columns based on the
underlying pyarrow dtype.

Allows the user to compute different stats per dtype (basically override
the setting if needed)

Example Usage:
```
ds = ray.data.from_items([
    {"age": 25, "salary": 50000, "name": "Alice", "city": "NYC"},
    {"age": 30, "salary": 60000, "name": None, "city": "LA"},
    {"age": 0, "salary": None, "name": "Bob", "city": None},
])

summary = ds.summary()
summary.to_pandas()
```

OUTPUT

```
statistic        age                         city                           name        salary
            0  approx_quantile[0]  25.000000                         None                           None  60000.000000
            1        approx_topk[0]        NaN   {'city': 'LA', 'count': 1}    {'count': 1, 'name': 'Bob'}           NaN
            2        approx_topk[1]        NaN  {'city': 'NYC', 'count': 1}  {'count': 1, 'name': 'Alice'}           NaN
            3                 count   3.000000                            3                              3      3.000000
            4                   max  30.000000                          NaN                            NaN  60000.000000
            5                  mean  18.333333                         None                           None  55000.000000
            6                   min   0.000000                          NaN                            NaN  50000.000000
            7           missing_pct   0.000000                    33.333333                      33.333333     33.333333
            8                   std  13.123346                         None                           None   5000.000000
            9              zero_pct  33.333333                         None                           None      0.000000
```

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues docs An issue or change related to documentation go add ONLY when ready to merge, run all tests usability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray fails to serialize self-reference objects

3 participants