[Data] [1/n] - Iceberg Schema Evolution by goutamvenkat-anyscale · Pull Request #59210 · ray-project/ray

goutamvenkat-anyscale · 2025-12-05T22:42:50Z

Description

Upgrades Pyiceberg to 0.10
Performs Schema Evolution for Iceberg Append Writes

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

python/ray/data/_internal/datasource/iceberg_datasink.py

gemini-code-assist

Code Review

This pull request introduces a significant improvement to the Iceberg datasink by adding support for schema evolution. The changes are well-structured, and the refactoring of IcebergDatasink makes it more robust and easier to understand. The addition of comprehensive tests for schema evolution is also a great contribution.

I have a couple of suggestions for minor optimizations in the on_write_complete method to improve efficiency by reducing redundant operations. Overall, this is an excellent pull request.

python/ray/data/_internal/datasource/iceberg_datasink.py

alexeykudinkin · 2025-12-05T23:09:30Z

python/ray/data/_internal/datasource/iceberg_datasink.py

+        """
+        Update the table schema to accommodate incoming data using union-by-name semantics.

-            property_as_bool = PropertyUtil.property_as_bool
+        This is called from the driver after reconciling all schemas.


Let's make it clear that this only can be called from the driver

(Also think about how we can assert that it's only called from the driver)

is_driver = ray.get_runtime_context().worker.mode != WORKER_MODE should work

python/ray/data/_internal/datasource/iceberg_datasink.py

python/ray/data/dataset.py

Signed-off-by: Goutam <goutam@anyscale.com>

python/ray/data/_internal/datasource/iceberg_datasink.py

python/ray/data/_internal/planner/plan_write_op.py

Signed-off-by: Goutam <goutam@anyscale.com>

cursor

Bug: Callback skipped for small datasets in all_inputs_done

The _on_first_input_callback is only invoked in _add_input_inner() but not in all_inputs_done(). When processing small datasets where all bundles don't meet the min_rows_per_bundle threshold during normal processing, the bundles are deferred to all_inputs_done(), which calls _add_bundled_input() directly without invoking the callback. For IcebergDatasink, this means on_write_start() (which handles schema evolution) is never called for small datasets, potentially causing write failures when incoming data has new columns.

python/ray/data/_internal/execution/operators/map_operator.py#L566-L576

ray/python/ray/data/_internal/execution/operators/map_operator.py

Lines 566 to 576 in 318f40c

    
           def all_inputs_done(self): 
        
               self._block_ref_bundler.done_adding_bundles() 
        
               if self._block_ref_bundler.has_bundle(): 
        
                   # Handle any leftover bundles in the bundler. 
        
                   ( 
        
                       _, 
        
                       bundled_input, 
        
                   ) = self._block_ref_bundler.get_next_bundle() 
        
                   self._add_bundled_input(bundled_input) 
        
               super().all_inputs_done()

python/ray/data/_internal/planner/plan_write_op.py#L136-L145

ray/python/ray/data/_internal/planner/plan_write_op.py

Lines 136 to 145 in 318f40c

    
           if not isinstance(datasink, _FileDatasink): 
        
               if isinstance(datasink, IcebergDatasink): 
        
                   # Iceberg needs the schema for schema evolution, use deferred callback 
        
                   def on_first_input(bundle: RefBundle): 
        
                       schema: Optional["pa.Schema"] = _get_pyarrow_schema_from_bundle( 
        
                           bundle 
        
                       ) 
        
                       datasink.on_write_start(schema) 
        
                   map_op.set_on_first_input_callback(on_first_input)

Signed-off-by: Goutam <goutam@anyscale.com>

python/ray/data/_internal/datasource/iceberg_datasink.py

Signed-off-by: Goutam <goutam@anyscale.com>

python/ray/data/_internal/planner/plan_write_op.py

Signed-off-by: Goutam <goutam@anyscale.com>

python/ray/data/_internal/planner/plan_write_op.py

python/ray/data/_internal/execution/operators/map_operator.py

Signed-off-by: Goutam <goutam@anyscale.com>

python/ray/data/_internal/datasource/iceberg_datasink.py

python/ray/data/_internal/arrow_ops/transform_pyarrow.py

alexeykudinkin

LGTM, minor comments

python/ray/data/_internal/datasource/iceberg_datasink.py

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale · 2025-12-10T01:34:03Z

/gemini summary

gemini-code-assist · 2025-12-10T01:34:32Z

Summary of Changes

This pull request significantly enhances Ray Data's integration with Apache Iceberg by upgrading the underlying Pyiceberg library and introducing robust schema evolution capabilities. This allows users to append data to Iceberg tables with varying schemas, as the system will automatically adapt the table's schema to accommodate new columns and promote types as needed, simplifying data ingestion workflows.

Highlights

Pyiceberg Upgrade: The Pyiceberg library has been upgraded to version 0.10.0, bringing the latest features and improvements from the Iceberg community.
Iceberg Schema Evolution: Ray Data's Iceberg datasink now supports automatic schema evolution for append writes. New columns in incoming data are automatically added to the table schema, and type promotion across data blocks is handled through schema reconciliation on the driver.
Improved Write Logic: The internal write mechanism has been refactored to collect schemas from all data blocks on workers, reconcile them on the driver, and perform an atomic commit that includes both schema updates and data file appends.

Changelog

ci/lint/pydoclint-baseline.txt
- Removed outdated docstring linting errors for IcebergDatasink.
python/ray/data/_internal/datasource/iceberg_datasink.py
- Updated imports, removing uuid and packaging.version.
- Modified IcebergDatasink to return a tuple of DataFile objects and pa.Schema objects from the write method, enabling schema collection from all blocks.
- Refactored __init__, __getstate__, and __setstate__ to manage the _table object more effectively for pickling and unpickling.
- Introduced _reload_table and _update_schema methods for managing table metadata and performing schema evolution.
- Added _append_and_commit for encapsulating the transactional append logic.
- Reworked on_write_start to initialize the table and perform an initial schema evolution based on the first data bundle.
- Refactored the write method to ensure the table is reloaded on workers if not present and to return schemas from all non-empty blocks.
- Completely re-implemented on_write_complete to collect all data files and schemas from workers, reconcile them using unify_schemas with type promotion, and then perform an atomic transaction to update the table schema (if necessary) and commit the new data files.
python/ray/data/dataset.py
- Updated the write_iceberg docstring to clearly state that schema evolution is automatically enabled.
- Added a new example demonstrating how schema evolution works when appending data with new columns.
- Modified the IcebergDatasink instantiation to use explicit keyword arguments.
python/ray/data/tests/test_iceberg.py
- Added new imports for Catalog and Table from pyiceberg.
- Corrected an assertion in test_read_basic for string type instead of large_string.
- Introduced a new TestSchemaEvolution class with helper functions (clean_table, _create_typed_dataframe, _write_to_iceberg, _read_from_iceberg, _verify_schema) to thoroughly test schema evolution.
- Added specific test cases for test_schema_evolution_add_column, test_multiple_schema_evolutions, and test_column_order_independence.
python/requirements/ml/data-test-requirements.txt
- Upgraded pyiceberg[sql-sqlite] from 0.9.0 to 0.10.0.
python/requirements_compiled.txt
- Upgraded pyiceberg from 0.9.0 to 0.10.0.
- Added pyroaring==1.0.3 as a new dependency, likely due to the Pyiceberg upgrade.

Activity

The author requested a summary of the pull request.
A bot identified a bug where only the first block's schema was captured, potentially leading to incorrect type promotion or missing column detection.
A bot identified a high-severity bug regarding shared write UUIDs potentially causing file collisions across workers.
A bot suggested an optimization for schema reconciliation by combining unify_schemas calls.
A bot suggested making _reload_table() conditional in on_write_complete to avoid redundant network calls.
A reviewer suggested a type hint change for schema in on_write_start.
A reviewer suggested assigning incoming_schema to _target_schema.
A reviewer commented on ensuring _update_schema is only called from the driver, with the author suggesting a runtime context check for assertion.
A reviewer questioned the necessity of retrying transactions, and the author confirmed retries could be removed as schema updates were no longer worker-side.
A reviewer asked about reloading the table on workers, and the author clarified it's for handling unpickled _table=None.
A reviewer inquired if schema updates happen before writing, which the author confirmed.
A reviewer suggested renaming _update_schema to _try_update_schema.
A reviewer asked for elaboration on a condition, leading to a discussion about schema evolution for new columns and type promotion, with the author eventually stating the function could be removed.
A reviewer suggested a specific way to unify schemas.
A reviewer suggested updating the schema within the write transaction.
A reviewer commented "We can't do that" in response to a suggestion, and the author noted the alternative is explicit schema passing by the user.
A bot identified a high-severity bug where the schema update transaction in on_write_complete was never committed.
A bot identified a medium-severity bug regarding duplicate on_write_start calls for file-based datasinks.
A bot identified a low-severity bug about premature table reload before transaction commit.
A bot identified a high-severity bug where TensorDtype was not handled correctly in schema conversion, leading to incorrect pyarrow.string() type.
A reviewer suggested always invoking on_write_start in the callback.
A reviewer suggested making all operations immutable and passing them as constructor arguments.
A reviewer suggested passing only the schema, not the input, to _on_start.
A bot identified a high-severity bug where schema evolution limited to the first bundle might fail if subsequent blocks introduce new columns. The author acknowledged this as a known limitation.
A bot identified a medium-severity bug regarding tensor type inconsistency in schema conversion functions.
A reviewer commented that running an RD pipeline from a Ray task (not uncommon) would fail.

python/ray/data/_internal/datasource/iceberg_datasink.py

Signed-off-by: Goutam <goutam@anyscale.com>

alexeykudinkin · 2025-12-10T01:58:27Z

python/ray/data/_internal/datasource/iceberg_datasink.py

+    def _update_schema(self, incoming_schema: "pa.Schema") -> None:
+        """
+        Update the table schema to accommodate incoming data using union-by-name semantics.

-            property_as_bool = PropertyUtil.property_as_bool
+        .. warning::
+            This method must only be called from the driver process.
+            It performs schema evolution which requires exclusive table access.

-        catalog = self._get_catalog()
-        table = catalog.load_table(self.table_identifier)
-        self._txn = table.transaction()
-        self._io = self._txn._table.io
-        self._table_metadata = self._txn.table_metadata
-        self._uuid = uuid.uuid4()
-
-        if unsupported_partitions := [
-            field
-            for field in self._table_metadata.spec().fields
-            if not field.transform.supports_pyarrow_transform
-        ]:
-            raise ValueError(
-                f"Not all partition types are supported for writes. Following partitions cannot be written using pyarrow: {unsupported_partitions}."
-            )
-
-        self._manifest_merge_enabled = property_as_bool(
-            self._table_metadata.properties,
-            TableProperties.MANIFEST_MERGE_ENABLED,
-            TableProperties.MANIFEST_MERGE_ENABLED_DEFAULT,
-        )
+        Args:
+            incoming_schema: The PyArrow schema to merge with the table schema
+        """
+        with self._table.update_schema() as update:
+            update.union_by_name(incoming_schema)
+        # Succeeded, reload to get latest table version and exit.
+        self._reload_table()


This is used in 1 place, let's inline

alexeykudinkin · 2025-12-10T02:00:44Z

python/ray/data/_internal/datasource/iceberg_datasink.py

+        # Reload table to get latest metadata
+        self._reload_table()


Why do we need to reload?

Can remove this.

alexeykudinkin · 2025-12-10T02:02:42Z

python/ray/data/tests/test_iceberg.py

+        )
+        assert rows_same(result_df, expected)
+
+    def test_multiple_schema_evolutions(self, clean_table):


Let's add a test promoting type (as separate test)

Added test_schema_evolution_type_promotion

Signed-off-by: Goutam <goutam@anyscale.com>

Signed-off-by: peterxcli <peterxcli@gmail.com>

goutamvenkat-anyscale requested a review from a team as a code owner December 5, 2025 22:42

goutamvenkat-anyscale added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Dec 5, 2025

cursor bot reviewed Dec 5, 2025

View reviewed changes

python/ray/data/_internal/datasource/iceberg_datasink.py Outdated Show resolved Hide resolved

python/ray/data/_internal/datasource/iceberg_datasink.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Dec 5, 2025

View reviewed changes

python/ray/data/_internal/datasource/iceberg_datasink.py Outdated Show resolved Hide resolved

python/ray/data/_internal/datasource/iceberg_datasink.py Outdated Show resolved Hide resolved

alexeykudinkin reviewed Dec 5, 2025

View reviewed changes

goutamvenkat-anyscale added 2 commits December 5, 2025 19:14

[Data] [1/n] - Iceberg Schema Evolution

f512923

Signed-off-by: Goutam <goutam@anyscale.com>

Address comments

ac28c03

Signed-off-by: Goutam <goutam@anyscale.com>

goutamvenkat-anyscale force-pushed the goutam/iceberg_schema_thing_v1 branch from bb2f196 to ac28c03 Compare December 6, 2025 04:14

Remove uuid

19b1b3b

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Dec 6, 2025

View reviewed changes

python/ray/data/_internal/datasource/iceberg_datasink.py Show resolved Hide resolved

python/ray/data/_internal/planner/plan_write_op.py Outdated Show resolved Hide resolved

goutamvenkat-anyscale added 3 commits December 5, 2025 20:33

couple more

971e8b7

Signed-off-by: Goutam <goutam@anyscale.com>

Merge branch 'master' into goutam/iceberg_schema_thing_v1

5424eb9

Fix lance test

318f40c

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Dec 7, 2025

View reviewed changes

goutamvenkat-anyscale added 2 commits December 7, 2025 13:53

Doc test

936f09e

Signed-off-by: Goutam <goutam@anyscale.com>

Cleanup

d5e7966

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Dec 7, 2025

View reviewed changes

python/ray/data/_internal/datasource/iceberg_datasink.py Show resolved Hide resolved

Couple more fixes

11473de

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Dec 8, 2025

View reviewed changes

python/ray/data/_internal/planner/plan_write_op.py Outdated Show resolved Hide resolved

Handle tensors

6a909a3

Signed-off-by: Goutam <goutam@anyscale.com>

alexeykudinkin reviewed Dec 8, 2025

View reviewed changes

python/ray/data/_internal/planner/plan_write_op.py Outdated Show resolved Hide resolved

python/ray/data/_internal/planner/plan_write_op.py Outdated Show resolved Hide resolved

python/ray/data/_internal/execution/operators/map_operator.py Outdated Show resolved Hide resolved

Address comments

5241320

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Dec 8, 2025

View reviewed changes

python/ray/data/_internal/datasource/iceberg_datasink.py Outdated Show resolved Hide resolved

python/ray/data/_internal/arrow_ops/transform_pyarrow.py Outdated Show resolved Hide resolved

alexeykudinkin reviewed Dec 8, 2025

View reviewed changes

python/ray/data/_internal/datasource/iceberg_datasink.py Outdated Show resolved Hide resolved

python/ray/data/_internal/datasource/iceberg_datasink.py Outdated Show resolved Hide resolved

goutamvenkat-anyscale added 4 commits December 8, 2025 11:26

Remove try_update_schema

81f98e7

Signed-off-by: Goutam <goutam@anyscale.com>

Merge from master

8ca0c86

Signed-off-by: Goutam <goutam@anyscale.com>

cleanup

ee2c1e1

Signed-off-by: Goutam <goutam@anyscale.com>

one more cleanup

d5bdc02

Signed-off-by: Goutam <goutam@anyscale.com>

cursor bot reviewed Dec 10, 2025

View reviewed changes

python/ray/data/_internal/datasource/iceberg_datasink.py Show resolved Hide resolved

goutamvenkat-anyscale added 2 commits December 9, 2025 17:39

Add todo

6aefc95

Signed-off-by: Goutam <goutam@anyscale.com>

Remove unused param

3be50e8

Signed-off-by: Goutam <goutam@anyscale.com>

alexeykudinkin approved these changes Dec 10, 2025

View reviewed changes

Clean up + add type promo test

07928e1

Signed-off-by: Goutam <goutam@anyscale.com>

alexeykudinkin merged commit 91cf075 into ray-project:master Dec 10, 2025
6 checks passed

goutamvenkat-anyscale deleted the goutam/iceberg_schema_thing_v1 branch December 10, 2025 05:10

peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026

[Data] [1/n] - Iceberg Schema Evolution (ray-project#59210)

97cf111

Signed-off-by: peterxcli <peterxcli@gmail.com>


	def all_inputs_done(self):
	self._block_ref_bundler.done_adding_bundles()
	if self._block_ref_bundler.has_bundle():
	# Handle any leftover bundles in the bundler.
	(
	_,
	bundled_input,
	) = self._block_ref_bundler.get_next_bundle()
	self._add_bundled_input(bundled_input)
	super().all_inputs_done()

	if not isinstance(datasink, _FileDatasink):
	if isinstance(datasink, IcebergDatasink):
	# Iceberg needs the schema for schema evolution, use deferred callback
	def on_first_input(bundle: RefBundle):
	schema: Optional["pa.Schema"] = _get_pyarrow_schema_from_bundle(
	bundle
	)
	datasink.on_write_start(schema)

	map_op.set_on_first_input_callback(on_first_input)

Conversation

goutamvenkat-anyscale commented Dec 5, 2025

Description

Related issues

Additional information

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexeykudinkin Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

goutamvenkat-anyscale Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Bug: Callback skipped for small datasets in all_inputs_done

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexeykudinkin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

goutamvenkat-anyscale commented Dec 10, 2025

Uh oh!

gemini-code-assist bot commented Dec 10, 2025

Summary of Changes

Highlights

Uh oh!

Uh oh!

alexeykudinkin Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

goutamvenkat-anyscale Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

goutamvenkat-anyscale Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants