[Data] Add Checkpointing to Ray Data by owenowenisme · Pull Request #59409 · ray-project/ray

owenowenisme · 2025-12-12T18:32:08Z

Description

Add checkpointing for data pipeline, currently only support pipeline starts with Read and end ends with Write.

Example Script:

import ray
import pandas as pd
from ray.data.checkpoint import CheckpointConfig

# Setup paths
base_dir = "/tmp/ray_checkpoint_demo"
input_path = os.path.join(base_dir, "input")
output_path = os.path.join(base_dir, "output")
checkpoint_path = os.path.join(base_dir, "checkpoint")

# Create sample data (10 rows with unique IDs)
df = pd.DataFrame({"id": range(10), "value": [f"row_{i}" for i in range(10)]})
df.to_parquet(os.path.join(input_path, "data.parquet"), index=False)
print(f"Created 10 rows of sample data")

def run_pipeline(fail_on_id_gt=None):
    """Run pipeline with optional simulated failure."""
    ctx = ray.data.DataContext.get_current()
    ctx.checkpoint_config = CheckpointConfig(
        id_column="id",
        checkpoint_path=checkpoint_path,
        delete_checkpoint_on_success=False,
    )

    def process_batch(batch):
        if fail_on_id_gt is not None and max(batch["id"]) > fail_on_id_gt:
            raise RuntimeError(f"Simulated failure at id > {fail_on_id_gt}")
        if fail_on_id_gt is not None:
            batch["info"] = ["checkpointed from first run"] * len(batch["id"])
        else:
            batch["info"] = ["not checkpointed from first run"] * len(batch["id"])
        return batch

    ds = ray.data.read_parquet(input_path)
    ds = ds.map_batches(process_batch, batch_size=5)
    ds.write_parquet(output_path)


# Run 1: Fail after processing some rows
print("\n=== Run 1: Pipeline with simulated failure ===")
try:
    run_pipeline(fail_on_id_gt=5)
except Exception as e:
    print(f"Failed as expected: {e}")

# Run 2: Resume from checkpoint
print("\n=== Run 2: Resume from checkpoint ===")
run_pipeline(fail_on_id_gt=None)  # No failure

# Verify results
print("\n=== Results ===")
result = ray.data.read_parquet(output_path)
print(f"Total rows in output: {result.count()}")
print(f"Result: {result.take_all()}")

{'id': 0, 'value': 'row_0', 'info': 'checkpointed from first run'}
{'id': 1, 'value': 'row_1', 'info': 'checkpointed from first run'}
{'id': 2, 'value': 'row_2', 'info': 'checkpointed from first run'}
{'id': 3, 'value': 'row_3', 'info': 'checkpointed from first run'}
{'id': 4, 'value': 'row_4', 'info': 'checkpointed from first run'}                                                                                  
{'id': 5, 'value': 'row_5', 'info': 'not checkpointed from first run'}
{'id': 6, 'value': 'row_6', 'info': 'not checkpointed from first run'}
{'id': 7, 'value': 'row_7', 'info': 'not checkpointed from first run'}
{'id': 8, 'value': 'row_8', 'info': 'not checkpointed from first run'}
{'id': 9, 'value': 'row_9', 'info': 'not checkpointed from first run'}

Related issues

Closes #55008

Additional information

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

gemini-code-assist

Code Review

This pull request introduces a robust checkpointing mechanism to Ray Data, a significant feature for ensuring fault tolerance and efficient recovery in data pipelines. The implementation is comprehensive, touching upon planning, execution, configuration, and testing. Key additions include a CheckpointConfig for setup, dynamic plan adjustments in the Planner to inject checkpointing logic, and specific operators for filtering processed rows and writing checkpoint data. The use of an ExecutionCallback for managing the checkpoint lifecycle is a clean approach. The code is well-structured, and the tests are thorough, covering various success and failure scenarios. I have a few suggestions to enhance maintainability, improve memory efficiency, and make the configuration handling more robust. Overall, this is a very strong and valuable addition to Ray Data.

python/ray/data/_internal/planner/checkpoint/plan_read_op.py

python/ray/data/_internal/planner/checkpoint/plan_write_op.py

python/ray/data/checkpoint/checkpoint_filter.py

python/ray/data/context.py

python/ray/data/_internal/planner/planner.py

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

python/ray/data/_internal/planner/planner.py

python/ray/data/checkpoint/checkpoint_filter.py

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

python/ray/data/checkpoint/checkpoint_filter.py

wxwmd · 2025-12-23T06:10:01Z

good feature. exactly what i need

python/ray/data/_internal/planner/checkpoint/plan_read_op.py

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

python/ray/data/checkpoint/checkpoint_filter.py

python/ray/data/_internal/planner/planner.py

python/ray/data/context.py

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

python/ray/data/checkpoint/checkpoint_filter.py

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

python/ray/data/checkpoint/load_checkpoint_callback.py

raulchen

stamping. already reviewed offline

## Description Add checkpointing for data pipeline, currently only support pipeline starts with `Read` and end ends with `Write`. Example Script: ```py import ray import pandas as pd from ray.data.checkpoint import CheckpointConfig # Setup paths base_dir = "/tmp/ray_checkpoint_demo" input_path = os.path.join(base_dir, "input") output_path = os.path.join(base_dir, "output") checkpoint_path = os.path.join(base_dir, "checkpoint") # Create sample data (10 rows with unique IDs) df = pd.DataFrame({"id": range(10), "value": [f"row_{i}" for i in range(10)]}) df.to_parquet(os.path.join(input_path, "data.parquet"), index=False) print(f"Created 10 rows of sample data") def run_pipeline(fail_on_id_gt=None): """Run pipeline with optional simulated failure.""" ctx = ray.data.DataContext.get_current() ctx.checkpoint_config = CheckpointConfig( id_column="id", checkpoint_path=checkpoint_path, delete_checkpoint_on_success=False, ) def process_batch(batch): if fail_on_id_gt is not None and max(batch["id"]) > fail_on_id_gt: raise RuntimeError(f"Simulated failure at id > {fail_on_id_gt}") if fail_on_id_gt is not None: batch["info"] = ["checkpointed from first run"] * len(batch["id"]) else: batch["info"] = ["not checkpointed from first run"] * len(batch["id"]) return batch ds = ray.data.read_parquet(input_path) ds = ds.map_batches(process_batch, batch_size=5) ds.write_parquet(output_path) # Run 1: Fail after processing some rows print("\n=== Run 1: Pipeline with simulated failure ===") try: run_pipeline(fail_on_id_gt=5) except Exception as e: print(f"Failed as expected: {e}") # Run 2: Resume from checkpoint print("\n=== Run 2: Resume from checkpoint ===") run_pipeline(fail_on_id_gt=None) # No failure # Verify results print("\n=== Results ===") result = ray.data.read_parquet(output_path) print(f"Total rows in output: {result.count()}") print(f"Result: {result.take_all()}") ``` ``` {'id': 0, 'value': 'row_0', 'info': 'checkpointed from first run'} {'id': 1, 'value': 'row_1', 'info': 'checkpointed from first run'} {'id': 2, 'value': 'row_2', 'info': 'checkpointed from first run'} {'id': 3, 'value': 'row_3', 'info': 'checkpointed from first run'} {'id': 4, 'value': 'row_4', 'info': 'checkpointed from first run'} {'id': 5, 'value': 'row_5', 'info': 'not checkpointed from first run'} {'id': 6, 'value': 'row_6', 'info': 'not checkpointed from first run'} {'id': 7, 'value': 'row_7', 'info': 'not checkpointed from first run'} {'id': 8, 'value': 'row_8', 'info': 'not checkpointed from first run'} {'id': 9, 'value': 'row_9', 'info': 'not checkpointed from first run'} ``` ## Related issues Closes ray-project#55008 ## Additional information --------- Signed-off-by: You-Cheng Lin <mses010108@gmail.com> Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>

## Description Add checkpointing for data pipeline, currently only support pipeline starts with `Read` and end ends with `Write`. Example Script: ```py import ray import pandas as pd from ray.data.checkpoint import CheckpointConfig # Setup paths base_dir = "/tmp/ray_checkpoint_demo" input_path = os.path.join(base_dir, "input") output_path = os.path.join(base_dir, "output") checkpoint_path = os.path.join(base_dir, "checkpoint") # Create sample data (10 rows with unique IDs) df = pd.DataFrame({"id": range(10), "value": [f"row_{i}" for i in range(10)]}) df.to_parquet(os.path.join(input_path, "data.parquet"), index=False) print(f"Created 10 rows of sample data") def run_pipeline(fail_on_id_gt=None): """Run pipeline with optional simulated failure.""" ctx = ray.data.DataContext.get_current() ctx.checkpoint_config = CheckpointConfig( id_column="id", checkpoint_path=checkpoint_path, delete_checkpoint_on_success=False, ) def process_batch(batch): if fail_on_id_gt is not None and max(batch["id"]) > fail_on_id_gt: raise RuntimeError(f"Simulated failure at id > {fail_on_id_gt}") if fail_on_id_gt is not None: batch["info"] = ["checkpointed from first run"] * len(batch["id"]) else: batch["info"] = ["not checkpointed from first run"] * len(batch["id"]) return batch ds = ray.data.read_parquet(input_path) ds = ds.map_batches(process_batch, batch_size=5) ds.write_parquet(output_path) # Run 1: Fail after processing some rows print("\n=== Run 1: Pipeline with simulated failure ===") try: run_pipeline(fail_on_id_gt=5) except Exception as e: print(f"Failed as expected: {e}") # Run 2: Resume from checkpoint print("\n=== Run 2: Resume from checkpoint ===") run_pipeline(fail_on_id_gt=None) # No failure # Verify results print("\n=== Results ===") result = ray.data.read_parquet(output_path) print(f"Total rows in output: {result.count()}") print(f"Result: {result.take_all()}") ``` ``` {'id': 0, 'value': 'row_0', 'info': 'checkpointed from first run'} {'id': 1, 'value': 'row_1', 'info': 'checkpointed from first run'} {'id': 2, 'value': 'row_2', 'info': 'checkpointed from first run'} {'id': 3, 'value': 'row_3', 'info': 'checkpointed from first run'} {'id': 4, 'value': 'row_4', 'info': 'checkpointed from first run'} {'id': 5, 'value': 'row_5', 'info': 'not checkpointed from first run'} {'id': 6, 'value': 'row_6', 'info': 'not checkpointed from first run'} {'id': 7, 'value': 'row_7', 'info': 'not checkpointed from first run'} {'id': 8, 'value': 'row_8', 'info': 'not checkpointed from first run'} {'id': 9, 'value': 'row_9', 'info': 'not checkpointed from first run'} ``` ## Related issues Closes ray-project#55008 ## Additional information --------- Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

This PR documents Ray Data job-level checkpointing functionality added in #59409. Adds documentation explaining job-level checkpointing and its application to offline batch inference, including configuration examples. **Sections modified:** - Execution Configurations - End-to-end: Offline Batch Inference Supersedes #60289 Fixes #60250 --------- Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com> Signed-off-by: “Alex <alexchien130@gmail.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Abhishek Kumar <anonyomoushunter@gmail.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>

…0921) This PR documents Ray Data job-level checkpointing functionality added in ray-project#59409. Adds documentation explaining job-level checkpointing and its application to offline batch inference, including configuration examples. **Sections modified:** - Execution Configurations - End-to-end: Offline Batch Inference Supersedes ray-project#60289 Fixes ray-project#60250 --------- Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com> Signed-off-by: “Alex <alexchien130@gmail.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Abhishek Kumar <anonyomoushunter@gmail.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: Ondrej Prenek <ondra.prenek@gmail.com>

…0921) This PR documents Ray Data job-level checkpointing functionality added in ray-project#59409. Adds documentation explaining job-level checkpointing and its application to offline batch inference, including configuration examples. **Sections modified:** - Execution Configurations - End-to-end: Offline Batch Inference Supersedes ray-project#60289 Fixes ray-project#60250 --------- Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com> Signed-off-by: “Alex <alexchien130@gmail.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Abhishek Kumar <anonyomoushunter@gmail.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>

…0921) This PR documents Ray Data job-level checkpointing functionality added in ray-project#59409. Adds documentation explaining job-level checkpointing and its application to offline batch inference, including configuration examples. **Sections modified:** - Execution Configurations - End-to-end: Offline Batch Inference Supersedes ray-project#60289 Fixes ray-project#60250 --------- Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com> Signed-off-by: “Alex <alexchien130@gmail.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Abhishek Kumar <anonyomoushunter@gmail.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: Muhammad Saif <2024BBIT200@student.Uet.edu.pk>

…0921) This PR documents Ray Data job-level checkpointing functionality added in ray-project#59409. Adds documentation explaining job-level checkpointing and its application to offline batch inference, including configuration examples. **Sections modified:** - Execution Configurations - End-to-end: Offline Batch Inference Supersedes ray-project#60289 Fixes ray-project#60250 --------- Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com> Signed-off-by: “Alex <alexchien130@gmail.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Abhishek Kumar <anonyomoushunter@gmail.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>

…0921) This PR documents Ray Data job-level checkpointing functionality added in ray-project#59409. Adds documentation explaining job-level checkpointing and its application to offline batch inference, including configuration examples. **Sections modified:** - Execution Configurations - End-to-end: Offline Batch Inference Supersedes ray-project#60289 Fixes ray-project#60250 --------- Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com> Signed-off-by: “Alex <alexchien130@gmail.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Abhishek Kumar <anonyomoushunter@gmail.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: Adel Nour <ans9868@nyu.edu>

…0921) This PR documents Ray Data job-level checkpointing functionality added in ray-project#59409. Adds documentation explaining job-level checkpointing and its application to offline batch inference, including configuration examples. **Sections modified:** - Execution Configurations - End-to-end: Offline Batch Inference Supersedes ray-project#60289 Fixes ray-project#60250 --------- Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com> Signed-off-by: “Alex <alexchien130@gmail.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Abhishek Kumar <anonyomoushunter@gmail.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>

## Description Add checkpointing for data pipeline, currently only support pipeline starts with `Read` and end ends with `Write`. Example Script: ```py import ray import pandas as pd from ray.data.checkpoint import CheckpointConfig # Setup paths base_dir = "/tmp/ray_checkpoint_demo" input_path = os.path.join(base_dir, "input") output_path = os.path.join(base_dir, "output") checkpoint_path = os.path.join(base_dir, "checkpoint") # Create sample data (10 rows with unique IDs) df = pd.DataFrame({"id": range(10), "value": [f"row_{i}" for i in range(10)]}) df.to_parquet(os.path.join(input_path, "data.parquet"), index=False) print(f"Created 10 rows of sample data") def run_pipeline(fail_on_id_gt=None): """Run pipeline with optional simulated failure.""" ctx = ray.data.DataContext.get_current() ctx.checkpoint_config = CheckpointConfig( id_column="id", checkpoint_path=checkpoint_path, delete_checkpoint_on_success=False, ) def process_batch(batch): if fail_on_id_gt is not None and max(batch["id"]) > fail_on_id_gt: raise RuntimeError(f"Simulated failure at id > {fail_on_id_gt}") if fail_on_id_gt is not None: batch["info"] = ["checkpointed from first run"] * len(batch["id"]) else: batch["info"] = ["not checkpointed from first run"] * len(batch["id"]) return batch ds = ray.data.read_parquet(input_path) ds = ds.map_batches(process_batch, batch_size=5) ds.write_parquet(output_path) # Run 1: Fail after processing some rows print("\n=== Run 1: Pipeline with simulated failure ===") try: run_pipeline(fail_on_id_gt=5) except Exception as e: print(f"Failed as expected: {e}") # Run 2: Resume from checkpoint print("\n=== Run 2: Resume from checkpoint ===") run_pipeline(fail_on_id_gt=None) # No failure # Verify results print("\n=== Results ===") result = ray.data.read_parquet(output_path) print(f"Total rows in output: {result.count()}") print(f"Result: {result.take_all()}") ``` ``` {'id': 0, 'value': 'row_0', 'info': 'checkpointed from first run'} {'id': 1, 'value': 'row_1', 'info': 'checkpointed from first run'} {'id': 2, 'value': 'row_2', 'info': 'checkpointed from first run'} {'id': 3, 'value': 'row_3', 'info': 'checkpointed from first run'} {'id': 4, 'value': 'row_4', 'info': 'checkpointed from first run'} {'id': 5, 'value': 'row_5', 'info': 'not checkpointed from first run'} {'id': 6, 'value': 'row_6', 'info': 'not checkpointed from first run'} {'id': 7, 'value': 'row_7', 'info': 'not checkpointed from first run'} {'id': 8, 'value': 'row_8', 'info': 'not checkpointed from first run'} {'id': 9, 'value': 'row_9', 'info': 'not checkpointed from first run'} ``` ## Related issues Closes ray-project#55008 ## Additional information --------- Signed-off-by: You-Cheng Lin <mses010108@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

…0921) This PR documents Ray Data job-level checkpointing functionality added in ray-project#59409. Adds documentation explaining job-level checkpointing and its application to offline batch inference, including configuration examples. **Sections modified:** - Execution Configurations - End-to-end: Offline Batch Inference Supersedes ray-project#60289 Fixes ray-project#60250 --------- Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com> Signed-off-by: “Alex <alexchien130@gmail.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Abhishek Kumar <anonyomoushunter@gmail.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

## Description Add checkpointing for data pipeline, currently only support pipeline starts with `Read` and end ends with `Write`. Example Script: ```py import ray import pandas as pd from ray.data.checkpoint import CheckpointConfig # Setup paths base_dir = "/tmp/ray_checkpoint_demo" input_path = os.path.join(base_dir, "input") output_path = os.path.join(base_dir, "output") checkpoint_path = os.path.join(base_dir, "checkpoint") # Create sample data (10 rows with unique IDs) df = pd.DataFrame({"id": range(10), "value": [f"row_{i}" for i in range(10)]}) df.to_parquet(os.path.join(input_path, "data.parquet"), index=False) print(f"Created 10 rows of sample data") def run_pipeline(fail_on_id_gt=None): """Run pipeline with optional simulated failure.""" ctx = ray.data.DataContext.get_current() ctx.checkpoint_config = CheckpointConfig( id_column="id", checkpoint_path=checkpoint_path, delete_checkpoint_on_success=False, ) def process_batch(batch): if fail_on_id_gt is not None and max(batch["id"]) > fail_on_id_gt: raise RuntimeError(f"Simulated failure at id > {fail_on_id_gt}") if fail_on_id_gt is not None: batch["info"] = ["checkpointed from first run"] * len(batch["id"]) else: batch["info"] = ["not checkpointed from first run"] * len(batch["id"]) return batch ds = ray.data.read_parquet(input_path) ds = ds.map_batches(process_batch, batch_size=5) ds.write_parquet(output_path) # Run 1: Fail after processing some rows print("\n=== Run 1: Pipeline with simulated failure ===") try: run_pipeline(fail_on_id_gt=5) except Exception as e: print(f"Failed as expected: {e}") # Run 2: Resume from checkpoint print("\n=== Run 2: Resume from checkpoint ===") run_pipeline(fail_on_id_gt=None) # No failure # Verify results print("\n=== Results ===") result = ray.data.read_parquet(output_path) print(f"Total rows in output: {result.count()}") print(f"Result: {result.take_all()}") ``` ``` {'id': 0, 'value': 'row_0', 'info': 'checkpointed from first run'} {'id': 1, 'value': 'row_1', 'info': 'checkpointed from first run'} {'id': 2, 'value': 'row_2', 'info': 'checkpointed from first run'} {'id': 3, 'value': 'row_3', 'info': 'checkpointed from first run'} {'id': 4, 'value': 'row_4', 'info': 'checkpointed from first run'} {'id': 5, 'value': 'row_5', 'info': 'not checkpointed from first run'} {'id': 6, 'value': 'row_6', 'info': 'not checkpointed from first run'} {'id': 7, 'value': 'row_7', 'info': 'not checkpointed from first run'} {'id': 8, 'value': 'row_8', 'info': 'not checkpointed from first run'} {'id': 9, 'value': 'row_9', 'info': 'not checkpointed from first run'} ``` ## Related issues Closes ray-project#55008 ## Additional information --------- Signed-off-by: You-Cheng Lin <mses010108@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

…0921) This PR documents Ray Data job-level checkpointing functionality added in ray-project#59409. Adds documentation explaining job-level checkpointing and its application to offline batch inference, including configuration examples. **Sections modified:** - Execution Configurations - End-to-end: Offline Batch Inference Supersedes ray-project#60289 Fixes ray-project#60250 --------- Signed-off-by: Abhishek Kumar <anonyomoushunter@gmail.com> Signed-off-by: “Alex <alexchien130@gmail.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Abhishek Kumar <anonyomoushunter@gmail.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

update

9c7b701

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

owenowenisme requested a review from a team as a code owner December 12, 2025 18:32

gemini-code-assist bot reviewed Dec 12, 2025

View reviewed changes

cursor bot reviewed Dec 12, 2025

View reviewed changes

python/ray/data/context.py Show resolved Hide resolved

python/ray/data/_internal/planner/planner.py Outdated Show resolved Hide resolved

ray-gardener bot added the data Ray Data-related issues label Dec 12, 2025

update

20e010c

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

cursor bot reviewed Dec 13, 2025

View reviewed changes

python/ray/data/_internal/planner/planner.py Outdated Show resolved Hide resolved

python/ray/data/checkpoint/checkpoint_filter.py Show resolved Hide resolved

cursor bot reviewed Dec 13, 2025

View reviewed changes

python/ray/data/checkpoint/checkpoint_filter.py Show resolved Hide resolved

mvoe test file to correct place

b7da48d

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

owenowenisme force-pushed the data/add-checkpoint-in-ray-data branch from 6e66b8f to b7da48d Compare December 13, 2025 08:20

cursor bot reviewed Dec 13, 2025

View reviewed changes

python/ray/data/checkpoint/checkpoint_filter.py Show resolved Hide resolved

richardliaw mentioned this pull request Dec 18, 2025

[data][llm] LLM Batch Inference Resiliency Thread #59522

Open

wxwmd reviewed Dec 24, 2025

View reviewed changes

python/ray/data/_internal/planner/checkpoint/plan_read_op.py Show resolved Hide resolved

update

4964696

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

owenowenisme added the go add ONLY when ready to merge, run all tests label Dec 30, 2025

Merge branch 'master' into data/add-checkpoint-in-ray-data

cd2ee2d

cursor bot reviewed Dec 30, 2025

View reviewed changes

python/ray/data/checkpoint/checkpoint_filter.py Show resolved Hide resolved

python/ray/data/_internal/planner/planner.py Show resolved Hide resolved

python/ray/data/context.py Show resolved Hide resolved

owenowenisme added 2 commits December 31, 2025 01:44

update

fb423c9

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

add docs

75e7922

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

cursor bot reviewed Dec 30, 2025

View reviewed changes

python/ray/data/checkpoint/checkpoint_filter.py Show resolved Hide resolved

fix docstring

4eeeb1f

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>

cursor bot reviewed Dec 31, 2025

View reviewed changes

python/ray/data/checkpoint/load_checkpoint_callback.py Show resolved Hide resolved

raulchen approved these changes Jan 2, 2026

View reviewed changes

raulchen merged commit 566a5fa into ray-project:master Jan 2, 2026
6 checks passed

bveeramani mentioned this pull request Jan 17, 2026

[Data] Add documentation for Ray Data checkpointing #60250

Closed

anonihunter mentioned this pull request Jan 19, 2026

[Data][Docs](draft): Document job-level checkpointing #60289

Closed

pushpavanthar mentioned this pull request Jan 19, 2026

[Data] Support Iceberg in Checkpointing #59870

Open

yuhuan130 mentioned this pull request Feb 10, 2026

[Data][Docs] Add job-level checkpointing documentation #60921

Merged

xinyuangui2 mentioned this pull request Feb 11, 2026

[Data] 2 phase commit for checkpointing to avoid duplicates #60983

Closed

wingkitlee0 mentioned this pull request Mar 1, 2026

[Data] Add include_row_hash to read_parquet #61410

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Add Checkpointing to Ray Data#59409

[Data] Add Checkpointing to Ray Data#59409
raulchen merged 8 commits intoray-project:masterfrom
owenowenisme:data/add-checkpoint-in-ray-data

owenowenisme commented Dec 12, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wxwmd commented Dec 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

raulchen left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

owenowenisme commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wxwmd commented Dec 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

raulchen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

owenowenisme commented Dec 12, 2025 •

edited

Loading