Skip to content

[data][llm][doc] Add in resiliency section and refine doc code#60594

Merged
kouroshHakha merged 4 commits intomasterfrom
fault-tolerance-doc
Feb 3, 2026
Merged

[data][llm][doc] Add in resiliency section and refine doc code#60594
kouroshHakha merged 4 commits intomasterfrom
fault-tolerance-doc

Conversation

@jeffreywang-anyscale
Copy link
Contributor

Description

  1. Add resiliency section to explain row-level and actor-level fault tolerance and the checkpointing feature
  2. Restore VLM / omni model batch inference examples removed by [docs][data][llm] Batch inference docs reorg + update to reflect per-stage config refactor #59214
  3. Adjust doc code examples to align with master's behavior (e.g. prefer chat_template_stage=True over apply_chat_template=True)

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@jeffreywang-anyscale jeffreywang-anyscale requested review from a team as code owners January 29, 2026 23:48
@gemini-code-assist
Copy link
Contributor

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

@jeffreywang-anyscale jeffreywang-anyscale added the go add ONLY when ready to merge, run all tests label Jan 29, 2026
@jeffreywang-anyscale
Copy link
Contributor Author

New resiliency section:
Screenshot 2026-01-29 at 4 03 43 PM

…behavior

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
ds = ray.data.read_parquet(input_path)
ds = processor(ds)
ds.write_parquet(output_path)
# __checkpoint_usage_example_end__
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checkpoint demo runs during module import

High Severity

The new checkpoint example executes at import time: it deletes and recreates /tmp/llm_checkpoint_demo/*, sets global ray.data.DataContext checkpoint config, then calls ray.data.read_parquet(input_path) and write_parquet(output_path) without creating any input data. This can fail CI/docs builds and introduces unexpected filesystem and global state side effects.

Fix in Cursor Fix in Web

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
@ray-gardener ray-gardener bot added docs An issue or change related to documentation data Ray Data-related issues llm labels Jan 30, 2026
@jeffreywang-anyscale jeffreywang-anyscale changed the title [data][llm][doc] Add in resiliency section and adjust doc code to align with master's behavior [data][llm][doc] Add in resiliency section and refine doc code Jan 30, 2026
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Copy link
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs have not compiled yet for me to check the rendered version but for now leaving this nit:


.. literalinclude:: doc_code/working-with-llms/vlm_image_example.py
:language: python
:start-after: def load_vision_dataset():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use dedicated tags, rather than function names

Copy link
Contributor Author

@jeffreywang-anyscale jeffreywang-anyscale Jan 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the latest revision and confirmed that the doc rendered properly.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
@kouroshHakha kouroshHakha enabled auto-merge (squash) January 31, 2026 00:33
Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamp

@kouroshHakha kouroshHakha merged commit 0dec286 into master Feb 3, 2026
7 checks passed
@kouroshHakha kouroshHakha deleted the fault-tolerance-doc branch February 3, 2026 00:41
rayhhome pushed a commit to rayhhome/ray that referenced this pull request Feb 4, 2026
…roject#60594)

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Feb 9, 2026
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Feb 9, 2026
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
ans9868 pushed a commit to ans9868/ray that referenced this pull request Feb 18, 2026
…roject#60594)

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Adel Nour <ans9868@nyu.edu>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…roject#60594)

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…roject#60594)

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues docs An issue or change related to documentation go add ONLY when ready to merge, run all tests llm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray fails to serialize self-reference objects

3 participants