Skip to content

[train][doc] Document checkpoint_upload_fn backend and support cuda:nccl backend#60541

Merged
matthewdeng merged 2 commits intoray-project:masterfrom
TimothySeah:tseah/update-checkpoint-upload-fn-docs
Jan 29, 2026
Merged

[train][doc] Document checkpoint_upload_fn backend and support cuda:nccl backend#60541
matthewdeng merged 2 commits intoray-project:masterfrom
TimothySeah:tseah/update-checkpoint-upload-fn-docs

Conversation

@TimothySeah
Copy link
Contributor

Summary

This PR makes the following changes:

  1. Updates checkpoint_upload_fn documentation to mention that backend="cpu:gloo,cuda:nccl" is required. Thanks @marwan116 for observing this.
  2. Update ray train code that checks for backend="nccl" to check _is_backend_nccl instead.

Testing

Unit tests

…ccl backend

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah requested a review from a team as a code owner January 27, 2026 19:03
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the TorchTrainer by adding support for comma-separated backend configurations, specifically for using cuda:nccl alongside other backends. This is achieved by introducing a new _is_backend_nccl helper function and updating the relevant check. The documentation is also updated with an example demonstrating the new usage with checkpoint_upload_fn. The changes are logical and well-tested. I have one suggestion to improve the implementation of the new helper function for better performance and style.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@ray-gardener ray-gardener bot added the train Ray Train Related Issue label Jan 27, 2026
train_loop_config={"num_epochs": 3},
scaling_config=train.ScalingConfig(num_workers=2, use_gpu=True),
# we need a cpu backend for async_save and a gpu backend for training
torch_config=train.torch.TorchConfig(backend="cpu:gloo,cuda:nccl"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just set this by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit hesitant since multiple backends is an experimental feature (https://docs.pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group) in torch 2.10.

Hmm actually looks like multiple backends i the default in torch 2.10

Support for multiple backends is experimental. Currently when no backend is specified, both gloo and nccl backends will be created. The gloo backend will be used for collectives with CPU tensors and the nccl backend will be used for collectives with CUDA tensors. A custom backend can be specified by passing in a string with format “<device_type>:<backend_name>,<device_type>:<backend_name>”, e.g. “cpu:gloo,cuda:custom_backend”.

but I'm not sure what our torch/ray train compatibility matrix is.

@matthewdeng matthewdeng enabled auto-merge (squash) January 29, 2026 19:52
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Jan 29, 2026
@matthewdeng matthewdeng merged commit 1dff47a into ray-project:master Jan 29, 2026
8 checks passed
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jan 29, 2026
…ccl backend (ray-project#60541)

# Summary

This PR makes the following changes:
1) Updates `checkpoint_upload_fn` documentation to mention that
`backend="cpu:gloo,cuda:nccl"` is required. Thanks @marwan116 for
observing this.
2) Update ray train code that checks for `backend="nccl"` to check
`_is_backend_nccl` instead.

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
liulehui pushed a commit to liulehui/ray that referenced this pull request Jan 31, 2026
…ccl backend (ray-project#60541)

# Summary

This PR makes the following changes:
1) Updates `checkpoint_upload_fn` documentation to mention that
`backend="cpu:gloo,cuda:nccl"` is required. Thanks @marwan116 for
observing this.
2) Update ray train code that checks for `backend="nccl"` to check
`_is_backend_nccl` instead.

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
400Ping pushed a commit to 400Ping/ray that referenced this pull request Feb 1, 2026
…ccl backend (ray-project#60541)

# Summary

This PR makes the following changes:
1) Updates `checkpoint_upload_fn` documentation to mention that
`backend="cpu:gloo,cuda:nccl"` is required. Thanks @marwan116 for
observing this.
2) Update ray train code that checks for `backend="nccl"` to check
`_is_backend_nccl` instead.

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: 400Ping <jiekaichang@apache.org>
rayhhome pushed a commit to rayhhome/ray that referenced this pull request Feb 4, 2026
…ccl backend (ray-project#60541)

# Summary

This PR makes the following changes:
1) Updates `checkpoint_upload_fn` documentation to mention that
`backend="cpu:gloo,cuda:nccl"` is required. Thanks @marwan116 for
observing this.
2) Update ray train code that checks for `backend="nccl"` to check
`_is_backend_nccl` instead.

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
ans9868 pushed a commit to ans9868/ray that referenced this pull request Feb 18, 2026
…ccl backend (ray-project#60541)

# Summary

This PR makes the following changes:
1) Updates `checkpoint_upload_fn` documentation to mention that
`backend="cpu:gloo,cuda:nccl"` is required. Thanks @marwan116 for
observing this.
2) Update ray train code that checks for `backend="nccl"` to check
`_is_backend_nccl` instead.

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Adel Nour <ans9868@nyu.edu>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…ccl backend (ray-project#60541)

# Summary

This PR makes the following changes:
1) Updates `checkpoint_upload_fn` documentation to mention that
`backend="cpu:gloo,cuda:nccl"` is required. Thanks @marwan116 for
observing this.
2) Update ray train code that checks for `backend="nccl"` to check
`_is_backend_nccl` instead.

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…ccl backend (ray-project#60541)

# Summary

This PR makes the following changes:
1) Updates `checkpoint_upload_fn` documentation to mention that
`backend="cpu:gloo,cuda:nccl"` is required. Thanks @marwan116 for
observing this.
2) Update ray train code that checks for `backend="nccl"` to check
`_is_backend_nccl` instead.

# Testing

Unit tests

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants