[train][doc] Document checkpoint_upload_fn backend and support cuda:nccl backend#60541
Conversation
…ccl backend Signed-off-by: Timothy Seah <tseah@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request enhances the TorchTrainer by adding support for comma-separated backend configurations, specifically for using cuda:nccl alongside other backends. This is achieved by introducing a new _is_backend_nccl helper function and updating the relevant check. The documentation is also updated with an example demonstrating the new usage with checkpoint_upload_fn. The changes are logical and well-tested. I have one suggestion to improve the implementation of the new helper function for better performance and style.
Signed-off-by: Timothy Seah <tseah@anyscale.com>
| train_loop_config={"num_epochs": 3}, | ||
| scaling_config=train.ScalingConfig(num_workers=2, use_gpu=True), | ||
| # we need a cpu backend for async_save and a gpu backend for training | ||
| torch_config=train.torch.TorchConfig(backend="cpu:gloo,cuda:nccl"), |
There was a problem hiding this comment.
Should we just set this by default?
There was a problem hiding this comment.
I'm a bit hesitant since multiple backends is an experimental feature (https://docs.pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group) in torch 2.10.
Hmm actually looks like multiple backends i the default in torch 2.10
Support for multiple backends is experimental. Currently when no backend is specified, both gloo and nccl backends will be created. The gloo backend will be used for collectives with CPU tensors and the nccl backend will be used for collectives with CUDA tensors. A custom backend can be specified by passing in a string with format “<device_type>:<backend_name>,<device_type>:<backend_name>”, e.g. “cpu:gloo,cuda:custom_backend”.
but I'm not sure what our torch/ray train compatibility matrix is.
…ccl backend (ray-project#60541) # Summary This PR makes the following changes: 1) Updates `checkpoint_upload_fn` documentation to mention that `backend="cpu:gloo,cuda:nccl"` is required. Thanks @marwan116 for observing this. 2) Update ray train code that checks for `backend="nccl"` to check `_is_backend_nccl` instead. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>
…ccl backend (ray-project#60541) # Summary This PR makes the following changes: 1) Updates `checkpoint_upload_fn` documentation to mention that `backend="cpu:gloo,cuda:nccl"` is required. Thanks @marwan116 for observing this. 2) Update ray train code that checks for `backend="nccl"` to check `_is_backend_nccl` instead. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>
…ccl backend (ray-project#60541) # Summary This PR makes the following changes: 1) Updates `checkpoint_upload_fn` documentation to mention that `backend="cpu:gloo,cuda:nccl"` is required. Thanks @marwan116 for observing this. 2) Update ray train code that checks for `backend="nccl"` to check `_is_backend_nccl` instead. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: 400Ping <jiekaichang@apache.org>
…ccl backend (ray-project#60541) # Summary This PR makes the following changes: 1) Updates `checkpoint_upload_fn` documentation to mention that `backend="cpu:gloo,cuda:nccl"` is required. Thanks @marwan116 for observing this. 2) Update ray train code that checks for `backend="nccl"` to check `_is_backend_nccl` instead. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
…ccl backend (ray-project#60541) # Summary This PR makes the following changes: 1) Updates `checkpoint_upload_fn` documentation to mention that `backend="cpu:gloo,cuda:nccl"` is required. Thanks @marwan116 for observing this. 2) Update ray train code that checks for `backend="nccl"` to check `_is_backend_nccl` instead. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Adel Nour <ans9868@nyu.edu>
…ccl backend (ray-project#60541) # Summary This PR makes the following changes: 1) Updates `checkpoint_upload_fn` documentation to mention that `backend="cpu:gloo,cuda:nccl"` is required. Thanks @marwan116 for observing this. 2) Update ray train code that checks for `backend="nccl"` to check `_is_backend_nccl` instead. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
…ccl backend (ray-project#60541) # Summary This PR makes the following changes: 1) Updates `checkpoint_upload_fn` documentation to mention that `backend="cpu:gloo,cuda:nccl"` is required. Thanks @marwan116 for observing this. 2) Update ray train code that checks for `backend="nccl"` to check `_is_backend_nccl` instead. # Testing Unit tests --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Summary
This PR makes the following changes:
checkpoint_upload_fndocumentation to mention thatbackend="cpu:gloo,cuda:nccl"is required. Thanks @marwan116 for observing this.backend="nccl"to check_is_backend_ncclinstead.Testing
Unit tests