[rdt] Add CUDA IPC transport#59838
Conversation
Signed-off-by: Avi Basnet <avigyabb@stanford.edu>
Signed-off-by: Avi Basnet <avigyabb@stanford.edu>
Signed-off-by: Avi Basnet <avigyabb@stanford.edu>
Signed-off-by: Avi Basnet <avigyabb@stanford.edu>
Signed-off-by: Avi Basnet <avigyabb@stanford.edu>
Signed-off-by: Avi Basnet <avigyabb@stanford.edu>
Signed-off-by: Avi Basnet <avigyabb@stanford.edu>
Signed-off-by: Ubuntu <ubuntu@ip-172-31-3-214.us-east-2.compute.internal>
Signed-off-by: Ubuntu <ubuntu@ip-172-31-3-214.us-east-2.compute.internal>
Signed-off-by: Ubuntu <ubuntu@ip-172-31-3-214.us-east-2.compute.internal>
| return False | ||
|
|
||
| def actor_has_tensor_transport(self, actor: "ray.actor.ActorHandle") -> bool: | ||
| return torch.cuda.is_available() |
There was a problem hiding this comment.
This will check if cuda is available on the driver, not the actor. Either way, long term i'm not sure if this actor_has_tensor_transport should exist in the form it is now bc we can't have a ray.get on a .remote call
There was a problem hiding this comment.
Ah good call, thanks. I guess I'll remove this for now and leave a TODO.
| torch.cuda.current_stream().record_event(event) | ||
|
|
||
| device = gpu_object[0].device | ||
| ray_gpu_idx = ray.get_gpu_ids()[device.index] |
There was a problem hiding this comment.
is there a guarantee that the torch index will be the right index in the ray gpu ids list?
There was a problem hiding this comment.
Ray will set the CUDA_VISIBLE_DEVICES to the assigned GPU IDs, so it should be the right index. I think the only case where it wouldn't be is if the user sets CUDA_VISIBLE_DEVICES themselves after the actor has been created. I'll add a note to the exception.
| raise RuntimeError( | ||
| f"Expected CUDA IPC tensor reconstruction list_args[6] to be device ID, but got {list_args[6]}. Please file an issue at https://github.com/ray-project/ray/issues/new/choose." | ||
| ) | ||
| list_args[6] = device.index |
|
|
||
| def double(self, data): | ||
| data.mul_(2) | ||
| torch.cuda.synchronize() |
There was a problem hiding this comment.
why this synchronize, everything should work without it still?
There was a problem hiding this comment.
Hmm yeah can probably remove it.
|
Hi, I'm curious if ray can reuse nixl (ucx) to enable cuda_ipc? Is |
I think UCX will not support this behavior because the memory is actually shared, no copies. |
Hmm actually it does seem to support CUDA IPC but I'm not sure how it's exposed exactly since it is through shared memory. |
|
#60076 |
dayshah
left a comment
There was a problem hiding this comment.
needs a couple updates from the latest merges
| ) | ||
|
|
||
| @staticmethod | ||
| def get_tensor_transport_metadata( |
There was a problem hiding this comment.
this function isn't needed anymore
| tensor_transport_metadata: CudaIpcTransportMetadata, | ||
| communicator_metadata: CudaIpcCommunicatorMetadata, |
There was a problem hiding this comment.
should be the parent type when overriding
|
|
||
| @staticmethod | ||
| def recv_multiple_tensors( | ||
| tensors, |
There was a problem hiding this comment.
tensors isn't a param anymore, have to return the tensors now
python/ray/experimental/gpu_object_manager/cuda_ipc_transport.py
Outdated
Show resolved
Hide resolved
python/ray/experimental/gpu_object_manager/cuda_ipc_transport.py
Outdated
Show resolved
Hide resolved
| tensors.append(tensor) | ||
| return tensors | ||
|
|
||
| @staticmethod |
There was a problem hiding this comment.
I think it's inconsistent for these to be static while the parent class's which this is overriding are not static. Same for all other statics that don't match parent here
|
|
||
| @property | ||
| def tensor_transport_backend(self) -> str: | ||
| return "CUDA_IPC" |
There was a problem hiding this comment.
Property decorator inconsistent with parent method interface
Low Severity
The tensor_transport_backend method uses a @property decorator, but the parent class TensorTransportManager declares it as a regular abstract method. All other implementations (NixlTensorTransport, CollectiveTensorTransport) define it as a regular method. This inconsistency means any polymorphic code calling transport.tensor_transport_backend() as a method would fail for CudaIpcTransport with a TypeError since calling a property returns a string, not a callable.
Adds a CUDA IPC transport to RDT. This relies on an internal torch function to serialize and deserialize a CUDA tensor across different processes. It may break if there are changes to [torch.multiprocessing.reductions](https://github.com/pytorch/pytorch/blob/1495b35d29512f303ab37780760c5e692158514b/torch/multiprocessing/reductions.py), but this seems to be the best stopgap solution. --------- Signed-off-by: Avi Basnet <avigyabb@stanford.edu> Signed-off-by: Ubuntu <ubuntu@ip-172-31-3-214.us-east-2.compute.internal> Signed-off-by: Qiaolin-Yu <liin1211@outlook.com> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Avi Basnet <avigyabb@stanford.edu> Co-authored-by: Ubuntu <ubuntu@ip-172-31-3-214.us-east-2.compute.internal> Co-authored-by: Qiaolin-Yu <liin1211@outlook.com> Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>
Adds a CUDA IPC transport to RDT. This relies on an internal torch function to serialize and deserialize a CUDA tensor across different processes. It may break if there are changes to [torch.multiprocessing.reductions](https://github.com/pytorch/pytorch/blob/1495b35d29512f303ab37780760c5e692158514b/torch/multiprocessing/reductions.py), but this seems to be the best stopgap solution. --------- Signed-off-by: Avi Basnet <avigyabb@stanford.edu> Signed-off-by: Ubuntu <ubuntu@ip-172-31-3-214.us-east-2.compute.internal> Signed-off-by: Qiaolin-Yu <liin1211@outlook.com> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Avi Basnet <avigyabb@stanford.edu> Co-authored-by: Ubuntu <ubuntu@ip-172-31-3-214.us-east-2.compute.internal> Co-authored-by: Qiaolin-Yu <liin1211@outlook.com> Signed-off-by: jeffery4011 <jefferyshen1015@gmail.com>
Adds a CUDA IPC transport to RDT. This relies on an internal torch function to serialize and deserialize a CUDA tensor across different processes. It may break if there are changes to [torch.multiprocessing.reductions](https://github.com/pytorch/pytorch/blob/1495b35d29512f303ab37780760c5e692158514b/torch/multiprocessing/reductions.py), but this seems to be the best stopgap solution. --------- Signed-off-by: Avi Basnet <avigyabb@stanford.edu> Signed-off-by: Ubuntu <ubuntu@ip-172-31-3-214.us-east-2.compute.internal> Signed-off-by: Qiaolin-Yu <liin1211@outlook.com> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Avi Basnet <avigyabb@stanford.edu> Co-authored-by: Ubuntu <ubuntu@ip-172-31-3-214.us-east-2.compute.internal> Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
Adds a CUDA IPC transport to RDT. This relies on an internal torch function to serialize and deserialize a CUDA tensor across different processes. It may break if there are changes to [torch.multiprocessing.reductions](https://github.com/pytorch/pytorch/blob/1495b35d29512f303ab37780760c5e692158514b/torch/multiprocessing/reductions.py), but this seems to be the best stopgap solution. --------- Signed-off-by: Avi Basnet <avigyabb@stanford.edu> Signed-off-by: Ubuntu <ubuntu@ip-172-31-3-214.us-east-2.compute.internal> Signed-off-by: Qiaolin-Yu <liin1211@outlook.com> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Avi Basnet <avigyabb@stanford.edu> Co-authored-by: Ubuntu <ubuntu@ip-172-31-3-214.us-east-2.compute.internal> Co-authored-by: Qiaolin-Yu <liin1211@outlook.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Adds a CUDA IPC transport to RDT. This relies on an internal torch function to serialize and deserialize a CUDA tensor across different processes. It may break if there are changes to [torch.multiprocessing.reductions](https://github.com/pytorch/pytorch/blob/1495b35d29512f303ab37780760c5e692158514b/torch/multiprocessing/reductions.py), but this seems to be the best stopgap solution. --------- Signed-off-by: Avi Basnet <avigyabb@stanford.edu> Signed-off-by: Ubuntu <ubuntu@ip-172-31-3-214.us-east-2.compute.internal> Signed-off-by: Qiaolin-Yu <liin1211@outlook.com> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Avi Basnet <avigyabb@stanford.edu> Co-authored-by: Ubuntu <ubuntu@ip-172-31-3-214.us-east-2.compute.internal> Co-authored-by: Qiaolin-Yu <liin1211@outlook.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Adds a CUDA IPC transport to RDT. This relies on an internal torch function to serialize and deserialize a CUDA tensor across different processes. It may break if there are changes to torch.multiprocessing.reductions, but this seems to be the best stopgap solution.
One minor issue is that right now the receiver's buffers are allocated outside of the tensor transport manager. But ideally we should allow the tensor transport itself to allocate the receiver's buffers, since in this case we don't need to allocate any new buffers on the receiver. Will address in this a followup to update the tensor transport manager interface for recv_multiple_tensors.