[rdt] Add CUDA IPC transport by stephanie-wang · Pull Request #59838 · ray-project/ray

stephanie-wang · 2026-01-05T02:12:11Z

Adds a CUDA IPC transport to RDT. This relies on an internal torch function to serialize and deserialize a CUDA tensor across different processes. It may break if there are changes to torch.multiprocessing.reductions, but this seems to be the best stopgap solution.

One minor issue is that right now the receiver's buffers are allocated outside of the tensor transport manager. But ideally we should allow the tensor transport itself to allocate the receiver's buffers, since in this case we don't need to allocate any new buffers on the receiver. Will address in this a followup to update the tensor transport manager interface for recv_multiple_tensors.

Signed-off-by: Avi Basnet <avigyabb@stanford.edu>

Signed-off-by: Ubuntu <ubuntu@ip-172-31-3-214.us-east-2.compute.internal>

Signed-off-by: Qiaolin-Yu <liin1211@outlook.com>

python/ray/experimental/collective/util.py

dayshah · 2026-01-05T09:01:30Z

python/ray/experimental/gpu_object_manager/cuda_ipc_transport.py

+        return False
+
+    def actor_has_tensor_transport(self, actor: "ray.actor.ActorHandle") -> bool:
+        return torch.cuda.is_available()


This will check if cuda is available on the driver, not the actor. Either way, long term i'm not sure if this actor_has_tensor_transport should exist in the form it is now bc we can't have a ray.get on a .remote call

Ah good call, thanks. I guess I'll remove this for now and leave a TODO.

dayshah · 2026-01-05T09:04:59Z

python/ray/experimental/gpu_object_manager/cuda_ipc_transport.py

+            torch.cuda.current_stream().record_event(event)
+
+            device = gpu_object[0].device
+            ray_gpu_idx = ray.get_gpu_ids()[device.index]


is there a guarantee that the torch index will be the right index in the ray gpu ids list?

Ray will set the CUDA_VISIBLE_DEVICES to the assigned GPU IDs, so it should be the right index. I think the only case where it wouldn't be is if the user sets CUDA_VISIBLE_DEVICES themselves after the actor has been created. I'll add a note to the exception.

python/ray/experimental/gpu_object_manager/cuda_ipc_transport.py

dayshah · 2026-01-05T09:11:39Z

python/ray/experimental/gpu_object_manager/cuda_ipc_transport.py

+                    raise RuntimeError(
+                        f"Expected CUDA IPC tensor reconstruction list_args[6] to be device ID, but got {list_args[6]}. Please file an issue at https://github.com/ray-project/ray/issues/new/choose."
+                    )
+                list_args[6] = device.index


dayshah · 2026-01-05T09:15:05Z

python/ray/tests/gpu_objects/test_gpu_objects_ipc.py

+
+    def double(self, data):
+        data.mul_(2)
+        torch.cuda.synchronize()


why this synchronize, everything should work without it still?

Hmm yeah can probably remove it.

tianyi-ge · 2026-01-05T12:51:43Z

Hi, I'm curious if ray can reuse nixl (ucx) to enable cuda_ipc? Is torch.multiprocessing.reductions better on performance?

stephanie-wang · 2026-01-06T18:50:35Z

Hi, I'm curious if ray can reuse nixl (ucx) to enable cuda_ipc? Is torch.multiprocessing.reductions better on performance?

I think UCX will not support this behavior because the memory is actually shared, no copies.

Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>

python/ray/experimental/gpu_object_manager/cuda_ipc_transport.py

stephanie-wang · 2026-01-07T00:54:23Z

Hi, I'm curious if ray can reuse nixl (ucx) to enable cuda_ipc? Is torch.multiprocessing.reductions better on performance?

I think UCX will not support this behavior because the memory is actually shared, no copies.

Hmm actually it does seem to support CUDA IPC but I'm not sure how it's exposed exactly since it is through shared memory.

dayshah · 2026-01-13T00:19:59Z

#60076
allowing recv to create the tensors

Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>

dayshah

needs a couple updates from the latest merges

dayshah · 2026-01-14T19:02:07Z

python/ray/experimental/gpu_object_manager/cuda_ipc_transport.py

+        )
+
+    @staticmethod
+    def get_tensor_transport_metadata(


this function isn't needed anymore

dayshah · 2026-01-14T19:02:32Z

python/ray/experimental/gpu_object_manager/cuda_ipc_transport.py

+        tensor_transport_metadata: CudaIpcTransportMetadata,
+        communicator_metadata: CudaIpcCommunicatorMetadata,


should be the parent type when overriding

dayshah · 2026-01-14T19:04:07Z

python/ray/experimental/gpu_object_manager/cuda_ipc_transport.py

+
+    @staticmethod
+    def recv_multiple_tensors(
+        tensors,


tensors isn't a param anymore, have to return the tensors now

python/ray/experimental/gpu_object_manager/cuda_ipc_transport.py

Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>

python/ray/experimental/gpu_object_manager/cuda_ipc_transport.py

Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>

dayshah · 2026-01-15T22:33:18Z

python/ray/experimental/gpu_object_manager/cuda_ipc_transport.py

+                tensors.append(tensor)
+        return tensors
+
+    @staticmethod


I think it's inconsistent for these to be static while the parent class's which this is overriding are not static. Same for all other statics that don't match parent here

dayshah

🚢

python/ray/experimental/gpu_object_manager/cuda_ipc_transport.py

Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>

cursor · 2026-01-16T02:16:52Z

python/ray/experimental/gpu_object_manager/cuda_ipc_transport.py

+
+    @property
+    def tensor_transport_backend(self) -> str:
+        return "CUDA_IPC"


Property decorator inconsistent with parent method interface

Low Severity

The tensor_transport_backend method uses a @property decorator, but the parent class TensorTransportManager declares it as a regular abstract method. All other implementations (NixlTensorTransport, CollectiveTensorTransport) define it as a regular method. This inconsistency means any polymorphic code calling transport.tensor_transport_backend() as a method would fail for CudaIpcTransport with a TypeError since calling a property returns a string, not a callable.

Adds a CUDA IPC transport to RDT. This relies on an internal torch function to serialize and deserialize a CUDA tensor across different processes. It may break if there are changes to [torch.multiprocessing.reductions](https://github.com/pytorch/pytorch/blob/1495b35d29512f303ab37780760c5e692158514b/torch/multiprocessing/reductions.py), but this seems to be the best stopgap solution. --------- Signed-off-by: Avi Basnet <avigyabb@stanford.edu> Signed-off-by: Ubuntu <ubuntu@ip-172-31-3-214.us-east-2.compute.internal> Signed-off-by: Qiaolin-Yu <liin1211@outlook.com> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Avi Basnet <avigyabb@stanford.edu> Co-authored-by: Ubuntu <ubuntu@ip-172-31-3-214.us-east-2.compute.internal> Co-authored-by: Qiaolin-Yu <liin1211@outlook.com> Signed-off-by: Limark Dcunha <limarkdcunha@gmail.com>

Adds a CUDA IPC transport to RDT. This relies on an internal torch function to serialize and deserialize a CUDA tensor across different processes. It may break if there are changes to [torch.multiprocessing.reductions](https://github.com/pytorch/pytorch/blob/1495b35d29512f303ab37780760c5e692158514b/torch/multiprocessing/reductions.py), but this seems to be the best stopgap solution. --------- Signed-off-by: Avi Basnet <avigyabb@stanford.edu> Signed-off-by: Ubuntu <ubuntu@ip-172-31-3-214.us-east-2.compute.internal> Signed-off-by: Qiaolin-Yu <liin1211@outlook.com> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Avi Basnet <avigyabb@stanford.edu> Co-authored-by: Ubuntu <ubuntu@ip-172-31-3-214.us-east-2.compute.internal> Co-authored-by: Qiaolin-Yu <liin1211@outlook.com> Signed-off-by: jeffery4011 <jefferyshen1015@gmail.com>

Adds a CUDA IPC transport to RDT. This relies on an internal torch function to serialize and deserialize a CUDA tensor across different processes. It may break if there are changes to [torch.multiprocessing.reductions](https://github.com/pytorch/pytorch/blob/1495b35d29512f303ab37780760c5e692158514b/torch/multiprocessing/reductions.py), but this seems to be the best stopgap solution. --------- Signed-off-by: Avi Basnet <avigyabb@stanford.edu> Signed-off-by: Ubuntu <ubuntu@ip-172-31-3-214.us-east-2.compute.internal> Signed-off-by: Qiaolin-Yu <liin1211@outlook.com> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Avi Basnet <avigyabb@stanford.edu> Co-authored-by: Ubuntu <ubuntu@ip-172-31-3-214.us-east-2.compute.internal> Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>

Adds a CUDA IPC transport to RDT. This relies on an internal torch function to serialize and deserialize a CUDA tensor across different processes. It may break if there are changes to [torch.multiprocessing.reductions](https://github.com/pytorch/pytorch/blob/1495b35d29512f303ab37780760c5e692158514b/torch/multiprocessing/reductions.py), but this seems to be the best stopgap solution. --------- Signed-off-by: Avi Basnet <avigyabb@stanford.edu> Signed-off-by: Ubuntu <ubuntu@ip-172-31-3-214.us-east-2.compute.internal> Signed-off-by: Qiaolin-Yu <liin1211@outlook.com> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Avi Basnet <avigyabb@stanford.edu> Co-authored-by: Ubuntu <ubuntu@ip-172-31-3-214.us-east-2.compute.internal> Co-authored-by: Qiaolin-Yu <liin1211@outlook.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

avigyabb and others added 30 commits August 18, 2025 23:44

init - not working

94c577e

Signed-off-by: Avi Basnet <avigyabb@stanford.edu>

working version

1b7da85

Signed-off-by: Avi Basnet <avigyabb@stanford.edu>

working version 2

a4865cf

Signed-off-by: Avi Basnet <avigyabb@stanford.edu>

clean up prints

e1f298a

Signed-off-by: Avi Basnet <avigyabb@stanford.edu>

lint

417beee

Signed-off-by: Avi Basnet <avigyabb@stanford.edu>

added warnings if src actor failes

dbda57e

Signed-off-by: Avi Basnet <avigyabb@stanford.edu>

lint

74091e7

Signed-off-by: Avi Basnet <avigyabb@stanford.edu>

added test

dbac329

Signed-off-by: Avi Basnet <avigyabb@stanford.edu>

lint

95ee97e

Signed-off-by: Avi Basnet <avigyabb@stanford.edu>

merge

1d637ba

Signed-off-by: Avi Basnet <avigyabb@stanford.edu>

working version

0f26366

Signed-off-by: Avi Basnet <avigyabb@stanford.edu>

fixes

d5783ad

Signed-off-by: Ubuntu <ubuntu@ip-172-31-3-214.us-east-2.compute.internal>

fixes

beb694c

Signed-off-by: Ubuntu <ubuntu@ip-172-31-3-214.us-east-2.compute.internal>

lint

c7fdc70

Signed-off-by: Ubuntu <ubuntu@ip-172-31-3-214.us-east-2.compute.internal>

lint

058a9cd

Signed-off-by: Ubuntu <ubuntu@ip-172-31-3-214.us-east-2.compute.internal>

new tests

f1283ee

Signed-off-by: Ubuntu <ubuntu@ip-172-31-3-214.us-east-2.compute.internal>

cache uuid

870b055

Signed-off-by: Ubuntu <ubuntu@ip-172-31-3-214.us-east-2.compute.internal>

removed get_tensor_meta func

d456328

Signed-off-by: Ubuntu <ubuntu@ip-172-31-3-214.us-east-2.compute.internal>

added TODO

91d586f

Signed-off-by: Ubuntu <ubuntu@ip-172-31-3-214.us-east-2.compute.internal>

Merge branch 'pr-55896' into ipc

68d71eb

Merge remote-tracking branch 'upstream/master' into ipc

a0f3d13

refactor

53b796d

refactor

2a007ee

Merge remote-tracking branch 'upstream/master' into ipc

23ef5d8

refine

86e7b6e

fix

ac7ef8b

fix

2045529

Signed-off-by: Qiaolin-Yu <liin1211@outlook.com>

refine

924da93

refine

b8d0676

Merge branch 'master' into ipc

1e14a51

ray-gardener bot added the community-contribution Contributed by the community label Jan 5, 2026

dayshah reviewed Jan 5, 2026

View reviewed changes

update

40cf2ae

Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>

cursor bot reviewed Jan 6, 2026

View reviewed changes

python/ray/experimental/gpu_object_manager/cuda_ipc_transport.py Show resolved Hide resolved

Merge remote-tracking branch 'origin/master' into ipc

607d0f0

dayshah mentioned this pull request Jan 13, 2026

[core][rdt] Allow transport recv to create tensors vs generic creation #60076

Merged

stephanie-wang added 2 commits January 13, 2026 21:09

Merge remote-tracking branch 'origin/master' into ipc

5939e73

fix

d322966

Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>

stephanie-wang added the go add ONLY when ready to merge, run all tests label Jan 14, 2026

dayshah reviewed Jan 14, 2026

View reviewed changes

Merge remote-tracking branch 'origin/master' into ipc

d07f983

cursor bot reviewed Jan 14, 2026

View reviewed changes

python/ray/experimental/gpu_object_manager/cuda_ipc_transport.py Outdated Show resolved Hide resolved

update

b5dfefe

Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>

cursor bot reviewed Jan 14, 2026

View reviewed changes

python/ray/experimental/gpu_object_manager/cuda_ipc_transport.py Outdated Show resolved Hide resolved

device

1bed65a

Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>

dayshah reviewed Jan 15, 2026

View reviewed changes

dayshah approved these changes Jan 15, 2026

View reviewed changes

python/ray/experimental/gpu_object_manager/cuda_ipc_transport.py Show resolved Hide resolved

stephanie-wang added 2 commits January 16, 2026 01:54

update

719ca7b

Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>

update

8ac7df2

Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>

cursor bot reviewed Jan 16, 2026

View reviewed changes

stephanie-wang merged commit 5ee47bd into ray-project:master Jan 16, 2026
5 of 6 checks passed

		tensor_transport_metadata: CudaIpcTransportMetadata,
		communicator_metadata: CudaIpcCommunicatorMetadata,

Conversation

stephanie-wang commented Jan 5, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyi-ge commented Jan 5, 2026

Uh oh!

stephanie-wang commented Jan 6, 2026

Uh oh!

Uh oh!

stephanie-wang commented Jan 7, 2026

Uh oh!

dayshah commented Jan 13, 2026

Uh oh!

dayshah left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dayshah left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot Jan 16, 2026

Choose a reason for hiding this comment

Property decorator inconsistent with parent method interface

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants