[core][rdt] Register your own transport at runtime for RDT by dayshah · Pull Request #59255 · ray-project/ray

dayshah · 2025-12-08T05:17:57Z

Description

Adding a new public api for registering a tensor transport for RDT at runtime. You just need to call register_tensor_transport with a name, a list of supported devices, and a class that implements the TensorTransportManager interface.

Note that in the test we have to explicitly pickle by value. There's some weird some stuff where if you define a class in a test it will pickle by ref, but if you do it normally in a driver script or even import a module into the driver script, it will pickle by value.

One of the major problems with registering a custom transport, is getting it to the actor:

Problem + Solution

You need a way to register the transport on the actors that are involved.
For the source actor, enable_tensor_transport is guaranteed to be True so we launch a task at actor creation time to register any custom transports. When launching a task with an rdt output, or rdt args (dst actor), we do a ray.get on the registration task if it's not done yet on this actor.

There's some drawbacks to this though.

Actor restarts don't work
You can't borrow the actor ref and submit from another worker, ordering bets are off and actor restarts don't work

Pointed these out in the docs.

A longterm solution without the drawbacks would probably be to hook into actor construction and ask the user to specify tensor transports which will be used with that actor at actor creation time so we can register the transport at construction time. We don't really have an easy way to hook into actor construction though. We would have to send the pickled class through an extra field with the actor creation task and keep it around in the task spec so it works for the restart.

Adding documentation for all this too.

Testing

Just registering NIXL, NCCL, and GLOO at runtime with this API, and having all the existing tests for the transports pass
Adding a new test file that registers and uses a custom shared memory transport that doesn't ship with ray

Signed-off-by: dayshah <dhyey2019@gmail.com>

…port Signed-off-by: dayshah <dhyey2019@gmail.com>

Signed-off-by: dayshah <dhyey2019@gmail.com>

gemini-code-assist

Code Review

This pull request introduces a new public API, register_tensor_transport, to allow registering custom tensor transports for RDT at runtime. This is a valuable feature for extensibility. The implementation refactors TensorTransportEnum to use strings for transport types and migrates existing transports (NIXL, NCCL, GLOO) to the new registration API. The documentation is also updated accordingly.

However, I've identified a couple of issues. The most significant is that ray.put and ray.get still contain hardcoded checks that only permit "NIXL" and "OBJECT_STORE", which undermines the goal of supporting custom transports. Additionally, there's some duplicated code for validating transport names across different files. My review provides specific feedback to address these points.

python/ray/_private/worker.py

python/ray/experimental/gpu_object_manager/gpu_object_manager.py

python/ray/_private/worker.py

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah · 2026-01-06T20:07:24Z

Will it work if we submit the tensor registration on the main thread? That seems like a good intermediate point to 2.

Well the recv is on the system concurrency group and the extract will be on the main thread after the last pr. So it's hard to guarantee arrival before both. And if it's out of order...

Hmm I don't think it's fundamentally different from any other library that the user needs to import? IIUC, the main requirement is that the driver and actors need to have the same PYTHONPATH? That seems OK to me since we recommend that anyway but lmk if I'm misunderstanding something.

My thought is that the user will define it somewhere on their driver process after importing ray, like the test does. We can't get the test to work either unless we pickle by value

python/ray/experimental/gpu_object_manager/util.py

stephanie-wang · 2026-01-06T22:37:57Z

Will it work if we submit the tensor registration on the main thread? That seems like a good intermediate point to 2.

Well the recv is on the system concurrency group and the extract will be on the main thread after the last pr. So it's hard to guarantee arrival before both. And if it's out of order...

How about blocking on the register task to finish before submitting any other tasks? If it's only for actors with custom transports enabled, that seems like an OK tradeoff right now.

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah · 2026-01-08T10:52:01Z

How about blocking on the register task to finish before submitting any other tasks? If it's only for actors with custom transports enabled, that seems like an OK tradeoff right now.

Ok so I updated to do this ^. I outlined the limitations in the doc, you can't borrow the actor ref and submit from another worker, ordering bets are off and actor restarts don't work

Also made pickle_class_by_value a parameter that defaults to False and mentioned when it's useful in the docs

Signed-off-by: dayshah <dhyey2019@gmail.com>

python/ray/experimental/gpu_object_manager/util.py

Signed-off-by: dayshah <dhyey2019@gmail.com>

python/ray/experimental/gpu_object_manager/util.py

stephanie-wang · 2026-01-09T19:09:17Z

python/ray/experimental/gpu_object_manager/util.py

+class TransportManagerInfo(NamedTuple):
+    transport_manager_class: type[TensorTransportManager]
+    # list of support device types for the transport
+    devices: List[str]


I wonder if you want to fold this into the TensorTransportManager class instead?

I want to try to keep the TensorTransportManager class/file easy to read for users since they'll be looking at it. They don't need to worry about this part.

stephanie-wang · 2026-01-09T19:11:21Z

doc/source/ray-core/direct-transport.rst

+      ...
+
+
+   register_tensor_transport("CUSTOM", ["cuda", "cpu"], CustomTransport)


Would be great to add another codeblock under this showing how you would then use the tensor transport in an actor class (and ideally break it up with text explaining that).

You could do it in a followup PR but it'd be great to walk through an actual example of a tensor transport manager implementation.

ya will do this in a follow-up and split it into its own page

Sparks0219 · 2026-01-09T19:42:58Z

doc/source/ray-core/direct-transport.rst

+To implement a new tensor transport, implement the abstract interface :class:`ray.experimental.TensorTransportManager <ray.experimental.TensorTransportManager>`
+defined in `tensor_transport_manager.py <https://github.com/ray-project/ray/blob/master/python/ray/experimental/gpu_object_manager/tensor_transport_manager.py>`__.
+Then call `register_tensor_transport <ray.experimental.register_tensor_transport>` with the transport name, supported devices for the transport,
+and the class that implements `TensorTransportManager`. Note that you have to register from the same process in which you create the actor you want


does this mean you have to register the custom transport on both src/dst actor involved in the transfer?

ya, we register it on both

Sparks0219 · 2026-01-09T22:19:24Z

python/ray/experimental/gpu_object_manager/collective_tensor_transport.py

-        assert isinstance(
-            communicator_metadata, CollectiveCommunicatorMetadata
-        ), "metadata must be a CollectiveCommunicatorMetadata object for non-NIXL transport"
+        assert isinstance(tensor_transport_metadata, CollectiveTransportMetadata)


Why were the error logs removed?

The error log doesn't add anything (it's just saying it expects another type which python will say) + it was outdated (we support more than nixl) + it's just typing enforcement that should be impossible to hit

Sparks0219 · 2026-01-09T22:22:47Z

python/ray/experimental/gpu_object_manager/collective_tensor_transport.py

        obj_id: str,
-        tensor_transport_metadata: CollectiveTransportMetadata,
-        communicator_metadata: CollectiveCommunicatorMetadata,
+        tensor_transport_metadata: TensorTransportMetadata,


Why did the types change to the base class here? Don't these all still have to be CollectiveTransportMetadata?

Python typing checkers complain about it otherwise, when inheriting the method signature is supposed to be the same including param types

Sparks0219 · 2026-01-09T22:53:49Z

python/ray/experimental/gpu_object_manager/gpu_object_manager.py

+        ref = register_custom_tensor_transports_on_actor(actor)
+        # ref is None if there are no custom transports registered.
+        self.actor_id_to_transports_registered[actor._actor_id] = (
+            True if ref is None else ref


Doesn't the above state the value is True if the actor has register the custom transport? It's also True if we have no custom transports registered?

If we have no custom transports, it means all custom transports are registered

python/ray/tests/gpu_objects/test_gpu_objects_custom.py

Signed-off-by: dayshah <dhyey2019@gmail.com>

python/ray/actor.py

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah · 2026-01-12T06:30:17Z

Will split into its own more fleshed-out doc page in a follow-up #59255 (comment)

…ct#59255) Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>

…ct#59255) Signed-off-by: dayshah <dhyey2019@gmail.com>

…ct#59255) Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

dayshah added 17 commits November 30, 2025 00:56

[core][rdt] Initial refactor for bring your own transport

5dc370f

Signed-off-by: dayshah <dhyey2019@gmail.com>

fix nixl

4fa48a2

Signed-off-by: dayshah <dhyey2019@gmail.com>

if not tensors

0386f17

Signed-off-by: dayshah <dhyey2019@gmail.com>

remove usage of get_group_handle in base rdt

4abde04

Signed-off-by: dayshah <dhyey2019@gmail.com>

fix

99f1d45

Signed-off-by: dayshah <dhyey2019@gmail.com>

fix

e579076

Signed-off-by: dayshah <dhyey2019@gmail.com>

get_tensor_transport_manager under lock

fb8a99f

Signed-off-by: dayshah <dhyey2019@gmail.com>

lazy nixl agent init

8ba388a

Signed-off-by: dayshah <dhyey2019@gmail.com>

actor has nixl fix

02d408c

Signed-off-by: dayshah <dhyey2019@gmail.com>

fix nixl has actor

ed3b576

Signed-off-by: dayshah <dhyey2019@gmail.com>

Merge branch 'master' into rework-transport-enum

dc323b5

[core][rdt] Rework passing transport through for bring your own trans…

42e1f85

…port Signed-off-by: dayshah <dhyey2019@gmail.com>

remove prints

6034d57

Signed-off-by: dayshah <dhyey2019@gmail.com>

remove prints

d0286f7

Signed-off-by: dayshah <dhyey2019@gmail.com>

fix tests

fb6e95c

Signed-off-by: dayshah <dhyey2019@gmail.com>

fix upper

21da299

Signed-off-by: dayshah <dhyey2019@gmail.com>

Merge branch 'master' into rework-transport-enum

0166383

dayshah added the go add ONLY when ready to merge, run all tests label Dec 8, 2025

gemini-code-assist bot reviewed Dec 8, 2025

View reviewed changes

python/ray/_private/worker.py Outdated Show resolved Hide resolved

python/ray/experimental/gpu_object_manager/gpu_object_manager.py Outdated Show resolved Hide resolved

python/ray/_private/worker.py Outdated Show resolved Hide resolved

dayshah added 2 commits December 8, 2025 22:27

util funcs

88b1fc0

Signed-off-by: dayshah <dhyey2019@gmail.com>

fix assert

62369d9

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah force-pushed the bring-your-transport branch from e3a266a to 55d0da9 Compare December 8, 2025 23:40

dayshah added 3 commits December 8, 2025 17:31

address comments

3a6aed3

Signed-off-by: dayshah <dhyey2019@gmail.com>

Merge branch 'master' into rework-transport-enum

a6d1bcf

[core][rdt] Register your own transport at runtime for RDT

20918be

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah force-pushed the bring-your-transport branch from 55d0da9 to 20918be Compare December 9, 2025 01:34

dayshah added 2 commits December 9, 2025 06:35

Merge branch 'master' into bring-your-transport

579a488

Merge branch 'master' into bring-your-transport

e31e58f

dayshah marked this pull request as ready for review December 9, 2025 06:40

dayshah requested a review from a team as a code owner December 9, 2025 06:40

stephanie-wang reviewed Jan 6, 2026

View reviewed changes

python/ray/experimental/gpu_object_manager/util.py Outdated Show resolved Hide resolved

dayshah added 2 commits January 6, 2026 18:44

Merge branch 'master' into bring-your-transport

5fe0d17

Signed-off-by: dayshah <dhyey2019@gmail.com>

kill the register on actors

e0d0cf8

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah requested a review from stephanie-wang January 8, 2026 10:52

no pickle by value

20336fd

Signed-off-by: dayshah <dhyey2019@gmail.com>

cursor bot reviewed Jan 8, 2026

View reviewed changes

python/ray/experimental/gpu_object_manager/util.py Show resolved Hide resolved

fix has custom transport flag + doc build

679b3d9

Signed-off-by: dayshah <dhyey2019@gmail.com>

stephanie-wang approved these changes Jan 9, 2026

View reviewed changes

Sparks0219 reviewed Jan 9, 2026

View reviewed changes

Sparks0219 reviewed Jan 10, 2026

View reviewed changes

python/ray/tests/gpu_objects/test_gpu_objects_custom.py Show resolved Hide resolved

Sparks0219 reviewed Jan 10, 2026

View reviewed changes

python/ray/tests/gpu_objects/test_gpu_objects_custom.py Outdated Show resolved Hide resolved

Sparks0219 reviewed Jan 10, 2026

View reviewed changes

python/ray/tests/gpu_objects/test_gpu_objects_custom.py Show resolved Hide resolved

comments

84f52f6

Signed-off-by: dayshah <dhyey2019@gmail.com>

cursor bot reviewed Jan 12, 2026

View reviewed changes

python/ray/actor.py Outdated Show resolved Hide resolved

bot race

11eaa06

Signed-off-by: dayshah <dhyey2019@gmail.com>

dayshah enabled auto-merge (squash) January 12, 2026 06:24

dayshah merged commit e7b5f5a into ray-project:master Jan 12, 2026
7 checks passed

dayshah deleted the bring-your-transport branch January 12, 2026 18:49

AYou0207 pushed a commit to AYou0207/ray that referenced this pull request Jan 13, 2026

[core][rdt] Register your own transport at runtime for RDT (ray-proje…

227e23c

…ct#59255) Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>

dayshah mentioned this pull request Jan 20, 2026

[core][rdt] Bring your own transport docs page #60308

Open

ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Feb 3, 2026

[core][rdt] Register your own transport at runtime for RDT (ray-proje…

8c94fbc

…ct#59255) Signed-off-by: dayshah <dhyey2019@gmail.com>

peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026

[core][rdt] Register your own transport at runtime for RDT (ray-proje…

f0c88ce

…ct#59255) Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026

[core][rdt] Register your own transport at runtime for RDT (ray-proje…

9b1eced

…ct#59255) Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

		...


		register_tensor_transport("CUSTOM", ["cuda", "cpu"], CustomTransport)

Conversation

dayshah commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem + Solution

Testing

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dayshah commented Jan 6, 2026

Uh oh!

Uh oh!

stephanie-wang commented Jan 6, 2026

Uh oh!

dayshah commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dayshah commented Jan 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dayshah commented Dec 8, 2025 •

edited

Loading

dayshah commented Jan 8, 2026 •

edited

Loading