Skip to content

[core][rdt] Register your own transport at runtime for RDT #59255

Merged
dayshah merged 42 commits intoray-project:masterfrom
dayshah:bring-your-transport
Jan 12, 2026
Merged

[core][rdt] Register your own transport at runtime for RDT #59255
dayshah merged 42 commits intoray-project:masterfrom
dayshah:bring-your-transport

Conversation

@dayshah
Copy link
Contributor

@dayshah dayshah commented Dec 8, 2025

Description

Adding a new public api for registering a tensor transport for RDT at runtime. You just need to call register_tensor_transport with a name, a list of supported devices, and a class that implements the TensorTransportManager interface.

Note that in the test we have to explicitly pickle by value. There's some weird some stuff where if you define a class in a test it will pickle by ref, but if you do it normally in a driver script or even import a module into the driver script, it will pickle by value.

One of the major problems with registering a custom transport, is getting it to the actor:

Problem + Solution

You need a way to register the transport on the actors that are involved.
For the source actor, enable_tensor_transport is guaranteed to be True so we launch a task at actor creation time to register any custom transports. When launching a task with an rdt output, or rdt args (dst actor), we do a ray.get on the registration task if it's not done yet on this actor.

There's some drawbacks to this though.

  1. Actor restarts don't work
  2. You can't borrow the actor ref and submit from another worker, ordering bets are off and actor restarts don't work

Pointed these out in the docs.

A longterm solution without the drawbacks would probably be to hook into actor construction and ask the user to specify tensor transports which will be used with that actor at actor creation time so we can register the transport at construction time. We don't really have an easy way to hook into actor construction though. We would have to send the pickled class through an extra field with the actor creation task and keep it around in the task spec so it works for the restart.

Adding documentation for all this too.

Testing

  • Just registering NIXL, NCCL, and GLOO at runtime with this API, and having all the existing tests for the transports pass
  • Adding a new test file that registers and uses a custom shared memory transport that doesn't ship with ray

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
…port

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah dayshah added the go add ONLY when ready to merge, run all tests label Dec 8, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new public API, register_tensor_transport, to allow registering custom tensor transports for RDT at runtime. This is a valuable feature for extensibility. The implementation refactors TensorTransportEnum to use strings for transport types and migrates existing transports (NIXL, NCCL, GLOO) to the new registration API. The documentation is also updated accordingly.

However, I've identified a couple of issues. The most significant is that ray.put and ray.get still contain hardcoded checks that only permit "NIXL" and "OBJECT_STORE", which undermines the goal of supporting custom transports. Additionally, there's some duplicated code for validating transport names across different files. My review provides specific feedback to address these points.

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah dayshah force-pushed the bring-your-transport branch from e3a266a to 55d0da9 Compare December 8, 2025 23:40
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah dayshah force-pushed the bring-your-transport branch from 55d0da9 to 20918be Compare December 9, 2025 01:34
@dayshah dayshah marked this pull request as ready for review December 9, 2025 06:40
@dayshah dayshah requested a review from a team as a code owner December 9, 2025 06:40
@dayshah
Copy link
Contributor Author

dayshah commented Jan 6, 2026

Will it work if we submit the tensor registration on the main thread? That seems like a good intermediate point to 2.

Well the recv is on the system concurrency group and the extract will be on the main thread after the last pr. So it's hard to guarantee arrival before both. And if it's out of order...

Hmm I don't think it's fundamentally different from any other library that the user needs to import? IIUC, the main requirement is that the driver and actors need to have the same PYTHONPATH? That seems OK to me since we recommend that anyway but lmk if I'm misunderstanding something.

My thought is that the user will define it somewhere on their driver process after importing ray, like the test does. We can't get the test to work either unless we pickle by value

@stephanie-wang
Copy link
Contributor

Will it work if we submit the tensor registration on the main thread? That seems like a good intermediate point to 2.

Well the recv is on the system concurrency group and the extract will be on the main thread after the last pr. So it's hard to guarantee arrival before both. And if it's out of order...

How about blocking on the register task to finish before submitting any other tasks? If it's only for actors with custom transports enabled, that seems like an OK tradeoff right now.

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah
Copy link
Contributor Author

dayshah commented Jan 8, 2026

How about blocking on the register task to finish before submitting any other tasks? If it's only for actors with custom transports enabled, that seems like an OK tradeoff right now.

Ok so I updated to do this ^. I outlined the limitations in the doc, you can't borrow the actor ref and submit from another worker, ordering bets are off and actor restarts don't work

Also made pickle_class_by_value a parameter that defaults to False and mentioned when it's useful in the docs

@dayshah dayshah requested a review from stephanie-wang January 8, 2026 10:52
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
class TransportManagerInfo(NamedTuple):
transport_manager_class: type[TensorTransportManager]
# list of support device types for the transport
devices: List[str]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you want to fold this into the TensorTransportManager class instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to try to keep the TensorTransportManager class/file easy to read for users since they'll be looking at it. They don't need to worry about this part.

...


register_tensor_transport("CUSTOM", ["cuda", "cpu"], CustomTransport)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to add another codeblock under this showing how you would then use the tensor transport in an actor class (and ideally break it up with text explaining that).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do it in a followup PR but it'd be great to walk through an actual example of a tensor transport manager implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya will do this in a follow-up and split it into its own page

To implement a new tensor transport, implement the abstract interface :class:`ray.experimental.TensorTransportManager <ray.experimental.TensorTransportManager>`
defined in `tensor_transport_manager.py <https://github.com/ray-project/ray/blob/master/python/ray/experimental/gpu_object_manager/tensor_transport_manager.py>`__.
Then call `register_tensor_transport <ray.experimental.register_tensor_transport>` with the transport name, supported devices for the transport,
and the class that implements `TensorTransportManager`. Note that you have to register from the same process in which you create the actor you want
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean you have to register the custom transport on both src/dst actor involved in the transfer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, we register it on both

assert isinstance(
communicator_metadata, CollectiveCommunicatorMetadata
), "metadata must be a CollectiveCommunicatorMetadata object for non-NIXL transport"
assert isinstance(tensor_transport_metadata, CollectiveTransportMetadata)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why were the error logs removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error log doesn't add anything (it's just saying it expects another type which python will say) + it was outdated (we support more than nixl) + it's just typing enforcement that should be impossible to hit

obj_id: str,
tensor_transport_metadata: CollectiveTransportMetadata,
communicator_metadata: CollectiveCommunicatorMetadata,
tensor_transport_metadata: TensorTransportMetadata,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did the types change to the base class here? Don't these all still have to be CollectiveTransportMetadata?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python typing checkers complain about it otherwise, when inheriting the method signature is supposed to be the same including param types

ref = register_custom_tensor_transports_on_actor(actor)
# ref is None if there are no custom transports registered.
self.actor_id_to_transports_registered[actor._actor_id] = (
True if ref is None else ref
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the above state the value is True if the actor has register the custom transport? It's also True if we have no custom transports registered?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have no custom transports, it means all custom transports are registered

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
@dayshah dayshah enabled auto-merge (squash) January 12, 2026 06:24
@dayshah
Copy link
Contributor Author

dayshah commented Jan 12, 2026

Will split into its own more fleshed-out doc page in a follow-up #59255 (comment)

@dayshah dayshah merged commit e7b5f5a into ray-project:master Jan 12, 2026
7 checks passed
@dayshah dayshah deleted the bring-your-transport branch January 12, 2026 18:49
AYou0207 pushed a commit to AYou0207/ray that referenced this pull request Jan 13, 2026
…ct#59255)

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Feb 3, 2026
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…ct#59255)

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…ct#59255)

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core docs An issue or change related to documentation go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants