[core][rdt] Register your own transport at runtime for RDT #59255
[core][rdt] Register your own transport at runtime for RDT #59255dayshah merged 42 commits intoray-project:masterfrom
Conversation
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
…port Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a new public API, register_tensor_transport, to allow registering custom tensor transports for RDT at runtime. This is a valuable feature for extensibility. The implementation refactors TensorTransportEnum to use strings for transport types and migrates existing transports (NIXL, NCCL, GLOO) to the new registration API. The documentation is also updated accordingly.
However, I've identified a couple of issues. The most significant is that ray.put and ray.get still contain hardcoded checks that only permit "NIXL" and "OBJECT_STORE", which undermines the goal of supporting custom transports. Additionally, there's some duplicated code for validating transport names across different files. My review provides specific feedback to address these points.
python/ray/experimental/gpu_object_manager/gpu_object_manager.py
Outdated
Show resolved
Hide resolved
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
e3a266a to
55d0da9
Compare
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
55d0da9 to
20918be
Compare
Well the recv is on the system concurrency group and the extract will be on the main thread after the last pr. So it's hard to guarantee arrival before both. And if it's out of order...
My thought is that the user will define it somewhere on their driver process after importing ray, like the test does. We can't get the test to work either unless we pickle by value |
How about blocking on the register task to finish before submitting any other tasks? If it's only for actors with custom transports enabled, that seems like an OK tradeoff right now. |
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Ok so I updated to do this ^. I outlined the limitations in the doc, you can't borrow the actor ref and submit from another worker, ordering bets are off and actor restarts don't work Also made |
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
| class TransportManagerInfo(NamedTuple): | ||
| transport_manager_class: type[TensorTransportManager] | ||
| # list of support device types for the transport | ||
| devices: List[str] |
There was a problem hiding this comment.
I wonder if you want to fold this into the TensorTransportManager class instead?
There was a problem hiding this comment.
I want to try to keep the TensorTransportManager class/file easy to read for users since they'll be looking at it. They don't need to worry about this part.
| ... | ||
|
|
||
|
|
||
| register_tensor_transport("CUSTOM", ["cuda", "cpu"], CustomTransport) |
There was a problem hiding this comment.
Would be great to add another codeblock under this showing how you would then use the tensor transport in an actor class (and ideally break it up with text explaining that).
There was a problem hiding this comment.
You could do it in a followup PR but it'd be great to walk through an actual example of a tensor transport manager implementation.
There was a problem hiding this comment.
ya will do this in a follow-up and split it into its own page
| To implement a new tensor transport, implement the abstract interface :class:`ray.experimental.TensorTransportManager <ray.experimental.TensorTransportManager>` | ||
| defined in `tensor_transport_manager.py <https://github.com/ray-project/ray/blob/master/python/ray/experimental/gpu_object_manager/tensor_transport_manager.py>`__. | ||
| Then call `register_tensor_transport <ray.experimental.register_tensor_transport>` with the transport name, supported devices for the transport, | ||
| and the class that implements `TensorTransportManager`. Note that you have to register from the same process in which you create the actor you want |
There was a problem hiding this comment.
does this mean you have to register the custom transport on both src/dst actor involved in the transfer?
There was a problem hiding this comment.
ya, we register it on both
| assert isinstance( | ||
| communicator_metadata, CollectiveCommunicatorMetadata | ||
| ), "metadata must be a CollectiveCommunicatorMetadata object for non-NIXL transport" | ||
| assert isinstance(tensor_transport_metadata, CollectiveTransportMetadata) |
There was a problem hiding this comment.
Why were the error logs removed?
There was a problem hiding this comment.
The error log doesn't add anything (it's just saying it expects another type which python will say) + it was outdated (we support more than nixl) + it's just typing enforcement that should be impossible to hit
| obj_id: str, | ||
| tensor_transport_metadata: CollectiveTransportMetadata, | ||
| communicator_metadata: CollectiveCommunicatorMetadata, | ||
| tensor_transport_metadata: TensorTransportMetadata, |
There was a problem hiding this comment.
Why did the types change to the base class here? Don't these all still have to be CollectiveTransportMetadata?
There was a problem hiding this comment.
Python typing checkers complain about it otherwise, when inheriting the method signature is supposed to be the same including param types
| ref = register_custom_tensor_transports_on_actor(actor) | ||
| # ref is None if there are no custom transports registered. | ||
| self.actor_id_to_transports_registered[actor._actor_id] = ( | ||
| True if ref is None else ref |
There was a problem hiding this comment.
Doesn't the above state the value is True if the actor has register the custom transport? It's also True if we have no custom transports registered?
There was a problem hiding this comment.
If we have no custom transports, it means all custom transports are registered
|
Will split into its own more fleshed-out doc page in a follow-up #59255 (comment) |
…ct#59255) Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: jasonwrwang <jasonwrwang@tencent.com>
…ct#59255) Signed-off-by: dayshah <dhyey2019@gmail.com>
…ct#59255) Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
…ct#59255) Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Description
Adding a new public api for registering a tensor transport for RDT at runtime. You just need to call
register_tensor_transportwith a name, a list of supported devices, and a class that implements theTensorTransportManagerinterface.Note that in the test we have to explicitly pickle by value. There's some weird some stuff where if you define a class in a test it will pickle by ref, but if you do it normally in a driver script or even import a module into the driver script, it will pickle by value.
One of the major problems with registering a custom transport, is getting it to the actor:
Problem + Solution
You need a way to register the transport on the actors that are involved.
For the source actor,
enable_tensor_transportis guaranteed to be True so we launch a task at actor creation time to register any custom transports. When launching a task with an rdt output, or rdt args (dst actor), we do a ray.get on the registration task if it's not done yet on this actor.There's some drawbacks to this though.
Pointed these out in the docs.
A longterm solution without the drawbacks would probably be to hook into actor construction and ask the user to specify tensor transports which will be used with that actor at actor creation time so we can register the transport at construction time. We don't really have an easy way to hook into actor construction though. We would have to send the pickled class through an extra field with the actor creation task and keep it around in the task spec so it works for the restart.
Adding documentation for all this too.
Testing