feat: support multi-node TP/EP/DP/PP from training side, and TP/EP/DP from rollout side, with rdma, for models of deepseek arch by JensenFire · Pull Request #14 · JD-ETH/slime

JensenFire · 2026-01-11T10:16:05Z

While last PR #11 is for nccl, this PR is for rdma setting. It not supports multi-mode setting for the models of deepseek arch.

Modifications:

Multi-node setting: as the counterpart of the sglang PR: feat: add support for endpoint get_remote_instance_transfer_engine_info of multi-node scenarios, and deepseek support sglang#3, this pr add the supports for launching rollout engines of multi-node settings. Including:
- How we identify a rollout_engine in multi-node scenario. In the past, if one rollout server is of tp_size=16, slime will take each single node(assuming each node has 8 gpus) as a single rollout_engine. While it could be complicated in rdma transferring if we still use this concept. Thus in rdma transfer_plan, per rollout engine should be the one with all nodes involved in the related server.
For example, when using training-gpus = 8 and rollout-gpus = 16, the transfer_plan now could be

       # source_rank  -> target(engine_idx, engine_rank) 
        source_rank=0 -> target (0,0),  target (0,8), 
        source_rank=1 -> target (0,1),  target (0,9), 
 ...
        source_rank=7 -> target (0,7),  target (0, 15)

initialize node_hosts in the ray setting.
Deepseek initialization setting in MockSglangDistributedContext.

Tests:

tests/test_weight_transfer_multinode_h100_80g.sh
Now it supports multiple settings with different parallelism

* ep/dp/pp for multinode, maybe cp ok too

JensenFire · 2026-01-12T07:20:10Z

@JD-ETH @Risc-lt Feel free to merge this PR or make any change based on it , if it's urgent for you to test features in multi-node scenarios.

slime/backends/megatron_utils/update_weight/remote_transfer_plan.py

JD-ETH · 2026-01-11T17:52:07Z

slime/backends/megatron_utils/update_weight/update_weight_from_rdma.py

                            self.args.hf_checkpoint,
                            pp_shard=target.source_shard,
-                            target_rank=target.engine_rank,
+                            target_rank=target.engine_rank,  # NOTE: here we assume that sglang_tp == world_size


i don't understand this comment.

this is how they convert: https://github.com/THUDM/slime/blob/394360f3ec516cdcc50682b9464ef85a2b72538b/slime/backends/sglang_utils/arguments.py#L114

Since we cannot support sglang's pp, meaning that sglang_pp_size = 1 , and usually tp * pp = world_size, so here we assume that sglang_tp == world_size

JD-ETH · 2026-01-11T17:54:50Z

tests/test_weight_transfer_moe_multinode.py

        "--max-tokens-per-gpu 2048 "
    )
+    if args.decoder_last_pipeline_num_layers is not None:
+        perf_args += f"--decoder-last-pipeline-num-layers {args.decoder_last_pipeline_num_layers} "


what is this for?

for training pp. in case num_layers % pp_size > 0.

JD-ETH · 2026-01-11T18:12:24Z

slime/backends/megatron_utils/update_weight/update_weight_from_rdma.py

            # Patch at import locations in model files - these are critical!
            patch("sglang.srt.models.qwen3.get_attention_tp_rank", return_value=self.attn_tp_rank),
            patch("sglang.srt.models.qwen3.get_attention_tp_size", return_value=self.attn_tp_size),
+            patch("sglang.srt.models.qwen3.get_pp_group", return_value=mock_pp_group),


many seem redundant. I will modify the mock context part to enable this for most models --- we shouldn't be changing the imports on the model level.

agree. it's annoying...

JD-ETH · 2026-01-11T18:38:15Z

slime/backends/megatron_utils/update_weight/update_weight_from_rdma.py

+        pp_shard: int,
+        target_rank: int,
+        target_tp: int,
+        dp_rank: int,


dp rank and size is irrelevant here

Considering we pass attn_tp_rank/attn_tp_size. it's possible that we could delete them.

One potential concerning point is this part

sglang_dp_attention._ATTN_DP_RANK = 0 sglang_dp_attention._ATTN_DP_SIZE = 1

shall we need to change it into :

sglang_dp_attention._ATTN_DP_RANK = self.dp_rank sglang_dp_attention._ATTN_DP_SIZE = self.dp_size

or, we could delete all of them after sglang's mocking context enabled

JD-ETH · 2026-01-11T18:45:45Z

slime/ray/rollout.py

@@ -398,6 +399,7 @@ def init_rollout_engines(args, pg, all_rollout_engines):


 def _allocate_rollout_engine_addr_and_ports_external(args, rollout_engines):


this all feels redundant due to the current sglang side design.

yeah, the node_hosts will be removed in the future once we make nccl inter-node communication available in sglang side

… from rollout side, with rdma, for models of deepseek arch (#14) While last PR #11 is for nccl, this PR is for rdma setting. It not supports multi-mode setting for the models of deepseek arch. - Multi-node setting: as the counterpart of the sglang PR: JD-ETH/sglang#3, this pr add the supports for launching rollout engines of multi-node settings. Including: - How we identify a rollout_engine in multi-node scenario. In the past, if one rollout server is of `tp_size=16`, slime will take each single node(assuming each node has 8 gpus) as a single `rollout_engine`. While it could be complicated in rdma transferring if we still use this concept. Thus in rdma transfer_plan, **per rollout engine should be the one with all nodes involved in the related server**. For example, when using `training-gpus = 8` and `rollout-gpus = 16`, the transfer_plan now could be ``` # source_rank -> target(engine_idx, engine_rank) source_rank=0 -> target (0,0), target (0,8), source_rank=1 -> target (0,1), target (0,9), ... source_rank=7 -> target (0,7), target (0, 15) ``` - initialize `node_hosts` in the ray setting. - Deepseek initialization setting in `MockSglangDistributedContext`. Tests: `tests/test_weight_transfer_multinode_h100_80g.sh` Now it supports multiple settings with different parallelism

JensenFire added 2 commits January 11, 2026 09:38

support multi-node rdma

db81853

misc

9d2f581

JensenFire requested review from JD-ETH and Risc-lt January 11, 2026 10:16

JensenFire changed the title ~~feat: support multi-node + TP with rdma, for models of deepseek arch~~ [1/2]feat: support multi-node + TP with rdma, for models of deepseek arch Jan 11, 2026

[2/2] supports for multinode ep/dp/pp, both training and rollout (#1)

f035b31

* ep/dp/pp for multinode, maybe cp ok too

JensenFire changed the title ~~[1/2]feat: support multi-node + TP with rdma, for models of deepseek arch~~ feat: support multi-node TP/EP/DP/PP from training side, and TP/EP/DP from rollout side, with rdma, for models of deepseek arch Jan 11, 2026

JD-ETH approved these changes Jan 12, 2026

View reviewed changes

JensenFire added 2 commits January 13, 2026 06:30

fix comments

2597689

fix comments, dprank

d701c4d

JensenFire merged commit 5c1bb4f into JD-ETH:jd/rdma-integration Jan 13, 2026
1 check passed

		@@ -398,6 +399,7 @@ def init_rollout_engines(args, pg, all_rollout_engines):


		def _allocate_rollout_engine_addr_and_ports_external(args, rollout_engines):

Conversation

JensenFire commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Modifications:

Uh oh!

JensenFire commented Jan 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JensenFire commented Jan 11, 2026 •

edited

Loading