Skip to content

feat: support multi-node TP/EP/DP/PP from training side, and TP/EP/DP from rollout side, with rdma, for models of deepseek arch#14

Merged
JensenFire merged 5 commits intoJD-ETH:jd/rdma-integrationfrom
JensenFire:jsf/patching_for_dsk
Jan 13, 2026
Merged

feat: support multi-node TP/EP/DP/PP from training side, and TP/EP/DP from rollout side, with rdma, for models of deepseek arch#14
JensenFire merged 5 commits intoJD-ETH:jd/rdma-integrationfrom
JensenFire:jsf/patching_for_dsk

Conversation

@JensenFire
Copy link
Collaborator

@JensenFire JensenFire commented Jan 11, 2026

While last PR #11 is for nccl, this PR is for rdma setting. It not supports multi-mode setting for the models of deepseek arch.

Modifications:

  • Multi-node setting: as the counterpart of the sglang PR: feat: add support for endpoint get_remote_instance_transfer_engine_info of multi-node scenarios, and deepseek support sglang#3, this pr add the supports for launching rollout engines of multi-node settings. Including:

    • How we identify a rollout_engine in multi-node scenario. In the past, if one rollout server is of tp_size=16, slime will take each single node(assuming each node has 8 gpus) as a single rollout_engine. While it could be complicated in rdma transferring if we still use this concept. Thus in rdma transfer_plan, per rollout engine should be the one with all nodes involved in the related server.

    For example, when using training-gpus = 8 and rollout-gpus = 16, the transfer_plan now could be

       # source_rank  -> target(engine_idx, engine_rank) 
        source_rank=0 -> target (0,0),  target (0,8), 
        source_rank=1 -> target (0,1),  target (0,9), 
 ...
        source_rank=7 -> target (0,7),  target (0, 15) 

  • initialize node_hosts in the ray setting.
  • Deepseek initialization setting in MockSglangDistributedContext.

Tests:

tests/test_weight_transfer_multinode_h100_80g.sh
Now it supports multiple settings with different parallelism

@JensenFire JensenFire requested review from JD-ETH and Risc-lt January 11, 2026 10:16
@JensenFire JensenFire changed the title feat: support multi-node + TP with rdma, for models of deepseek arch [1/2]feat: support multi-node + TP with rdma, for models of deepseek arch Jan 11, 2026
@JensenFire JensenFire changed the title [1/2]feat: support multi-node + TP with rdma, for models of deepseek arch feat: support multi-node TP/EP/DP/PP from training side, and TP/EP/DP from rollout side, with rdma, for models of deepseek arch Jan 11, 2026
@JensenFire
Copy link
Collaborator Author

@JD-ETH @Risc-lt Feel free to merge this PR or make any change based on it , if it's urgent for you to test features in multi-node scenarios.

self.args.hf_checkpoint,
pp_shard=target.source_shard,
target_rank=target.engine_rank,
target_rank=target.engine_rank, # NOTE: here we assume that sglang_tp == world_size
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't understand this comment.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we cannot support sglang's pp, meaning that sglang_pp_size = 1 , and usually tp * pp = world_size, so here we assume that sglang_tp == world_size

"--max-tokens-per-gpu 2048 "
)
if args.decoder_last_pipeline_num_layers is not None:
perf_args += f"--decoder-last-pipeline-num-layers {args.decoder_last_pipeline_num_layers} "
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for training pp. in case num_layers % pp_size > 0.

# Patch at import locations in model files - these are critical!
patch("sglang.srt.models.qwen3.get_attention_tp_rank", return_value=self.attn_tp_rank),
patch("sglang.srt.models.qwen3.get_attention_tp_size", return_value=self.attn_tp_size),
patch("sglang.srt.models.qwen3.get_pp_group", return_value=mock_pp_group),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

many seem redundant. I will modify the mock context part to enable this for most models --- we shouldn't be changing the imports on the model level.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree. it's annoying...

pp_shard: int,
target_rank: int,
target_tp: int,
dp_rank: int,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dp rank and size is irrelevant here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering we pass attn_tp_rank/attn_tp_size. it's possible that we could delete them.

One potential concerning point is this part

sglang_dp_attention._ATTN_DP_RANK = 0
sglang_dp_attention._ATTN_DP_SIZE = 1

shall we need to change it into :

sglang_dp_attention._ATTN_DP_RANK = self.dp_rank
sglang_dp_attention._ATTN_DP_SIZE = self.dp_size

or, we could delete all of them after sglang's mocking context enabled

@@ -398,6 +399,7 @@ def init_rollout_engines(args, pg, all_rollout_engines):


def _allocate_rollout_engine_addr_and_ports_external(args, rollout_engines):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this all feels redundant due to the current sglang side design.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the node_hosts will be removed in the future once we make nccl inter-node communication available in sglang side

@JensenFire JensenFire merged commit 5c1bb4f into JD-ETH:jd/rdma-integration Jan 13, 2026
1 check passed
Risc-lt pushed a commit that referenced this pull request Jan 28, 2026
… from rollout side, with rdma, for models of deepseek arch (#14)

While last PR #11 is for nccl, this PR is for rdma setting. It not supports multi-mode setting for the models of deepseek arch.

- Multi-node setting: as the counterpart of the sglang PR: JD-ETH/sglang#3, this pr add the supports for launching rollout engines of multi-node settings. Including:
  - How we identify a rollout_engine in multi-node scenario. In the past, if one rollout server is of `tp_size=16`, slime will take each single node(assuming each node has 8 gpus) as a single `rollout_engine`.  While it could be complicated in rdma transferring if we still use this concept.  Thus in rdma transfer_plan, **per rollout engine should be the one with all nodes involved in the related server**.

   For example, when using `training-gpus = 8` and `rollout-gpus = 16`, the transfer_plan now could be

```
       # source_rank  -> target(engine_idx, engine_rank)
        source_rank=0 -> target (0,0),  target (0,8),
        source_rank=1 -> target (0,1),  target (0,9),
 ...
        source_rank=7 -> target (0,7),  target (0, 15)

```
  - initialize `node_hosts` in the ray setting.
- Deepseek initialization setting in `MockSglangDistributedContext`.

Tests:

`tests/test_weight_transfer_multinode_h100_80g.sh`
Now it supports multiple settings with different parallelism
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants