feat: support multi-node TP/EP/DP/PP from training side, and TP/EP/DP from rollout side, with rdma, for models of deepseek arch#14
Conversation
* ep/dp/pp for multinode, maybe cp ok too
slime/backends/megatron_utils/update_weight/remote_transfer_plan.py
Outdated
Show resolved
Hide resolved
slime/backends/megatron_utils/update_weight/remote_transfer_plan.py
Outdated
Show resolved
Hide resolved
slime/backends/megatron_utils/update_weight/remote_transfer_plan.py
Outdated
Show resolved
Hide resolved
| self.args.hf_checkpoint, | ||
| pp_shard=target.source_shard, | ||
| target_rank=target.engine_rank, | ||
| target_rank=target.engine_rank, # NOTE: here we assume that sglang_tp == world_size |
There was a problem hiding this comment.
There was a problem hiding this comment.
Since we cannot support sglang's pp, meaning that sglang_pp_size = 1 , and usually tp * pp = world_size, so here we assume that sglang_tp == world_size
| "--max-tokens-per-gpu 2048 " | ||
| ) | ||
| if args.decoder_last_pipeline_num_layers is not None: | ||
| perf_args += f"--decoder-last-pipeline-num-layers {args.decoder_last_pipeline_num_layers} " |
There was a problem hiding this comment.
for training pp. in case num_layers % pp_size > 0.
| # Patch at import locations in model files - these are critical! | ||
| patch("sglang.srt.models.qwen3.get_attention_tp_rank", return_value=self.attn_tp_rank), | ||
| patch("sglang.srt.models.qwen3.get_attention_tp_size", return_value=self.attn_tp_size), | ||
| patch("sglang.srt.models.qwen3.get_pp_group", return_value=mock_pp_group), |
There was a problem hiding this comment.
many seem redundant. I will modify the mock context part to enable this for most models --- we shouldn't be changing the imports on the model level.
There was a problem hiding this comment.
agree. it's annoying...
| pp_shard: int, | ||
| target_rank: int, | ||
| target_tp: int, | ||
| dp_rank: int, |
There was a problem hiding this comment.
dp rank and size is irrelevant here
There was a problem hiding this comment.
Considering we pass attn_tp_rank/attn_tp_size. it's possible that we could delete them.
One potential concerning point is this part
sglang_dp_attention._ATTN_DP_RANK = 0
sglang_dp_attention._ATTN_DP_SIZE = 1
shall we need to change it into :
sglang_dp_attention._ATTN_DP_RANK = self.dp_rank
sglang_dp_attention._ATTN_DP_SIZE = self.dp_size
or, we could delete all of them after sglang's mocking context enabled
| @@ -398,6 +399,7 @@ def init_rollout_engines(args, pg, all_rollout_engines): | |||
|
|
|||
|
|
|||
| def _allocate_rollout_engine_addr_and_ports_external(args, rollout_engines): | |||
There was a problem hiding this comment.
this all feels redundant due to the current sglang side design.
There was a problem hiding this comment.
yeah, the node_hosts will be removed in the future once we make nccl inter-node communication available in sglang side
… from rollout side, with rdma, for models of deepseek arch (#14) While last PR #11 is for nccl, this PR is for rdma setting. It not supports multi-mode setting for the models of deepseek arch. - Multi-node setting: as the counterpart of the sglang PR: JD-ETH/sglang#3, this pr add the supports for launching rollout engines of multi-node settings. Including: - How we identify a rollout_engine in multi-node scenario. In the past, if one rollout server is of `tp_size=16`, slime will take each single node(assuming each node has 8 gpus) as a single `rollout_engine`. While it could be complicated in rdma transferring if we still use this concept. Thus in rdma transfer_plan, **per rollout engine should be the one with all nodes involved in the related server**. For example, when using `training-gpus = 8` and `rollout-gpus = 16`, the transfer_plan now could be ``` # source_rank -> target(engine_idx, engine_rank) source_rank=0 -> target (0,0), target (0,8), source_rank=1 -> target (0,1), target (0,9), ... source_rank=7 -> target (0,7), target (0, 15) ``` - initialize `node_hosts` in the ray setting. - Deepseek initialization setting in `MockSglangDistributedContext`. Tests: `tests/test_weight_transfer_multinode_h100_80g.sh` Now it supports multiple settings with different parallelism
While last PR #11 is for nccl, this PR is for rdma setting. It not supports multi-mode setting for the models of deepseek arch.
Modifications:
Multi-node setting: as the counterpart of the sglang PR: feat: add support for endpoint
get_remote_instance_transfer_engine_infoof multi-node scenarios, and deepseek support sglang#3, this pr add the supports for launching rollout engines of multi-node settings. Including:tp_size=16, slime will take each single node(assuming each node has 8 gpus) as a singlerollout_engine. While it could be complicated in rdma transferring if we still use this concept. Thus in rdma transfer_plan, per rollout engine should be the one with all nodes involved in the related server.For example, when using
training-gpus = 8androllout-gpus = 16, the transfer_plan now could benode_hostsin the ray setting.MockSglangDistributedContext.Tests:
tests/test_weight_transfer_multinode_h100_80g.shNow it supports multiple settings with different parallelism