[DeepEP] Eliminate unnecessary DP cudagraph padding#5557
[DeepEP] Eliminate unnecessary DP cudagraph padding#5557yuleil wants to merge 3 commits intosgl-project:mainfrom
Conversation
|
Even if DeepEP is enabled, this padding is necessary for cuda graph because DeepSeek-V3 contains several dense FFNs. Could you please add a barrier before the communication operator to confirm whether the communication volume is redundant under DeepEP? This is weird to me because low-latency dispatch cannot handle input over |
I will continue to investigate the situation with the dense FFN.
Okay, I’ll give it a try.
The low-latency dispatch returns a tensor of shape
The slow combine is occurring under CUDA Graph. |
|
Hi @yuleil , I've been testing this PR together with #5435 and encountered a CUDA memory error during batch decoding. Here are the details: Reproduction Steps:
python3 -m sglang.bench_serving \
--port 8000 \
--backend sglang \
--dataset-name random \
--num-prompt 128 \
--random-input 4096 \
--random-output 1500 \
--random-range-ratio 1 \
--dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json \
--max-concurrency 128Error: [2025-04-22 02:22:22 DP2 TP4] Scheduler hit an exception: Traceback (most recent call last):
File "/home/nas/code/sglang/python/sglang/srt/managers/scheduler.py", line 2015, in run_scheduler_process
scheduler.event_loop_normal_disagg_decode()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/nas/code/sglang/python/sglang/srt/disaggregation/decode.py", line 456, in event_loop_normal_disagg_decode
self.process_batch_result(batch, result)
File "/home/nas/code/sglang/python/sglang/srt/managers/scheduler.py", line 1408, in process_batch_result
self.process_batch_result_decode(batch, result)
File "/home/nas/code/sglang/python/sglang/srt/managers/scheduler_output_processor_mixin.py", line 194, in process_batch_result_decode
next_token_ids = next_token_ids.tolist()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.Environment: # python3 -m sglang.check_env
Python: 3.10.12 (main, Feb 4 2025, 14:57:36) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H20
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 550.127.08
PyTorch: 2.5.1+cu124
sglang: 0.4.5.post1
sgl_kernel: 0.0.9.post2
flashinfer: Module Not Found
triton: 3.1.0
transformers: 4.51.1
torchao: 0.9.0
numpy: 2.2.4
aiohttp: 3.11.14
fastapi: 0.115.12
hf_transfer: 0.1.9
huggingface_hub: 0.30.2
interegular: 0.3.3
modelscope: 1.24.1
orjson: 3.10.16
outlines: 0.1.11
packaging: 24.2
psutil: 7.0.0
pydantic: 2.11.1
multipart: Module Not Found
zmq: Module Not Found
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.17
openai: 1.69.0
tiktoken: 0.9.0
anthropic: 0.49.0
litellm: 1.65.0
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA AffinityGPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE SYS SYS 0-47,96-143 0 /A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE SYS SYS 0-47,96-143 0 /A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE SYS SYS 0-47,96-143 0 /A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE PIX SYS SYS 0-47,96-143 0 /A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS PIX NODE 48-95,144-191 1 /A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS NODE NODE 48-95,144-191 1 /A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS NODE PIX 48-95,144-191 1 /A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS NODE NODE 48-95,144-191 1 /A
NIC0 NODE PIX NODE NODE SYS SYS SYS SYS X NODE SYS SYS
NIC1 NODE NODE NODE PIX SYS SYS SYS SYS NODE X SYS SYS
NIC2 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS X NODE
NIC3 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_bond_0
NIC1: mlx5_bond_1
NIC2: mlx5_bond_2
NIC3: mlx5_bond_3
ulimit soft: 1048576 |
9c6ff3a to
8f9586f
Compare
8f9586f to
349df6d
Compare
According to @ch-wan 's response, the first three dense layers and the I have verified that after making both the dense layers and lm_head fully DP, this fix can run correctly. |
|
@TianyuZhang1214 Please check these two PRs: #5558 and #5657 |
Thanks for your reply! Your PR has consistently resolved our issues with remarkable efficiency! I’m currently on Ant Group and have just sent a detailed email to your Gmail account regarding potential collaboration opportunities. Could you kindly review it at your earliest convenience? |
@ch-wan Thank you for sharing PRs #5558 and #5657. I've tested them by deploying SGLang across 4 H20 nodes (8×96G) configured as:
Then I've encountered error message in refill node as follows: return self.model.forward(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/nas/code/sglang/python/sglang/srt/models/deepseek_v2.py", line 1501, in forward
hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/home/nas/code/sglang/python/sglang/srt/models/deepseek_v2.py", line 1425, in forward
hidden_states, residual = layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/home/nas/code/sglang/python/sglang/srt/models/deepseek_v2.py", line 1214, in forward
return self.forward_ffn_with_full_input(
File "/home/nas/code/sglang/python/sglang/srt/models/deepseek_v2.py", line 1260, in forward_ffn_with_full_input
dp_gather_partial(hidden_states, local_hidden_states, forward_batch)
File "/home/nas/code/sglang/python/sglang/srt/layers/dp_attention.py", line 270, in dp_gather_partial
_dp_gather(global_tokens, local_tokens, forward_batch, is_partial=True)
File "/home/nas/code/sglang/python/sglang/srt/layers/dp_attention.py", line 237, in _dp_gather
local_start_pos, local_num_tokens = get_dp_local_info(forward_batch)
File "/home/nas/code/sglang/python/sglang/srt/layers/dp_attention.py", line 177, in get_dp_local_info
cumtokens = torch.cumsum(forward_batch.global_num_tokens_gpu, dim=0)
TypeError: cumsum() received an invalid combination of arguments - got (NoneType, dim=int), but expected one of:
* (Tensor input, int dim, *, torch.dtype dtype = None, Tensor out = None)
* (Tensor input, name dim, *, torch.dtype dtype = None, Tensor out = None)Launch script: # Prefill Node 0 (1 is the same)
MOONCAKE_CONFIG_PATH=./prefill_node_0.json SUPPORT_CUTLASS_BLOCK_FP8=1 python3 -m sglang.launch_server \
--model-path /home/moyun.zty/models/deepseek-ai__DeepSeek-R1 \
--disaggregation-mode prefill \
--host 10.13.3.156 \
--port 30001 \
--trust-remote-code \
--dist-init-addr 10.13.3.156:50000 \
--nnodes 2 \
--node-rank 0 \
--tp-size 16 \
--dp-size 16 \
--enable-dp-attention \
--enable-deepep-moe \
--deepep-mode normal \
--mem-fraction-static 0.9 \
--quantization fp8 \
--log-level debug \
--chunked-prefill-size 8196 \
--disable-radix-cache \
--context-length 65535 \
--max-running-requests 128 \
--stream-output \
--log-requests \
--attention-backend flashinfer \
--enable-mixed-chunk \
--flashinfer-mla-disable-ragged \
> sglang-prefill.log 2>&1 &# Decode Node 0 (1 is the same)
MOONCAKE_CONFIG_PATH=./prefill_node_0.json SUPPORT_CUTLASS_BLOCK_FP8=1 python3 -m sglang.launch_server \
--model-path /home/moyun.zty/models/deepseek-ai__DeepSeek-R1 \
--disaggregation-mode decode \
--host 10.13.3.169 \
--port 30001 \
--trust-remote-code \
--dist-init-addr 10.13.3.169:50001 \
--nnodes 2 \
--node-rank 0 \
--tp-size 16 \
--dp-size 16 \
--enable-dp-attention \
--enable-deepep-moe \
--deepep-mode low_latency \
--moe-dense-tp-size 1 \
--mem-fraction-static 0.85 \
--quantization fp8 \
--log-level debug \
--disable-radix-cache \
--context-length 65535 \
--max-running-requests 128 \
--stream-output \
--log-requests \
--attention-backend flashinfer \
--enable-mixed-chunk \
--flashinfer-mla-disable-ragged \
> sglang-decode.log 2>&1 & |
|
@TianyuZhang1214 Hi, could you please provide the MOONCAKE_CONFIG for me? |

Motivation
In DP attention, the batch size of the CUDA graph is expanded to the sum of batches across all DP ranks to enable
all_gatherglobal tokens before MLP fwd. Since decoding is memory-bound, this padding does not introduce significant performance overhead. Under DeepEP, the padded tokens from each DP rank participate in dispatch/combine operations, resulting in a DP times increase in communication costs.DeepEP's MoE computation adopts a fixed shape with masking instead of gathering tokens from all DP ranks, so padding batch size to the global tokens is actually unnecessary.
Before this fix:

After this fix:

With H20, EP16,120 batch decoding with cudagraph, the combine time has been reduced by 10 times, and TPOT has decreased from 300ms to ~100ms.
Modifications
Checklist