Support overlapping two batches#4068
Conversation
|
Hi, is this a minimal available version for two batch overlap? e.g., I mean could we directly run/test it on two H800 nodes? |
|
@agiping Hi, this PR is currently still in the state of "Draft PR", i.e. I am working on it. When it is done, I will convert it to be non-draft. Indeed I continued programming today and was waiting for DeepGEMM and DeepEP integration for several weeks, which are prerequisite of this PR. |
[2025-04-10 06:54:25 TP4] MLA optimization is turned on. Use flashmla decode.
[2025-04-10 06:54:25 TP4] DeepEP is turned on. DeepEP mode: None
File "/workspace/github/sglang/python/sglang/srt/models/deepseek_v2.py", line 1227, in __init__
self.mlp = DeepseekV2MoE(
File "/workspace/github/sglang/python/sglang/srt/models/deepseek_v2.py", line 220, in __init__
dict(deepep_mode=DeepEPMode[global_server_args_dict["deepep_mode"]])
File "/usr/lib/python3.10/enum.py", line 440, in __getitem__
return cls._member_map_[name]
KeyError: Nonelooks buggy here |
missing |
|
btw, is there chance to decouple this feature dependency on deepep-moe ? for non-NV chips, there is no easy replacement for ibgdr/nvsmem yet. thanks bro. |
After the series of PRs are merged, you can have a check, and there are some tools that may be useful for other kinds of two batch overlap |
|
Have you tried testing with the I tested using the latest branch from your repository and found that it ran into an error: The command I use is as follows. I tested the following cases:
The environment I used is a single machine with 8 H800 cards, and the model has been reduced in layers (down to 20 hidden layers) to ensure that there is no OOM issue. |
|
I will get back to two batch overlap after EPLB |
# Conflicts: # python/sglang/srt/operations_strategy.py
|
Hello! Your work is great! May I ask if you have considered splitting into multiple chunks before the GEMM and hiding the communication through multiple streams, I experimented with this and found that although it is a coarse-grained approach, there are some throughput gains. |
|
How to split when input batch-size = 1, like warm up or single request? |
There doesn't seem to be a need to split in this case, I've made a simple example:#6923 |
|
Hello, I'd like to ask you a question. Where can I find the code for the scheduling of the two micro batch Decode stage? I want to learn about its implementation. Thanks! @fzyzcjy |
|
just check code diff |
|
Hello, I have one question when I check the code , I can't find how to set dispatch operator into another stream in two batch overlap issue and I can't find how to control the attention operator to overlap dispatch operator (I think it will use event but I can't find the code). Can you help me to get more information. Thanks!@fzyzcjy |
|
Hello @fzyzcjy , is TBO okay to run on L20, not Hopper? If yes, could you please share the command to run on L20? |
So DeepEP and DeepGemm is still necessary? So I have to use hopper and upper struct? (I tried A10 A100 with TBO, but can not see real timeline overlap) |
|
@fzyzcjy Hi! I have tested the TBO, but cannot see comm/comp overlap: Could anyone kindly tell me why? Thanks! |

Update
If you want to try PD + EPLB + two-batch-overlap + ..., here is the branch that merges everything before they are merged into master: https://github.com/fzyzcjy/sglang/tree/feat/dev_branch
2025.03.26
Just now I run some benchmark on 8xH200 and there seems to be performance improvements. Note that I have not done careful tuning, because still waiting for the kernels and features (e.g. DeepGEMM for grouped gemm, DeepEP low-latency). Also, other orthogonal techniques such as reducing imbalance between GPUs may also help.
Experiment setup
Command
For baseline and this PR, change
{{extra_args}}to empty string and--enable-two-batch-overlap, respectively.The
random-outputis set to 1 deliberately to disable decode phase, because decode relies on low-latency kernel and CUDA Graph support, which is still not there yet.The bench-serving script is repeated 5 times, and throw away the 1st run (because it contains JIT compilation etc).
Experiment result
Throughput
On average, it improves 6.4% throughput. Again, since the dependent PRs are not there yet, this is a very preliminary number without real kernels and carful optimization.
2025.03.20
Current status
Since both DeepGEMM and DeepEP integration are finally ready (which are prerequisites of this PR), today I updated the code. Now it seems to work with the new DeepEP and also uses vanilla non-generator-based code (because the yield grammar for torch.compile will not be available until next pytorch release).
What to do next
More correctness tests (awaiting H100 GPU to be free)---> 2025.03.21 morning: H100 is free now, MMLU passesCheck profile results to see there does exist overlap (awaiting H100 GPU to be free)---> 2025.03.21 morning: YesCode cleanup and make PR ready (awaiting correctness tests)---> 2025.03.21 morning: doneawaiting correctness tests above, awaiting kernels)2025.03.04 (Outdated)
Details
Currently, it is just a draft hacky implementation, because I need to wait for integration of DeepEP/DeepGEMM/etc before doing careful performance tuning.
The generation output looks roughly reasonable:
The profile timeline looks like there are two batch interleaving, and one batch's communication overlaps with another batch's computation. (CUDA graph is not enabled yet, since I hacked the part that will be replaced by DeepEP etc and it seems not CUDA graph compatible.)
The code is quite hacky and will refactor later.
Motivation
Modifications
Checklist