Support overlapping two batches by fzyzcjy · Pull Request #4068 · sgl-project/sglang

fzyzcjy · 2025-03-04T15:10:06Z

Update

If you want to try PD + EPLB + two-batch-overlap + ..., here is the branch that merges everything before they are merged into master: https://github.com/fzyzcjy/sglang/tree/feat/dev_branch

2025.03.26

Just now I run some benchmark on 8xH200 and there seems to be performance improvements. Note that I have not done careful tuning, because still waiting for the kernels and features (e.g. DeepGEMM for grouped gemm, DeepEP low-latency). Also, other orthogonal techniques such as reducing imbalance between GPUs may also help.

Experiment setup

Command

sglang-bench-serving-launch-server extra_args:
    PYTHONUNBUFFERED=1 SGLANG_TORCH_PROFILER_DIR=/host_home/temp_sglang_server2local python3 -m sglang.launch_server \
        --model-path /dev/shm/DeepSeek-V3-0324 --trust-remote-code \
        --tp 8 --dp 8 \
        --host 0.0.0.0 --port 32123 --decode-log-interval 1 \
        --enable-dp-attention --enable-deepep-moe --disable-cuda-graph \
        --enable-flashmla \
        --chunked-prefill-size 65536 \
        {{extra_args}}

sglang-bench-serving-random:
    python3 -m sglang.bench_serving \
        --backend sglang --host 127.0.0.1 --port 32123 \
        --dataset-name random \
        --num-prompt 1024 \
        --random-input 1000 --random-output 1 --random-range-ratio 1 \
        --max-concurrency 1024

For baseline and this PR, change {{extra_args}} to empty string and --enable-two-batch-overlap, respectively.
The random-output is set to 1 deliberately to disable decode phase, because decode relies on low-latency kernel and CUDA Graph support, which is still not there yet.

The bench-serving script is repeated 5 times, and throw away the 1st run (because it contains JIT compilation etc).

Experiment result

Throughput

baseline: 14492.08, 14657.42, 14515.49, 14590.32
ours: 15446.02, 15528.12, 15322.22, 15724.91

On average, it improves 6.4% throughput. Again, since the dependent PRs are not there yet, this is a very preliminary number without real kernels and carful optimization.

2025.03.20

Current status

Since both DeepGEMM and DeepEP integration are finally ready (which are prerequisites of this PR), today I updated the code. Now it seems to work with the new DeepEP and also uses vanilla non-generator-based code (because the yield grammar for torch.compile will not be available until next pytorch release).

What to do next

~~More correctness tests (awaiting H100 GPU to be free)~~ ---> 2025.03.21 morning: H100 is free now, MMLU passes
~~Check profile results to see there does exist overlap (awaiting H100 GPU to be free)~~ ---> 2025.03.21 morning: Yes
~~Code cleanup and make PR ready (awaiting correctness tests)~~ ---> 2025.03.21 morning: done
Tune performance on H200 for DeepSeek-V3 model (~~awaiting correctness tests above~~, awaiting kernels)
Test CUDA Graph and torch compile (my code is roughly done, but need to wait DeepEP integration's support for CUDA Graph)

2025.03.04 (Outdated)

Details

Currently, it is just a draft hacky implementation, because I need to wait for integration of DeepEP/DeepGEMM/etc before doing careful performance tuning.

The generation output looks roughly reasonable:

The profile timeline looks like there are two batch interleaving, and one batch's communication overlaps with another batch's computation. (CUDA graph is not enabled yet, since I hacked the part that will be replaced by DeepEP etc and it seems not CUDA graph compatible.)

The code is quite hacky and will refactor later.

Motivation

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

agiping · 2025-03-20T04:07:47Z

Hi, is this a minimal available version for two batch overlap? e.g., I mean could we directly run/test it on two H800 nodes?

fzyzcjy · 2025-03-20T05:40:20Z

@agiping Hi, this PR is currently still in the state of "Draft PR", i.e. I am working on it. When it is done, I will convert it to be non-draft.

Indeed I continued programming today and was waiting for DeepGEMM and DeepEP integration for several weeks, which are prerequisite of this PR.

python/sglang/srt/models/deepseek_v2.py

ZJLi2013 · 2025-04-10T06:53:19Z

[2025-04-10 06:54:25 TP4] MLA optimization is turned on. Use flashmla decode.
[2025-04-10 06:54:25 TP4] DeepEP is turned on. DeepEP mode: None

  File "/workspace/github/sglang/python/sglang/srt/models/deepseek_v2.py", line 1227, in __init__
    self.mlp = DeepseekV2MoE(
  File "/workspace/github/sglang/python/sglang/srt/models/deepseek_v2.py", line 220, in __init__
    dict(deepep_mode=DeepEPMode[global_server_args_dict["deepep_mode"]])
  File "/usr/lib/python3.10/enum.py", line 440, in __getitem__
    return cls._member_map_[name]
KeyError: None

looks buggy here

ZJLi2013 · 2025-04-10T07:04:17Z

[2025-04-10 06:54:25 TP4] MLA optimization is turned on. Use flashmla decode.
[2025-04-10 06:54:25 TP4] DeepEP is turned on. DeepEP mode: None

  File "/workspace/github/sglang/python/sglang/srt/models/deepseek_v2.py", line 1227, in __init__
    self.mlp = DeepseekV2MoE(
  File "/workspace/github/sglang/python/sglang/srt/models/deepseek_v2.py", line 220, in __init__
    dict(deepep_mode=DeepEPMode[global_server_args_dict["deepep_mode"]])
  File "/usr/lib/python3.10/enum.py", line 440, in __getitem__
    return cls._member_map_[name]
KeyError: None

looks buggy here

missing --deepep-mode normal, no issue to reproduce now. thanks

ZJLi2013 · 2025-04-10T10:00:32Z

btw, is there chance to decouple this feature dependency on deepep-moe ? for non-NV chips, there is no easy replacement for ibgdr/nvsmem yet. thanks bro.

fzyzcjy · 2025-04-10T15:11:35Z

btw, is there chance to decouple this feature dependency on deepep-moe ? for non-NV chips, there is no easy replacement for ibgdr/nvsmem yet. thanks bro.

After the series of PRs are merged, you can have a check, and there are some tools that may be useful for other kinds of two batch overlap

FrontierSetter · 2025-04-11T05:06:41Z

Have you tried testing with the --random-output parameter set to greater than 1?

I tested using the latest branch from your repository and found that it ran into an error:

Assertion failed: /usr/local/lib/python3.10/dist-packages/deep_gemm/jit/../include/deep_gemm/fp8_gemm.cuh:435, condition: status == cudaSuccess                                                     [175/1002]
Fatal Python error: PyThreadState_Get: the function must be called with the GIL held, but the GIL is released (the current Python thread state is NULL)
Python runtime state: initialized

Thread 0x00007f440bfff640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f4417fff640 (most recent call first):
  File "/usr/lib/python3.10/socket.py", line 293 in accept
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 609 in accept
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 463 in accept
  File "/usr/lib/python3.10/multiprocessing/resource_sharer.py", line 138 in _serve
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f4423fff640 (most recent call first):
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/managers/scheduler.py", line 1660 in watchdog_thread
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x00007f442ffff640 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/deep_gemm/jit/runtime.py", line 45 in __call__
  File "/usr/local/lib/python3.10/dist-packages/deep_gemm/jit_kernels/gemm.py", line 205 in gemm_fp8_fp8_bf16_nt
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/layers/quantization/fp8_kernel.py", line 62 in deep_gemm_fp8_fp8_bf16_nt
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1116 in __call__
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/layers/quantization/fp8_kernel.py", line 783 in w8a8_block_fp8_matmul
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/layers/quantization/fp8_utils.py", line 156 in apply_w8a8_block_fp8_linear
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/layers/quantization/fp8.py", line 422 in apply
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/layers/linear.py", line 1276 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/models/deepseek_v2.py", line 1000 in forward_absorb_stage_core
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/models/deepseek_v2.py", line 922 in forward_absorb
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/models/deepseek_v2.py", line 869 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/models/deepseek_v2.py", line 1315 in forward_mode_mlp_all
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/models/deepseek_v2.py", line 1289 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/models/deepseek_v2.py", line 1604 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736 in _wrapped_call_impl
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/models/deepseek_v2.py", line 1701 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/model_executor/model_runner.py", line 961 in forward_decode
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/model_executor/model_runner.py", line 1023 in _forward_raw
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/model_executor/model_runner.py", line 1005 in forward
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/managers/tp_worker.py", line 176 in forward_batch_generation
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 143 in forward_thread_func_
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116 in decorate_context
  File "/share/test/sglang_fzyzcjy/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 112 in forward_thread_func
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap
...

The command I use is as follows.

CUDA_LAUNCH_BLOCKING=1 SGL_ENABLE_JIT_DEEPGEMM=1 python3 -m sglang.launch_server --model-path /share/model/DeepSeek-R1/ --tp 8 --dp 8 --trust-remote-code --disable-radix-cache --enable-dp-attention --enable-deepep-moe --deepep-mode low_latency --enable-flashmla --port 20000 --disable-cuda-graph --max-running-requests 128 --chunked-prefill-size 1024 --max-prefill-tokens 128 --stream-output --enable-two-batch-overlap

python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompt 512 --random-input 1000 --random-output 10 --random-range-ratio 1 --host 127.0.0.1 --port 20000 --max-concurrency 128

I tested the following cases:

Without enabling --enable-two-batch-overlap and set --random-output to 1000: no error
Enabling --enable-two-batch-overlap and set --random-output to 1: no error
Enabling --enable-two-batch-overlap and set --random-output to 10 (or 1000): error occurred.

The environment I used is a single machine with 8 H800 cards, and the model has been reduced in layers (down to 20 hidden layers) to ensure that there is no OOM issue.

fzyzcjy · 2025-04-11T07:08:31Z

I will get back to two batch overlap after EPLB

This reverts commit f7a3ccc.

This reverts commit e1ed71f. # Conflicts: # python/sglang/srt/models/deepseek_v2.py

This reverts commit 52bafde.

This reverts commit a953279.

This reverts commit 975eb41.

# Conflicts: # python/sglang/srt/operations_strategy.py

Jacki1223 · 2025-06-06T05:49:14Z

Hello! Your work is great! May I ask if you have considered splitting into multiple chunks before the GEMM and hiding the communication through multiple streams, I experimented with this and found that although it is a coarse-grained approach, there are some throughput gains.

GreatBryan · 2025-06-11T08:00:45Z

How to split when input batch-size = 1, like warm up or single request?

Jacki1223 · 2025-06-11T08:10:34Z

How to split when input batch-size = 1, like warm up or single request?

There doesn't seem to be a need to split in this case, I've made a simple example：#6923

nannaer · 2025-06-26T09:53:24Z

Hello, I'd like to ask you a question. Where can I find the code for the scheduling of the two micro batch Decode stage? I want to learn about its implementation. Thanks! @fzyzcjy

fzyzcjy · 2025-06-26T09:56:55Z

just check code diff

Kim1230 · 2025-07-29T01:25:00Z

Hello， I have one question when I check the code , I can't find how to set dispatch operator into another stream in two batch overlap issue and I can't find how to control the attention operator to overlap dispatch operator (I think it will use event but I can't find the code). Can you help me to get more information. Thanks!@fzyzcjy

yanbing-j · 2025-09-08T01:27:22Z

Hello @fzyzcjy , is TBO okay to run on L20, not Hopper? If yes, could you please share the command to run on L20?

ziyuhuang123 · 2025-11-07T06:04:24Z

@agiping Hi, this PR is currently still in the state of "Draft PR", i.e. I am working on it. When it is done, I will convert it to be non-draft.

Indeed I continued programming today and was waiting for DeepGEMM and DeepEP integration for several weeks, which are prerequisite of this PR.

So DeepEP and DeepGemm is still necessary? So I have to use hopper and upper struct? (I tried A10 A100 with TBO, but can not see real timeline overlap)

ziyuhuang123 · 2025-11-08T09:30:39Z

@fzyzcjy Hi! I have tested the TBO, but cannot see comm/comp overlap:

python3 -m sglang.launch_server     \
--model-path /root/zyhuang/temp_can/sglang/test/DeepSeek-V3 \
--load-format dummy     \
--tp-size 4     \
--dp-size 2  \
--enable-two-batch-overlap     \
--disable-cuda-graph     \
--trust-remote-code

python3 -m sglang.bench_one_batch_server \
  --model-path /root/zyhuang/temp_can/sglang/test/DeepSeek-V3\
  --base-url http://127.0.0.1:30000 \
  --batch-size 16 \
  --input-len 128 \
  --profile \
  --profile-filename-prefix "profile_dpskV3_tp4_dp2_a100"

Could anyone kindly tell me why? Thanks!
#12815

merrymercy mentioned this pull request Mar 13, 2025

Development Roadmap (2025 H1) #4042

Closed

67 tasks

This was referenced Mar 20, 2025

Multiple tiny code cleanups #4608

Merged

Support async in DeepEP #4610

Merged

Refactor DeepSeek model by extracting basic functions #4611

Closed

fzyzcjy marked this pull request as ready for review March 21, 2025 01:08

fzyzcjy requested review from ByronHsu, HaiShaw, Ying1123, hnyls2002, ispobock, merrymercy, xiezhq-hermann and zhyncs as code owners March 21, 2025 01:08

fzyzcjy changed the title ~~[WIP] Support overlapping two batches~~ Support overlapping two batches Mar 21, 2025

This was referenced Mar 21, 2025

Let bench_one_batch support enable_dp_attention #4058

Merged

Super tiny fix typo deepseek-ai/DeepEP#89

Merged

ch-wan mentioned this pull request Mar 25, 2025

[Roadmap] EP Enhancement #4734

Closed

18 tasks

agiping reviewed Mar 27, 2025

View reviewed changes

python/sglang/srt/models/deepseek_v2.py Outdated Show resolved Hide resolved

This was referenced Apr 1, 2025

Allow benchmarking each forward pass in e2e systems #4666

Open

Support fine-grained control of requests that are run together #4699

Closed

[Draft] TP Overlap of Micro Batches #4963

Closed

Support splitting one batch into two micro-batches #4965

Closed

fzyzcjy and others added 13 commits May 24, 2025 09:36

Revert "Revert "minor""

adfb77d

This reverts commit f7a3ccc.

minor

3313ee0

ci

c9c1307

Revert "simp model_forward_maybe_tbo"

8263ca0

This reverts commit e1ed71f. # Conflicts: # python/sglang/srt/models/deepseek_v2.py

minor

4bd7504

revert

975eb41

split forward_absorb

a953279

reduction

52bafde

Revert "reduction"

6f5593e

This reverts commit 52bafde.

Revert "split forward_absorb"

3a9d226

This reverts commit a953279.

Revert "revert"

e280eab

This reverts commit 975eb41.

fix torch compile

789153a

Merge branch 'main' into feat/deepseekv2_two_batch_overlap

09bfd6f

yizhang2077 approved these changes May 24, 2025

View reviewed changes

Merge branch 'main-upstream' into feat/deepseekv2_two_batch_overlap

30ca420

# Conflicts: # python/sglang/srt/operations_strategy.py

zhyncs approved these changes May 25, 2025

View reviewed changes

zhyncs merged commit 0d47788 into sgl-project:main May 25, 2025
3 of 42 checks passed

Layssy pushed a commit to Layssy/sglang-iaas that referenced this pull request Jun 9, 2025

Support overlapping two batches (sgl-project#4068)

e25a074

xwu-intel pushed a commit to xwu-intel/sglang that referenced this pull request Jun 17, 2025

Support overlapping two batches (sgl-project#4068)

7de6b68

sherry-1001 mentioned this pull request Jul 21, 2025

support idle batch for TBO #8233

Merged

6 tasks

Conversation

fzyzcjy commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update

2025.03.26

Experiment setup

Experiment result

2025.03.20

Current status

What to do next

2025.03.04 (Outdated)

Motivation

Modifications

Checklist

Uh oh!

agiping commented Mar 20, 2025

Uh oh!

fzyzcjy commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ZJLi2013 commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZJLi2013 commented Apr 10, 2025

Uh oh!

ZJLi2013 commented Apr 10, 2025

Uh oh!

fzyzcjy commented Apr 10, 2025

Uh oh!

FrontierSetter commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fzyzcjy commented Apr 11, 2025

Uh oh!

Uh oh!

Jacki1223 commented Jun 6, 2025

Uh oh!

GreatBryan commented Jun 11, 2025

Uh oh!

Jacki1223 commented Jun 11, 2025

Uh oh!

nannaer commented Jun 26, 2025

Uh oh!

fzyzcjy commented Jun 26, 2025

Uh oh!

Kim1230 commented Jul 29, 2025

Uh oh!

yanbing-j commented Sep 8, 2025

Uh oh!

ziyuhuang123 commented Nov 7, 2025

Uh oh!

ziyuhuang123 commented Nov 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants

fzyzcjy commented Mar 4, 2025 •

edited

Loading

fzyzcjy commented Mar 20, 2025 •

edited

Loading

ZJLi2013 commented Apr 10, 2025 •

edited

Loading

FrontierSetter commented Apr 11, 2025 •

edited

Loading