Custom All Reduce for Piecewise Cuda Graph by Oasis-Git · Pull Request #15356 · sgl-project/sglang

Oasis-Git · 2025-12-18T02:20:20Z

Motivation

Enable Custom All Reduce in Piecewise Cuda Graph, equal contribution to @ByronHsu

Modifications

Split the compile period and warmup-capture period to make sure to replay during pcg init.
Enable graph_capture during capture

Accuracy Tests

# Server
python3 -m sglang.launch_server --model-path Qwen/Qwen3-8B --enable-piecewise-cuda-graph \
     --tp 2 \
     --piecewise-cuda-graph-max-tokens 2048

# Client
python3 sglang/benchmark/gsm8k/bench_sglang.py --num-questions 1319 --parallel 1319 --port 30000

bash client.sh 
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:10<00:00, 127.84it/s]
Accuracy: 0.908
Invalid: 0.000
Latency: 10.396 s
Output throughput: 15629.471 token/s

For VL Model:

# Server
python -m sglang.launch_server --model Qwen/Qwen2.5-VL-7B-Instruct --tp 4 \
    --enable-piecewise-cuda-graph \
    --disable-radix-cache
    
# Client
python3 -m sglang.bench_serving \
  --backend sglang-oai-chat \
  --dataset-name image \
  --num-prompts 256 \
  --apply-chat-template \
  --random-input-len 128 \
  --random-output-len 32 \
  --image-resolution 560x560 \
  --image-format jpeg \
  --image-count 1 \
  --image-content random \
  --random-range-ratio 0.1 \
  --port 30000 \
  --max-concurrency 32
  
Created 256 random jpeg images with average 316335 bytes per request
Starting warmup with 1 sequences...
Warmup completed with 1 sequences. Starting main benchmark run...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 256/256 [00:14<00:00, 18.11it/s]

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 32        
Successful requests:                     256       
Benchmark duration (s):                  14.14     
Total input tokens:                      126306    
Total input text tokens:                 23394     
Total input vision tokens:               102912    
Total generated tokens:                  4541      
Total generated tokens (retokenized):    4513      
Request throughput (req/s):              18.10     
Input token throughput (tok/s):          8932.69   
Output token throughput (tok/s):         321.15    
Peak output token throughput (tok/s):    522.00    
Peak concurrent requests:                59        
Total token throughput (tok/s):          9253.84   
Concurrency:                             31.68     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1750.05   
Median E2E Latency (ms):                 1660.67   
---------------Time to First Token----------------
Mean TTFT (ms):                          870.50    
Median TTFT (ms):                        618.02    
P99 TTFT (ms):                           2365.29   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          50.13     
Median TPOT (ms):                        49.86     
P99 TPOT (ms):                           135.72    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           52.74     
Median ITL (ms):                         4.36      
P95 ITL (ms):                            435.08    
P99 ITL (ms):                            623.28    
Max ITL (ms):                            1451.89   
==================================================

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

gemini-code-assist · 2025-12-18T02:20:24Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

ByronHsu · 2025-12-18T02:25:01Z

ref: #14193

Oasis-Git · 2025-12-18T02:26:22Z

python/sglang/srt/distributed/device_communicators/custom_all_reduce.py

 class CustomAllreduce:
    _SUPPORTED_WORLD_SIZES = [2, 4, 6, 8]
-    _MAX_CAR_SIZE = 8192 * 1024
+    _MAX_CAR_SIZE = 8192 * 1024 * 4


Here should be a fix. Need to calculate the max_size for all_reduce

Can we add a comment here how it's calculated?

Oasis-Git

Fix it here

ByronHsu · 2025-12-18T02:29:58Z

python/sglang/srt/compilation/cuda_piecewise_backend.py

        # if not entry.use_cudagraph or skip_cuda_graphs:
        #     return entry.runnable(*args)
+        if is_in_torch_compile():
+            return entry.runnable(*args)


noob question: what is this for?

Basically we should prevent replay happens in capture since the custom all reduce buffer has not been allocated. Since now the capture function does warmup and capture only, we should avoid any compile operations to be counted into piecewise cudagraph backend. Thus in compile stage we skip all the later process and directly return the eager run.

why we need this protection since we already avoided replay in warmup?
https://github.com/sgl-project/sglang/pull/15356/changes#diff-b822ec9786c7a7d6d03d7187d6ad277435a67019576f4f7ef577d0ca2ee3c50eR512

without this protection the warmup could happen in the warmup_and_torch_compile function. For example, if we capture from 4 to 4096, after the first run with 4096 in warmup_and_torch_compile, other shape from 4 to 3840 will be warmup without this protection since no recompile exists for dense model.

ByronHsu · 2025-12-18T02:43:07Z

could you test qwen vl as well? thanks

Oasis-Git · 2025-12-18T06:24:03Z

could you test qwen vl as well? thanks

The result for vl model is updated. Please check!

python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py

python/sglang/srt/compilation/piecewise_context_manager.py

python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py

ispobock · 2025-12-18T12:43:42Z

python/sglang/srt/distributed/device_communicators/custom_all_reduce.py

 class CustomAllreduce:
    _SUPPORTED_WORLD_SIZES = [2, 4, 6, 8]
-    _MAX_CAR_SIZE = 8192 * 1024
+    _MAX_CAR_SIZE = 8192 * 1024 * 4


Can we add a comment here how it's calculated?

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Oasis-Git · 2025-12-19T08:16:52Z

/tag-run-ci-label

ispobock · 2025-12-19T08:21:25Z

/tag-and-rerun-ci

ByronHsu · 2025-12-19T23:01:26Z

btw, we can remove use_original_ca_comm and disable_ca_comm from piecewise_cuda_graph_runner.py?

Oasis-Git · 2025-12-19T23:50:58Z

btw, we can remove use_original_ca_comm and disable_ca_comm from piecewise_cuda_graph_runner.py?

True

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

ByronHsu · 2025-12-20T21:53:16Z

can you resolve the conflict?

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Oasis-Git · 2025-12-20T23:54:54Z

can you resolve the conflict?

solved

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Oasis-Git · 2025-12-23T03:31:31Z

/rerun-failed-ci

Oasis-Git · 2025-12-25T00:19:55Z

/rerun-failed-ci

ispobock · 2025-12-25T15:53:29Z

All ci passed: https://github.com/sgl-project/sglang/actions/runs/20481135725?pr=15356

mmdbhs · 2026-01-05T08:18:35Z

python/sglang/srt/compilation/cuda_piecewise_backend.py

                # mind-exploding: carefully manage the reference and memory.
-                with torch.cuda.graph(cudagraph, pool=self.graph_pool):
+                stream = get_pcg_capture_stream()
+                assert stream is not None, "PCG capture stream is not set"


why we can't use “stream is None”

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Oasis-Git added 2 commits December 18, 2025 02:10

car support

3e51ad3

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

car support

103f97a

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Oasis-Git requested review from ch-wan, hebiao064, merrymercy and yizhang2077 as code owners December 18, 2025 02:20

github-actions bot added the piecewise-cuda-graph label Dec 18, 2025

Oasis-Git commented Dec 18, 2025

View reviewed changes

ByronHsu reviewed Dec 18, 2025

View reviewed changes

hebiao064 self-assigned this Dec 18, 2025

hebiao064 reviewed Dec 18, 2025

View reviewed changes

python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py Outdated Show resolved Hide resolved

ByronHsu reviewed Dec 18, 2025

View reviewed changes

python/sglang/srt/compilation/piecewise_context_manager.py Show resolved Hide resolved

ByronHsu reviewed Dec 18, 2025

View reviewed changes

python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py Outdated Show resolved Hide resolved

hebiao064 reviewed Dec 18, 2025

View reviewed changes

python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py Show resolved Hide resolved

Oasis-Git added 2 commits December 18, 2025 08:59

update stream

bb4f5a8

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

update stream

567ff39

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

ispobock mentioned this pull request Dec 18, 2025

[Feature] Roadmap for Prefill (Piecewise) CUDA Graph #11490

Open

34 tasks

ispobock reviewed Dec 18, 2025

View reviewed changes

Oasis-Git added 4 commits December 18, 2025 21:49

reserve inplace ar and set cutting node

a69b168

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

reserve inplace ar and set cutting node

be7911b

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

add comment

247d529

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Merge branch 'main' into car

07039de

github-actions bot added the run-ci label Dec 19, 2025

ispobock approved these changes Dec 19, 2025

View reviewed changes

remove unused context

6513f54

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Oasis-Git requested review from Ying1123, hnyls2002 and xiezhq-hermann as code owners December 20, 2025 06:26

solve conflict

c5d0b15

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Oasis-Git added 5 commits December 21, 2025 07:13

solve ci

ace0a9d

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

remove grad

64b527b

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Merge branch 'main' into car

6e5ae34

add split ops

540bd6c

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Merge branch 'car' of https://github.com/Oasis-Git/sglang into car

518eb97

Merge branch 'main' into car

0678947

ispobock merged commit 5c243ba into sgl-project:main Dec 25, 2025
427 of 448 checks passed

Oasis-Git deleted the car branch December 25, 2025 23:14

alisonshao mentioned this pull request Dec 26, 2025

[Bug] Performance regression in test_deepseek_v3_basic: speed dropped from 75+ to ~14 tokens/s #15869

Closed

Oasis-Git mentioned this pull request Dec 27, 2025

Piecewise Cuda Graph Memory Usage #15927

Merged

6 tasks

mmdbhs reviewed Jan 5, 2026

View reviewed changes

mmdbhs mentioned this pull request Jan 6, 2026

[Bugfix] Remove unnecessary assert for PCG capture stream #16489

Open

5 tasks

YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026

Custom All Reduce for Piecewise Cuda Graph (sgl-project#15356)

6d0933f

Signed-off-by: Oasis-Git <ayw.sirius19@gmail.com>

Conversation

Oasis-Git commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Checklist

Uh oh!

gemini-code-assist bot commented Dec 18, 2025

Uh oh!

ByronHsu commented Dec 18, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Oasis-Git left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ByronHsu commented Dec 18, 2025

Uh oh!

Oasis-Git commented Dec 18, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Oasis-Git commented Dec 19, 2025

Uh oh!

ispobock commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ByronHsu commented Dec 19, 2025

Uh oh!

Oasis-Git commented Dec 19, 2025

Uh oh!

ByronHsu commented Dec 20, 2025

Uh oh!

Oasis-Git commented Dec 20, 2025

Uh oh!

Oasis-Git commented Dec 23, 2025

Uh oh!

Oasis-Git commented Dec 25, 2025

Uh oh!

ispobock commented Dec 25, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Oasis-Git commented Dec 18, 2025 •

edited

Loading

ispobock commented Dec 19, 2025 •

edited

Loading