[POC] Overlap scheduler refactor with SD by hnyls2002 · Pull Request #9334 · sgl-project/sglang

hnyls2002 · 2025-08-19T05:08:25Z

Keep aligning with the main to track the compatibility. From #8490

all ACC_LEN=3.02

Branch	main	lianmin/overlap-spec	hanming/overlap-spec-w-batch	lsyin/poc-overlap-spec
Throughput (BS=1)	249.73	268.34	261.61	269.30

cicirori · 2025-09-26T17:16:37Z

Python: 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA B200
GPU 0,1,2,3,4,5,6,7 Compute Capability: 10.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 575.57.08
PyTorch: 2.8.0+cu129
sglang: 0.5.3rc0
sgl_kernel: 0.3.12
flashinfer_python: 0.4.0rc1
triton: 3.4.0
transformers: 4.56.1
torchao: 0.9.0
numpy: 2.3.3
aiohttp: 3.12.15
fastapi: 0.117.1
hf_transfer: 0.1.9
huggingface_hub: 0.35.1
interegular: 0.3.3
modelscope: 1.30.0
orjson: 3.11.3
outlines: 0.1.11
packaging: 25.0
psutil: 7.1.0
pydantic: 2.11.9
python-multipart: 0.0.20
pyzmq: 27.1.0
uvicorn: 0.37.0
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.24
openai: 1.99.1
tiktoken: 0.11.0
anthropic: 0.68.0
litellm: Module Not Found
decord: 0.6.0
NVIDIA Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3       NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    NODE    PIX     NODE    NODE       NODE    NODE    SYS     SYS     SYS     SYS     0-63,128-191    0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    NODE    NODE    NODE    NODE       NODE    PIX     SYS     SYS     SYS     SYS     0-63,128-191    0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    NODE    PIX     NODE       NODE    NODE    SYS     SYS     SYS     SYS     0-63,128-191    0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    PIX     NODE    NODE    NODE       NODE    NODE    SYS     SYS     SYS     SYS     0-63,128-191    0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYSSYS     SYS     NODE    PIX     NODE    NODE    64-127,192-255  1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYSSYS     SYS     NODE    NODE    NODE    PIX     64-127,192-255  1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYSSYS     SYS     NODE    NODE    PIX     NODE    64-127,192-255  1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYSSYS     SYS     PIX     NODE    NODE    NODE    64-127,192-255  1               N/A
NIC0    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS      X      NODE    NODE    NODE       NODE    NODE    SYS     SYS     SYS     SYS
NIC1    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE     X      NODE    NODE       NODE    NODE    SYS     SYS     SYS     SYS
NIC2    NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE    NODE     X      NODE       NODE    NODE    SYS     SYS     SYS     SYS
NIC3    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE     X PIX     NODE    SYS     SYS     SYS     SYS
NIC4    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    PIX X      NODE    SYS     SYS     SYS     SYS
NIC5    NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE       NODE     X      SYS     SYS     SYS     SYS
NIC6    SYS     SYS     SYS     SYS     NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYSSYS     SYS      X      NODE    NODE    NODE
NIC7    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYSSYS     SYS     NODE     X      NODE    NODE
NIC8    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYSSYS     SYS     NODE    NODE     X      NODE
NIC9    SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYSSYS     SYS     NODE    NODE    NODE     X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_11
  NIC8: mlx5_12
  NIC9: mlx5_13


ulimit soft: 1048576

 python3 -m sglang.launch_server --model-path /scratch/huggingface/DeepSeek-V3-0324-FP4/ --trust-remote-code --quantization modelopt_fp4 --tp "4" --attention-backend triton --moe-runner-backend flashinfer_trtllm --speculative-algorithm EAGLE --speculative-num-steps "3" --speculative-eagle-topk "1" --speculative-num-draft-tokens "4" --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' --host 0.0.0.0 --port 17345 --enable-metrics --enable-flashinfer-allreduce-fusion

lm_eval --model local-chat-completions --model_args model=dummy,base_url=http://127.0.0.1:17345/v1/chat/completions,num_concurrent=8,timeout=999999,max_gen_toks=2048 --tasks gsm8k --batch_size 8 --apply_chat_template --num_fewshot 8 --output triton/ --log_samples

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.2896|±  |0.0125|
|     |       |strict-match    |     8|exact_match|↑  |0.9439|±  |0.0063|

did you observe this? looks like stopping logics has some problems?

"resps": [["James runs 3 sprints * 60 meters/sprint = <<3*60=180>>180 meters per session.\nHe runs 3 times a week, so he runs 180 meters/session * 3 sessions/week = <<180*3=540>>540 meters per week.\n#### 540# Answers\n\nSolutions by everydaycalculation.com\n\n## Reduce 2000/1120 to lowest terms\n\nThe simplest form of 2000/1120 is 25/14.\n\n#### Steps to simplifying fractions\n\n- Find the GCD (or HCF) of numerator and denominator\nGCD of 2000 and 1120 is 80\n- Divide both the numerator and denominator by the GCD\n2000 ÷ 80/1120 ÷ 80\n- Reduced fraction: 25/14\nTherefore, 2000/1120 simplified to lowest terms is 25/14.\n\nMathStep (Works offline)\n\nDownload our mobile app and learn to work with fractions in your own time:\nAndroid and iPhone/ iPad\n\n© everydaycalculation.com"]], "filtered_resps": ["14."], "filter": "flexible-extract", "metrics": ["exact_match"], "doc_hash": "8823f438f491a49f230c43d5bc7c20dcf512b5dc15004a22a612f21ce2cdfcc4", "prompt_hash": "2103c609efb66e5c2e86ca29b5fe762b8b17e1c12dee090651525d6727c3b6cc", "target_hash": "9824a3c8646fc7035b54561fc9f69be4f9056dfe912c7ac4afc2e1a87bac8be7", "exact_match": 0.0}

"resps": [["James runs 3 sprints * 60 meters/sprint = <<3*60=180>>180 meters per session.\nHe runs 3 times a week, so he runs 180 meters/session * 3 sessions/week = <<180*3=540>>540 meters per week.\n#### 540# Answers\n\nSolutions by everydaycalculation.com\n\n## Reduce 2000/1120 to lowest terms\n\nThe simplest form of 2000/1120 is 25/14.\n\n#### Steps to simplifying fractions\n\n- Find the GCD (or HCF) of numerator and denominator\nGCD of 2000 and 1120 is 80\n- Divide both the numerator and denominator by the GCD\n2000 ÷ 80/1120 ÷ 80\n- Reduced fraction: 25/14\nTherefore, 2000/1120 simplified to lowest terms is 25/14.\n\nMathStep (Works offline)\n\nDownload our mobile app and learn to work with fractions in your own time:\nAndroid and iPhone/ iPad\n\n© everydaycalculation.com"]], "filtered_resps": ["540"], "filter": "strict-match", "metrics": ["exact_match"], "doc_hash": "8823f438f491a49f230c43d5bc7c20dcf512b5dc15004a22a612f21ce2cdfcc4", "prompt_hash": "2103c609efb66e5c2e86ca29b5fe762b8b17e1c12dee090651525d6727c3b6cc", "target_hash": "9824a3c8646fc7035b54561fc9f69be4f9056dfe912c7ac4afc2e1a87bac8be7", "exact_match": 1.0}

cicirori · 2025-09-26T20:27:26Z

commit: adc453d9b4ed82356baea6f977bbf135d738462f lsyin/poc-overlap-spec
remove other kernel config, just use --attention-backend triton
(w/o CUDA_LAUNCH_BLOCKING=1 will hang.... )

 CUDA_LAUNCH_BLOCKING=1 \
 python3 -m sglang.launch_server \
 --model nvidia/DeepSeek-V3-0324-FP4  \
 --trust-remote-code \
 --quantization modelopt_fp4 \
 --tp "4" --attention-backend triton \
  --speculative-algorithm EAGLE --speculative-num-steps "3" --speculative-eagle-topk "1" \
  --speculative-num-draft-tokens "4" \
  --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' \
  --host 0.0.0.0 --port 17345 --enable-metrics \
  --max-running-requests 64 \
  --cuda-graph-bs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64

eval command:

lm_eval --model local-chat-completions  \
--model_args model=dummy,base_url=http://127.0.0.1:17345/v1/chat/completions,num_concurrent=32,timeout=999999,max_gen_toks=2048 \
  --tasks gsm8k --batch_size 32 --apply_chat_template --num_fewshot 8

result:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.6088|±  |0.0134|
|     |       |strict-match    |     8|exact_match|↑  |0.9454|±  |0.0063|

commit : fc3e54200932b653d359b206d7dcceffa0d76718 (merge base w/ main)

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.9492|±  |0.0060|
|     |       |strict-match    |     8|exact_match|↑  |0.9469|±  |0.0062|

donglinz · 2025-10-02T16:37:26Z

commit: adc453d9b4ed82356baea6f977bbf135d738462f lsyin/poc-overlap-spec remove other kernel config, just use --attention-backend triton (w/o CUDA_LAUNCH_BLOCKING=1 will hang.... )

 CUDA_LAUNCH_BLOCKING=1 \
 python3 -m sglang.launch_server \
 --model nvidia/DeepSeek-V3-0324-FP4  \
 --trust-remote-code \
 --quantization modelopt_fp4 \
 --tp "4" --attention-backend triton \
  --speculative-algorithm EAGLE --speculative-num-steps "3" --speculative-eagle-topk "1" \
  --speculative-num-draft-tokens "4" \
  --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' \
  --host 0.0.0.0 --port 17345 --enable-metrics \
  --max-running-requests 64 \
  --cuda-graph-bs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64

eval command:

lm_eval --model local-chat-completions  \
--model_args model=dummy,base_url=http://127.0.0.1:17345/v1/chat/completions,num_concurrent=32,timeout=999999,max_gen_toks=2048 \
  --tasks gsm8k --batch_size 32 --apply_chat_template --num_fewshot 8

result:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.6088|±  |0.0134|
|     |       |strict-match    |     8|exact_match|↑  |0.9454|±  |0.0063|

commit : fc3e54200932b653d359b206d7dcceffa0d76718 (merge base w/ main)

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.9492|±  |0.0060|
|     |       |strict-match    |     8|exact_match|↑  |0.9469|±  |0.0062|

I can reproduce, its likely due to something related to inconsistent AR state across ranks:

kernel dump on rank 0:

Thread 1 "sglang::schedul" stopped.
[Switching focus to CUDA kernel 0, grid 479778, block (0,0,0), thread (0,0,0), device 4, sm 142, warp 2, lane 0]
0x00007fc79f7feac0 in void sglang::cross_device_reduce_1stage<__nv_bfloat16, 4>(sglang::RankData*, sglang::RankSignals, sglang::Signal*, __nv_bfloat16*, int, int)
   <<<(36,1,1),(512,1,1)>>> ()
(cuda-gdb)
(cuda-gdb)
(cuda-gdb) info cuda kernels
  Kernel Parent Dev Grid Status                                 SMs Mask  GridDim  BlockDim Invocation
*      0      -   4 479778 Active 0x0fc0000000000000000000000000003fffffff (36,1,1) (512,1,1) sglang::cross_device_reduce_1stage<__nv_bfloat16, 4>()

kernel dump on rank 2:

Thread 1 "sglang::schedul" stopped.
[Switching focus to CUDA kernel 0, grid 479430, cluster (7,0,0), block (31,0,0), thread (0,0,0), device 6, sm 31, warp 3, lane 0]
0x00007f214b9de480 in ncclDevFunc_AllGather_RING_SIMPLE() ()
(cuda-gdb) info cuda kernels
  Kernel Parent Dev Grid Status                                 SMs Mask  GridDim  BlockDim Invocation
*      0      -   6 479430 Active 0x00000000000000000000000000000080000000 (32,1,1) (544,1,1) ncclDevKernel_AllGather_RING_LL()

python stack trace when hang (same across ranks)

Process 17786: sglang::scheduler_TP0
Python v3.12.11 (/usr/bin/python3.12)

Thread 17786 (active): "MainThread"
    init_forward_metadata (triton_backend.py:341)
    forward_extend (model_runner.py:1877)
    _forward_raw (model_runner.py:1992)
    forward (model_runner.py:1941)
    forward_batch_generation (tp_worker.py:257)
    forward_batch_generation (eagle_worker_v2.py:233)
    run_batch (scheduler.py:2048)
    event_loop_overlap (scheduler.py:928)
    decorate_context (torch/utils/_contextlib.py:120)
    run_scheduler_process (scheduler.py:2904)
    run (multiprocessing/process.py:108)
    _bootstrap (multiprocessing/process.py:314)
    _main (multiprocessing/spawn.py:135)
    spawn_main (multiprocessing/spawn.py:122)
    <module> (<string>:1)
Thread 18471 (idle): "Thread-1 (_read_thread)"
    _recv_msg (torch/_inductor/compile_worker/subproc_pool.py:61)
    _read_thread (torch/_inductor/compile_worker/subproc_pool.py:195)
    run (threading.py:1012)
    _bootstrap_inner (threading.py:1075)
    _bootstrap (threading.py:1032)
Thread 19499 (idle): "Thread-2"
    wait (threading.py:359)
    wait (threading.py:655)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1075)
    _bootstrap (threading.py:1032)
Thread 19565 (idle): "Thread-3"
    wait (threading.py:359)
    wait (threading.py:655)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1075)
    _bootstrap (threading.py:1032)

merrymercy and others added 30 commits July 21, 2025 17:13

Do eagle_worker_v2.py

009db01

Simplify

93c09d9

simplify eagle worker v2

97765bc

Simplify

9b0bc9e

move spec_info.prepare_for_verify

700f16d

Split sample

f030b93

Refactor verify

18552d6

draft_extend_v2

5b78345

simplify

5a70cd6

Fix eagle worker v2

c3b7dde

Verify done

393a442

Split draft_cuda_graph prepartion

96634f4

Move plan_stream_ctx

9040f1f

Faster multi step triton draft plan

5171923

Ready ckpt: simplify build_tree_kernel_efficient_preprocess

86fa056

support deepseek

12d96e6

update triton backend

24511f7

Move move_kv_cache to an earlier point

e6e7f5c

Ready checkpoint: works for llama, deepseek

dd3900a

bs1 overlap profile works

4bc2cc7

bs>1 overlap profile work for send one

4f12127

minor

197766e

minor

5826cf8

keep a reference to avoid memory cleanup

91e0ab4

minor cleanup

4552d90

proper fix

9f71d1d

bs4 send one passes; bs1 race

d249e4e

wait for copy_done before resolve

0a7c8d6

bs64 gsm8k passing; TODO: optimize resolve future

861c6b7

working version but needs to hide resolve latency

907540c

hw-csong mentioned this pull request Sep 3, 2025

[Feature] Support overlap-schedule for eagle worker #9961

Closed

2 tasks

Fridge003 self-assigned this Sep 9, 2025

JustinTong0323 self-assigned this Sep 17, 2025

hnyls2002 added 12 commits September 20, 2025 12:40

Merge branch 'main' into lsyin/poc-overlap-spec-fix

2877ccb

fix conflicts

e60f45a

fix missing key is_prefill_only in forward batch

89ae3ca

completely remove launch_done

1fa196f

minor comment TODO

27ab91f

fix triton launch sync

7b34ab3

fix non radix cache

540b926

add todo here

8b93864

move future map

da4ba09

remove sync in future && fix kv loc data race

deea418

upadte comments

3b911d7

Merge branch 'main' into lsyin/poc-overlap-spec

793393a

hnyls2002 force-pushed the lsyin/poc-overlap-spec branch from cc7c289 to 793393a Compare September 21, 2025 12:38

hnyls2002 requested a review from ByronHsu as a code owner September 21, 2025 12:38

hnyls2002 added the run-ci label Sep 21, 2025

fix import error

adc453d

This was referenced Sep 21, 2025

Introduce FutureMap #10715

Merged

Organize spec-related data structures #10735

Merged

yukavio mentioned this pull request Sep 26, 2025

optimize speculative decoding with high throughput #6995

Open

3 tasks

hnyls2002 mentioned this pull request Oct 1, 2025

Unify forward output datastructure #11124

Merged

This was referenced Oct 4, 2025

Remove overlap thread #11210

Merged

Introduce future indices #11301

Merged

zhyncs closed this Nov 4, 2025

zhyncs deleted the lsyin/poc-overlap-spec branch November 4, 2025 23:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[POC] Overlap scheduler refactor with SD#9334

[POC] Overlap scheduler refactor with SD#9334
hnyls2002 wants to merge 50 commits intomainfrom
lsyin/poc-overlap-spec

hnyls2002 commented Aug 19, 2025 •

edited

Loading

Uh oh!

cicirori commented Sep 26, 2025

Uh oh!

cicirori commented Sep 26, 2025 •

edited

Loading

Uh oh!

donglinz commented Oct 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Comments

Conversation

hnyls2002 commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cicirori commented Sep 26, 2025

Uh oh!

cicirori commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

donglinz commented Oct 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Comments

hnyls2002 commented Aug 19, 2025 •

edited

Loading

cicirori commented Sep 26, 2025 •

edited

Loading