Skip to content

[POC] Overlap scheduler refactor with SD#9334

Closed
hnyls2002 wants to merge 50 commits intomainfrom
lsyin/poc-overlap-spec
Closed

[POC] Overlap scheduler refactor with SD#9334
hnyls2002 wants to merge 50 commits intomainfrom
lsyin/poc-overlap-spec

Conversation

@hnyls2002
Copy link
Collaborator

@hnyls2002 hnyls2002 commented Aug 19, 2025

Keep aligning with the main to track the compatibility. From #8490

all ACC_LEN=3.02

Branch main lianmin/overlap-spec hanming/overlap-spec-w-batch lsyin/poc-overlap-spec
Throughput (BS=1) 249.73 268.34 261.61 269.30

@Fridge003 Fridge003 self-assigned this Sep 9, 2025
@JustinTong0323 JustinTong0323 self-assigned this Sep 17, 2025
@cicirori
Copy link
Collaborator

Python: 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA B200
GPU 0,1,2,3,4,5,6,7 Compute Capability: 10.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 575.57.08
PyTorch: 2.8.0+cu129
sglang: 0.5.3rc0
sgl_kernel: 0.3.12
flashinfer_python: 0.4.0rc1
triton: 3.4.0
transformers: 4.56.1
torchao: 0.9.0
numpy: 2.3.3
aiohttp: 3.12.15
fastapi: 0.117.1
hf_transfer: 0.1.9
huggingface_hub: 0.35.1
interegular: 0.3.3
modelscope: 1.30.0
orjson: 3.11.3
outlines: 0.1.11
packaging: 25.0
psutil: 7.1.0
pydantic: 2.11.9
python-multipart: 0.0.20
pyzmq: 27.1.0
uvicorn: 0.37.0
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.24
openai: 1.99.1
tiktoken: 0.11.0
anthropic: 0.68.0
litellm: Module Not Found
decord: 0.6.0
NVIDIA Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3       NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    NODE    PIX     NODE    NODE       NODE    NODE    SYS     SYS     SYS     SYS     0-63,128-191    0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    NODE    NODE    NODE    NODE       NODE    PIX     SYS     SYS     SYS     SYS     0-63,128-191    0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    NODE    PIX     NODE       NODE    NODE    SYS     SYS     SYS     SYS     0-63,128-191    0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    PIX     NODE    NODE    NODE       NODE    NODE    SYS     SYS     SYS     SYS     0-63,128-191    0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYSSYS     SYS     NODE    PIX     NODE    NODE    64-127,192-255  1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYSSYS     SYS     NODE    NODE    NODE    PIX     64-127,192-255  1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYSSYS     SYS     NODE    NODE    PIX     NODE    64-127,192-255  1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYSSYS     SYS     PIX     NODE    NODE    NODE    64-127,192-255  1               N/A
NIC0    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS      X      NODE    NODE    NODE       NODE    NODE    SYS     SYS     SYS     SYS
NIC1    PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE     X      NODE    NODE       NODE    NODE    SYS     SYS     SYS     SYS
NIC2    NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE    NODE     X      NODE       NODE    NODE    SYS     SYS     SYS     SYS
NIC3    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE     X PIX     NODE    SYS     SYS     SYS     SYS
NIC4    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    PIX X      NODE    SYS     SYS     SYS     SYS
NIC5    NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE       NODE     X      SYS     SYS     SYS     SYS
NIC6    SYS     SYS     SYS     SYS     NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYSSYS     SYS      X      NODE    NODE    NODE
NIC7    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     SYS     SYS     SYSSYS     SYS     NODE     X      NODE    NODE
NIC8    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYSSYS     SYS     NODE    NODE     X      NODE
NIC9    SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYSSYS     SYS     NODE    NODE    NODE     X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_11
  NIC8: mlx5_12
  NIC9: mlx5_13


ulimit soft: 1048576

 python3 -m sglang.launch_server --model-path /scratch/huggingface/DeepSeek-V3-0324-FP4/ --trust-remote-code --quantization modelopt_fp4 --tp "4" --attention-backend triton --moe-runner-backend flashinfer_trtllm --speculative-algorithm EAGLE --speculative-num-steps "3" --speculative-eagle-topk "1" --speculative-num-draft-tokens "4" --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' --host 0.0.0.0 --port 17345 --enable-metrics --enable-flashinfer-allreduce-fusion

lm_eval --model local-chat-completions --model_args model=dummy,base_url=http://127.0.0.1:17345/v1/chat/completions,num_concurrent=8,timeout=999999,max_gen_toks=2048 --tasks gsm8k --batch_size 8 --apply_chat_template --num_fewshot 8 --output triton/ --log_samples
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.2896|±  |0.0125|
|     |       |strict-match    |     8|exact_match|↑  |0.9439|±  |0.0063|

did you observe this? looks like stopping logics has some problems?

"resps": [["James runs 3 sprints * 60 meters/sprint = <<3*60=180>>180 meters per session.\nHe runs 3 times a week, so he runs 180 meters/session * 3 sessions/week = <<180*3=540>>540 meters per week.\n#### 540# Answers\n\nSolutions by everydaycalculation.com\n\n## Reduce 2000/1120 to lowest terms\n\nThe simplest form of 2000/1120 is 25/14.\n\n#### Steps to simplifying fractions\n\n- Find the GCD (or HCF) of numerator and denominator\nGCD of 2000 and 1120 is 80\n- Divide both the numerator and denominator by the GCD\n2000 ÷ 80/1120 ÷ 80\n- Reduced fraction: 25/14\nTherefore, 2000/1120 simplified to lowest terms is 25/14.\n\nMathStep (Works offline)\n\nDownload our mobile app and learn to work with fractions in your own time:\nAndroid and iPhone/ iPad\n\n© everydaycalculation.com"]], "filtered_resps": ["14."], "filter": "flexible-extract", "metrics": ["exact_match"], "doc_hash": "8823f438f491a49f230c43d5bc7c20dcf512b5dc15004a22a612f21ce2cdfcc4", "prompt_hash": "2103c609efb66e5c2e86ca29b5fe762b8b17e1c12dee090651525d6727c3b6cc", "target_hash": "9824a3c8646fc7035b54561fc9f69be4f9056dfe912c7ac4afc2e1a87bac8be7", "exact_match": 0.0}
"resps": [["James runs 3 sprints * 60 meters/sprint = <<3*60=180>>180 meters per session.\nHe runs 3 times a week, so he runs 180 meters/session * 3 sessions/week = <<180*3=540>>540 meters per week.\n#### 540# Answers\n\nSolutions by everydaycalculation.com\n\n## Reduce 2000/1120 to lowest terms\n\nThe simplest form of 2000/1120 is 25/14.\n\n#### Steps to simplifying fractions\n\n- Find the GCD (or HCF) of numerator and denominator\nGCD of 2000 and 1120 is 80\n- Divide both the numerator and denominator by the GCD\n2000 ÷ 80/1120 ÷ 80\n- Reduced fraction: 25/14\nTherefore, 2000/1120 simplified to lowest terms is 25/14.\n\nMathStep (Works offline)\n\nDownload our mobile app and learn to work with fractions in your own time:\nAndroid and iPhone/ iPad\n\n© everydaycalculation.com"]], "filtered_resps": ["540"], "filter": "strict-match", "metrics": ["exact_match"], "doc_hash": "8823f438f491a49f230c43d5bc7c20dcf512b5dc15004a22a612f21ce2cdfcc4", "prompt_hash": "2103c609efb66e5c2e86ca29b5fe762b8b17e1c12dee090651525d6727c3b6cc", "target_hash": "9824a3c8646fc7035b54561fc9f69be4f9056dfe912c7ac4afc2e1a87bac8be7", "exact_match": 1.0}

@cicirori
Copy link
Collaborator

cicirori commented Sep 26, 2025

commit: adc453d9b4ed82356baea6f977bbf135d738462f lsyin/poc-overlap-spec
remove other kernel config, just use --attention-backend triton
(w/o CUDA_LAUNCH_BLOCKING=1 will hang.... )

 CUDA_LAUNCH_BLOCKING=1 \
 python3 -m sglang.launch_server \
 --model nvidia/DeepSeek-V3-0324-FP4  \
 --trust-remote-code \
 --quantization modelopt_fp4 \
 --tp "4" --attention-backend triton \
  --speculative-algorithm EAGLE --speculative-num-steps "3" --speculative-eagle-topk "1" \
  --speculative-num-draft-tokens "4" \
  --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' \
  --host 0.0.0.0 --port 17345 --enable-metrics \
  --max-running-requests 64 \
  --cuda-graph-bs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64

eval command:

lm_eval --model local-chat-completions  \
--model_args model=dummy,base_url=http://127.0.0.1:17345/v1/chat/completions,num_concurrent=32,timeout=999999,max_gen_toks=2048 \
  --tasks gsm8k --batch_size 32 --apply_chat_template --num_fewshot 8 

result:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.6088|±  |0.0134|
|     |       |strict-match    |     8|exact_match|↑  |0.9454|±  |0.0063|

commit : fc3e54200932b653d359b206d7dcceffa0d76718 (merge base w/ main)

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.9492|±  |0.0060|
|     |       |strict-match    |     8|exact_match|↑  |0.9469|±  |0.0062|

@donglinz
Copy link

donglinz commented Oct 2, 2025

commit: adc453d9b4ed82356baea6f977bbf135d738462f lsyin/poc-overlap-spec remove other kernel config, just use --attention-backend triton (w/o CUDA_LAUNCH_BLOCKING=1 will hang.... )

 CUDA_LAUNCH_BLOCKING=1 \
 python3 -m sglang.launch_server \
 --model nvidia/DeepSeek-V3-0324-FP4  \
 --trust-remote-code \
 --quantization modelopt_fp4 \
 --tp "4" --attention-backend triton \
  --speculative-algorithm EAGLE --speculative-num-steps "3" --speculative-eagle-topk "1" \
  --speculative-num-draft-tokens "4" \
  --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' \
  --host 0.0.0.0 --port 17345 --enable-metrics \
  --max-running-requests 64 \
  --cuda-graph-bs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64

eval command:

lm_eval --model local-chat-completions  \
--model_args model=dummy,base_url=http://127.0.0.1:17345/v1/chat/completions,num_concurrent=32,timeout=999999,max_gen_toks=2048 \
  --tasks gsm8k --batch_size 32 --apply_chat_template --num_fewshot 8 

result:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.6088|±  |0.0134|
|     |       |strict-match    |     8|exact_match|↑  |0.9454|±  |0.0063|

commit : fc3e54200932b653d359b206d7dcceffa0d76718 (merge base w/ main)

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     8|exact_match|↑  |0.9492|±  |0.0060|
|     |       |strict-match    |     8|exact_match|↑  |0.9469|±  |0.0062|

I can reproduce, its likely due to something related to inconsistent AR state across ranks:

kernel dump on rank 0:

Thread 1 "sglang::schedul" stopped.
[Switching focus to CUDA kernel 0, grid 479778, block (0,0,0), thread (0,0,0), device 4, sm 142, warp 2, lane 0]
0x00007fc79f7feac0 in void sglang::cross_device_reduce_1stage<__nv_bfloat16, 4>(sglang::RankData*, sglang::RankSignals, sglang::Signal*, __nv_bfloat16*, int, int)
   <<<(36,1,1),(512,1,1)>>> ()
(cuda-gdb)
(cuda-gdb)
(cuda-gdb) info cuda kernels
  Kernel Parent Dev Grid Status                                 SMs Mask  GridDim  BlockDim Invocation
*      0      -   4 479778 Active 0x0fc0000000000000000000000000003fffffff (36,1,1) (512,1,1) sglang::cross_device_reduce_1stage<__nv_bfloat16, 4>()

kernel dump on rank 2:

Thread 1 "sglang::schedul" stopped.
[Switching focus to CUDA kernel 0, grid 479430, cluster (7,0,0), block (31,0,0), thread (0,0,0), device 6, sm 31, warp 3, lane 0]
0x00007f214b9de480 in ncclDevFunc_AllGather_RING_SIMPLE() ()
(cuda-gdb) info cuda kernels
  Kernel Parent Dev Grid Status                                 SMs Mask  GridDim  BlockDim Invocation
*      0      -   6 479430 Active 0x00000000000000000000000000000080000000 (32,1,1) (544,1,1) ncclDevKernel_AllGather_RING_LL()

python stack trace when hang (same across ranks)

Process 17786: sglang::scheduler_TP0
Python v3.12.11 (/usr/bin/python3.12)

Thread 17786 (active): "MainThread"
    init_forward_metadata (triton_backend.py:341)
    forward_extend (model_runner.py:1877)
    _forward_raw (model_runner.py:1992)
    forward (model_runner.py:1941)
    forward_batch_generation (tp_worker.py:257)
    forward_batch_generation (eagle_worker_v2.py:233)
    run_batch (scheduler.py:2048)
    event_loop_overlap (scheduler.py:928)
    decorate_context (torch/utils/_contextlib.py:120)
    run_scheduler_process (scheduler.py:2904)
    run (multiprocessing/process.py:108)
    _bootstrap (multiprocessing/process.py:314)
    _main (multiprocessing/spawn.py:135)
    spawn_main (multiprocessing/spawn.py:122)
    <module> (<string>:1)
Thread 18471 (idle): "Thread-1 (_read_thread)"
    _recv_msg (torch/_inductor/compile_worker/subproc_pool.py:61)
    _read_thread (torch/_inductor/compile_worker/subproc_pool.py:195)
    run (threading.py:1012)
    _bootstrap_inner (threading.py:1075)
    _bootstrap (threading.py:1032)
Thread 19499 (idle): "Thread-2"
    wait (threading.py:359)
    wait (threading.py:655)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1075)
    _bootstrap (threading.py:1032)
Thread 19565 (idle): "Thread-3"
    wait (threading.py:359)
    wait (threading.py:655)
    run (tqdm/_monitor.py:60)
    _bootstrap_inner (threading.py:1075)
    _bootstrap (threading.py:1032)

This was referenced Oct 4, 2025
@zhyncs zhyncs closed this Nov 4, 2025
@zhyncs zhyncs deleted the lsyin/poc-overlap-spec branch November 4, 2025 23:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Comments