Skip to content

[Bug] Deepseek v3.2 prefill workers crash when PD disaggregation #18799

@dongyibo

Description

@dongyibo

Checklist

  • I searched related issues but found no solution.
  • The bug persists in the latest version.
  • Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
  • If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
  • Please use English. Otherwise, it will be closed.

Describe the bug

when I run ds32 with sglang-0.5.8.post1 with pd disagg,prefill workers will crash when DeepGEMM warmup.
It will work well when I use tp=16 / dp=16 / enable-dp-attn, but when I only use tp=16 (not enable dp-attn) , it crashs.

DeepGEMM warmup: 0%| | 0/16384 [00:00<?, ?it/s]
DeepGEMM warmup: 5%|▌ | 897/16384 [00:00<00:01, 8739.92it/s]
DeepGEMM warmup: 12%|█▏ | 1921/16384 [00:00<00:01, 9523.12it/s]
DeepGEMM warmup: 18%|█▊ | 2945/16384 [00:00<00:01, 9808.65it/s]
DeepGEMM warmup: 25%|██▌ | 4097/16384 [00:00<00:01, 10376.90it/s]
DeepGEMM warmup: 33%|███▎ | 5377/16384 [00:00<00:00, 11196.62it/s]
DeepGEMM warmup: 40%|████ | 6584/16384 [00:00<00:00, 11489.85it/s]
DeepGEMM warmup: 47%|████▋ | 7734/16384 [00:00<00:00, 11403.44it/s]
DeepGEMM warmup: 55%|█████▍ | 9009/16384 [00:00<00:00, 11826.79it/s]
DeepGEMM warmup: 65%|██████▍ | 10584/16384 [00:00<00:00, 13043.34it/s]
DeepGEMM warmup: 73%|███████▎ | 12033/16384 [00:01<00:00, 13368.33it/s]
DeepGEMM warmup: 82%|████████▏ | 13387/16384 [00:01<00:00, 13418.97it/s]
DeepGEMM warmup: 90%|████████▉ | 14730/16384 [00:01<00:00, 13300.73it/s]
DeepGEMM warmup: 98%|█████████▊| 16072/16384 [00:01<00:00, 13334.43it/s]
DeepGEMM warmup: 100%|██████████| 16384/16384 [00:01<00:00, 12227.84it/s]
[2026-02-13 12:55:55 TP8] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run sglang.compile_deep_gemm. It is recommended to run sglang.compile_deep_gemm with same args as sglang.launch_server for pre-compilation to reduce the overhead if you have not run it before. For example: python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code
[2026-02-13 12:55:55 TP8] Try DeepGEMM JIT Compiling for <GEMM_NT_F8F8BF16> N=2048, K=512, num_groups=1 with all Ms. It only takes a little time (typically 1 sec) if you have run python3 -m sglang.compile_deep_gemm.
[2026-02-13 12:55:55 TP8] Required memory for warmup: 0.0712890625GB, Available memory: 9.89849853515625GB

DeepGEMM warmup: 0%| | 0/16384 [00:00<?, ?it/s]Warning: please use at least NVCC 12.9 for the best DeepGEMM performance
Warning: please use at least NVCC 12.9 for the best DeepGEMM performance
[rank10]:[E213 12:55:55.949080515 ProcessGroupNCCL.cpp:1794] [PG ID 0 PG GUID 0 Rank 10] Observed flight recorder dump signal from another rank via TCPStore.
[rank9]:[E213 12:55:55.949075463 ProcessGroupNCCL.cpp:1794] [PG ID 0 PG GUID 0 Rank 9] Observed flight recorder dump signal from another rank via TCPStore.
[rank10]:[E213 12:55:55.949265986 ProcessGroupNCCL.cpp:1858] [PG ID 0 PG GUID 0 Rank 10] Received a dump signal due to a collective timeout from rank 5 and we will try our best to dump the debug info. Last enqueued NCCL work: -1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank9]:[E213 12:55:55.949336115 ProcessGroupNCCL.cpp:1858] [PG ID 0 PG GUID 0 Rank 9] Received a dump signal due to a collective timeout from rank 5 and we will try our best to dump the debug info. Last enqueued NCCL work: -1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank10]:[E213 12:55:55.949393505 ProcessGroupNCCL.cpp:1575] [PG ID 0 PG GUID 0 Rank 10] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank9]:[E213 12:55:55.949472887 ProcessGroupNCCL.cpp:1575] [PG ID 0 PG GUID 0 Rank 9] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
Warning: please use at least NVCC 12.9 for the best DeepGEMM performance
[rank13]:[E213 12:55:55.969788730 ProcessGroupNCCL.cpp:1794] [PG ID 0 PG GUID 0 Rank 13] Observed flight recorder dump signal from another rank via TCPStore.
[rank13]:[E213 12:55:55.969954558 ProcessGroupNCCL.cpp:1858] [PG ID 0 PG GUID 0 Rank 13] Received a dump signal due to a collective timeout from rank 5 and we will try our best to dump the debug info. Last enqueued NCCL work: -1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank13]:[E213 12:55:55.970052937 ProcessGroupNCCL.cpp:1575] [PG ID 0 PG GUID 0 Rank 13] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
Warning: please use at least NVCC 12.9 for the best DeepGEMM performance
[rank14]:[E213 12:55:55.970891987 ProcessGroupNCCL.cpp:1794] [PG ID 0 PG GUID 0 Rank 14] Observed flight recorder dump signal from another rank via TCPStore.
[rank14]:[E213 12:55:55.971089572 ProcessGroupNCCL.cpp:1858] [PG ID 0 PG GUID 0 Rank 14] Received a dump signal due to a collective timeout from rank 5 and we will try our best to dump the debug info. Last enqueued NCCL work: -1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
Warning: please use at least NVCC 12.9 for the best DeepGEMM performance
[rank8]:[E213 12:55:55.971132781 ProcessGroupNCCL.cpp:1794] [PG ID 0 PG GUID 0 Rank 8] Observed flight recorder dump signal from another rank via TCPStore.
[rank14]:[E213 12:55:55.971191660 ProcessGroupNCCL.cpp:1575] [PG ID 0 PG GUID 0 Rank 14] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank8]:[E213 12:55:55.971329476 ProcessGroupNCCL.cpp:1858] [PG ID 0 PG GUID 0 Rank 8] Received a dump signal due to a collective timeout from rank 5 and we will try our best to dump the debug info. Last enqueued NCCL work: -1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
Warning: please use at least NVCC 12.9 for the best DeepGEMM performance
[rank15]:[E213 12:55:55.971420677 ProcessGroupNCCL.cpp:1794] [PG ID 0 PG GUID 0 Rank 15] Observed flight recorder dump signal from another rank via TCPStore.
[rank8]:[E213 12:55:55.971472230 ProcessGroupNCCL.cpp:1575] [PG ID 0 PG GUID 0 Rank 8] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank15]:[E213 12:55:55.971662622 ProcessGroupNCCL.cpp:1858] [PG ID 0 PG GUID 0 Rank 15] Received a dump signal due to a collective timeout from rank 5 and we will try our best to dump the debug info. Last enqueued NCCL work: -1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank15]:[E213 12:55:55.971757336 ProcessGroupNCCL.cpp:1575] [PG ID 0 PG GUID 0 Rank 15] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
Warning: please use at least NVCC 12.9 for the best DeepGEMM performance
[rank12]:[E213 12:55:55.972336654 ProcessGroupNCCL.cpp:1794] [PG ID 0 PG GUID 0 Rank 12] Observed flight recorder dump signal from another rank via TCPStore.
[rank12]:[E213 12:55:55.972507426 ProcessGroupNCCL.cpp:1858] [PG ID 0 PG GUID 0 Rank 12] Received a dump signal due to a collective timeout from rank 5 and we will try our best to dump the debug info. Last enqueued NCCL work: -1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank12]:[E213 12:55:55.972626061 ProcessGroupNCCL.cpp:1575] [PG ID 0 PG GUID 0 Rank 12] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
Warning: please use at least NVCC 12.9 for the best DeepGEMM performance
[rank11]:[E213 12:55:55.974137560 ProcessGroupNCCL.cpp:1794] [PG ID 0 PG GUID 0 Rank 11] Observed flight recorder dump signal from another rank via TCPStore.
[rank11]:[E213 12:55:55.974310921 ProcessGroupNCCL.cpp:1858] [PG ID 0 PG GUID 0 Rank 11] Received a dump signal due to a collective timeout from rank 5 and we will try our best to dump the debug info. Last enqueued NCCL work: -1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank11]:[E213 12:55:55.974410359 ProcessGroupNCCL.cpp:1575] [PG ID 0 PG GUID 0 Rank 11] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank12]:[E213 12:55:55.456047970 ProcessGroupNCCL.cpp:683] [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=28672, NumelOut=28672, Timeout(ms)=600000) ran for 600085 milliseconds before timing out.
[rank12]:[E213 12:55:55.457849911 ProcessGroupNCCL.cpp:2241] [PG ID 2 PG GUID 3 Rank 12] failure detected by watchdog at work sequence id: 2 PG status: last enqueued work: 2, last completed work: 1
[rank12]:[E213 12:55:55.457857636 ProcessGroupNCCL.cpp:730] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank9]:[E213 12:55:55.556041629 ProcessGroupNCCL.cpp:683] [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=28672, NumelOut=28672, Timeout(ms)=600000) ran for 600014 milliseconds before timing out.
[rank9]:[E213 12:55:55.556245574 ProcessGroupNCCL.cpp:2241] [PG ID 2 PG GUID 3 Rank 9] failure detected by watchdog at work sequence id: 2 PG status: last enqueued work: 2, last completed work: 1
[rank9]:[E213 12:55:55.556251353 ProcessGroupNCCL.cpp:730] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank15]:[E213 12:55:56.756652808 ProcessGroupNCCL.cpp:683] [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=28672, NumelOut=28672, Timeout(ms)=600000) ran for 600084 milliseconds before timing out.
[rank15]:[E213 12:55:56.756888373 ProcessGroupNCCL.cpp:2241] [PG ID 2 PG GUID 3 Rank 15] failure detected by watchdog at work sequence id: 2 PG status: last enqueued work: 2, last completed work: 1
[rank15]:[E213 12:55:56.756895188 ProcessGroupNCCL.cpp:730] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank11]:[E213 12:55:56.869579210 ProcessGroupNCCL.cpp:683] [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=28672, NumelOut=28672, Timeout(ms)=600000) ran for 600038 milliseconds before timing out.
[rank11]:[E213 12:55:56.869797302 ProcessGroupNCCL.cpp:2241] [PG ID 2 PG GUID 3 Rank 11] failure detected by watchdog at work sequence id: 2 PG status: last enqueued work: 2, last completed work: 1
[rank11]:[E213 12:55:56.869803770 ProcessGroupNCCL.cpp:730] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank13]:[E213 12:55:56.056594822 ProcessGroupNCCL.cpp:683] [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=28672, NumelOut=28672, Timeout(ms)=600000) ran for 600030 milliseconds before timing out.
[rank13]:[E213 12:55:56.056802607 ProcessGroupNCCL.cpp:2241] [PG ID 2 PG GUID 3 Rank 13] failure detected by watchdog at work sequence id: 2 PG status: last enqueued work: 2, last completed work: 1
[rank13]:[E213 12:55:56.056808585 ProcessGroupNCCL.cpp:730] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank14]:[E213 12:55:56.257567214 ProcessGroupNCCL.cpp:683] [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=28672, NumelOut=28672, Timeout(ms)=600000) ran for 600091 milliseconds before timing out.
[rank14]:[E213 12:55:56.257794926 ProcessGroupNCCL.cpp:2241] [PG ID 2 PG GUID 3 Rank 14] failure detected by watchdog at work sequence id: 2 PG status: last enqueued work: 2, last completed work: 1
[rank14]:[E213 12:55:56.257800865 ProcessGroupNCCL.cpp:730] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank10]:[E213 12:55:56.259282747 ProcessGroupNCCL.cpp:683] [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=28672, NumelOut=28672, Timeout(ms)=600000) ran for 600000 milliseconds before timing out.
[rank10]:[E213 12:55:56.259479474 ProcessGroupNCCL.cpp:2241] [PG ID 2 PG GUID 3 Rank 10] failure detected by watchdog at work sequence id: 2 PG status: last enqueued work: 2, last completed work: 1
[rank10]:[E213 12:55:56.259485138 ProcessGroupNCCL.cpp:730] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.

DeepGEMM warmup: 0%| | 17/16384 [00:02<41:51, 6.52it/s]
DeepGEMM warmup: 0%| | 33/16384 [00:05<43:11, 6.31it/s]
DeepGEMM warmup: 0%| | 65/16384 [00:07<30:22, 8.96it/s]
DeepGEMM warmup: 1%| | 129/16384 [00:10<18:03, 15.01it/s]2026-02-13 12:56:06,097 - WARNING - 健康检查失败: HTTPConnectionPool(host='10.92.216.214', port=28056): Max retries exceeded with url: /health (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f53fa449cc0>: Failed to establish a new connection: [Errno 111] Connection refused')),当前系统时间: 2026-02-13 12:56:06
2026-02-13 12:56:06,097 - INFO - 服务尚未恢复,继续检查...

DeepGEMM warmup: 1%| | 193/16384 [00:13<14:44, 18.31it/s]
DeepGEMM warmup: 2%|▏ | 257/16384 [00:15<13:19, 20.17it/s]
DeepGEMM warmup: 2%|▏ | 321/16384 [00:19<14:22, 18.63it/s]
DeepGEMM warmup: 2%|▏ | 385/16384 [00:22<13:14, 20.15it/s]
DeepGEMM warmup: 3%|▎ | 449/16384 [00:25<12:46, 20.79it/s]
DeepGEMM warmup: 3%|▎ | 512/16384 [00:28<12:30, 21.16it/s]
DeepGEMM warmup: 3%|▎ | 515/16384 [00:30<17:19, 15.26it/s]2026-02-13 12:56:36,129 - WARNING - 健康检查失败: HTTPConnectionPool(host='10.92.216.214', port=28056): Max retries exceeded with url: /health (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f55a8a353c0>: Failed to establish a new connection: [Errno 111] Connection refused')),当前系统时间: 2026-02-13 12:56:36
2026-02-13 12:56:36,129 - INFO - 服务尚未恢复,继续检查...

DeepGEMM warmup: 4%|▎ | 577/16384 [00:44<32:19, 8.15it/s]
DeepGEMM warmup: 4%|▍ | 641/16384 [00:48<26:22, 9.95it/s]
DeepGEMM warmup: 4%|▍ | 705/16384 [00:57<30:28, 8.58it/s]
DeepGEMM warmup: 5%|▍ | 769/16384 [01:00<24:10, 10.77it/s][rank12]:[E213 12:56:55.457986761 ProcessGroupNCCL.cpp:744] [Rank 12] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank12]:[E213 12:56:55.458022124 ProcessGroupNCCL.cpp:758] [Rank 12] To avoid data inconsistency, we are taking the entire process down.
[rank12]:[E213 12:56:55.459128632 ProcessGroupNCCL.cpp:2057] [PG ID 2 PG GUID 3 Rank 12] Process group watchdog thread terminated with exception: [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=28672, NumelOut=28672, Timeout(ms)=600000) ran for 600085 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:686 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x80 (0x7f54d793fb80 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x247 (0x7f54d881b5b7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x1691 (0x7f54d88201c1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xdf (0x7f54d882140f in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7f552f6b0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7f553878dac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7f553881f850 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 2 PG GUID 3 Rank 12] Process group watchdog thread terminated with exception: [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=28672, NumelOut=28672, Timeout(ms)=600000) ran for 600085 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:686 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x80 (0x7f54d793fb80 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x247 (0x7f54d881b5b7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x1691 (0x7f54d88201c1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xdf (0x7f54d882140f in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7f552f6b0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7f553878dac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7f553881f850 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2063 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x80 (0x7f54d793fb80 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: + 0xe34731 (0x7f54d87f7731 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x9504a1 (0x7f54d83134a1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: + 0xdc253 (0x7f552f6b0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #4: + 0x94ac3 (0x7f553878dac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #5: + 0x126850 (0x7f553881f850 in /lib/x86_64-linux-gnu/libc.so.6)

Fatal Python error: Aborted

Thread 0x00007f3767ff8640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 324 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f376bff9640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 320 in wait
File "/local-ssd/pv0/python/sglang/srt/disaggregation/common/utils.py", line 24 in get
File "/local-ssd/pv0/python/sglang/srt/disaggregation/mooncake/conn.py", line 784 in transfer_worker
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f376fffa640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 320 in wait
File "/local-ssd/pv0/python/sglang/srt/disaggregation/common/utils.py", line 24 in get
File "/local-ssd/pv0/python/sglang/srt/disaggregation/mooncake/conn.py", line 784 in transfer_worker
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f3773ffb640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 320 in wait
File "/local-ssd/pv0/python/sglang/srt/disaggregation/common/utils.py", line 24 in get
File "/local-ssd/pv0/python/sglang/srt/disaggregation/mooncake/conn.py", line 784 in transfer_worker
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f3777ffc640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 320 in wait
File "/local-ssd/pv0/python/sglang/srt/disaggregation/common/utils.py", line 24 in get
File "/local-ssd/pv0/python/sglang/srt/disaggregation/mooncake/conn.py", line 784 in transfer_worker
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f377bffd640 (most recent call first):
File "/usr/local/lib/python3.10/dist-packages/zmq/sugar/socket.py", line 799 in recv_multipart
File "/local-ssd/pv0/python/sglang/srt/disaggregation/mooncake/conn.py", line 927 in bootstrap_thread
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f3ecbfff640 (most recent call first):
File "/local-ssd/pv0/python/sglang/srt/utils/watchdog.py", line 145 in _watchdog_once
File "/local-ssd/pv0/python/sglang/srt/utils/watchdog.py", line 125 in _watchdog_thread
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f3ee3fff640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 324 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f55386f8480 (most recent call first):
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1255 in call
File "/usr/local/lib/python3.10/dist-packages/deep_gemm/init.py", line 50 in _fn
File "/local-ssd/pv0/python/sglang/srt/layers/deep_gemm_wrapper/entrypoint.py", line 98 in gemm_nt_f8f8bf16
File "/local-ssd/pv0/python/sglang/srt/layers/quantization/fp8_kernel.py", line 110 in deep_gemm_fp8_fp8_bf16_nt
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1255 in call
File "/local-ssd/pv0/python/sglang/srt/layers/quantization/fp8_kernel.py", line 1081 in w8a8_block_fp8_matmul_deepgemm
File "/local-ssd/pv0/python/sglang/srt/layers/quantization/fp8_utils.py", line 480 in deepgemm_w8a8_block_fp8_linear_with_fallback
File "/local-ssd/pv0/python/sglang/srt/layers/quantization/fp8.py", line 638 in apply
File "/local-ssd/pv0/python/sglang/srt/layers/linear.py", line 451 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1786 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1775 in _wrapped_call_impl
File "/local-ssd/pv0/python/sglang/srt/models/deepseek_v2.py", line 278 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1786 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1775 in _wrapped_call_impl
File "/local-ssd/pv0/python/sglang/srt/models/deepseek_v2.py", line 2410 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1786 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1775 in _wrapped_call_impl
File "/local-ssd/pv0/python/sglang/srt/models/deepseek_v2.py", line 2719 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1786 in _call_impl
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1775 in _wrapped_call_impl
File "/local-ssd/pv0/python/sglang/srt/models/deepseek_v2.py", line 2908 in forward
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 120 in decorate_context
File "/local-ssd/pv0/python/sglang/srt/model_executor/model_runner.py", line 2283 in forward_extend
File "/local-ssd/pv0/python/sglang/srt/model_executor/model_runner.py", line 2449 in _forward_raw
File "/local-ssd/pv0/python/sglang/srt/model_executor/model_runner.py", line 2346 in forward
File "/local-ssd/pv0/python/sglang/srt/managers/tp_worker.py", line 451 in forward_batch_generation
File "/local-ssd/pv0/python/sglang/srt/managers/scheduler.py", line 2291 in run_batch
File "/local-ssd/pv0/python/sglang/srt/disaggregation/prefill.py", line 407 in event_loop_overlap_disagg_prefill
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 120 in decorate_context
File "/local-ssd/pv0/python/sglang/srt/managers/scheduler.py", line 3095 in run_scheduler_process
File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
File "", line 1 in

Extension modules: charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, numpy._core._multiarray_umath, numpy.linalg._umath_linalg, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, pybase64._pybase64, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, psutil._psutil_linux, zmq.backend.cython._zmq, PIL._imaging, sentencepiece._sentencepiece, yaml._yaml, regex._regex, markupsafe._speedups, cuda_utils, PIL._imagingft, av._core, av.logging, av.bytesource, av.buffer, av.audio.format, av.error, av.dictionary, av.container.pyio, av.option, av.descriptor, av.format, av.utils, av.stream, av.container.streams, av.sidedata.motionvectors, av.sidedata.sidedata, av.opaque, av.packet, av.container.input, av.container.output, av.container.core, av.codec.context, av.video.format, av.video.reformatter, av.plane, av.video.plane, av.video.frame, av.video.stream, av.codec.hwaccel, av.codec.codec, av.frame, av.audio.layout, av.audio.plane, av.audio.frame, av.audio.stream, av.filter.link, av.filter.context, av.filter.graph, av.filter.filter, av.filter.loudnorm, av.audio.resampler, av.audio.codeccontext, av.audio.fifo, av.bitstream, av.video.codeccontext, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, _cffi_backend, scipy._lib._ccallback_c, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._cython_nnls, scipy._lib._uarray._uarray, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.linalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.optimize._direct, sklearn.__check_build._check_build, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.interpolate._fitpack, scipy.interpolate._dfitpack, scipy.interpolate._dierckx, scipy.interpolate._ppoly, scipy.interpolate._interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.interpolate._bspl, scipy.special.cython_special, scipy.stats._stats, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._biasedurn, scipy.stats._stats_pythran, scipy.stats._levy_stable.levyst, scipy.stats._ansari_swilk_statistics, scipy.stats._mvn, scipy.stats._rcont.rcont, scipy.ndimage._nd_image, scipy.ndimage._rank_filter_1d, _ni_label, scipy.ndimage._ni_label, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, _cyutility, sklearn._cyutility, sklearn.utils._isfinite, sklearn.utils.sparsefuncs_fast, sklearn.utils.murmurhash, sklearn.utils._openmp_helpers, sklearn.metrics.cluster._expected_mutual_info_fast, sklearn.preprocessing._csr_polynomial_expansion, sklearn.preprocessing._target_encoder_fast, sklearn.metrics._dist_metrics, sklearn.metrics._pairwise_distances_reduction._datasets_pair, sklearn.utils._cython_blas, sklearn.metrics._pairwise_distances_reduction._base, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, sklearn.utils._heap, sklearn.utils._sorting, sklearn.metrics._pairwise_distances_reduction._argkmin, sklearn.metrics._pairwise_distances_reduction._argkmin_classmode, sklearn.utils._vector_sentinel, sklearn.metrics._pairwise_distances_reduction._radius_neighbors, sklearn.metrics._pairwise_distances_reduction._radius_neighbors_classmode, sklearn.metrics._pairwise_fast, setproctitle._setproctitle, _cbor2, cuda.bindings._lib.utils, cuda.bindings._bindings.cydriver, cuda.bindings.cydriver, cuda.bindings.driver, tvm_ffi.core, cuda.bindings._bindings.cyruntime_ptds, cuda.bindings._bindings.cyruntime, cuda.bindings._lib.cyruntime.utils, cuda.bindings._lib.cyruntime.cyruntime, cuda.bindings.cyruntime, cuda.bindings.runtime, cuda.bindings._bindings.cynvrtc, cuda.bindings.cynvrtc, cuda.bindings.nvrtc, msgspec._core, cuda.cudart, cuda.nvrtc, __triton_launcher (total: 257)
[rank9]:[E213 12:56:55.556365223 ProcessGroupNCCL.cpp:744] [Rank 9] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank9]:[E213 12:56:55.556388528 ProcessGroupNCCL.cpp:758] [Rank 9] To avoid data inconsistency, we are taking the entire process down.
[rank9]:[E213 12:56:55.557490243 ProcessGroupNCCL.cpp:2057] [PG ID 2 PG GUID 3 Rank 9] Process group watchdog thread terminated with exception: [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=28672, NumelOut=28672, Timeout(ms)=600000) ran for 600014 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:686 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x80 (0x7f385b57cb80 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x247 (0x7f37fd21b5b7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x1691 (0x7f37fd2201c1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xdf (0x7f37fd22140f in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7f38540b0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7f385d333ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7f385d3c5850 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 2 PG GUID 3 Rank 9] Process group watchdog thread terminated with exception: [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=28672, NumelOut=28672, Timeout(ms)=600000) ran for 600014 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:686 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x80 (0x7f385b57cb80 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x247 (0x7f37fd21b5b7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x1691 (0x7f37fd2201c1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xdf (0x7f37fd22140f in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdc253 (0x7f38540b0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: + 0x94ac3 (0x7f385d333ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7f385d3c5850 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2063 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x80 (0x7f385b57cb80 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: + 0xe34731 (0x7f37fd1f7731 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x9504a1 (0x7f37fcd134a1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: + 0xdc253 (0x7f38540b0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #4: + 0x94ac3 (0x7f385d333ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #5: + 0x126850 (0x7f385d3c5850 in /lib/x86_64-linux-gnu/libc.so.6)

Fatal Python error: Aborted

Thread 0x00007f1a93ff8640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 324 in wait
File "/usr/lib/python3.10/threading.py", line 607 in wait
File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f1a97ff9640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 320 in wait
File "/local-ssd/pv0/python/sglang/srt/disaggregation/common/utils.py", line 24 in get
File "/local-ssd/pv0/python/sglang/srt/disaggregation/mooncake/conn.py", line 784 in transfer_worker
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f1a9bffa640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 320 in wait
File "/local-ssd/pv0/python/sglang/srt/disaggregation/common/utils.py", line 24 in get
File "/local-ssd/pv0/python/sglang/srt/disaggregation/mooncake/conn.py", line 784 in transfer_worker
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f1a9fffb640 (most recent call first):
File "/usr/lib/python3.10/threading.py", line 320 in wait
File "/local-ssd/pv0/python/sglang/srt/disaggregation/common/utils.py", line 24 in get
File "/local-ssd/pv0/python/sglang/srt/disaggregation/mooncake/conn.py", line 784 in transfer_worker
File "/usr/lib/python3.10/threading.py", line 953 in run
File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Reproduction

4 machine / 32 H100 cards / pd+tp16
ulimit -u unlimited
ulimit -l unlimited

prefill workers on 2 machine with 16 cards:
$RANK ∈ {0,1}
python3 -m sglang.launch_server
--tp-size 16
--disaggregation-mode prefill
--port "$PREFILL_PORT"
--mem-fraction-static 0.80
--model-path /local-ssd/pv2/DeepSeek-V3.2
--dist-init-addr "$PREFILL_MASTER_IP_ADDRESS:20000"
--nnodes 2
--node-rank "$RANK"
--trust-remote-code
--host 0.0.0.0
--schedule-policy fcfs
--decode-log-interval 1
--context-length 128000
$ARGS

decode workers on 2 machine with 16 cards:
$RANK ∈ {0,1}
python3 -m sglang.launch_server
--tp-size 16
--disaggregation-mode decode
--port "$DECODE_PORT"
--mem-fraction-static 0.80
--model-path /local-ssd/pv2/DeepSeek-V3.2
--dist-init-addr "$DECODE_MASTER_IP_ADDRESS:20000"
--nnodes 2
--node-rank "$RANK"
--trust-remote-code
--host 0.0.0.0
--schedule-policy fcfs
--decode-log-interval 1
--context-length 128000
$ARGS

ARGS=--dist-timeout 3600 --watchdog-timeout 3600 --disaggregation-transfer-backend mooncake --kv-cache-dtype fp8_e4m3 --cuda-graph-bs 1 2 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 72 80 88 96 104 112 120 128 --max-running-requests 5120 --chunked-prefill-size 1024 --schedule-conservativeness 3.333 --tokenizer-worker-num 1 --tool-call-parser deepseekv32 --reasoning-parser deepseek-v3 --disable-custom-all-reduce

Environment

/local-ssd/pv0/python/sglang/srt/environ.py:516: UserWarning: Environment variable SGL_DG_CACHE_DIR is deprecated, please use SGLANG_DG_CACHE_DIR
warnings.warn(
/local-ssd/pv0/python/sglang/srt/environ.py:516: UserWarning: Environment variable SGL_VERSION is deprecated, please use SGLANG_VERSION
warnings.warn(
Python: 3.10.12 (main, Aug 15 2025, 14:32:43) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H800
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda-12.4
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 535.261.03
PyTorch: 2.9.1+cu128
sglang: 0.5.8.post1
sgl_kernel: 0.3.21
flashinfer_python: 0.6.2
flashinfer_cubin: 0.6.2
flashinfer_jit_cache: Module Not Found
triton: 3.5.1
transformers: 4.57.1
torchao: 0.9.0
numpy: 2.2.6
aiohttp: 3.13.2
fastapi: 0.121.2
hf_transfer: 0.1.9
huggingface_hub: 0.36.0
interegular: 0.3.3
modelscope: 1.34.0
orjson: 3.11.4
outlines: 0.1.11
packaging: 25.0
psutil: 7.1.3
pydantic: 2.12.4
python-multipart: 0.0.20
pyzmq: 27.1.0
uvicorn: 0.29.0
uvloop: 0.22.1
vllm: 0.11.0
xgrammar: 0.1.27
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.72.1
litellm: Module Not Found
decord2: 3.0.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 SYS PIX PHB PHB PHB SYS SYS SYS SYS 0-89 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 SYS PHB PIX PHB PHB SYS SYS SYS SYS 0-89 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 SYS PHB PHB PIX PHB SYS SYS SYS SYS 0-89 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 SYS PHB PHB PHB PIX SYS SYS SYS SYS 0-89 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS SYS PIX PHB PHB PHB 90-179 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS SYS PHB PIX PHB PHB 90-179 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS SYS PHB PHB PIX PHB 90-179 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS SYS PHB PHB PHB PIX 90-179 1 N/A
NIC0 SYS SYS SYS SYS SYS SYS SYS SYS X SYS SYS SYS SYS SYS SYS SYS SYS
NIC1 PIX PHB PHB PHB SYS SYS SYS SYS SYS X PHB PHB PHB SYS SYS SYS SYS
NIC2 PHB PIX PHB PHB SYS SYS SYS SYS SYS PHB X PHB PHB SYS SYS SYS SYS
NIC3 PHB PHB PIX PHB SYS SYS SYS SYS SYS PHB PHB X PHB SYS SYS SYS SYS
NIC4 PHB PHB PHB PIX SYS SYS SYS SYS SYS PHB PHB PHB X SYS SYS SYS SYS
NIC5 SYS SYS SYS SYS PIX PHB PHB PHB SYS SYS SYS SYS SYS X PHB PHB PHB
NIC6 SYS SYS SYS SYS PHB PIX PHB PHB SYS SYS SYS SYS SYS PHB X PHB PHB
NIC7 SYS SYS SYS SYS PHB PHB PIX PHB SYS SYS SYS SYS SYS PHB PHB X PHB
NIC8 SYS SYS SYS SYS PHB PHB PHB PIX SYS SYS SYS SYS SYS PHB PHB PHB X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8

Hypervisor vendor:: KVM
ulimit soft: 1048576

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions