[Bug] Deepseek v3.2 prefill workers crash when PD disaggregation

### Checklist

- [x] I searched related issues but found no solution.
- [x] The bug persists in the latest version.
- [x] Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
- [x] If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- [x] Please use English. Otherwise, it will be closed.

### Describe the bug

when I run ds32 with sglang-0.5.8.post1 with pd disagg，prefill workers will crash when DeepGEMM warmup.
It will work well when I use tp=16 /  dp=16 / enable-dp-attn, but when I only use  tp=16 (not enable dp-attn) , it crashs.


DeepGEMM warmup:   0%|          | 0/16384 [00:00<?, ?it/s]
DeepGEMM warmup:   5%|▌         | 897/16384 [00:00<00:01, 8739.92it/s]
DeepGEMM warmup:  12%|█▏        | 1921/16384 [00:00<00:01, 9523.12it/s]
DeepGEMM warmup:  18%|█▊        | 2945/16384 [00:00<00:01, 9808.65it/s]
DeepGEMM warmup:  25%|██▌       | 4097/16384 [00:00<00:01, 10376.90it/s]
DeepGEMM warmup:  33%|███▎      | 5377/16384 [00:00<00:00, 11196.62it/s]
DeepGEMM warmup:  40%|████      | 6584/16384 [00:00<00:00, 11489.85it/s]
DeepGEMM warmup:  47%|████▋     | 7734/16384 [00:00<00:00, 11403.44it/s]
DeepGEMM warmup:  55%|█████▍    | 9009/16384 [00:00<00:00, 11826.79it/s]
DeepGEMM warmup:  65%|██████▍   | 10584/16384 [00:00<00:00, 13043.34it/s]
DeepGEMM warmup:  73%|███████▎  | 12033/16384 [00:01<00:00, 13368.33it/s]
DeepGEMM warmup:  82%|████████▏ | 13387/16384 [00:01<00:00, 13418.97it/s]
DeepGEMM warmup:  90%|████████▉ | 14730/16384 [00:01<00:00, 13300.73it/s]
DeepGEMM warmup:  98%|█████████▊| 16072/16384 [00:01<00:00, 13334.43it/s]
DeepGEMM warmup: 100%|██████████| 16384/16384 [00:01<00:00, 12227.84it/s]
[2026-02-13 12:55:55 TP8] Entering DeepGEMM JIT Pre-Compile session. It may take a long time (typically 10-20 mins) if you have not run `sglang.compile_deep_gemm`. It is recommended to run `sglang.compile_deep_gemm` with same args as `sglang.launch_server` for pre-compilation to reduce the overhead if you have not run it before. For example: `python3 -m sglang.compile_deep_gemm --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code`
[2026-02-13 12:55:55 TP8] Try DeepGEMM JIT Compiling for <GEMM_NT_F8F8BF16> N=2048, K=512, num_groups=1 with all Ms. It only takes a little time (typically 1 sec) if you have run `python3 -m sglang.compile_deep_gemm`. 
[2026-02-13 12:55:55 TP8] Required memory for warmup: 0.0712890625GB, Available memory: 9.89849853515625GB

DeepGEMM warmup:   0%|          | 0/16384 [00:00<?, ?it/s]Warning: please use at least NVCC 12.9 for the best DeepGEMM performance
Warning: please use at least NVCC 12.9 for the best DeepGEMM performance
[rank10]:[E213 12:55:55.949080515 ProcessGroupNCCL.cpp:1794] [PG ID 0 PG GUID 0 Rank 10] Observed flight recorder dump signal from another rank via TCPStore.
[rank9]:[E213 12:55:55.949075463 ProcessGroupNCCL.cpp:1794] [PG ID 0 PG GUID 0 Rank 9] Observed flight recorder dump signal from another rank via TCPStore.
[rank10]:[E213 12:55:55.949265986 ProcessGroupNCCL.cpp:1858] [PG ID 0 PG GUID 0 Rank 10] Received a dump signal due to a collective timeout from  rank 5 and we will try our best to dump the debug info. Last enqueued NCCL work: -1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. 
[rank9]:[E213 12:55:55.949336115 ProcessGroupNCCL.cpp:1858] [PG ID 0 PG GUID 0 Rank 9] Received a dump signal due to a collective timeout from  rank 5 and we will try our best to dump the debug info. Last enqueued NCCL work: -1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. 
[rank10]:[E213 12:55:55.949393505 ProcessGroupNCCL.cpp:1575] [PG ID 0 PG GUID 0 Rank 10] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank9]:[E213 12:55:55.949472887 ProcessGroupNCCL.cpp:1575] [PG ID 0 PG GUID 0 Rank 9] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
Warning: please use at least NVCC 12.9 for the best DeepGEMM performance
[rank13]:[E213 12:55:55.969788730 ProcessGroupNCCL.cpp:1794] [PG ID 0 PG GUID 0 Rank 13] Observed flight recorder dump signal from another rank via TCPStore.
[rank13]:[E213 12:55:55.969954558 ProcessGroupNCCL.cpp:1858] [PG ID 0 PG GUID 0 Rank 13] Received a dump signal due to a collective timeout from  rank 5 and we will try our best to dump the debug info. Last enqueued NCCL work: -1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. 
[rank13]:[E213 12:55:55.970052937 ProcessGroupNCCL.cpp:1575] [PG ID 0 PG GUID 0 Rank 13] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
Warning: please use at least NVCC 12.9 for the best DeepGEMM performance
[rank14]:[E213 12:55:55.970891987 ProcessGroupNCCL.cpp:1794] [PG ID 0 PG GUID 0 Rank 14] Observed flight recorder dump signal from another rank via TCPStore.
[rank14]:[E213 12:55:55.971089572 ProcessGroupNCCL.cpp:1858] [PG ID 0 PG GUID 0 Rank 14] Received a dump signal due to a collective timeout from  rank 5 and we will try our best to dump the debug info. Last enqueued NCCL work: -1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. 
Warning: please use at least NVCC 12.9 for the best DeepGEMM performance
[rank8]:[E213 12:55:55.971132781 ProcessGroupNCCL.cpp:1794] [PG ID 0 PG GUID 0 Rank 8] Observed flight recorder dump signal from another rank via TCPStore.
[rank14]:[E213 12:55:55.971191660 ProcessGroupNCCL.cpp:1575] [PG ID 0 PG GUID 0 Rank 14] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank8]:[E213 12:55:55.971329476 ProcessGroupNCCL.cpp:1858] [PG ID 0 PG GUID 0 Rank 8] Received a dump signal due to a collective timeout from  rank 5 and we will try our best to dump the debug info. Last enqueued NCCL work: -1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. 
Warning: please use at least NVCC 12.9 for the best DeepGEMM performance
[rank15]:[E213 12:55:55.971420677 ProcessGroupNCCL.cpp:1794] [PG ID 0 PG GUID 0 Rank 15] Observed flight recorder dump signal from another rank via TCPStore.
[rank8]:[E213 12:55:55.971472230 ProcessGroupNCCL.cpp:1575] [PG ID 0 PG GUID 0 Rank 8] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank15]:[E213 12:55:55.971662622 ProcessGroupNCCL.cpp:1858] [PG ID 0 PG GUID 0 Rank 15] Received a dump signal due to a collective timeout from  rank 5 and we will try our best to dump the debug info. Last enqueued NCCL work: -1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. 
[rank15]:[E213 12:55:55.971757336 ProcessGroupNCCL.cpp:1575] [PG ID 0 PG GUID 0 Rank 15] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
Warning: please use at least NVCC 12.9 for the best DeepGEMM performance
[rank12]:[E213 12:55:55.972336654 ProcessGroupNCCL.cpp:1794] [PG ID 0 PG GUID 0 Rank 12] Observed flight recorder dump signal from another rank via TCPStore.
[rank12]:[E213 12:55:55.972507426 ProcessGroupNCCL.cpp:1858] [PG ID 0 PG GUID 0 Rank 12] Received a dump signal due to a collective timeout from  rank 5 and we will try our best to dump the debug info. Last enqueued NCCL work: -1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. 
[rank12]:[E213 12:55:55.972626061 ProcessGroupNCCL.cpp:1575] [PG ID 0 PG GUID 0 Rank 12] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
Warning: please use at least NVCC 12.9 for the best DeepGEMM performance
[rank11]:[E213 12:55:55.974137560 ProcessGroupNCCL.cpp:1794] [PG ID 0 PG GUID 0 Rank 11] Observed flight recorder dump signal from another rank via TCPStore.
[rank11]:[E213 12:55:55.974310921 ProcessGroupNCCL.cpp:1858] [PG ID 0 PG GUID 0 Rank 11] Received a dump signal due to a collective timeout from  rank 5 and we will try our best to dump the debug info. Last enqueued NCCL work: -1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. 
[rank11]:[E213 12:55:55.974410359 ProcessGroupNCCL.cpp:1575] [PG ID 0 PG GUID 0 Rank 11] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank12]:[E213 12:55:55.456047970 ProcessGroupNCCL.cpp:683] [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=28672, NumelOut=28672, Timeout(ms)=600000) ran for 600085 milliseconds before timing out.
[rank12]:[E213 12:55:55.457849911 ProcessGroupNCCL.cpp:2241] [PG ID 2 PG GUID 3 Rank 12]  failure detected by watchdog at work sequence id: 2 PG status: last enqueued work: 2, last completed work: 1
[rank12]:[E213 12:55:55.457857636 ProcessGroupNCCL.cpp:730] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank9]:[E213 12:55:55.556041629 ProcessGroupNCCL.cpp:683] [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=28672, NumelOut=28672, Timeout(ms)=600000) ran for 600014 milliseconds before timing out.
[rank9]:[E213 12:55:55.556245574 ProcessGroupNCCL.cpp:2241] [PG ID 2 PG GUID 3 Rank 9]  failure detected by watchdog at work sequence id: 2 PG status: last enqueued work: 2, last completed work: 1
[rank9]:[E213 12:55:55.556251353 ProcessGroupNCCL.cpp:730] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank15]:[E213 12:55:56.756652808 ProcessGroupNCCL.cpp:683] [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=28672, NumelOut=28672, Timeout(ms)=600000) ran for 600084 milliseconds before timing out.
[rank15]:[E213 12:55:56.756888373 ProcessGroupNCCL.cpp:2241] [PG ID 2 PG GUID 3 Rank 15]  failure detected by watchdog at work sequence id: 2 PG status: last enqueued work: 2, last completed work: 1
[rank15]:[E213 12:55:56.756895188 ProcessGroupNCCL.cpp:730] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank11]:[E213 12:55:56.869579210 ProcessGroupNCCL.cpp:683] [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=28672, NumelOut=28672, Timeout(ms)=600000) ran for 600038 milliseconds before timing out.
[rank11]:[E213 12:55:56.869797302 ProcessGroupNCCL.cpp:2241] [PG ID 2 PG GUID 3 Rank 11]  failure detected by watchdog at work sequence id: 2 PG status: last enqueued work: 2, last completed work: 1
[rank11]:[E213 12:55:56.869803770 ProcessGroupNCCL.cpp:730] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank13]:[E213 12:55:56.056594822 ProcessGroupNCCL.cpp:683] [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=28672, NumelOut=28672, Timeout(ms)=600000) ran for 600030 milliseconds before timing out.
[rank13]:[E213 12:55:56.056802607 ProcessGroupNCCL.cpp:2241] [PG ID 2 PG GUID 3 Rank 13]  failure detected by watchdog at work sequence id: 2 PG status: last enqueued work: 2, last completed work: 1
[rank13]:[E213 12:55:56.056808585 ProcessGroupNCCL.cpp:730] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank14]:[E213 12:55:56.257567214 ProcessGroupNCCL.cpp:683] [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=28672, NumelOut=28672, Timeout(ms)=600000) ran for 600091 milliseconds before timing out.
[rank14]:[E213 12:55:56.257794926 ProcessGroupNCCL.cpp:2241] [PG ID 2 PG GUID 3 Rank 14]  failure detected by watchdog at work sequence id: 2 PG status: last enqueued work: 2, last completed work: 1
[rank14]:[E213 12:55:56.257800865 ProcessGroupNCCL.cpp:730] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank10]:[E213 12:55:56.259282747 ProcessGroupNCCL.cpp:683] [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=28672, NumelOut=28672, Timeout(ms)=600000) ran for 600000 milliseconds before timing out.
[rank10]:[E213 12:55:56.259479474 ProcessGroupNCCL.cpp:2241] [PG ID 2 PG GUID 3 Rank 10]  failure detected by watchdog at work sequence id: 2 PG status: last enqueued work: 2, last completed work: 1
[rank10]:[E213 12:55:56.259485138 ProcessGroupNCCL.cpp:730] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.

DeepGEMM warmup:   0%|          | 17/16384 [00:02<41:51,  6.52it/s]
DeepGEMM warmup:   0%|          | 33/16384 [00:05<43:11,  6.31it/s]
DeepGEMM warmup:   0%|          | 65/16384 [00:07<30:22,  8.96it/s]
DeepGEMM warmup:   1%|          | 129/16384 [00:10<18:03, 15.01it/s]2026-02-13 12:56:06,097 - WARNING - 健康检查失败: HTTPConnectionPool(host='10.92.216.214', port=28056): Max retries exceeded with url: /health (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f53fa449cc0>: Failed to establish a new connection: [Errno 111] Connection refused'))，当前系统时间: 2026-02-13 12:56:06
2026-02-13 12:56:06,097 - INFO - 服务尚未恢复，继续检查...

DeepGEMM warmup:   1%|          | 193/16384 [00:13<14:44, 18.31it/s]
DeepGEMM warmup:   2%|▏         | 257/16384 [00:15<13:19, 20.17it/s]
DeepGEMM warmup:   2%|▏         | 321/16384 [00:19<14:22, 18.63it/s]
DeepGEMM warmup:   2%|▏         | 385/16384 [00:22<13:14, 20.15it/s]
DeepGEMM warmup:   3%|▎         | 449/16384 [00:25<12:46, 20.79it/s]
DeepGEMM warmup:   3%|▎         | 512/16384 [00:28<12:30, 21.16it/s]
DeepGEMM warmup:   3%|▎         | 515/16384 [00:30<17:19, 15.26it/s]2026-02-13 12:56:36,129 - WARNING - 健康检查失败: HTTPConnectionPool(host='10.92.216.214', port=28056): Max retries exceeded with url: /health (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f55a8a353c0>: Failed to establish a new connection: [Errno 111] Connection refused'))，当前系统时间: 2026-02-13 12:56:36
2026-02-13 12:56:36,129 - INFO - 服务尚未恢复，继续检查...

DeepGEMM warmup:   4%|▎         | 577/16384 [00:44<32:19,  8.15it/s]
DeepGEMM warmup:   4%|▍         | 641/16384 [00:48<26:22,  9.95it/s]
DeepGEMM warmup:   4%|▍         | 705/16384 [00:57<30:28,  8.58it/s]
DeepGEMM warmup:   5%|▍         | 769/16384 [01:00<24:10, 10.77it/s][rank12]:[E213 12:56:55.457986761 ProcessGroupNCCL.cpp:744] [Rank 12] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank12]:[E213 12:56:55.458022124 ProcessGroupNCCL.cpp:758] [Rank 12] To avoid data inconsistency, we are taking the entire process down.
[rank12]:[E213 12:56:55.459128632 ProcessGroupNCCL.cpp:2057] [PG ID 2 PG GUID 3 Rank 12] Process group watchdog thread terminated with exception: [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=28672, NumelOut=28672, Timeout(ms)=600000) ran for 600085 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:686 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f54d793fb80 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x247 (0x7f54d881b5b7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x1691 (0x7f54d88201c1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xdf (0x7f54d882140f in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdc253 (0x7f552f6b0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7f553878dac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7f553881f850 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 12] Process group watchdog thread terminated with exception: [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=28672, NumelOut=28672, Timeout(ms)=600000) ran for 600085 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:686 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f54d793fb80 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x247 (0x7f54d881b5b7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x1691 (0x7f54d88201c1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xdf (0x7f54d882140f in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdc253 (0x7f552f6b0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7f553878dac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7f553881f850 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2063 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f54d793fb80 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe34731 (0x7f54d87f7731 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x9504a1 (0x7f54d83134a1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xdc253 (0x7f552f6b0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #4: <unknown function> + 0x94ac3 (0x7f553878dac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #5: <unknown function> + 0x126850 (0x7f553881f850 in /lib/x86_64-linux-gnu/libc.so.6)

Fatal Python error: Aborted

Thread 0x00007f3767ff8640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f376bff9640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 320 in wait
  File "/local-ssd/pv0/python/sglang/srt/disaggregation/common/utils.py", line 24 in get
  File "/local-ssd/pv0/python/sglang/srt/disaggregation/mooncake/conn.py", line 784 in transfer_worker
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f376fffa640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 320 in wait
  File "/local-ssd/pv0/python/sglang/srt/disaggregation/common/utils.py", line 24 in get
  File "/local-ssd/pv0/python/sglang/srt/disaggregation/mooncake/conn.py", line 784 in transfer_worker
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f3773ffb640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 320 in wait
  File "/local-ssd/pv0/python/sglang/srt/disaggregation/common/utils.py", line 24 in get
  File "/local-ssd/pv0/python/sglang/srt/disaggregation/mooncake/conn.py", line 784 in transfer_worker
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f3777ffc640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 320 in wait
  File "/local-ssd/pv0/python/sglang/srt/disaggregation/common/utils.py", line 24 in get
  File "/local-ssd/pv0/python/sglang/srt/disaggregation/mooncake/conn.py", line 784 in transfer_worker
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f377bffd640 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/zmq/sugar/socket.py", line 799 in recv_multipart
  File "/local-ssd/pv0/python/sglang/srt/disaggregation/mooncake/conn.py", line 927 in bootstrap_thread
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f3ecbfff640 (most recent call first):
  File "/local-ssd/pv0/python/sglang/srt/utils/watchdog.py", line 145 in _watchdog_once
  File "/local-ssd/pv0/python/sglang/srt/utils/watchdog.py", line 125 in _watchdog_thread
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f3ee3fff640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f55386f8480 (most recent call first):
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1255 in __call__
  File "/usr/local/lib/python3.10/dist-packages/deep_gemm/__init__.py", line 50 in _fn
  File "/local-ssd/pv0/python/sglang/srt/layers/deep_gemm_wrapper/entrypoint.py", line 98 in gemm_nt_f8f8bf16
  File "/local-ssd/pv0/python/sglang/srt/layers/quantization/fp8_kernel.py", line 110 in deep_gemm_fp8_fp8_bf16_nt
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1255 in __call__
  File "/local-ssd/pv0/python/sglang/srt/layers/quantization/fp8_kernel.py", line 1081 in w8a8_block_fp8_matmul_deepgemm
  File "/local-ssd/pv0/python/sglang/srt/layers/quantization/fp8_utils.py", line 480 in deepgemm_w8a8_block_fp8_linear_with_fallback
  File "/local-ssd/pv0/python/sglang/srt/layers/quantization/fp8.py", line 638 in apply
  File "/local-ssd/pv0/python/sglang/srt/layers/linear.py", line 451 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1786 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1775 in _wrapped_call_impl
  File "/local-ssd/pv0/python/sglang/srt/models/deepseek_v2.py", line 278 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1786 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1775 in _wrapped_call_impl
  File "/local-ssd/pv0/python/sglang/srt/models/deepseek_v2.py", line 2410 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1786 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1775 in _wrapped_call_impl
  File "/local-ssd/pv0/python/sglang/srt/models/deepseek_v2.py", line 2719 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1786 in _call_impl
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1775 in _wrapped_call_impl
  File "/local-ssd/pv0/python/sglang/srt/models/deepseek_v2.py", line 2908 in forward
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 120 in decorate_context
  File "/local-ssd/pv0/python/sglang/srt/model_executor/model_runner.py", line 2283 in forward_extend
  File "/local-ssd/pv0/python/sglang/srt/model_executor/model_runner.py", line 2449 in _forward_raw
  File "/local-ssd/pv0/python/sglang/srt/model_executor/model_runner.py", line 2346 in forward
  File "/local-ssd/pv0/python/sglang/srt/managers/tp_worker.py", line 451 in forward_batch_generation
  File "/local-ssd/pv0/python/sglang/srt/managers/scheduler.py", line 2291 in run_batch
  File "/local-ssd/pv0/python/sglang/srt/disaggregation/prefill.py", line 407 in event_loop_overlap_disagg_prefill
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 120 in decorate_context
  File "/local-ssd/pv0/python/sglang/srt/managers/scheduler.py", line 3095 in run_scheduler_process
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108 in run
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314 in _bootstrap
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 129 in _main
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116 in spawn_main
  File "<string>", line 1 in <module>

Extension modules: charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, numpy._core._multiarray_umath, numpy.linalg._umath_linalg, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, pybase64._pybase64, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, psutil._psutil_linux, zmq.backend.cython._zmq, PIL._imaging, sentencepiece._sentencepiece, yaml._yaml, regex._regex, markupsafe._speedups, cuda_utils, PIL._imagingft, av._core, av.logging, av.bytesource, av.buffer, av.audio.format, av.error, av.dictionary, av.container.pyio, av.option, av.descriptor, av.format, av.utils, av.stream, av.container.streams, av.sidedata.motionvectors, av.sidedata.sidedata, av.opaque, av.packet, av.container.input, av.container.output, av.container.core, av.codec.context, av.video.format, av.video.reformatter, av.plane, av.video.plane, av.video.frame, av.video.stream, av.codec.hwaccel, av.codec.codec, av.frame, av.audio.layout, av.audio.plane, av.audio.frame, av.audio.stream, av.filter.link, av.filter.context, av.filter.graph, av.filter.filter, av.filter.loudnorm, av.audio.resampler, av.audio.codeccontext, av.audio.fifo, av.bitstream, av.video.codeccontext, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, _cffi_backend, scipy._lib._ccallback_c, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._cobyla, scipy.optimize._slsqp, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy.optimize._cython_nnls, scipy._lib._uarray._uarray, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.linalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.spatial.transform._rotation, scipy.optimize._direct, sklearn.__check_build._check_build, scipy.integrate._odepack, scipy.integrate._quadpack, scipy.integrate._vode, scipy.integrate._dop, scipy.integrate._lsoda, scipy.interpolate._fitpack, scipy.interpolate._dfitpack, scipy.interpolate._dierckx, scipy.interpolate._ppoly, scipy.interpolate._interpnd, scipy.interpolate._rbfinterp_pythran, scipy.interpolate._rgi_cython, scipy.interpolate._bspl, scipy.special.cython_special, scipy.stats._stats, scipy.stats._sobol, scipy.stats._qmc_cy, scipy.stats._biasedurn, scipy.stats._stats_pythran, scipy.stats._levy_stable.levyst, scipy.stats._ansari_swilk_statistics, scipy.stats._mvn, scipy.stats._rcont.rcont, scipy.ndimage._nd_image, scipy.ndimage._rank_filter_1d, _ni_label, scipy.ndimage._ni_label, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, _cyutility, sklearn._cyutility, sklearn.utils._isfinite, sklearn.utils.sparsefuncs_fast, sklearn.utils.murmurhash, sklearn.utils._openmp_helpers, sklearn.metrics.cluster._expected_mutual_info_fast, sklearn.preprocessing._csr_polynomial_expansion, sklearn.preprocessing._target_encoder_fast, sklearn.metrics._dist_metrics, sklearn.metrics._pairwise_distances_reduction._datasets_pair, sklearn.utils._cython_blas, sklearn.metrics._pairwise_distances_reduction._base, sklearn.metrics._pairwise_distances_reduction._middle_term_computer, sklearn.utils._heap, sklearn.utils._sorting, sklearn.metrics._pairwise_distances_reduction._argkmin, sklearn.metrics._pairwise_distances_reduction._argkmin_classmode, sklearn.utils._vector_sentinel, sklearn.metrics._pairwise_distances_reduction._radius_neighbors, sklearn.metrics._pairwise_distances_reduction._radius_neighbors_classmode, sklearn.metrics._pairwise_fast, setproctitle._setproctitle, _cbor2, cuda.bindings._lib.utils, cuda.bindings._bindings.cydriver, cuda.bindings.cydriver, cuda.bindings.driver, tvm_ffi.core, cuda.bindings._bindings.cyruntime_ptds, cuda.bindings._bindings.cyruntime, cuda.bindings._lib.cyruntime.utils, cuda.bindings._lib.cyruntime.cyruntime, cuda.bindings.cyruntime, cuda.bindings.runtime, cuda.bindings._bindings.cynvrtc, cuda.bindings.cynvrtc, cuda.bindings.nvrtc, msgspec._core, cuda.cudart, cuda.nvrtc, __triton_launcher (total: 257)
[rank9]:[E213 12:56:55.556365223 ProcessGroupNCCL.cpp:744] [Rank 9] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank9]:[E213 12:56:55.556388528 ProcessGroupNCCL.cpp:758] [Rank 9] To avoid data inconsistency, we are taking the entire process down.
[rank9]:[E213 12:56:55.557490243 ProcessGroupNCCL.cpp:2057] [PG ID 2 PG GUID 3 Rank 9] Process group watchdog thread terminated with exception: [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=28672, NumelOut=28672, Timeout(ms)=600000) ran for 600014 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:686 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f385b57cb80 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x247 (0x7f37fd21b5b7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x1691 (0x7f37fd2201c1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xdf (0x7f37fd22140f in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdc253 (0x7f38540b0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7f385d333ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7f385d3c5850 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 2 PG GUID 3 Rank 9] Process group watchdog thread terminated with exception: [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=28672, NumelOut=28672, Timeout(ms)=600000) ran for 600014 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:686 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f385b57cb80 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x247 (0x7f37fd21b5b7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x1691 (0x7f37fd2201c1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xdf (0x7f37fd22140f in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdc253 (0x7f38540b0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7f385d333ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7f385d3c5850 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2063 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f385b57cb80 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe34731 (0x7f37fd1f7731 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x9504a1 (0x7f37fcd134a1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xdc253 (0x7f38540b0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #4: <unknown function> + 0x94ac3 (0x7f385d333ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #5: <unknown function> + 0x126850 (0x7f385d3c5850 in /lib/x86_64-linux-gnu/libc.so.6)

Fatal Python error: Aborted

Thread 0x00007f1a93ff8640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  File "/usr/lib/python3.10/threading.py", line 607 in wait
  File "/usr/local/lib/python3.10/dist-packages/tqdm/_monitor.py", line 60 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f1a97ff9640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 320 in wait
  File "/local-ssd/pv0/python/sglang/srt/disaggregation/common/utils.py", line 24 in get
  File "/local-ssd/pv0/python/sglang/srt/disaggregation/mooncake/conn.py", line 784 in transfer_worker
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f1a9bffa640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 320 in wait
  File "/local-ssd/pv0/python/sglang/srt/disaggregation/common/utils.py", line 24 in get
  File "/local-ssd/pv0/python/sglang/srt/disaggregation/mooncake/conn.py", line 784 in transfer_worker
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007f1a9fffb640 (most recent call first):
  File "/usr/lib/python3.10/threading.py", line 320 in wait
  File "/local-ssd/pv0/python/sglang/srt/disaggregation/common/utils.py", line 24 in get
  File "/local-ssd/pv0/python/sglang/srt/disaggregation/mooncake/conn.py", line 784 in transfer_worker
  File "/usr/lib/python3.10/threading.py", line 953 in run
  File "/usr/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/lib/python3.10/threading.py", line 973 in _bootstrap

### Reproduction

4 machine / 32 H100 cards /  pd+tp16
ulimit -u unlimited
ulimit -l unlimited

prefill workers on 2 machine with 16 cards:
$RANK ∈ {0,1}
python3 -m sglang.launch_server \
        --tp-size 16 \
        --disaggregation-mode prefill \
        --port "$PREFILL_PORT" \
        --mem-fraction-static 0.80 \
        --model-path /local-ssd/pv2/DeepSeek-V3.2 \
        --dist-init-addr "$PREFILL_MASTER_IP_ADDRESS:20000" \
        --nnodes 2 \
        --node-rank "$RANK" \
        --trust-remote-code \
        --host 0.0.0.0 \
        --schedule-policy fcfs \
        --decode-log-interval 1 \
        --context-length 128000 \
        $ARGS

decode workers on 2 machine with 16 cards:
$RANK ∈ {0,1}
python3 -m sglang.launch_server \
        --tp-size 16 \
        --disaggregation-mode decode \
        --port "$DECODE_PORT" \
        --mem-fraction-static 0.80 \
        --model-path /local-ssd/pv2/DeepSeek-V3.2 \
        --dist-init-addr "$DECODE_MASTER_IP_ADDRESS:20000" \
        --nnodes 2 \
        --node-rank "$RANK" \
        --trust-remote-code \
        --host 0.0.0.0 \
        --schedule-policy fcfs \
        --decode-log-interval 1 \
        --context-length 128000 \
        $ARGS

ARGS=--dist-timeout 3600 --watchdog-timeout 3600 --disaggregation-transfer-backend mooncake --kv-cache-dtype fp8_e4m3 --cuda-graph-bs 1 2 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 72 80 88 96 104 112 120 128 --max-running-requests 5120 --chunked-prefill-size 1024  --schedule-conservativeness 3.333 --tokenizer-worker-num 1 --tool-call-parser deepseekv32 --reasoning-parser deepseek-v3 --disable-custom-all-reduce

### Environment

/local-ssd/pv0/python/sglang/srt/environ.py:516: UserWarning: Environment variable SGL_DG_CACHE_DIR is deprecated, please use SGLANG_DG_CACHE_DIR
  warnings.warn(
/local-ssd/pv0/python/sglang/srt/environ.py:516: UserWarning: Environment variable SGL_VERSION is deprecated, please use SGLANG_VERSION
  warnings.warn(
Python: 3.10.12 (main, Aug 15 2025, 14:32:43) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H800
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda-12.4
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 535.261.03
PyTorch: 2.9.1+cu128
sglang: 0.5.8.post1
sgl_kernel: 0.3.21
flashinfer_python: 0.6.2
flashinfer_cubin: 0.6.2
flashinfer_jit_cache: Module Not Found
triton: 3.5.1
transformers: 4.57.1
torchao: 0.9.0
numpy: 2.2.6
aiohttp: 3.13.2
fastapi: 0.121.2
hf_transfer: 0.1.9
huggingface_hub: 0.36.0
interegular: 0.3.3
modelscope: 1.34.0
orjson: 3.11.4
outlines: 0.1.11
packaging: 25.0
psutil: 7.1.3
pydantic: 2.12.4
python-multipart: 0.0.20
pyzmq: 27.1.0
uvicorn: 0.29.0
uvloop: 0.22.1
vllm: 0.11.0
xgrammar: 0.1.27
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.72.1
litellm: Module Not Found
decord2: 3.0.0
NVIDIA Topology: 
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    SYS     PIX     PHB     PHB     PHB     SYS     SYS     SYS     SYS     0-89    0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    SYS     PHB     PIX     PHB     PHB     SYS     SYS     SYS     SYS     0-89    0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    SYS     PHB     PHB     PIX     PHB     SYS     SYS     SYS     SYS     0-89    0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    SYS     PHB     PHB     PHB     PIX     SYS     SYS     SYS     SYS     0-89    0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     PIX     PHB     PHB     PHB     90-179  1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     SYS     PHB     PIX     PHB     PHB     90-179  1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     SYS     PHB     PHB     PIX     PHB     90-179  1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB     PIX     90-179  1               N/A
NIC0    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC1    PIX     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS      X      PHB     PHB     PHB     SYS     SYS     SYS     SYS
NIC2    PHB     PIX     PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB      X      PHB     PHB     SYS     SYS     SYS     SYS
NIC3    PHB     PHB     PIX     PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB      X      PHB     SYS     SYS     SYS     SYS
NIC4    PHB     PHB     PHB     PIX     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB      X      SYS     SYS     SYS     SYS
NIC5    SYS     SYS     SYS     SYS     PIX     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS      X      PHB     PHB     PHB
NIC6    SYS     SYS     SYS     SYS     PHB     PIX     PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB      X      PHB     PHB
NIC7    SYS     SYS     SYS     SYS     PHB     PHB     PIX     PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB      X      PHB
NIC8    SYS     SYS     SYS     SYS     PHB     PHB     PHB     PIX     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8


Hypervisor vendor:: KVM
ulimit soft: 1048576

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Deepseek v3.2 prefill workers crash when PD disaggregation #18799

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Deepseek v3.2 prefill workers crash when PD disaggregation #18799

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions