-
Notifications
You must be signed in to change notification settings - Fork 729
Description
When running an sglang server with --enable-flashinfer-allreduce-fusion the error RuntimeError: CUDART error: invalid resource handle is encountered on GB200 NVL72 systems (in this case two nodes). (See complete sglang server command below.) If the allreduce fusion is not enabled or running a single node, the error is not encountered. The error in more detail and with more context:
File "/sgl-workspace/sglang/python/sglang/srt/layers/flashinfer_comm_fusion.py", line 163, in flashinfer_allreduce_residual_rmsnorm
if not ensure_workspace_initialized(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/layers/flashinfer_comm_fusion.py", line 115, in ensure_workspace_initialized
_workspace_manager.initialize(
File "/sgl-workspace/sglang/python/sglang/srt/layers/flashinfer_comm_fusion.py", line 61, in initialize
comm.trtllm_create_ipc_workspace_for_all_reduce_fusion(
File "/usr/local/lib/python3.12/dist-packages/flashinfer/comm/trtllm_ar.py", line 554, in trtllm_create_ipc_workspace_for_all_reduce_fusion
ipc_handles.append(create_shared_buffer(aligned_size, group))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/flashinfer/comm/cuda_ipc.py", line 220, in create_shared_buffer
pointers.append(cudart.cudaIpcOpenMemHandle(h).value)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/flashinfer/comm/cuda_ipc.py", line 186, in cudaIpcOpenMemHandle
self.CUDART_CHECK(
File "/usr/local/lib/python3.12/dist-packages/flashinfer/comm/cuda_ipc.py", line 144, in CUDART_CHECK
raise RuntimeError(f"CUDART error: {error_str}")
RuntimeError: CUDART error: invalid resource handle
The root cause is due to using a deprecated API to create the IPC memory: cudaMalloc + cudaIpcOpenMemHandle. These calls are not supported by MNNVL. Fixing this likely requires flashinfer to use the cuMemCreate API to create IPC memory instead.
I used two GB200 nodes with four GPUs each in order to produce this error:
SGL_ENABLE_JIT_DEEPGEMM=0 SGLANG_ENABLE_FLASHINFER_GEMM=1 python3 -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-R1-0528 \
--host 0.0.0.0 --port 8000 \
--tensor-parallel-size 8 \
--ep-size 8 \
--data-parallel-size=1 \
--cuda-graph-max-bs 128 \
--max-running-requests 128 \
--mem-fraction-static 0.90 \
--kv-cache-dtype fp8_e4m3 \
--chunked-prefill-size 32768 \
--max-prefill-tokens 32768 \
--enable-flashinfer-allreduce-fusion \
--enable-symm-mem \
--scheduler-recv-interval 10 \
--disable-radix-cache \
--attention-backend trtllm_mla \
--stream-interval 30 \
--moe-runner-backend flashinfer_trtllm \
--quantization fp8 \
--dist-init-addr "<hostname or ip of node 0>:5000" \
--nnodes 2 \
--node-rank <current node either 0 or 1>
SGLang Environment via python3 -m sglang.check_env:
Python: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA GB200
GPU 0,1,2,3 Compute Capability: 10.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 580.82.07
PyTorch: 2.8.0+cu129
sglang: 0.5.4.post1
sgl_kernel: 0.3.16.post4
flashinfer_python: 0.4.1
triton: 3.4.0
transformers: 4.57.1
torchao: 0.9.0
numpy: 2.3.4
aiohttp: 3.13.1
fastapi: 0.119.1
hf_transfer: 0.1.9
huggingface_hub: 0.35.3
interegular: 0.3.3
modelscope: 1.31.0
orjson: 3.11.3
outlines: 0.1.11
packaging: 25.0
psutil: 7.1.1
pydantic: 2.12.3
python-multipart: 0.0.20
pyzmq: 27.1.0
uvicorn: 0.38.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.25
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.71.0
litellm: Module Not Found
decord2: 2.0.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NODE NODE NODE NODE SYS SYS SYS SYS 0-71 0 N/A
GPU1 NV18 X NV18 NV18 NODE NODE NODE NODE SYS SYS SYS SYS 0-71 0 N/A
GPU2 NV18 NV18 X NV18 SYS SYS SYS SYS NODE NODE NODE NODE 72-143 1 N/A
GPU3 NV18 NV18 NV18 X SYS SYS SYS SYS NODE NODE NODE NODE 72-143 1 N/A
NIC0 NODE NODE SYS SYS X NODE NODE NODE SYS SYS SYS SYS
NIC1 NODE NODE SYS SYS NODE X NODE NODE SYS SYS SYS SYS
NIC2 NODE NODE SYS SYS NODE NODE X PIX SYS SYS SYS SYS
NIC3 NODE NODE SYS SYS NODE NODE PIX X SYS SYS SYS SYS
NIC4 SYS SYS NODE NODE SYS SYS SYS SYS X NODE NODE NODE
NIC5 SYS SYS NODE NODE SYS SYS SYS SYS NODE X NODE NODE
NIC6 SYS SYS NODE NODE SYS SYS SYS SYS NODE NODE X PIX
NIC7 SYS SYS NODE NODE SYS SYS SYS SYS NODE NODE PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
ulimit soft: 1048576