Skip to content

[Bug] Using flashinfer.comm.trtllm_allreduce_fusion results in RuntimeError: CUDART error: invalid resource handle on multi-node systems #2006

@leejnau

Description

@leejnau

When running an sglang server with --enable-flashinfer-allreduce-fusion the error RuntimeError: CUDART error: invalid resource handle is encountered on GB200 NVL72 systems (in this case two nodes). (See complete sglang server command below.) If the allreduce fusion is not enabled or running a single node, the error is not encountered. The error in more detail and with more context:

File "/sgl-workspace/sglang/python/sglang/srt/layers/flashinfer_comm_fusion.py", line 163, in flashinfer_allreduce_residual_rmsnorm
    if not ensure_workspace_initialized(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/layers/flashinfer_comm_fusion.py", line 115, in ensure_workspace_initialized
    _workspace_manager.initialize(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/flashinfer_comm_fusion.py", line 61, in initialize
    comm.trtllm_create_ipc_workspace_for_all_reduce_fusion(
  File "/usr/local/lib/python3.12/dist-packages/flashinfer/comm/trtllm_ar.py", line 554, in trtllm_create_ipc_workspace_for_all_reduce_fusion
    ipc_handles.append(create_shared_buffer(aligned_size, group))
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/flashinfer/comm/cuda_ipc.py", line 220, in create_shared_buffer
    pointers.append(cudart.cudaIpcOpenMemHandle(h).value)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/flashinfer/comm/cuda_ipc.py", line 186, in cudaIpcOpenMemHandle
    self.CUDART_CHECK(
  File "/usr/local/lib/python3.12/dist-packages/flashinfer/comm/cuda_ipc.py", line 144, in CUDART_CHECK
    raise RuntimeError(f"CUDART error: {error_str}")
RuntimeError: CUDART error: invalid resource handle

The root cause is due to using a deprecated API to create the IPC memory: cudaMalloc + cudaIpcOpenMemHandle. These calls are not supported by MNNVL. Fixing this likely requires flashinfer to use the cuMemCreate API to create IPC memory instead.

I used two GB200 nodes with four GPUs each in order to produce this error:

SGL_ENABLE_JIT_DEEPGEMM=0 SGLANG_ENABLE_FLASHINFER_GEMM=1 python3 -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-R1-0528 \
--host 0.0.0.0 --port 8000 \
--tensor-parallel-size 8 \
--ep-size 8 \
--data-parallel-size=1 \
--cuda-graph-max-bs 128 \
--max-running-requests 128 \
--mem-fraction-static 0.90 \
--kv-cache-dtype fp8_e4m3 \
--chunked-prefill-size 32768 \
--max-prefill-tokens 32768 \
--enable-flashinfer-allreduce-fusion \
--enable-symm-mem \
--scheduler-recv-interval 10 \
--disable-radix-cache \
--attention-backend trtllm_mla \
--stream-interval 30 \
--moe-runner-backend flashinfer_trtllm \
--quantization fp8 \
--dist-init-addr "<hostname or ip of node 0>:5000" \
--nnodes 2 \
--node-rank <current node either 0 or 1>

SGLang Environment via python3 -m sglang.check_env:

Python: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3: NVIDIA GB200
GPU 0,1,2,3 Compute Capability: 10.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 580.82.07
PyTorch: 2.8.0+cu129
sglang: 0.5.4.post1
sgl_kernel: 0.3.16.post4
flashinfer_python: 0.4.1
triton: 3.4.0
transformers: 4.57.1
torchao: 0.9.0
numpy: 2.3.4
aiohttp: 3.13.1
fastapi: 0.119.1
hf_transfer: 0.1.9
huggingface_hub: 0.35.3
interegular: 0.3.3
modelscope: 1.31.0
orjson: 3.11.3
outlines: 0.1.11
packaging: 25.0
psutil: 7.1.1
pydantic: 2.12.3
python-multipart: 0.0.20
pyzmq: 27.1.0
uvicorn: 0.38.0
uvloop: 0.22.1
vllm: Module Not Found
xgrammar: 0.1.25
openai: 2.6.1
tiktoken: 0.12.0
anthropic: 0.71.0
litellm: Module Not Found
decord2: 2.0.0
NVIDIA Topology: 
	GPU0	GPU1	GPU2	GPU3	NIC0	NIC1	NIC2	NIC3	NIC4	NIC5	NIC6	NIC7	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV18	NV18	NV18	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	0-71	0		N/A
GPU1	NV18	 X 	NV18	NV18	NODE	NODE	NODE	NODE	SYS	SYS	SYS	SYS	0-71	0		N/A
GPU2	NV18	NV18	 X 	NV18	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	72-143	1		N/A
GPU3	NV18	NV18	NV18	 X 	SYS	SYS	SYS	SYS	NODE	NODE	NODE	NODE	72-143	1		N/A
NIC0	NODE	NODE	SYS	SYS	 X 	NODE	NODE	NODE	SYS	SYS	SYS	SYS				
NIC1	NODE	NODE	SYS	SYS	NODE	 X 	NODE	NODE	SYS	SYS	SYS	SYS				
NIC2	NODE	NODE	SYS	SYS	NODE	NODE	 X 	PIX	SYS	SYS	SYS	SYS				
NIC3	NODE	NODE	SYS	SYS	NODE	NODE	PIX	 X 	SYS	SYS	SYS	SYS				
NIC4	SYS	SYS	NODE	NODE	SYS	SYS	SYS	SYS	 X 	NODE	NODE	NODE				
NIC5	SYS	SYS	NODE	NODE	SYS	SYS	SYS	SYS	NODE	 X 	NODE	NODE				
NIC6	SYS	SYS	NODE	NODE	SYS	SYS	SYS	SYS	NODE	NODE	 X 	PIX				
NIC7	SYS	SYS	NODE	NODE	SYS	SYS	SYS	SYS	NODE	NODE	PIX	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7


ulimit soft: 1048576

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions