Skip to content

[Bug] Running DeepSeek R1 error when enable flashinfer-mla and torch-compile #4964

@lambert0312

Description

@lambert0312

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

I use the latest main (1c63e797560405b9ec5149d93f02d6147383c627), start the DeepSeek R1 model, and when using the --flashinfer-mla --enable-torch-compile parameter, an error is report.

Reproduction

node 1:

python3 -m sglang.launch_server --model /path/to/DeepSeek-R1-Channel-INT8 --trust-remote --host 0.0.0.0 --port 30000 --enable-flashinfer-mla --speculative-algo EAGLE --speculative-draft /path/to/DeepSeek-R1-NextN-Channel-INT8 --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --enable-torch-compile --torch-compile-max-bs 4 --quantization w8a8_int8 --tp 16 --dist-init-addr ip:20000 --nnodes 2 --node-rank 0

node 2:

python3 -m sglang.launch_server --model /path/to/DeepSeek-R1-Channel-INT8 --trust-remote --host 0.0.0.0 --port 30000 --enable-flashinfer-mla --speculative-algo EAGLE --speculative-draft /path/to/DeepSeek-R1-NextN-Channel-INT8 --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --enable-torch-compile --torch-compile-max-bs 4 --quantization w8a8_int8 --tp 16 --dist-init-addr ip:20000 --nnodes 2 --node-rank 1

The exception information is as follows:

[2025-04-01 04:49:35 TP3] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 267, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 351, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 443, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 436, in run_once
    logits_output = forward(input_ids, forward_batch.positions, forward_batch)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 465, in _fn
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/external_utils.py", line 40, in inner
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1323, in forward
    hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1277, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1084, in forward
    return self.forward_normal(
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1110, in forward_normal
    hidden_states = self.self_attn(
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1140, in torch_dynamo_resume_in_forward_normal_at_1110
    hidden_states = self.mlp(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 256, in forward
    return self.forward_normal(hidden_states)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 262, in forward_normal
    shared_output = self.shared_experts(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 131, in forward
    gate_up, _ = self.gate_up_proj(x)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 1269, in __call__
    return self._torchdynamo_orig_callable(
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 1064, in __call__
    result = self._inner_convert(
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 526, in __call__
    return _compile(
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 952, in _compile
    raise InternalTorchDynamoError(
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 924, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 666, in compile_inner
    return _compile_inner(code, one_graph, hooks, transform)
  File "/usr/local/lib/python3.10/dist-packages/torch/_utils_internal.py", line 87, in wrapper_function
    return function(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 699, in _compile_inner
    out_code = transform_code_object(code, transform)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/bytecode_transformation.py", line 1322, in transform_code_object
    transformations(instructions, code_options)
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/convert_frame.py", line 231, in _fn
    torch.cuda.set_rng_state(cuda_rng_state)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/random.py", line 77, in set_rng_state
    _lazy_call(cb)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 249, in _lazy_call
    callable()
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/random.py", line 75, in cb
    default_generator.set_state(new_state_copy)
torch._dynamo.exc.InternalTorchDynamoError: RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.



You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1999, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 249, in __init__
    self.tp_worker = TpWorkerClass(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 74, in __init__
    self.model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 169, in __init__
    self.initialize(min_per_gpu_memory)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 207, in initialize
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 919, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 269, in __init__
    raise Exception(
Exception: Capture cuda graph failed: RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.



You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

Possible solutions:
1. disable cuda graph by --disable-cuda-graph
2. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
3. disable torch compile by not using --enable-torch-compile
4. set --cuda-graph-max-bs to a smaller value (e.g., 32)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose

Environment

Python: 3.10.12 (main, Jan 17 2025, 14:35:34) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA A800-SXM4-80GB
GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 550.54.14
PyTorch: 2.5.1+cu124
sglang: 0.4.4.post3+build1
sgl_kernel: 0.0.6
flashinfer: Module Not Found
triton: 3.1.0
transformers: 4.50.0
torchao: 0.9.0
numpy: 1.26.4
aiohttp: 3.11.13
fastapi: 0.115.11
hf_transfer: 0.1.9
huggingface_hub: 0.29.3
interegular: 0.3.3
modelscope: 1.23.2
orjson: 3.10.15
outlines: 0.1.11
packaging: 24.2
psutil: 7.0.0
pydantic: 2.10.6
multipart: Module Not Found
zmq: Module Not Found
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.7.2
xgrammar: 0.1.17
openai: 1.66.3
tiktoken: 0.9.0
anthropic: 0.49.0
litellm: 1.63.7
decord: 0.6.0
NVIDIA Topology:
	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	NIC0	NIC1	NIC2	NIC3	NIC4	NIC5	NIC6	NIC7	NIC8	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV8	NV8	NV8	NV8	NV8	NV8	NV8	PXB	PXB	SYS	SYS	SYSSYS	SYS	SYS	SYS	0-31,64-95	0		N/A
GPU1	NV8	 X 	NV8	NV8	NV8	NV8	NV8	NV8	PXB	PXB	SYS	SYS	SYSSYS	SYS	SYS	SYS	0-31,64-95	0		N/A
GPU2	NV8	NV8	 X 	NV8	NV8	NV8	NV8	NV8	SYS	SYS	PXB	PXB	SYSSYS	SYS	SYS	SYS	0-31,64-95	0		N/A
GPU3	NV8	NV8	NV8	 X 	NV8	NV8	NV8	NV8	SYS	SYS	PXB	PXB	SYSSYS	SYS	SYS	SYS	0-31,64-95	0		N/A
GPU4	NV8	NV8	NV8	NV8	 X 	NV8	NV8	NV8	SYS	SYS	SYS	SYS	PXBPXB	SYS	SYS	SYS	32-63,96-127	1		N/A
GPU5	NV8	NV8	NV8	NV8	NV8	 X 	NV8	NV8	SYS	SYS	SYS	SYS	PXBPXB	SYS	SYS	SYS	32-63,96-127	1		N/A
GPU6	NV8	NV8	NV8	NV8	NV8	NV8	 X 	NV8	SYS	SYS	SYS	SYS	SYSSYS	PXB	PXB	SYS	32-63,96-127	1		N/A
GPU7	NV8	NV8	NV8	NV8	NV8	NV8	NV8	 X 	SYS	SYS	SYS	SYS	SYSSYS	PXB	PXB	SYS	32-63,96-127	1		N/A
NIC0	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PIX	SYS	SYS	SYSSYS	SYS	SYS	SYS
NIC1	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	PIX	 X 	SYS	SYS	SYSSYS	SYS	SYS	SYS
NIC2	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PIX	SYSSYS	SYS	SYS	SYS
NIC3	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	PIX	 X 	SYSSYS	SYS	SYS	SYS
NIC4	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	 X PIX	SYS	SYS	SYS
NIC5	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	PIX X 	SYS	SYS	SYS
NIC6	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYSSYS	 X 	PIX	SYS
NIC7	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYSSYS	PIX	 X 	SYS
NIC8	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYSSYS	SYS	SYS	 X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_2
  NIC1: mlx5_3
  NIC2: mlx5_4
  NIC3: mlx5_5
  NIC4: mlx5_6
  NIC5: mlx5_7
  NIC6: mlx5_8
  NIC7: mlx5_9
  NIC8: mlx5_bond_0


ulimit soft: 1048576

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions