[Bug] Draft extend CUDA graph fails to load using llama 3.3 70b

### Checklist

- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.

### Describe the bug

Hi all, currently running EAGLE3 on llama 3.3 70b using the official draft model from HF: lmsys/sglang-EAGLE3-LLaMA3.3-Instruct-70B. Before the PR to add extend CUDA graph support https://github.com/sgl-project/sglang/pull/6606 server was able to be started and sglang is able to load draft cuda graphs. With the addition of the extend cuda graphs, the draft is no longer compatible and errs our with this:

`[2025-06-09 18:16:26 TP1] Capture draft cuda graph end. Time elapsed: 13.48 s. avail mem=31.31 GB. mem usage=1.10 GB.
[2025-06-09 18:16:26 TP1] Capture draft extend cuda graph begin. This can take up to several minutes. avail mem=31.31 GB
Capturing batches (avail_mem=30.87 GB):   0%|                                                     | 0/20 [00:00<?, ?it/s]
[2025-06-09 18:16:26 TP1] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py", line 89, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py", line 113, in capture
    CudaGraphRunner.capture(self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 389, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py", line 187, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py", line 173, in run_once
    ret = self.eagle_worker.draft_model_runner.model.forward(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/llama.py", line 457, in forward
    hidden_states = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/llama_eagle3.py", line 131, in forward
    hidden_states = self.fc(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py", line 125, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (128x18432 and 24576x6144)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2490, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, pp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 295, in __init__
    self.draft_worker = EAGLEWorker(
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 152, in __init__
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 270, in init_cuda_graphs
    self.cuda_graph_runner_for_draft_extend = EAGLEDraftExtendCudaGraphRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py", line 91, in __init__
    raise Exception(
Exception: Capture CUDA graph failed: mat1 and mat2 shapes cannot be multiplied (128x18432 and 24576x6144)
Possible solutions:
1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
3. disable torch compile by not using --enable-torch-compile
4. disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose


[2025-06-09 18:16:26 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py", line 89, in __init__
    self.capture()
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py", line 113, in capture
    CudaGraphRunner.capture(self)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 389, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py", line 187, in capture_one_batch_size
    run_once()
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py", line 173, in run_once
    ret = self.eagle_worker.draft_model_runner.model.forward(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/llama.py", line 457, in forward
    hidden_states = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/llama_eagle3.py", line 131, in forward
    hidden_states = self.fc(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py", line 125, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (128x18432 and 24576x6144)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2490, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, pp_rank, dp_rank)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 295, in __init__
    self.draft_worker = EAGLEWorker(
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 152, in __init__
    self.init_cuda_graphs()
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 270, in init_cuda_graphs
    self.cuda_graph_runner_for_draft_extend = EAGLEDraftExtendCudaGraphRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py", line 91, in __init__
    raise Exception(
Exception: Capture CUDA graph failed: mat1 and mat2 shapes cannot be multiplied (128x18432 and 24576x6144)
Possible solutions:
1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
3. disable torch compile by not using --enable-torch-compile
4. disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose


[2025-06-09 18:16:26] Received sigquit from a child process. It usually means the child failed.`

Is the new extend CUDA graph feature not intended for EAGLE3? Thanks!

### Reproduction

python3 -m sglang.launch_server --model /Llama-3.3-70B-Instruct-FP8-Dynamic --speculative-algorithm EAGLE3 --speculative-draft-model-path /sglang-EAGLE3-LLaMA3.3-Instruct-70B --speculative-num-steps 3 --speculative-eagle-topk 2 --speculative-nu
m-draft-tokens 4 --mem-fraction 0.6 --dtype float16 --tp-size 2 --max-running-requests 32

### Environment

Python: 3.10.12 (main, Feb  4 2025, 14:57:36) [GCC 11.4.0]
CUDA available: True
GPU 0,1: NVIDIA H100 NVL
GPU 0,1 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 535.161.08
PyTorch: 2.6.0+cu124
sglang: 0.4.6.post5
sgl_kernel: 0.1.5
flashinfer_python: 0.2.5+cu124torch2.6
triton: 3.2.0
transformers: 4.52.3
torchao: 0.9.0
numpy: 2.2.6
aiohttp: 3.12.9
fastapi: 0.115.12
hf_transfer: 0.1.9
huggingface_hub: 0.32.4
interegular: 0.3.3
modelscope: 1.26.0
orjson: 3.10.18
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.5
python-multipart: 0.0.20
pyzmq: 26.4.0
uvicorn: 0.34.3
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.19
openai: 1.84.0
tiktoken: 0.9.0
anthropic: 0.52.2
litellm: 1.72.1
decord: 0.6.0
NVIDIA Topology:
        GPU0    GPU1    NIC0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     NODE    0-39    0               N/A
GPU1    SYS      X      SYS     40-79   1               N/A
NIC0    NODE    SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_an0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Draft extend CUDA graph fails to load using llama 3.3 70b #7011

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Draft extend CUDA graph fails to load using llama 3.3 70b #7011

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions