-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Description
Checklist
- 1. I have searched related issues but cannot get the expected help.
- 2. The bug has not been fixed in the latest version.
- 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- 5. Please use English, otherwise it will be closed.
Describe the bug
Hi all, currently running EAGLE3 on llama 3.3 70b using the official draft model from HF: lmsys/sglang-EAGLE3-LLaMA3.3-Instruct-70B. Before the PR to add extend CUDA graph support #6606 server was able to be started and sglang is able to load draft cuda graphs. With the addition of the extend cuda graphs, the draft is no longer compatible and errs our with this:
`[2025-06-09 18:16:26 TP1] Capture draft cuda graph end. Time elapsed: 13.48 s. avail mem=31.31 GB. mem usage=1.10 GB.
[2025-06-09 18:16:26 TP1] Capture draft extend cuda graph begin. This can take up to several minutes. avail mem=31.31 GB
Capturing batches (avail_mem=30.87 GB): 0%| | 0/20 [00:00<?, ?it/s]
[2025-06-09 18:16:26 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py", line 89, in init
self.capture()
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py", line 113, in capture
CudaGraphRunner.capture(self)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 389, in capture
) = self.capture_one_batch_size(bs, forward)
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py", line 187, in capture_one_batch_size
run_once()
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py", line 173, in run_once
ret = self.eagle_worker.draft_model_runner.model.forward(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/llama.py", line 457, in forward
hidden_states = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/llama_eagle3.py", line 131, in forward
hidden_states = self.fc(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py", line 125, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (128x18432 and 24576x6144)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2490, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, pp_rank, dp_rank)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 295, in init
self.draft_worker = EAGLEWorker(
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 152, in init
self.init_cuda_graphs()
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 270, in init_cuda_graphs
self.cuda_graph_runner_for_draft_extend = EAGLEDraftExtendCudaGraphRunner(
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py", line 91, in init
raise Exception(
Exception: Capture CUDA graph failed: mat1 and mat2 shapes cannot be multiplied (128x18432 and 24576x6144)
Possible solutions:
- set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
- set --cuda-graph-max-bs to a smaller value (e.g., 16)
- disable torch compile by not using --enable-torch-compile
- disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose
[2025-06-09 18:16:26 TP0] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py", line 89, in init
self.capture()
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py", line 113, in capture
CudaGraphRunner.capture(self)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 389, in capture
) = self.capture_one_batch_size(bs, forward)
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py", line 187, in capture_one_batch_size
run_once()
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py", line 173, in run_once
ret = self.eagle_worker.draft_model_runner.model.forward(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/llama.py", line 457, in forward
hidden_states = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/llama_eagle3.py", line 131, in forward
hidden_states = self.fc(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py", line 125, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (128x18432 and 24576x6144)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2490, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, pp_rank, dp_rank)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 295, in init
self.draft_worker = EAGLEWorker(
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 152, in init
self.init_cuda_graphs()
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 270, in init_cuda_graphs
self.cuda_graph_runner_for_draft_extend = EAGLEDraftExtendCudaGraphRunner(
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py", line 91, in init
raise Exception(
Exception: Capture CUDA graph failed: mat1 and mat2 shapes cannot be multiplied (128x18432 and 24576x6144)
Possible solutions:
- set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
- set --cuda-graph-max-bs to a smaller value (e.g., 16)
- disable torch compile by not using --enable-torch-compile
- disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose
[2025-06-09 18:16:26] Received sigquit from a child process. It usually means the child failed.`
Is the new extend CUDA graph feature not intended for EAGLE3? Thanks!
Reproduction
python3 -m sglang.launch_server --model /Llama-3.3-70B-Instruct-FP8-Dynamic --speculative-algorithm EAGLE3 --speculative-draft-model-path /sglang-EAGLE3-LLaMA3.3-Instruct-70B --speculative-num-steps 3 --speculative-eagle-topk 2 --speculative-nu
m-draft-tokens 4 --mem-fraction 0.6 --dtype float16 --tp-size 2 --max-running-requests 32
Environment
Python: 3.10.12 (main, Feb 4 2025, 14:57:36) [GCC 11.4.0]
CUDA available: True
GPU 0,1: NVIDIA H100 NVL
GPU 0,1 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 535.161.08
PyTorch: 2.6.0+cu124
sglang: 0.4.6.post5
sgl_kernel: 0.1.5
flashinfer_python: 0.2.5+cu124torch2.6
triton: 3.2.0
transformers: 4.52.3
torchao: 0.9.0
numpy: 2.2.6
aiohttp: 3.12.9
fastapi: 0.115.12
hf_transfer: 0.1.9
huggingface_hub: 0.32.4
interegular: 0.3.3
modelscope: 1.26.0
orjson: 3.10.18
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.5
python-multipart: 0.0.20
pyzmq: 26.4.0
uvicorn: 0.34.3
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.19
openai: 1.84.0
tiktoken: 0.9.0
anthropic: 0.52.2
litellm: 1.72.1
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS NODE 0-39 0 N/A
GPU1 SYS X SYS 40-79 1 N/A
NIC0 NODE SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_an0