-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Closed
Description
Checklist
- 1. I have searched related issues but cannot get the expected help.
- 2. The bug has not been fixed in the latest version.
- 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- 5. Please use English, otherwise it will be closed.
Describe the bug
I start the sglang server as below, about to use two-batch-overlap and eplb:
export model_path=/media/ssd1/ds-r1
export node_ip="127.0.0.1"
export node_port="30000"
export CUDA_LAUNCH_BLOCKING=1
export NCCL_DEBUG=INFO
unset https_proxy HTTPS_PROXY HTTP_PROXY http_proxy
MC_TE_METRIC=true SGLANG_HACK_DEEPEP_NEW_MODE=0 SGL_ENABLE_JIT_DEEPGEMM=1 nohup python3 -m sglang.launch_server --model-path ${model_path} --host ${node_ip} --port ${node_port} --trust-remote-code --tp-size 8 --dp-size 8 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --mem-fraction-static 0.85 --chunked-prefill-size 65536 --max-running-requests 2048 --max-total-tokens 131076 --context-length 8192 --ep-num-redundant-experts 32 --enable-two-batch-overlap --moe-dense-tp-size 1 --disable-radix-cache --enable-eplb --eplb-rebalance-num-iterations 1000 --expert-distribution-recorder-mode stat 2>&1 > ds_$(date +'%Y%m%d_%H%M%S').log &
then I benchmark for testing the performance of elpb:
nohup python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input-len 2048 --random-output-len 128 --request-rate 100 --max-concurrency 20 --num-prompts 500 --host "127.0.0.1" --port "30000" 2>&1 > bench_$(date +'%Y%m%d_%H%M%S').log &
it encounters a invalid memory access error when the max-concurrency == 20, while max-concurrency = 1 works well. Here's the trace stack:
[2025-05-26 18:03:12 DP1 TP1] TpModelWorkerClient hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/expert_distribution.py", line 137, in with_forward_pass
yield
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1185, in forward
output = self._forward_raw(
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1212, in _forward_raw
ret = self.forward_decode(forward_batch, pp_proxy_tensors=pp_proxy_tensors)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1133, in forward_decode
return self.model.forward(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1702, in forward
return self.logits_processor(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 333, in forward
logits = self._get_logits(pruned_states, lm_head, logits_metadata)
File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 491, in _get_logits
dp_scatter(logits, global_logits, logits_metadata)
File "/sgl-workspace/sglang/python/sglang/srt/layers/dp_attention.py", line 290, in dp_scatter
memcpy_triton(
File "/sgl-workspace/sglang/python/sglang/srt/layers/dp_attention.py", line 221, in memcpy_triton
memcpy_triton_kernel[grid](dst, src, offset, sz, offset_src, chunk_size, BLOCK_SIZE)
File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 330, in <lambda>
return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 653, in run
kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
File "/usr/local/lib/python3.10/dist-packages/triton/backends/nvidia/driver.py", line 444, in __call__
self.launch(*args, **kwargs)
RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 118, in forward_thread_func
self.forward_thread_func_()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 151, in forward_thread_func_
self.worker.forward_batch_generation(
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 202, in forward_batch_generation
logits_output, can_run_cuda_graph = self.model_runner.forward(
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1181, in forward
with get_global_expert_distribution_recorder().with_forward_pass(
File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__
self.gen.throw(typ, value, traceback)
File "/sgl-workspace/sglang/python/sglang/srt/managers/expert_distribution.py", line 139, in with_forward_pass
self._on_forward_pass_end(forward_pass_id)
File "/sgl-workspace/sglang/python/sglang/srt/managers/expert_distribution.py", line 153, in _on_forward_pass_end
self._accumulator.append(forward_pass_id, gatherer_key, single_pass_data)
File "/sgl-workspace/sglang/python/sglang/srt/managers/expert_distribution.py", line 572, in append
self._global_physical_count_of_buffered_step.append(
File "/sgl-workspace/sglang/python/sglang/srt/managers/expert_distribution.py", line 639, in append
self._buffer[self._curr_index] = value
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 117, in forward_thread_func
with torch.get_device_module(self.device).stream(self.forward_stream):
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 595, in __exit__
torch.cuda.set_stream(self.src_prev_stream) # type: ignore[arg-type]
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 636, in set_stream
_set_stream_by_id(
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 618, in _set_stream_by_id
torch._C._cuda_setStream(
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Reproduction
Environment
Python: 3.10.12 (main, Feb 4 2025, 14:57:36) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H800
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 560.35.03
PyTorch: 2.6.0+cu124
sglang: 0.4.6.post5
sgl_kernel: 0.1.4
flashinfer_python: 0.2.5+cu124torch2.6
triton: 3.2.0
transformers: 4.52.3
torchao: 0.9.0
numpy: 2.2.6
aiohttp: 3.11.18
fastapi: 0.115.12
hf_transfer: 0.1.9
huggingface_hub: 0.32.0
interegular: 0.3.3
modelscope: 1.26.0
orjson: 3.10.18
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.5
python-multipart: 0.0.20
pyzmq: 26.4.0
uvicorn: 0.34.2
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.19
openai: 1.82.0
tiktoken: 0.9.0
anthropic: 0.52.0
litellm: 1.70.4
decord: 0.6.0
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels