Skip to content

[Bug] invalid memory access when using EPLB in multi batch scenario. #6644

@UnlceYang

Description

@UnlceYang

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

I start the sglang server as below, about to use two-batch-overlap and eplb:

export model_path=/media/ssd1/ds-r1
export node_ip="127.0.0.1"
export node_port="30000"
export CUDA_LAUNCH_BLOCKING=1
export NCCL_DEBUG=INFO
unset https_proxy HTTPS_PROXY HTTP_PROXY http_proxy

MC_TE_METRIC=true SGLANG_HACK_DEEPEP_NEW_MODE=0 SGL_ENABLE_JIT_DEEPGEMM=1 nohup python3 -m sglang.launch_server --model-path ${model_path} --host ${node_ip} --port ${node_port} --trust-remote-code --tp-size 8 --dp-size 8 --enable-dp-attention --enable-deepep-moe --deepep-mode normal --mem-fraction-static 0.85 --chunked-prefill-size 65536 --max-running-requests 2048 --max-total-tokens 131076 --context-length 8192 --ep-num-redundant-experts 32 --enable-two-batch-overlap --moe-dense-tp-size 1 --disable-radix-cache --enable-eplb --eplb-rebalance-num-iterations 1000 --expert-distribution-recorder-mode stat 2>&1 > ds_$(date +'%Y%m%d_%H%M%S').log &

then I benchmark for testing the performance of elpb:

nohup python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input-len 2048 --random-output-len 128 --request-rate 100 --max-concurrency 20 --num-prompts 500 --host "127.0.0.1" --port "30000" 2>&1 > bench_$(date +'%Y%m%d_%H%M%S').log &

it encounters a invalid memory access error when the max-concurrency == 20, while max-concurrency = 1 works well. Here's the trace stack:

[2025-05-26 18:03:12 DP1 TP1] TpModelWorkerClient hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/expert_distribution.py", line 137, in with_forward_pass
    yield
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1185, in forward
    output = self._forward_raw(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1212, in _forward_raw
    ret = self.forward_decode(forward_batch, pp_proxy_tensors=pp_proxy_tensors)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1133, in forward_decode
    return self.model.forward(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/deepseek_v2.py", line 1702, in forward
    return self.logits_processor(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 333, in forward
    logits = self._get_logits(pruned_states, lm_head, logits_metadata)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/logits_processor.py", line 491, in _get_logits
    dp_scatter(logits, global_logits, logits_metadata)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/dp_attention.py", line 290, in dp_scatter
    memcpy_triton(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/dp_attention.py", line 221, in memcpy_triton
    memcpy_triton_kernel[grid](dst, src, offset, sz, offset_src, chunk_size, BLOCK_SIZE)
  File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 330, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 653, in run
    kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
  File "/usr/local/lib/python3.10/dist-packages/triton/backends/nvidia/driver.py", line 444, in __call__
    self.launch(*args, **kwargs)
RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 118, in forward_thread_func
    self.forward_thread_func_()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 151, in forward_thread_func_
    self.worker.forward_batch_generation(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 202, in forward_batch_generation
    logits_output, can_run_cuda_graph = self.model_runner.forward(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1181, in forward
    with get_global_expert_distribution_recorder().with_forward_pass(
  File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/expert_distribution.py", line 139, in with_forward_pass
    self._on_forward_pass_end(forward_pass_id)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/expert_distribution.py", line 153, in _on_forward_pass_end
    self._accumulator.append(forward_pass_id, gatherer_key, single_pass_data)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/expert_distribution.py", line 572, in append
    self._global_physical_count_of_buffered_step.append(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/expert_distribution.py", line 639, in append
    self._buffer[self._curr_index] = value
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
      
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 117, in forward_thread_func
    with torch.get_device_module(self.device).stream(self.forward_stream):
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 595, in __exit__
    torch.cuda.set_stream(self.src_prev_stream)  # type: ignore[arg-type]
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 636, in set_stream
    _set_stream_by_id(
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 618, in _set_stream_by_id
    torch._C._cuda_setStream(
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Reproduction

Environment

Python: 3.10.12 (main, Feb  4 2025, 14:57:36) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H800
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
CUDA Driver Version: 560.35.03
PyTorch: 2.6.0+cu124
sglang: 0.4.6.post5
sgl_kernel: 0.1.4
flashinfer_python: 0.2.5+cu124torch2.6
triton: 3.2.0
transformers: 4.52.3
torchao: 0.9.0
numpy: 2.2.6
aiohttp: 3.11.18
fastapi: 0.115.12
hf_transfer: 0.1.9
huggingface_hub: 0.32.0
interegular: 0.3.3
modelscope: 1.26.0
orjson: 3.10.18
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.5
python-multipart: 0.0.20
pyzmq: 26.4.0
uvicorn: 0.34.2
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.19
openai: 1.82.0
tiktoken: 0.9.0
anthropic: 0.52.0
litellm: 1.70.4
decord: 0.6.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions