Skip to content

[Feature] Support DeepEP normal & Redundant Experts on NPU#9881

Merged
zhyncs merged 20 commits intosgl-project:mainfrom
iforgetmyname:feature/deepep_normal
Sep 11, 2025
Merged

[Feature] Support DeepEP normal & Redundant Experts on NPU#9881
zhyncs merged 20 commits intosgl-project:mainfrom
iforgetmyname:feature/deepep_normal

Conversation

@iforgetmyname
Copy link
Collaborator

@iforgetmyname iforgetmyname commented Sep 1, 2025

Motivation

This PR adds support on Ascend NPU for:

  1. DeepEP normal mode
  2. Redundant Experts

Along with previously merged #8355, we are now allowing both prefill and decode to run with expert parallelism on Altas 800I A3. This also means running large-scale moe models without PD disaggregation is also possible if HBM capacity allows.

Checkout our roadmap here about DeepEP-Ascend

Modifications

  • Fix a bug where FIA kernel argues not supporting input_seq_len smaller than tp_size
  • Remove AscendDeepEPLLOutput due to DeepEP-Ascend now aligns output variables with DeepSeek's DeepEP
  • Support intranode dispatch/combine (deepep normal mode)
  • Support expert distribution recorder & redundant experts

Accuracy Tests

image

Benchmarking and Profiling

# Prefill
# NOTE: should increase the number of P instances as D is definitely not fullfilled
export HCCL_BUFFSIZE=1536
python3 -m sglang.launch_server \
    --model-path <deepseek-model-path> \
    --trust-remote-code \
    --attention-backend ascend \
    --mem-fraction-static 0.85 \
    --quantization w8a8_int8 \
    --disable-radix-cache \
    --chunked-prefill-size 32768 \
    --tp-size 16 \
    --dp-size 1 \
    --ep-size 16 \
    --moe-a2a-backend deepep \
    --deepep-mode normal \
    --nnodes 1 \
    --node-rank 0 \
    --disaggregation-mode prefill \
    --disaggregation-transfer-backend ascend \
    --ep-num-redundant-experts 16 \
    --ep-dispatch-algorithm static \
    --init-expert-location <location-file>

# Decode
export HCCL_BUFFSIZE=500
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=32
python3 -m sglang.launch_server \
    --model-path <deepseek-model-path> \
    --max-running-requests 512 \
    --trust-remote-code \
    --attention-backend ascend \
    --mem-fraction-static 0.9 \
    --quantization w8a8_int8 \
    --disable-radix-cache \
    --chunked-prefill-size 32768 \
    --cuda-graph-bs 8 16 24 32 \
    --tp-size 16 \
    --dp-size 2 \
    --enable-dp-attention \
    --ep-size 16 \
    --moe-a2a-backend deepep \
    --deepep-mode low_latency \
    --nnodes 1 \
    --node-rank 0 \
    --disaggregation-mode decode \
    --disaggregation-transfer-backend ascend \
    --ep-num-redundant-experts 16 \
    --ep-dispatch-algorithm static \
    --init-expert-location <location-file>
image

Checklist

@iforgetmyname iforgetmyname marked this pull request as ready for review September 2, 2025 07:36
@iforgetmyname iforgetmyname marked this pull request as draft September 2, 2025 09:43
@iforgetmyname iforgetmyname marked this pull request as ready for review September 2, 2025 13:53
@iforgetmyname iforgetmyname changed the title [Feature] Support DeepEP normal & EPLB [Feature] Support DeepEP normal & Redundant Experts on NPU Sep 2, 2025
@iforgetmyname
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for DeepEP normal mode and redundant experts on Ascend NPUs. The changes are well-organized and align with the goal of extending hardware support. Key modifications include making parts of the codebase device-agnostic by replacing CUDA-specific calls with generic device-aware functions, and refactoring the NPU forward pass to handle both normal and low-latency DeepEP modes. The removal of AscendDeepEPLLOutput in favor of the standard DeepEPLLOutput is a good step towards unifying the implementation. Additionally, new tests for Ascend DeepEP have been added, which is great for ensuring correctness.

I have a couple of suggestions to improve maintainability and robustness:

  1. Refactoring the forward_npu method in python/sglang/srt/layers/moe/ep_moe/layer.py to reduce code duplication.
  2. Improving the robustness of the CI script scripts/ci/npu_ci_install_dependency.sh for locating the site-packages directory.

### Install sgl-kernel-npu
SGL_KERNEL_NPU_TAG="20250901"
git clone --depth 1 https://github.com/sgl-project/sgl-kernel-npu.git --branch ${SGL_KERNEL_NPU_TAG}
(cd sgl-kernel-npu && bash ./build.sh -a deepep && pip install output/deep_ep*.whl && cd "$(pip show deep-ep | grep -E '^Location:' | awk '{print $2}')" && ln -s deep_ep/deep_ep_cpp*.so)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The command to find the site-packages directory using pip show | grep | awk is a bit fragile and might break if the output format of pip show changes in the future. A more robust approach would be to use Python's site module to get the site-packages path directly. This avoids parsing command-line tool output.

Suggested change
(cd sgl-kernel-npu && bash ./build.sh -a deepep && pip install output/deep_ep*.whl && cd "$(pip show deep-ep | grep -E '^Location:' | awk '{print $2}')" && ln -s deep_ep/deep_ep_cpp*.so)
(cd sgl-kernel-npu && bash ./build.sh -a deepep && pip install output/deep_ep*.whl && cd "$(python3 -c 'import site; print(site.getsitepackages()[0])')" && ln -s deep_ep/deep_ep_cpp*.so)

push: ${{ github.repository == 'sgl-project/sglang' && github.event_name != 'pull_request' }}
provenance: false
build-args: |
SGLANG_KERNEL_NPU_TAG=20250901
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why hard code here? is this because this npu version is not releasde?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, you can see this link

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this link instead, this is a releasing tag for sgl-kernel-npu repo

)

TEST_MODEL_MATRIX = {
"/root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-R1-0528-W8A8": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this path is hard coded

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we want to accelerate the CI tests through cache file

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's still a bug with modelscope, we still can't launch models downloaded with modelscope using namespace/model format

@zhyncs
Copy link
Collaborator

zhyncs commented Sep 8, 2025

@fzyzcjy @hnyls2002 please help review this pr. thanks.

@zhyncs zhyncs merged commit 5b64f00 into sgl-project:main Sep 11, 2025
191 of 212 checks passed
@iforgetmyname iforgetmyname deleted the feature/deepep_normal branch September 13, 2025 07:29
@ping1jing2 ping1jing2 self-assigned this Dec 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants