[ci] chore: add sglang ci for NPU#6015
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces Dockerfiles for Ascend NPU support and refactors the run_ppo_trainer_megatron.sh script to conditionally support both CUDA and NPU devices. Review feedback identifies a critical version mismatch between the installed PyTorch version and the sgl-kernel-npu package, which could lead to binary incompatibility. Additionally, improvements are suggested for the Dockerfiles to reduce image size by cleaning up downloaded artifacts and to enhance security by verifying certificates during downloads.
| pip install torch==2.7.1 torchvision==0.22.1 && \ | ||
| pip install -e python[all_npu] && \ | ||
| # Install torch_npu | ||
| ARCH=$(uname -m) && wget ${PTA_URL}/${PTA_BASE_VERSION}_${ARCH}.whl && pip install ${PTA_BASE_VERSION}_${ARCH}.whl && \ | ||
| echo "[LOG INFO] Torch_npu version is: ${PTA_BASE_VERSION}_${ARCH}.whl" && \ | ||
| cd .. | ||
|
|
||
| # Install sgl-kernel-npu | ||
| RUN wget --no-check-certificate https://github.com/sgl-project/sgl-kernel-npu/releases/download/2026.02.01/sgl-kernel-npu-2026.02.01-torch2.8.0-py311-cann8.5.0-910b-aarch64.zip && \ |
There was a problem hiding this comment.
There is a significant version mismatch between the installed PyTorch version and the sgl-kernel-npu package. You are installing torch==2.7.1 at line 35, but downloading a kernel built for torch2.8.0 at line 43. This will likely result in binary incompatibility and runtime errors when the NPU kernels are loaded. Please ensure that the sgl-kernel-npu version matches the installed PyTorch version.
| pip install torch==2.7.1 torchvision==0.22.1 && \ | ||
| pip install -e python[all_npu] && \ | ||
| # Install torch_npu | ||
| ARCH=$(uname -m) && wget ${PTA_URL}/${PTA_BASE_VERSION}_${ARCH}.whl && pip install ${PTA_BASE_VERSION}_${ARCH}.whl && \ | ||
| echo "[LOG INFO] Torch_npu version is: ${PTA_BASE_VERSION}_${ARCH}.whl" && \ | ||
| cd .. | ||
|
|
||
| # Install sgl-kernel-npu | ||
| RUN wget --no-check-certificate https://github.com/sgl-project/sgl-kernel-npu/releases/download/2026.02.01/sgl-kernel-npu-2026.02.01-torch2.8.0-py311-cann8.5.0-910b-aarch64.zip && \ |
There was a problem hiding this comment.
There is a significant version mismatch between the installed PyTorch version and the sgl-kernel-npu package. You are installing torch==2.7.1 at line 35, but downloading a kernel built for torch2.8.0 at line 43. This will likely result in binary incompatibility and runtime errors when the NPU kernels are loaded. Please ensure that the sgl-kernel-npu version matches the installed PyTorch version.
| RUN wget --no-check-certificate https://github.com/sgl-project/sgl-kernel-npu/releases/download/2026.02.01/sgl-kernel-npu-2026.02.01-torch2.8.0-py311-cann8.5.0-910b-aarch64.zip && \ | ||
| unzip sgl-kernel-npu*.zip && \ | ||
| pip install torch_memory_saver*.whl && \ | ||
| pip install sgl_kernel_npu*.whl && \ | ||
| pip install deep_ep*.whl |
There was a problem hiding this comment.
The Dockerfile downloads several large .zip and .whl files into the root directory but does not remove them after installation. Additionally, using --no-check-certificate with wget is a security risk. It is recommended to remove the downloaded artifacts in the same RUN layer to keep the image size small and avoid security warnings.
RUN wget https://github.com/sgl-project/sgl-kernel-npu/releases/download/2026.02.01/sgl-kernel-npu-2026.02.01-torch2.8.0-py311-cann8.5.0-910b-aarch64.zip && \
unzip sgl-kernel-npu*.zip && \
pip install torch_memory_saver*.whl && \
pip install sgl_kernel_npu*.whl && \
pip install deep_ep*.whl && \
rm -rf sgl-kernel-npu*.zip *.whl
| RUN wget --no-check-certificate https://github.com/sgl-project/sgl-kernel-npu/releases/download/2026.02.01/sgl-kernel-npu-2026.02.01-torch2.8.0-py311-cann8.5.0-910b-aarch64.zip && \ | ||
| unzip sgl-kernel-npu*.zip && \ | ||
| pip install torch_memory_saver*.whl && \ | ||
| pip install sgl_kernel_npu*.whl && \ | ||
| pip install deep_ep*.whl |
There was a problem hiding this comment.
The Dockerfile downloads several large .zip and .whl files into the root directory but does not remove them after installation. Additionally, using --no-check-certificate with wget is a security risk. It is recommended to remove the downloaded artifacts in the same RUN layer to keep the image size small and avoid security warnings.
RUN wget https://github.com/sgl-project/sgl-kernel-npu/releases/download/2026.02.01/sgl-kernel-npu-2026.02.01-torch2.8.0-py311-cann8.5.0-910b-aarch64.zip && \
unzip sgl-kernel-npu*.zip && \
pip install torch_memory_saver*.whl && \
pip install sgl_kernel_npu*.whl && \
pip install deep_ep*.whl && \
rm -rf sgl-kernel-npu*.zip *.whl
Updated installation steps to include pytest and modified pip install options.
What does this PR do?
Add sglang ci yaml for Ascend NPU && change related scripts.
All CI test cases have been tested locally and executed successfully.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.