Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
142 commits
Select commit Hold shift + click to select a range
3edfd88
Fix BadRequestError wrong arguments and remove openai dependency (#4882)
fzyzcjy Mar 29, 2025
e34ccea
Improve stack trace of retry errors (#4845)
fzyzcjy Mar 29, 2025
409d470
Tiny fix doc error (#4795)
fzyzcjy Mar 29, 2025
167d22c
[Docs] Update DeepGEMM at README.md (#4886)
FlamingoPg Mar 29, 2025
7d8a876
Update CODEOWNERS (#4889)
zhyncs Mar 29, 2025
4f9d2ba
Delete test_deep_gemm.py (#4891)
FlamingoPg Mar 29, 2025
5634a3a
Add deepseek style fused moe group gate selection kernel (#4530)
qingquansong Mar 29, 2025
e6d3ba6
quick fix: add default for new kernel (#4898)
FlamingoPg Mar 29, 2025
8cca62b
remove setup for sgl-kernel (#4899)
zhyncs Mar 29, 2025
4515581
[Misc] Clean m.def and add Development Tips (#4890)
FlamingoPg Mar 30, 2025
116c00c
fix allreduce test (#4909)
yizhang2077 Mar 30, 2025
ac3d99a
Support page size > 1 + eagle (#4908)
merrymercy Mar 30, 2025
410d637
Fix retract for page size > 1 (#4914)
merrymercy Mar 30, 2025
6024dff
[Feature] use pytest for sgl-kernel (#4896)
adarshxs Mar 30, 2025
53245c2
fix bmm fp8 (#4926)
zhyncs Mar 30, 2025
a67c15e
Fix the timeout for unit-test-2-gpu in pr-test.yml (#4927)
merrymercy Mar 30, 2025
e0bba86
Fix 2-gpu CI test and suppress some warnings (#4930)
merrymercy Mar 30, 2025
2c09126
[feat] add fa3 in sgl-kernel (#4902)
FlamingoPg Mar 30, 2025
a1630d6
Fix sglang frontend's incorrect dependency on torch (#4931)
seplos Mar 30, 2025
1b4768a
[Fix] avoid stream sync and torch compile in prefill for fa3 backend …
Fridge003 Mar 30, 2025
5ff9059
cleanup sgl-kernel (#4933)
zhyncs Mar 30, 2025
a639a05
[Fix] Improve Lora tests and reduce CI runtime (#4925)
Fridge003 Mar 31, 2025
8339e22
Fix DeepSeek bug causing 2.2% MMLU drop when TP!=DP (#4883)
fzyzcjy Mar 31, 2025
acc9ae6
[Fix] Add torch compile for torch.clamp back (#4936)
Fridge003 Mar 31, 2025
0259f4e
Fix oom error for large page size (#4913)
xiezhq-hermann Mar 31, 2025
5d61e95
[feat] interface for platforms abstraction (#4928)
Alcanderian Mar 31, 2025
c34051c
[Fix] revert clean m.def for cudagraph (#4944)
FlamingoPg Mar 31, 2025
ba2b8e4
refactor: multimodal data (#4754)
mickqian Mar 31, 2025
151b8da
bump sgl-kernel v0.0.6 (#4950)
zhyncs Mar 31, 2025
e999662
[Build] Fix cuda12.8 build error in nvfp4_scaled_mm_kernels.cu (#4953)
guoyuhong Mar 31, 2025
d5c6416
use fa3 in sgl-kernel (#4954)
zhyncs Mar 31, 2025
bb0be58
Revert PR 4764 & 4813 related to R1 RoPE (#4959)
guoyuhong Apr 1, 2025
d782244
[Feature] Support DeepEP Low Latency (#4767)
liz-badada Apr 1, 2025
f4aa041
update bench_serving (#4958)
zhyncs Apr 1, 2025
353aba4
Prevent memory leak of retract_decode when page_size > 1 (#4977)
xiezhq-hermann Apr 1, 2025
bdcf3b5
[VLM RLHF] Take Image input for verl vlm rollout (#4915)
JustinTong0323 Apr 2, 2025
28a21ed
Large page size aligned hierarchical caching (#4581)
xiezhq-hermann Apr 2, 2025
f024098
bug fix for hicache host eviction (#4989)
xiezhq-hermann Apr 2, 2025
9c4abc1
sgl scaled_fp8_quant support output padding (#4861)
BBuf Apr 2, 2025
b8790ec
Add Eagle Speculative Decoding to FA3 Backend (#4951)
qingquansong Apr 2, 2025
6ee3bd1
Update tokenizer_manager.py (#5008)
yangky11 Apr 2, 2025
f08e4a5
[sgl-kernel] per token group quant support COLUMN MAJOR (#4817)
BBuf Apr 3, 2025
b77beb6
update cutlass tag (#5011)
xiezhq-hermann Apr 3, 2025
cb5d8fa
Feature/revise docs ci (#5009)
renxinx Apr 3, 2025
2f0bc46
fix: fix illegal cuda memory access at fused_moe_kernel (#4727)
saltyfish66 Apr 3, 2025
0d09d42
[Build] Support build sgl-kernel with ccache (#5020)
guoyuhong Apr 3, 2025
2f5ad61
fix deepgemm as well (#5030)
xiezhq-hermann Apr 3, 2025
2039ae4
try to fix ci oserror (#5024)
BBuf Apr 3, 2025
30cddef
Replace enable_flashinfer_mla argument with attention_backend (#5005)
Fridge003 Apr 3, 2025
1a42720
Small refactor DeepEPMode to clean up code a bit (#4992)
fzyzcjy Apr 3, 2025
69b91de
[Fix] fix fa3 build at cu118 (#5036)
FlamingoPg Apr 3, 2025
ecc640f
Revert "Replace enable_flashinfer_mla argument with attention_backend…
merrymercy Apr 3, 2025
8430f3c
bump sgl-kernel v0.0.7 (#5046)
zhyncs Apr 3, 2025
91355c8
update eagle-3 docs (#4796)
simveit Apr 3, 2025
326b58d
Add LlavaLlamaForCausaLM in MultiModal Processors (#5039)
ravi03071991 Apr 3, 2025
a67ea4a
Update the retry count (#5051)
zhyncs Apr 4, 2025
a4f6dd4
upgrade sgl-kernel v0.0.7 (#5049)
zhyncs Apr 4, 2025
6aaea85
[2/3] fix dsv3 awq issue (#4625)
AniZpZ Apr 4, 2025
210e831
Feature/revise docs ci (#5056)
renxinx Apr 4, 2025
d0753fb
Add H20 fused MoE kernel tuning configs for DeepSeek V3/R1 (#5057)
M0gician Apr 4, 2025
fd4b549
[fix] remove `cuda_device_count_stateless` (#5060)
Alcanderian Apr 4, 2025
e0dc54f
Small refactor DeepEPDispatcher into subclasses (#4994)
fzyzcjy Apr 4, 2025
0b91be8
Support async DeepEP by splitting into two stages (#4995)
fzyzcjy Apr 4, 2025
546ebc8
Cleanup unused resources after DeepEP operation (#4996)
fzyzcjy Apr 4, 2025
0cfe991
Add DeepSeek V3/R1 shared experts fusion (#4918)
BBuf Apr 4, 2025
2a03b83
[deepep] fix: shared experts are not initialized when shared experts …
ch-wan Apr 4, 2025
5cf628d
fix dummy-load deepseekv2 (#4535)
inkcherry Apr 4, 2025
4f79ccd
support sgl-kernel on blackwell (#5074)
zhyncs Apr 4, 2025
43478d0
FA3 Spec Decoding to support top k = 1 and add cuda graph support (#5…
hebiao064 Apr 5, 2025
c1997f7
[Revision] Replace enable_flashinfer_mla argument with attention_back…
Fridge003 Apr 5, 2025
e13eb1a
upgrade transformers 4.51.0 (#5088)
zhyncs Apr 5, 2025
2709d38
sgl-kernel transfer custom allreduce from trt kernel to vllm kernel (…
yizhang2077 Apr 5, 2025
8de5848
bump sgl-kernel 0.0.8 (#5089)
zhyncs Apr 5, 2025
080e30f
python transfer custom allreduce from trt kernel to vllm kernel (#5080)
yizhang2077 Apr 5, 2025
6b4373b
bump v0.4.4.post4 (#5091)
zhyncs Apr 5, 2025
c84460d
Fix: Reduce the number of document ci attempts to avoid long ci runni…
minleminzui Apr 6, 2025
005bcf6
Add Llama4 support (#5092)
CatherineSue Apr 7, 2025
f1f9dd5
Fix refactor error - fp8.py (#5106)
HaiShaw Apr 7, 2025
e9f428e
bump v0.4.5 (#5117)
zhyncs Apr 7, 2025
caeb3b8
[ci] fix llama4 ci error (#5126)
BBuf Apr 7, 2025
804e840
Refactor and Optimize FA3 Code (#5090)
hebiao064 Apr 7, 2025
a306a1e
Add Llama4 user guide (#5133)
ispobock Apr 8, 2025
cae9410
[Misc] Use pytest.mark.skipif in sgl-kernel test (#5137)
FlamingoPg Apr 8, 2025
9609eba
feat: disable grammar restrictions within reasoning sections (#4984)
minleminzui Apr 8, 2025
523ebd1
[modelopt] automatically inspect if model is ModelOpt quantized and s…
yundai424 Apr 8, 2025
63b3e26
[AMD] Fix missing per_token_group_quant_fp8 for ROCm (#5140)
hubertlu-tw Apr 8, 2025
fabe6d4
fix multimodal hash feature (#5083)
huangtingwei9988 Apr 8, 2025
e7328db
Fix run time error in ROCm platform (#5147)
kkHuang-amd Apr 8, 2025
5cfce40
[FA3 Feature] Support multi modal Llama-3.2-11B-Vision-Instruct (#5103)
zcnrex Apr 8, 2025
026ac6e
Add unit test on page_size > 1 and mla and integration test for Flas…
yubofredwang Apr 8, 2025
e0056a9
Use public model for FA3 speculative decode testing (#5152)
yubofredwang Apr 8, 2025
e18ab11
Add dummy grok test to amd CI. (#5115)
saienduri Apr 8, 2025
376e926
fix empty_cache error in pt_weights_iterator (#5151)
dangkai4u Apr 8, 2025
91a6868
Fix torch compile errors (#5158)
kkHuang-amd Apr 8, 2025
6806322
Fix loading KV quantization scale; Enable modelopt kv cache (#4686)
yundai424 Apr 8, 2025
c062a43
[PD] Fix unclosed prefill connection warning of mini_lb (#5155)
ShangmingCai Apr 8, 2025
08a3e58
Add optimized native kernels in sgl-kernel (#5150)
mingfeima Apr 8, 2025
21ff770
[PD] Simplify mini LB (#4911)
ByronHsu Apr 8, 2025
0a66d28
Small improvement of native api docs (#5139)
simveit Apr 8, 2025
7e23867
[feat&refactor] Enhance multimodal input support with refactor io_str…
JustinTong0323 Apr 8, 2025
e66bb14
Support 2x8xH100 for Llama 4 (#5159)
fzyzcjy Apr 8, 2025
b099b4d
FP4 weight loading and inference (2/2) (#3972)
trevor-m Apr 9, 2025
e373feb
Fix multimodal hashing error (#5174)
fzyzcjy Apr 9, 2025
f9efb42
Tiny disable model that does not work (#5175)
fzyzcjy Apr 9, 2025
b9ff9fe
[Bugfix] Fix index out of bounds in local attention with large sequen…
CatherineSue Apr 9, 2025
aa007f5
[Fix] DeepEP Compatibility with Low Latency (#5068)
liz-badada Apr 9, 2025
bdda960
docs: remove the use of Downward API for LWS_WORKER_INDEX (#5110)
yankay Apr 9, 2025
fe0d022
feat: add DeepGEMM build warning (#5176)
zhyncs Apr 9, 2025
c5df351
fix: use DeepEPDispatcher on CUDA (#5180)
zhyncs Apr 9, 2025
720ef3a
[DeepEP] fix: import buffer error (#5179)
ch-wan Apr 9, 2025
84626fc
Let `bench_one_batch` support `enable_dp_attention` (#4058)
fzyzcjy Apr 9, 2025
3535023
[Misc] clean up vllm in sgl-kernel test (#5189)
FlamingoPg Apr 9, 2025
5e1e3ca
Fix ci test "test_eval_fp8_accuracy" failed (#5185)
kkHuang-amd Apr 9, 2025
72e66fd
Optimize topk operation in llama4 (#5128)
fzyzcjy Apr 9, 2025
bbb5f05
Support Llama4 fp8 inference (#5194)
HandH1998 Apr 9, 2025
facf837
[ci] fix ci test fused_moe op (#5102)
BBuf Apr 9, 2025
97e80a4
model: support mllama4 (#5144)
mickqian Apr 9, 2025
aae6996
update grok test (#5171)
saienduri Apr 9, 2025
77e9549
sgl-kernel use cutlass latest version for fp8 blockwise gemm (#5207)
yizhang2077 Apr 9, 2025
3617f1e
Add H20 dtype fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V…
Muuuchen Apr 9, 2025
f241390
fix: log warning when disable cuda graph (#5209)
zhyncs Apr 9, 2025
414a840
[metrics] Add in queue metrics (#4444)
hebiao064 Apr 10, 2025
6fbe8d1
Fix DeepSeek error when using DeepEP mode (#5190)
fzyzcjy Apr 10, 2025
720818a
reduce moe_align_block_size_kernel small batch mode overhead (#5086)
BBuf Apr 10, 2025
796073f
[PD] Support KV transfer with mooncake (#4880)
stmatengss Apr 10, 2025
a85f762
[PD] Add get_contiguous_buf_infos interface for MLATokenToKVPool (#5204)
stmatengss Apr 10, 2025
41e1bee
Update deps for mllama4 (#5215)
ispobock Apr 10, 2025
c097f41
Fix deepseek-v3 with torch.compile in PyTorch 2.6. (#5213)
zou3519 Apr 10, 2025
973c449
ROCm sgl-kernel: compatible to later torch (#5167)
HaiShaw Apr 10, 2025
1dc3b18
[Misc] Clean sgl-kernel test (#5216)
FlamingoPg Apr 10, 2025
9684112
Update Makefile / build script to avoid installing incompatible torch…
elfiegg Apr 10, 2025
9e8a69f
Fix torch.compile cacheing (#5259)
zou3519 Apr 11, 2025
02f9a5e
ROCm/AITER CK_MoE: update 2-stage kernels & support both Activations …
HaiShaw Apr 11, 2025
74eb12c
Optimize attention in llama4 (#5127)
fzyzcjy Apr 11, 2025
d5df05a
Optimize GPU memory usage in FlashAttentionBackend's strided indexing…
CatherineSue Apr 11, 2025
1d65a62
Support `--enable-llama4-multimodal` (#5254)
ch-wan Apr 11, 2025
c9180cc
[fix] fix mrope positions not picked up (#5265)
mickqian Apr 11, 2025
610da05
doc: nested loop code for offline engine (#5244)
minleminzui Apr 11, 2025
278d4d2
fix: examples for token_in_token_out_vlm (#5193)
JustinTong0323 Apr 11, 2025
cfcc692
Fix a 404 link in send_request.ipynb (#5280)
windsonsea Apr 11, 2025
9371d0c
fix: enable fp4 compilation on cu128 (#5286)
zhyncs Apr 11, 2025
3e3276e
update
thyecust Apr 11, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@
/test/lang @merrymercy @Ying1123 @ByronHsu
/test/srt @merrymercy @Ying1123 @zhyncs
/sgl-router @ByronHsu @Ying1123
/sgl-kernel @zhyncs @ispobock @HandH1998 @BBuf @yizhang2077 @merrymercy
/sgl-kernel @zhyncs @ispobock @HandH1998 @BBuf @yizhang2077 @merrymercy @yinfan98
57 changes: 52 additions & 5 deletions .github/workflows/pr-test-amd.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,14 @@ on:
- "python/sglang/**"
- "test/**"
- "sgl-kernel/**"
- ".github/workflows/pr-test-amd.yml"
pull_request:
branches: [ main ]
paths:
- "python/sglang/**"
- "test/**"
- "sgl-kernel/**"
- ".github/workflows/pr-test-amd.yml"
workflow_dispatch:

concurrency:
Expand All @@ -36,12 +38,12 @@ jobs:
else
DEVICE_FLAG="--device /dev/dri"
fi
docker pull lmsysorg/sglang:v0.4.3.post4-rocm630
docker pull lmsysorg/sglang:v0.4.5-rocm630
docker run -dt --user root --device=/dev/kfd $DEVICE_FLAG \
-v ${{ github.workspace }}:/sglang-checkout --ipc=host --group-add video \
--cap-add=SYS_PTRACE -e HF_TOKEN=${HF_TOKEN} --security-opt seccomp=unconfined \
-w /sglang-checkout --name ci_sglang \
lmsysorg/sglang:v0.4.3.post4-rocm630
lmsysorg/sglang:v0.4.5-rocm630

- name: Install dependencies
run: |
Expand All @@ -53,6 +55,10 @@ jobs:
docker exec -w / ci_sglang git clone https://github.com/merrymercy/human-eval.git
docker exec -w /human-eval ci_sglang pip install -e .

docker exec -w / ci_sglang mkdir -p /dummy-grok
mkdir -p dummy-grok && wget https://sharkpublic.blob.core.windows.net/sharkpublic/sglang/dummy_grok.json -O dummy-grok/config.json
docker cp ./dummy-grok ci_sglang:/

- name: Evaluate Accuracy
timeout-minutes: 20
run: |
Expand All @@ -76,20 +82,19 @@ jobs:
else
DEVICE_FLAG="--device /dev/dri"
fi
docker pull lmsysorg/sglang:v0.4.3.post4-rocm630
docker pull lmsysorg/sglang:v0.4.5-rocm630
docker run -dt --user root --device=/dev/kfd $DEVICE_FLAG \
-v ${{ github.workspace }}:/sglang-checkout --ipc=host --group-add video \
--cap-add=SYS_PTRACE -e HF_TOKEN=${{ secrets.AMD_HF_TOKEN }} --security-opt seccomp=unconfined \
-w /sglang-checkout --name ci_sglang \
lmsysorg/sglang:v0.4.3.post4-rocm630
lmsysorg/sglang:v0.4.5-rocm630

- name: Install dependencies
run: |
docker exec ci_sglang pip install --upgrade pip
docker exec ci_sglang pip uninstall sgl-kernel -y || true
docker exec -w /sglang-checkout/sgl-kernel ci_sglang bash -c "rm -f pyproject.toml && mv pyproject_rocm.toml pyproject.toml && python3 setup_rocm.py install"
docker exec ci_sglang pip install -e "python[dev_hip]"
docker exec ci_sglang pip install py-spy || true

docker exec -w / ci_sglang git clone https://github.com/merrymercy/human-eval.git
docker exec -w /human-eval ci_sglang pip install -e .
Expand All @@ -99,6 +104,48 @@ jobs:
run: |
docker exec -w /sglang-checkout/test/srt -e SGLANG_IS_IN_CI=1 ci_sglang python3 test_mla.py

bench-test-2-gpu-amd:
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
github.event.pull_request.draft == false
runs-on: linux-mi300-gpu-2
steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Setup docker
run: |
# Ensure GPU isolation if pod is part of kubernetes setup with DEVICE_FLAG.
if [ -f "/etc/podinfo/gha-render-devices" ]; then
DEVICE_FLAG=$(cat /etc/podinfo/gha-render-devices)
else
DEVICE_FLAG="--device /dev/dri"
fi
docker pull lmsysorg/sglang:v0.4.5-rocm630
docker run -dt --user root --device=/dev/kfd $DEVICE_FLAG \
-v ${{ github.workspace }}:/sglang-checkout --ipc=host --group-add video \
--cap-add=SYS_PTRACE -e HF_TOKEN=${HF_TOKEN} --security-opt seccomp=unconfined \
-w /sglang-checkout --name ci_sglang \
lmsysorg/sglang:v0.4.5-rocm630

- name: Install dependencies
run: |
docker exec ci_sglang pip install --upgrade pip
docker exec ci_sglang pip uninstall sgl-kernel -y || true
docker exec -w /sglang-checkout/sgl-kernel ci_sglang bash -c "rm -f pyproject.toml && mv pyproject_rocm.toml pyproject.toml && python3 setup_rocm.py install"
docker exec ci_sglang pip install -e "python[dev_hip]"

docker exec -w / ci_sglang git clone https://github.com/merrymercy/human-eval.git
docker exec -w /human-eval ci_sglang pip install -e .

docker exec -w / ci_sglang mkdir -p /dummy-grok
mkdir -p dummy-grok && wget https://sharkpublic.blob.core.windows.net/sharkpublic/sglang/dummy_grok.json -O dummy-grok/config.json
docker cp ./dummy-grok ci_sglang:/

- name: Evaluate Benchmark
timeout-minutes: 20
run: |
docker exec -w /sglang-checkout/test/srt -e SGLANG_IS_IN_CI=1 ci_sglang python3 models/test_dummy_grok_models.py

finish:
if: always()
needs: [
Expand Down
5 changes: 3 additions & 2 deletions .github/workflows/pr-test-sgl-kernel.yml
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,8 @@ jobs:

- name: Install
run: |
pip3 install torch==2.5.1 && pip3 install pytest && pip3 install vllm==0.6.4.post1
bash scripts/ci_install_dependency.sh
pip3 install torch==2.5.1 && pip3 install pytest && pip3 install vllm==0.7.2
pip3 uninstall sgl-kernel -y || true
pip3 install sgl-kernel/dist/*whl --force-reinstall --no-deps
pip3 list | grep sgl-kernel
Expand All @@ -89,7 +90,7 @@ jobs:
timeout-minutes: 30
run: |
cd sgl-kernel
find tests -name "test_*.py" | xargs -n 1 python3
pytest tests/

- name: Uninstall dependencies
run: |
Expand Down
50 changes: 4 additions & 46 deletions .github/workflows/pr-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ jobs:
bash scripts/ci_install_dependency.sh

- name: Run test
timeout-minutes: 30
timeout-minutes: 40
run: |
cd test/srt
python3 run_suite.py --suite per-commit --auto-partition-id ${{ matrix.part }} --auto-partition-size 7
Expand All @@ -87,53 +87,11 @@ jobs:
run: |
bash scripts/ci_install_dependency.sh

- name: Test data parallelism (DP=2)
timeout-minutes: 10
run: |
cd test/srt
python3 test_data_parallelism.py

- name: Test data parallelism attention (DP=2)
timeout-minutes: 10
run: |
cd test/srt
python3 test_dp_attention.py

- name: Test update weights from distributed
timeout-minutes: 10
run: |
cd test/srt
python3 test_update_weights_from_distributed.py

- name: Test VerlEngine
timeout-minutes: 10
run: |
cd test/srt
python3 test_verl_engine.py

- name: Test Patch Torch
timeout-minutes: 10
run: |
cd test/srt
python3 test_patch_torch.py

- name: Test expert parallelism (EP=2)
timeout-minutes: 10
run: |
cd test/srt
python3 test_moe_ep.py

- name: Test torch compile (TP=2)
timeout-minutes: 10
- name: Run test
timeout-minutes: 25
run: |
cd test/srt
python3 test_mla_tp.py

- name: Test lora tensor parallelism (TP=2)
timeout-minutes: 10
run: |
cd test/srt/models/lora
python3 test_lora_tp.py
python3 run_suite.py --suite per-commit-2-gpu

performance-test-1-gpu-part-1:
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/release-docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ jobs:
make compile

make html
python3 wrap_run_llm.py
cd _build/html

git clone https://$GITHUB_TOKEN@github.com/sgl-project/sgl-project.github.io.git ../sgl-project.github.io --depth 1
Expand Down
12 changes: 2 additions & 10 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -1,12 +1,4 @@
[submodule "sgl-kernel/3rdparty/cutlass"]
path = sgl-kernel/3rdparty/cutlass
url = https://github.com/NVIDIA/cutlass.git
[submodule "sgl-kernel/3rdparty/cccl"]
path = sgl-kernel/3rdparty/cccl
url = https://github.com/NVIDIA/cccl.git
[submodule "sgl-kernel/3rdparty/flashinfer"]
path = sgl-kernel/3rdparty/flashinfer
url = https://github.com/flashinfer-ai/flashinfer.git
[submodule "sgl-kernel/3rdparty/deepgemm"]
path = sgl-kernel/3rdparty/deepgemm
url = https://github.com/deepseek-ai/DeepGEMM
url = https://github.com/sgl-project/flashinfer.git
branch = sgl-kernel
5 changes: 3 additions & 2 deletions benchmark/deepseek_v3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -178,10 +178,11 @@ python3 -m sglang.bench_one_batch_server --model None --base-url http://10.0.0.1

### Example: Serving with 8 A100/A800 with AWQ Quantization

AWQ does not support BF16, so add the `--dtype half` flag if AWQ is used for quantization. One example is as follows:
Add `--quantization moe_wna16` flag to enable moe wna16 kernel for better performance.
One example is as follows:

```bash
python3 -m sglang.launch_server --model cognitivecomputations/DeepSeek-R1-AWQ --tp 8 --trust-remote-code --dtype half
python3 -m sglang.launch_server --model cognitivecomputations/DeepSeek-R1-AWQ --tp 8 --trust-remote-code --quantization moe_wna16
```


Expand Down
13 changes: 12 additions & 1 deletion benchmark/kernels/fused_moe_triton/tuning_fused_moe_triton.py
Original file line number Diff line number Diff line change
Expand Up @@ -399,7 +399,12 @@ def main(args: argparse.Namespace):
intermediate_size = config.moe_intermediate_size
shard_intermediate_size = 2 * intermediate_size // args.tp_size
elif config.architectures[0] in ["DeepseekV2ForCausalLM", "DeepseekV3ForCausalLM"]:
E = config.n_routed_experts
n_share_fusion_experts = args.n_share_experts_fusion
E = (
config.n_routed_experts + n_share_fusion_experts
if config.architectures[0] in ["DeepseekV3ForCausalLM"]
else config.n_routed_experts
)
topk = config.num_experts_per_tok
intermediate_size = config.moe_intermediate_size
shard_intermediate_size = 2 * intermediate_size // args.tp_size
Expand Down Expand Up @@ -559,6 +564,12 @@ def _distribute(method: str, inputs: List[Any]) -> List[Any]:
parser.add_argument("--seed", type=int, default=0)
parser.add_argument("--batch-size", type=int, required=False)
parser.add_argument("--tune", action="store_true")
parser.add_argument(
"--n-share-experts-fusion",
type=int,
default=0,
help="The number of shared_experts need to be replica to fuse with normal experts in deepseek v3/r1",
)
args = parser.parse_args()

main(args)
44 changes: 33 additions & 11 deletions benchmark/mmmu/bench_hf.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import argparse

import PIL
import torch
from data_utils import save_json
from eval_utils import (
Expand Down Expand Up @@ -72,17 +73,38 @@ def eval_mmmu(args):
if suffix:
contents += [{"type": "text", "text": suffix}]
messages = [{"role": "user", "content": contents}]
model_inputs = processor.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
add_generation_prompt=True,
return_tensors="pt",
).to(model.device)
input_len = model_inputs["input_ids"].shape[-1]
generation = model.generate(**model_inputs, generation_config=generation_config)
generation = generation[0][input_len:]
response = processor.decode(generation, skip_special_tokens=True)
try:
model_inputs = processor.tokenizer.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
add_generation_prompt=True,
return_tensors="pt",
).to(model.device)
input_len = model_inputs["input_ids"].shape[-1]
generation = model.generate(
**model_inputs, generation_config=generation_config
)
generation = generation[0][input_len:]
response = processor.decode(generation, skip_special_tokens=True)
except:
contents = []
if prefix:
contents += [prefix]
image = PIL.Image.open(sample["image_path"])
contents += [image]
if suffix:
contents += [suffix]
messages = [{"role": "user", "content": contents}]
response = model.chat(
msgs=messages,
tokenizer=processor.tokenizer,
sampling=False,
max_new_tokens=sampling_params["max_new_tokens"],
use_tts_template=False,
generate_audio=False,
temperature=0.0,
)
print(f"response: {response}")
process_result(response, sample, answer_dict, out_samples)

Expand Down
2 changes: 1 addition & 1 deletion benchmark/mmmu/bench_sglang.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,8 +86,8 @@ def eval_mmmu(args):

if __name__ == "__main__":
parser = argparse.ArgumentParser()
args = add_common_sglang_args_and_parse(parser)
EvalArgs.add_cli_args(parser)
args = add_common_sglang_args_and_parse(parser)
args = parser.parse_args()

eval_mmmu(args)
2 changes: 2 additions & 0 deletions benchmark/mmmu/eval_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -442,6 +442,8 @@ def calculate_ins_level_acc(results: Dict):


def process_result(response, sample, answer_dict, out_samples):
if response is None:
return
if sample["question_type"] == "multiple-choice":
pred_ans = parse_multi_choice_response(
response, sample["all_choices"], sample["index2ans"]
Expand Down
2 changes: 2 additions & 0 deletions docker/Dockerfile.dev
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ RUN apt-get update && apt-get install -y \
pkg-config \
libssl-dev \
bear \
ccache \
&& apt install -y rdma-core infiniband-diags openssh-server perftest ibverbs-providers libibumad3 libibverbs1 libnl-3-200 libnl-route-3-200 librdmacm1 \
&& rm -rf /var/lib/apt/lists/* \
&& apt-get clean
Expand All @@ -44,6 +45,7 @@ RUN python3 -m pip install --no-cache-dir \
black \
isort \
icdiff \
uv \
pre-commit

# Install diff-so-fancy
Expand Down
2 changes: 1 addition & 1 deletion docker/Dockerfile.rocm
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Usage (to build SGLang ROCm docker image):
# docker build --build-arg SGL_BRANCH=v0.4.4.post3 -t v0.4.4.post3-rocm630 -f Dockerfile.rocm .
# docker build --build-arg SGL_BRANCH=v0.4.5 -t v0.4.5-rocm630 -f Dockerfile.rocm .

# default base image
ARG BASE_IMAGE="rocm/sgl-dev:vllm20250114"
Expand Down
2 changes: 1 addition & 1 deletion docs/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ compile:
parallel -0 -j3 --halt soon,fail=1 ' \
NB_NAME=$$(basename {}); \
START_TIME=$$(date +%s); \
retry --delay=0 --times=3 -- \
retry --delay=0 --times=2 -- \
jupyter nbconvert --to notebook --execute --inplace "{}" \
--ExecutePreprocessor.timeout=600 \
--ExecutePreprocessor.kernel_name=python3; \
Expand Down
Loading
Loading