[CI] update verlengine ci to 4-gpu test by ocss884 · Pull Request #6007 · sgl-project/sglang

ocss884 · 2025-05-04T06:47:09Z

Motivation

Relevant to #5997 . Update VerlEngine test to a a more comprehensive one, which uses 8-gpu (dp==4, tp==2) . @merrymercy @zhaochenyang20

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

zhaochenyang20

I think only a small part of test_verl_engine.py should use 8-gpu. We should split it, to save 8-gpu time.

ocss884 · 2025-05-05T18:18:58Z

I split the tests by requirements of gpus <=2 or >2. Random choice (although only 1 model for 8-gpus ci now) is added in the 8-gpus one. The 2-gpu one will not have any model for ci test, only for local usage. @zhaochenyang20

test/srt/test_verl_engine_2_gpu.py

zhaochenyang20 · 2025-05-06T20:54:09Z

test/srt/test_verl_engine_2_gpu.py

we can remove gpt2, nobody care about it

I mean, just delete it

test/srt/test_verl_engine_8_gpu.py

zhaochenyang20 · 2025-05-18T04:54:25Z

Just rebased and after the CI, we can merge it

zhaochenyang20 · 2025-05-22T21:48:04Z

Now there is already a 4 - gpu CI, which can be used directly. Note that you need to modify the current test_verl_engine_8_gpu.py. One thing is to change the name, and the other is to check the content. There should be no unit tests that actually use 8 gpus.

zhaochenyang20 · 2025-05-23T04:30:11Z

.github/workflows/pr-test.yml

+  unit-test-backend-4-gpu:
+    if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
+        github.event.pull_request.draft == false
+    runs-on: 4-gpu-runner
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v4
+
+      - name: Install dependencies
+        run: |
+          bash scripts/ci_install_dependency.sh
+
+      - name: Run test
+        timeout-minutes: 25
+        run: |
+          cd test/srt
+          python3 run_suite.py --suite per-commit-4-gpu
+


hey, what's the difference in unit-test-backend-4-gpu (line 92) and unittest-test-backend-4-gpu (line 110)? I think we should only take the second one, which has more dedicated rules to reduce CI waste.

zhaochenyang20 · 2025-05-23T04:34:12Z

test/srt/test_verl_engine_2_gpu.py

I mean, just delete it

zhaochenyang20 · 2025-05-23T04:42:46Z

test/srt/run_suite.py

+        TestFile("test_verl_engine_2_gpu.py", 64),
+    ],
+    "per-commit-4-gpu": [
        TestFile("test_verl_engine.py", 64),
    ],


"per-commit-4-gpu": [ TestFile("test_verl_engine.py", 64), ], "per-commit-2-gpu-amd": [ TestFile("test_mla_tp.py", 170), ], "per-commit-4-gpu": [ TestFile("test_local_attn.py", 250), TestFile("test_pp_single_node.py", 150), TestFile("test_verl_engine_4_gpu.py", 64), ],

redundant.

zhaochenyang20 · 2025-05-23T04:50:20Z

test/srt/test_verl_engine_4_gpu.py

 CI_MODELS = [
-    dict(model_path="meta-llama/Llama-3.1-8B-Instruct"),
+    dict(
+        model_path="Qwen/Qwen2.5-0.5B",
+        dp_size=2,
+        tp_size=2,  # default to 2
+    ),
    # Fail to run gemma-2-2b after transformers==4.48.3 -> 4.50.0
    # dict(model_path="google/gemma-2-2b"),
 ]
 ALL_OTHER_MODELS = [
-    dict(model_path="meta-llama/Llama-3.2-1B-Instruct"),
-    dict(model_path="Qwen/Qwen2-1.5B"),
    dict(
        model_path="Qwen/Qwen2.5-14B-Instruct",
-        mem_fraction_static=0.4,
-        tp_size=8,
+        mem_fraction_static=0.7,
+        dp_size=2,
+        tp_size=2,
        tight_memory=True,


We do not need to save CI time for 4-gpu test now. So we can make a bigger ALL_MODELS list, and randomly choose one to test on CI. and Test all locally.

Say, how about Qwen 2.5 model and llama 3.1 model?

zhaochenyang20 · 2025-05-23T04:51:38Z

If we can have a EP test later, it would be optimal. But not for now 😂

ocss884 requested review from Ying1123, merrymercy and zhyncs as code owners May 4, 2025 06:47

ocss884 changed the title ~~add 4-gpu test & update verlengine ci~~ [CI] add 4-gpu test & update verlengine ci May 4, 2025

zhaochenyang20 mentioned this pull request May 4, 2025

[sglang] Upgrade sglang to 0.4.6.post1 & misc fixes verl-project/verl#1385

Merged

13 tasks

ocss884 changed the title ~~[CI] add 4-gpu test & update verlengine ci~~ [CI] update verlengine ci to 4-gpu test May 5, 2025

zhaochenyang20 requested changes May 5, 2025

View reviewed changes

ocss884 force-pushed the add-4-gpu-ci branch from 15ec35b to 1d4f89b Compare May 5, 2025 03:42

ocss884 changed the title ~~[CI] update verlengine ci to 4-gpu test~~ [CI] update verlengine ci to 8-gpu test May 5, 2025

zhaochenyang20 requested changes May 5, 2025

View reviewed changes

test/srt/test_verl_engine_2_gpu.py Outdated Show resolved Hide resolved

zhaochenyang20 requested changes May 6, 2025

View reviewed changes

zhaochenyang20 approved these changes May 18, 2025

View reviewed changes

ocss884 added 11 commits May 22, 2025 20:42

add 4-gpu test & update verlengine ci

2134cfe

lint

f1ae3f5

move test to 8-gpu

ac0469f

more

9335988

split test

ac9c700

lint

34f65a4

more

570a118

more

2df357e

lint

8154569

more

b4601e0

move to 4-gpu test

4cdf0d7

ocss884 force-pushed the add-4-gpu-ci branch from 08fd4ba to 4cdf0d7 Compare May 23, 2025 03:55

zhaochenyang20 requested changes May 23, 2025

View reviewed changes

more

e04a387

zhaochenyang20 requested changes May 23, 2025

View reviewed changes

remove original test

0fffd05

zhaochenyang20 requested changes May 23, 2025

View reviewed changes

zhaochenyang20 added high priority RLHF labels May 23, 2025

ocss884 added 2 commits May 22, 2025 22:16

more

47a9f74

lint

9342cda

zhaochenyang20 changed the title ~~[CI] update verlengine ci to 8-gpu test~~ [CI] update verlengine ci to 4-gpu test May 23, 2025

zhaochenyang20 added 2 commits May 23, 2025 14:02

Increase 4-gpu test time

4b2ed01

Merge branch 'main' into add-4-gpu-ci

bf719b9

zhaochenyang20 approved these changes May 26, 2025

View reviewed changes

Merge branch 'main' into add-4-gpu-ci

3c66b57

zhyncs self-assigned this May 26, 2025

zhyncs merged commit 2103b80 into sgl-project:main May 27, 2025
125 of 150 checks passed

ocss884 deleted the add-4-gpu-ci branch May 28, 2025 17:26

Layssy pushed a commit to Layssy/sglang-iaas that referenced this pull request Jun 9, 2025

[CI] update verlengine ci to 4-gpu test (sgl-project#6007)

eb0cffb

xwu-intel pushed a commit to xwu-intel/sglang that referenced this pull request Jun 17, 2025

[CI] update verlengine ci to 4-gpu test (sgl-project#6007)

04cba1e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] update verlengine ci to 4-gpu test#6007

[CI] update verlengine ci to 4-gpu test#6007
zhyncs merged 18 commits intosgl-project:mainfrom
ocss884:add-4-gpu-ci

ocss884 commented May 4, 2025 •

edited

Loading

Uh oh!

zhaochenyang20 left a comment

Uh oh!

ocss884 commented May 5, 2025

Uh oh!

Uh oh!

zhaochenyang20 May 6, 2025

Uh oh!

zhaochenyang20 May 23, 2025

Uh oh!

Uh oh!

zhaochenyang20 commented May 18, 2025

Uh oh!

zhaochenyang20 commented May 22, 2025

Uh oh!

zhaochenyang20 May 23, 2025

Uh oh!

zhaochenyang20 May 23, 2025

Uh oh!

zhaochenyang20 May 23, 2025

Uh oh!

zhaochenyang20 May 23, 2025

Uh oh!

zhaochenyang20 commented May 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ocss884 commented May 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

zhaochenyang20 left a comment

Choose a reason for hiding this comment

Uh oh!

ocss884 commented May 5, 2025

Uh oh!

Uh oh!

zhaochenyang20 May 6, 2025

Choose a reason for hiding this comment

Uh oh!

zhaochenyang20 May 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhaochenyang20 commented May 18, 2025

Uh oh!

zhaochenyang20 commented May 22, 2025

Uh oh!

zhaochenyang20 May 23, 2025

Choose a reason for hiding this comment

Uh oh!

zhaochenyang20 May 23, 2025

Choose a reason for hiding this comment

Uh oh!

zhaochenyang20 May 23, 2025

Choose a reason for hiding this comment

Uh oh!

zhaochenyang20 May 23, 2025

Choose a reason for hiding this comment

Uh oh!

zhaochenyang20 commented May 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ocss884 commented May 4, 2025 •

edited

Loading