Skip to content

[CI] update verlengine ci to 4-gpu test#6007

Merged
zhyncs merged 18 commits intosgl-project:mainfrom
ocss884:add-4-gpu-ci
May 27, 2025
Merged

[CI] update verlengine ci to 4-gpu test#6007
zhyncs merged 18 commits intosgl-project:mainfrom
ocss884:add-4-gpu-ci

Conversation

@ocss884
Copy link
Collaborator

@ocss884 ocss884 commented May 4, 2025

Motivation

Relevant to #5997 . Update VerlEngine test to a a more comprehensive one, which uses 8-gpu (dp==4, tp==2) . @merrymercy @zhaochenyang20

Modifications

Checklist

@ocss884 ocss884 changed the title add 4-gpu test & update verlengine ci [CI] add 4-gpu test & update verlengine ci May 4, 2025
@ocss884 ocss884 changed the title [CI] add 4-gpu test & update verlengine ci [CI] update verlengine ci to 4-gpu test May 5, 2025
Copy link
Collaborator

@zhaochenyang20 zhaochenyang20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think only a small part of test_verl_engine.py should use 8-gpu. We should split it, to save 8-gpu time.

@ocss884 ocss884 changed the title [CI] update verlengine ci to 4-gpu test [CI] update verlengine ci to 8-gpu test May 5, 2025
@ocss884
Copy link
Collaborator Author

ocss884 commented May 5, 2025

I split the tests by requirements of gpus <=2 or >2. Random choice (although only 1 model for 8-gpus ci now) is added in the 8-gpus one. The 2-gpu one will not have any model for ci test, only for local usage. @zhaochenyang20

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove gpt2, nobody care about it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, just delete it

@zhaochenyang20
Copy link
Collaborator

Just rebased and after the CI, we can merge it

@zhaochenyang20
Copy link
Collaborator

Now there is already a 4 - gpu CI, which can be used directly. Note that you need to modify the current test_verl_engine_8_gpu.py. One thing is to change the name, and the other is to check the content. There should be no unit tests that actually use 8 gpus.

Comment on lines 92 to 109
unit-test-backend-4-gpu:
if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') &&
github.event.pull_request.draft == false
runs-on: 4-gpu-runner
steps:
- name: Checkout code
uses: actions/checkout@v4

- name: Install dependencies
run: |
bash scripts/ci_install_dependency.sh

- name: Run test
timeout-minutes: 25
run: |
cd test/srt
python3 run_suite.py --suite per-commit-4-gpu

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey, what's the difference in unit-test-backend-4-gpu (line 92) and unittest-test-backend-4-gpu (line 110)? I think we should only take the second one, which has more dedicated rules to reduce CI waste.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, just delete it

Comment on lines 103 to 107
TestFile("test_verl_engine_2_gpu.py", 64),
],
"per-commit-4-gpu": [
TestFile("test_verl_engine.py", 64),
],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    "per-commit-4-gpu": [
        TestFile("test_verl_engine.py", 64),
    ],
    "per-commit-2-gpu-amd": [
        TestFile("test_mla_tp.py", 170),
    ],
    "per-commit-4-gpu": [
        TestFile("test_local_attn.py", 250),
        TestFile("test_pp_single_node.py", 150),
        TestFile("test_verl_engine_4_gpu.py", 64),
    ],

redundant.

Comment on lines 41 to 56
CI_MODELS = [
dict(model_path="meta-llama/Llama-3.1-8B-Instruct"),
dict(
model_path="Qwen/Qwen2.5-0.5B",
dp_size=2,
tp_size=2, # default to 2
),
# Fail to run gemma-2-2b after transformers==4.48.3 -> 4.50.0
# dict(model_path="google/gemma-2-2b"),
]
ALL_OTHER_MODELS = [
dict(model_path="meta-llama/Llama-3.2-1B-Instruct"),
dict(model_path="Qwen/Qwen2-1.5B"),
dict(
model_path="Qwen/Qwen2.5-14B-Instruct",
mem_fraction_static=0.4,
tp_size=8,
mem_fraction_static=0.7,
dp_size=2,
tp_size=2,
tight_memory=True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need to save CI time for 4-gpu test now. So we can make a bigger ALL_MODELS list, and randomly choose one to test on CI. and Test all locally.

Say, how about Qwen 2.5 model and llama 3.1 model?

@zhaochenyang20
Copy link
Collaborator

If we can have a EP test later, it would be optimal. But not for now 😂

@zhaochenyang20 zhaochenyang20 changed the title [CI] update verlengine ci to 8-gpu test [CI] update verlengine ci to 4-gpu test May 23, 2025
@zhyncs zhyncs self-assigned this May 26, 2025
@zhyncs zhyncs merged commit 2103b80 into sgl-project:main May 27, 2025
125 of 150 checks passed
@ocss884 ocss884 deleted the add-4-gpu-ci branch May 28, 2025 17:26
Layssy pushed a commit to Layssy/sglang-iaas that referenced this pull request Jun 9, 2025
xwu-intel pushed a commit to xwu-intel/sglang that referenced this pull request Jun 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants