[CI] update verlengine ci to 4-gpu test#6007
Conversation
zhaochenyang20
left a comment
There was a problem hiding this comment.
I think only a small part of test_verl_engine.py should use 8-gpu. We should split it, to save 8-gpu time.
|
I split the tests by requirements of gpus |
test/srt/test_verl_engine_2_gpu.py
Outdated
There was a problem hiding this comment.
we can remove gpt2, nobody care about it
There was a problem hiding this comment.
I mean, just delete it
|
Just rebased and after the CI, we can merge it |
|
Now there is already a 4 - gpu CI, which can be used directly. Note that you need to modify the current test_verl_engine_8_gpu.py. One thing is to change the name, and the other is to check the content. There should be no unit tests that actually use 8 gpus. |
.github/workflows/pr-test.yml
Outdated
| unit-test-backend-4-gpu: | ||
| if: (github.repository == 'sgl-project/sglang' || github.event_name == 'pull_request') && | ||
| github.event.pull_request.draft == false | ||
| runs-on: 4-gpu-runner | ||
| steps: | ||
| - name: Checkout code | ||
| uses: actions/checkout@v4 | ||
|
|
||
| - name: Install dependencies | ||
| run: | | ||
| bash scripts/ci_install_dependency.sh | ||
|
|
||
| - name: Run test | ||
| timeout-minutes: 25 | ||
| run: | | ||
| cd test/srt | ||
| python3 run_suite.py --suite per-commit-4-gpu | ||
|
|
There was a problem hiding this comment.
hey, what's the difference in unit-test-backend-4-gpu (line 92) and unittest-test-backend-4-gpu (line 110)? I think we should only take the second one, which has more dedicated rules to reduce CI waste.
test/srt/test_verl_engine_2_gpu.py
Outdated
There was a problem hiding this comment.
I mean, just delete it
test/srt/run_suite.py
Outdated
| TestFile("test_verl_engine_2_gpu.py", 64), | ||
| ], | ||
| "per-commit-4-gpu": [ | ||
| TestFile("test_verl_engine.py", 64), | ||
| ], |
There was a problem hiding this comment.
"per-commit-4-gpu": [
TestFile("test_verl_engine.py", 64),
],
"per-commit-2-gpu-amd": [
TestFile("test_mla_tp.py", 170),
],
"per-commit-4-gpu": [
TestFile("test_local_attn.py", 250),
TestFile("test_pp_single_node.py", 150),
TestFile("test_verl_engine_4_gpu.py", 64),
],redundant.
test/srt/test_verl_engine_4_gpu.py
Outdated
| CI_MODELS = [ | ||
| dict(model_path="meta-llama/Llama-3.1-8B-Instruct"), | ||
| dict( | ||
| model_path="Qwen/Qwen2.5-0.5B", | ||
| dp_size=2, | ||
| tp_size=2, # default to 2 | ||
| ), | ||
| # Fail to run gemma-2-2b after transformers==4.48.3 -> 4.50.0 | ||
| # dict(model_path="google/gemma-2-2b"), | ||
| ] | ||
| ALL_OTHER_MODELS = [ | ||
| dict(model_path="meta-llama/Llama-3.2-1B-Instruct"), | ||
| dict(model_path="Qwen/Qwen2-1.5B"), | ||
| dict( | ||
| model_path="Qwen/Qwen2.5-14B-Instruct", | ||
| mem_fraction_static=0.4, | ||
| tp_size=8, | ||
| mem_fraction_static=0.7, | ||
| dp_size=2, | ||
| tp_size=2, | ||
| tight_memory=True, |
There was a problem hiding this comment.
We do not need to save CI time for 4-gpu test now. So we can make a bigger ALL_MODELS list, and randomly choose one to test on CI. and Test all locally.
Say, how about Qwen 2.5 model and llama 3.1 model?
|
If we can have a EP test later, it would be optimal. But not for now 😂 |
Motivation
Relevant to #5997 . Update
VerlEnginetest to a a more comprehensive one, which uses 8-gpu (dp==4, tp==2) . @merrymercy @zhaochenyang20Modifications
Checklist