Skip to content

Conversation

@alisonshao
Copy link
Collaborator

Summary

Fixes the unit-test-backend-4-gpu timeout issue introduced by #14222.

PR #14222 moved test_piecewise_cuda_graph.py (estimated 1200s) to the per-commit-4-gpu suite, which caused the LPT partition algorithm to create an unbalanced distribution with only 2 partitions:

  • Partition 0: 1264s (2 tests) ✓
  • Partition 1: 1483s (4 tests) ✗ exceeds 20min timeout

This PR increases the number of partitions from 2 to 3:

  • Partition 0: ~1200s (1 test) - test_piecewise_cuda_graph.py
  • Partition 1: ~772s (2 tests) - test_pp_single_node.py, test_qwen3_next_models.py
  • Partition 2: ~775s (3 tests) - test_local_attn.py, test_gpt_oss_4gpu.py, test_multi_instance_release_memory_occupation.py

All partitions now fit within the 20-minute timeout.

Example failure

https://github.com/sgl-project/sglang/actions/runs/19845982270/job/56878342316?pr=14253

args=Namespace(timeout_per_file=1200, suite='per-commit-4-gpu', auto_partition_id=1, auto_partition_size=2, continue_on_error=False)
The running tests are  ['test_pp_single_node.py', 'test_local_attn.py', 'test_gpt_oss_4gpu.py', 'models/test_qwen3_next_models.py']
...
Error: The action 'Run test' has timed out after 20 minutes.

Test plan

  • CI passes with the new 3-partition configuration

After PR #14222 added test_piecewise_cuda_graph.py (1200s) to the
per-commit-4-gpu suite, the LPT partition algorithm created an
unbalanced distribution:

- Partition 0: 1264s (2 tests)
- Partition 1: 1483s (4 tests) - exceeds 20min timeout

This change increases the number of partitions from 2 to 3, resulting
in a more balanced distribution:

- Partition 0: ~1200s (1 test)
- Partition 1: ~772s (2 tests)
- Partition 2: ~775s (3 tests)

All partitions now fit within the 20-minute timeout.
@alisonshao
Copy link
Collaborator Author

/tag-and-rerun-ci

@github-actions github-actions bot added the run-ci label Dec 2, 2025
@sgl-project sgl-project deleted a comment from gemini-code-assist bot Dec 2, 2025
@ispobock ispobock merged commit e0ec42c into main Dec 2, 2025
52 of 76 checks passed
@ispobock ispobock deleted the fix/4gpu-ci-partition-timeout branch December 2, 2025 08:47
harvenstar pushed a commit to harvenstar/sglang that referenced this pull request Dec 4, 2025
yingluosanqian pushed a commit to yingluosanqian/sglang that referenced this pull request Dec 4, 2025
tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025
yuchengz816-bot pushed a commit to yuchengz816-bot/sglang that referenced this pull request Dec 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants