-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Open
2 / 22 of 2 issues completedLabels
Description
Existing Issues
- Many CI tests under
test/srtare never executed by current workflows. - Some contributors mistakenly added CI tests under
sglang.test; tests in this module are not connected to any CI workflow and therefore never run. - CI tests are tightly coupled to hardware backends (CUDA, AMD, XPU, NPU, etc.), reducing clarity and making it difficult to reuse unit tests across different backends.
- Cannot easily switch a CI test between nightly and per-commit pipelines; doing so currently requires physically moving the test files across directories.
- CI workflow pipelines and dependencies are messy and lack a proper fast-fail mechanism, leading to unnecessary CI resource usage.
- Timeout settings are not fine-grained and fail to catch most real timeout scenarios, causing severe waste of CI resources.
- Some performance tests are not managed by any suites and must be manually triggered inside CI workflow files, which limits scalability and reduces consistency.
- CI monitoring is tightly coupled to standalone summary steps and is not compatible with the
run_suite-based triggering approach. - There is no tracking or management of flaky tests.
- CPU-only tests are mixed with other platform tests, making it difficult to isolate test behavior and backend-specific failures.
Refactor Steps
Workflow Pipeline Prototype
CI suites structure
test/manual: unofficially maintained CI tests, kept mainly as code references for agents (cursor, codex), not guaranteed to run in CI.test/registered(previouslyper-commit+srt+nightly): officially maintained by the SGLang community and guaranteed to be triggered in at least one CI workflow (or temporarily disabled with clear justification).- All test files (registered or manual) should be organized by features rather than behaviors.
CI registry
Example for a test registered with AMD and CUDA (nightly only) workflows.
register_cuda_ci(est_time=80, suite="stage-a-test-1", nightly=True)
register_amd_ci(est_time=120, suite="stage-a-test-1")Example for a test that is temporarily disabled.
register_tmp_disabled(reason="flaky...")(or add "tmp disabled" flag to existing registry)
Example of triggering a CI through run_suite API:
run_suite.py --nightly --hw cuda --suite "stage-a-test-1"Refactoring Steps
- Move tests into
test/registeredandtest/manual, organizing them by feature. - Deprecate most CI tests under
sglang.testor move them intotest/manual. - Introduce a CI registry to manage backend selection, nightly/per-commit inclusion, and disable flags.
- Reorganize the CI workflow pipeline to follow the new unified structure.
- Introduce fine-grained timeout settings:
- Per-file timeout
- Per-unit-test timeout
- Server boot timeout
- Make all performance tests compatible with
run_suite. - Refactor CI monitor summary/reporting to work with the
run_suiteAPI.
Reactions are currently unavailable