Skip to content

[Roadmap] CI suites organization #13808

@hnyls2002

Description

@hnyls2002

Existing Issues

  • Many CI tests under test/srt are never executed by current workflows.
  • Some contributors mistakenly added CI tests under sglang.test; tests in this module are not connected to any CI workflow and therefore never run.
  • CI tests are tightly coupled to hardware backends (CUDA, AMD, XPU, NPU, etc.), reducing clarity and making it difficult to reuse unit tests across different backends.
  • Cannot easily switch a CI test between nightly and per-commit pipelines; doing so currently requires physically moving the test files across directories.
  • CI workflow pipelines and dependencies are messy and lack a proper fast-fail mechanism, leading to unnecessary CI resource usage.
  • Timeout settings are not fine-grained and fail to catch most real timeout scenarios, causing severe waste of CI resources.
  • Some performance tests are not managed by any suites and must be manually triggered inside CI workflow files, which limits scalability and reduces consistency.
  • CI monitoring is tightly coupled to standalone summary steps and is not compatible with the run_suite-based triggering approach.
  • There is no tracking or management of flaky tests.
  • CPU-only tests are mixed with other platform tests, making it difficult to isolate test behavior and backend-specific failures.

Refactor Steps

Workflow Pipeline Prototype

Image

CI suites structure

  • test/manual: unofficially maintained CI tests, kept mainly as code references for agents (cursor, codex), not guaranteed to run in CI.
  • test/registered (previously per-commit + srt + nightly): officially maintained by the SGLang community and guaranteed to be triggered in at least one CI workflow (or temporarily disabled with clear justification).
  • All test files (registered or manual) should be organized by features rather than behaviors.

CI registry

Example for a test registered with AMD and CUDA (nightly only) workflows.

register_cuda_ci(est_time=80, suite="stage-a-test-1", nightly=True)
register_amd_ci(est_time=120, suite="stage-a-test-1")

Example for a test that is temporarily disabled.

register_tmp_disabled(reason="flaky...")

(or add "tmp disabled" flag to existing registry)

Example of triggering a CI through run_suite API:

run_suite.py --nightly --hw cuda --suite "stage-a-test-1"

Refactoring Steps

  • Move tests into test/registered and test/manual, organizing them by feature.
  • Deprecate most CI tests under sglang.test or move them into test/manual.
  • Introduce a CI registry to manage backend selection, nightly/per-commit inclusion, and disable flags.
  • Reorganize the CI workflow pipeline to follow the new unified structure.
  • Introduce fine-grained timeout settings:
    • Per-file timeout
    • Per-unit-test timeout
    • Server boot timeout
  • Make all performance tests compatible with run_suite.
  • Refactor CI monitor summary/reporting to work with the run_suite API.

Sub-issues

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions