Refs/heads/add nightly test multi gpu configs #12870

alisonshao · 2025-11-08T06:25:13Z

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

- Add nightly-4-gpu, nightly-8-gpu-h200, and nightly-8-gpu-h20 suites to run_suite.py - Update workflow to use nightly-* suite names instead of per-commit-* - per-commit suites are for PR testing, nightly suites are for scheduled nightly tests

- Add --continue-on-error flag to run_suite.py - Update test_utils.py to use flag instead of assert/break - Apply flag to all nightly test jobs in workflow

gemini-code-assist · 2025-11-08T06:25:28Z

Summary of Changes

Hello @alisonshao, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces significant improvements to the test execution framework, primarily by enabling more robust and informative nightly test runs. It allows tests to continue running despite individual failures, which is crucial for gathering complete test coverage data. Additionally, it enhances test reporting with detailed summaries and lays the groundwork for expanding multi-GPU testing capabilities.

Highlights

Enhanced Test Execution: The run_unittest_files function now supports a continue_on_error option, allowing test suites (particularly nightly runs) to proceed with subsequent tests even if a test fails or times out, rather than stopping immediately.
Detailed Test Summaries: The test runner now provides a comprehensive summary at the end of execution, clearly listing all passed and failed tests along with the specific reasons for any failures (e.g., exit code or timeout).
New Nightly Test Configurations: Placeholder configurations have been added for new multi-GPU nightly test suites, including nightly-4-gpu, nightly-8-gpu-h200, and nightly-8-gpu-h20, preparing the infrastructure for future test additions.

Ignored Files

Ignored by pattern: .github/workflows/** (1)
- .github/workflows/nightly-test.yml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request enhances the test runner by introducing a continue_on_error flag, which is particularly useful for nightly tests to gather all failures instead of stopping at the first one. The implementation is solid, and the addition of a detailed test summary at the end significantly improves the clarity of test results. I have one suggestion to refactor the failure handling logic to reduce code duplication, which should improve maintainability.

gemini-code-assist · 2025-11-08T06:27:04Z

python/sglang/test/test_utils.py

            ret_code = run_with_timeout(
                run_one_file, args=(filename,), timeout=timeout_per_file
            )
-            assert (
-                ret_code == 0
-            ), f"expected return code 0, but {filename} returned {ret_code}"
+            if ret_code != 0:
+                print(
+                    f"\n✗ FAILED: {filename} returned exit code {ret_code}\n",
+                    flush=True,
+                )
+                success = False
+                failed_tests.append((filename, f"exit code {ret_code}"))
+                if not continue_on_error:
+                    # Stop at first failure for PR tests
+                    break
+                # Otherwise continue to next test for nightly tests
+            else:
+                passed_tests.append(filename)
        except TimeoutError:
            kill_process_tree(process.pid)
            time.sleep(5)
            print(
-                f"\nTimeout after {timeout_per_file} seconds when running {filename}\n",
+                f"\n✗ TIMEOUT: {filename} after {timeout_per_file} seconds\n",
                flush=True,
            )
            success = False
-            break
+            failed_tests.append((filename, f"timeout after {timeout_per_file}s"))
+            if not continue_on_error:
+                # Stop at first timeout for PR tests
+                break
+            # Otherwise continue to next test for nightly tests


The failure handling logic for non-zero exit codes and timeouts is very similar and contains duplicated code. You can refactor this to reduce duplication and improve readability by handling the failure condition after the try...except block.

failure_reason = None try: ret_code = run_with_timeout( run_one_file, args=(filename,), timeout=timeout_per_file ) if ret_code == 0: passed_tests.append(filename) else: print( f"\n✗ FAILED: {filename} returned exit code {ret_code}\n", flush=True, ) failure_reason = f"exit code {ret_code}" except TimeoutError: kill_process_tree(process.pid) time.sleep(5) print( f"\n✗ TIMEOUT: {filename} after {timeout_per_file} seconds\n", flush=True, ) failure_reason = f"timeout after {timeout_per_file}s" if failure_reason: success = False failed_tests.append((filename, failure_reason)) if not continue_on_error: break

Fridge003 · 2025-11-08T06:44:30Z

Please fix lint with pre-commit run --all-files

.github/workflows/nightly-test.yml

alisonshao and others added 12 commits November 3, 2025 17:14

add multi GPU config to nightly-test.yml

16d1d47

add kernel build step to nightly tests

29afa8b

revert kernel build step - not needed for nightly tests

ce9713d

add new nightly test jobs to CI monitor

0222a21

increase timeout for nightly-test-1-gpu to 60 minutes

8fca818

Add continue-on-error to nightly test jobs

b875b8f

Merge branch 'main' into add-nightly-test-multi-gpu-configs

52bbb27

Fix test suite names to match available suites in run_suite.py

f287d92

Implement continue-on-error for nightly tests

d94ee9b

- Add --continue-on-error flag to run_suite.py - Update test_utils.py to use flag instead of assert/break - Apply flag to all nightly test jobs in workflow

Add test summary at end showing passed/failed tests

2cb570a

Add retry logic for VLM eval test to handle flaky CUDA graph capture

4cd2949

alisonshao requested review from Fridge003, ispobock and merrymercy as code owners November 8, 2025 06:25

sglang-bot added the run-ci label Nov 8, 2025

gemini-code-assist bot reviewed Nov 8, 2025

View reviewed changes

Merge branch 'main' into add-nightly-test-multi-gpu-configs

a8e75b7

Format run_unittest_files function signature

8445d13

Fridge003 reviewed Nov 8, 2025

View reviewed changes

.github/workflows/nightly-test.yml Outdated Show resolved Hide resolved

Update .github/workflows/nightly-test.yml

bdd2df1

Fridge003 approved these changes Nov 8, 2025

View reviewed changes

Fridge003 merged commit d3a03ae into main Nov 8, 2025
112 of 121 checks passed

Fridge003 deleted the add-nightly-test-multi-gpu-configs branch November 8, 2025 23:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refs/heads/add nightly test multi gpu configs #12870

Refs/heads/add nightly test multi gpu configs #12870

Uh oh!

alisonshao commented Nov 8, 2025

Uh oh!

gemini-code-assist bot commented Nov 8, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 8, 2025

Uh oh!

Fridge003 commented Nov 8, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Refs/heads/add nightly test multi gpu configs #12870

Refs/heads/add nightly test multi gpu configs #12870

Uh oh!

Conversation

alisonshao commented Nov 8, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Nov 8, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

Fridge003 commented Nov 8, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants