-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[router] Fix flaky router e2e tests #13306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[router] Fix flaky router e2e tests #13306
Conversation
Summary of ChangesHello @XinyueZhang369, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the stability of router end-to-end tests by making them more resilient to transient conditions during worker initialization. The changes prevent tests from failing due to temporary service unavailability (HTTP 503) and ensure that test failures accurately reflect underlying issues, improving the reliability of the test suite. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses flaky e2e tests by introducing a retry mechanism to wait for workers to become fully available. The changes in test_circuit_breaker.py and test_retries.py replace a single request with a loop that tolerates initial 503 Service Unavailable responses before the expected 500 Internal Server Error is returned.
My review focuses on improving the new test logic for better maintainability and debuggability. I've suggested adding descriptive messages to assertions to make test failures more informative and pointed out code duplication that could be refactored into a shared helper function. These changes will make the test suite more robust and easier to maintain.
| r.status_code == 503 | ||
| ), "Should only see 503 when waiting for worker to start" | ||
|
|
||
| assert saw_500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The final assertion assert saw_500 lacks a descriptive message, making test failures harder to debug. Please add a message, similar to the assertion on line 46.
Additionally, this retry logic is duplicated across multiple tests. Consider extracting it into a shared helper function to improve maintainability. The magic number 8 could also be defined as a constant.
| assert saw_500 | |
| assert saw_500, "Worker did not return status 500 after retries" |
| r.status_code == 503 | ||
| ), "Should only see 503 when waiting for worker to start" | ||
|
|
||
| assert saw_500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion would be more informative with a failure message to aid in debugging. As noted in the review for test_circuit_breaker.py, this retry logic is also a candidate for refactoring into a shared helper function to reduce code duplication.
| assert saw_500 | |
| assert saw_500, "Worker did not return status 500 after retries" |
Motivation
Fix flaky tests: test_circuit_breaker_disable_flag and test_disable_retries_surfaces_failure
Modifications
Change tests to tolerate the 503 when worker is not fully started
Run tests multiple times without failure


Accuracy Tests
Benchmarking and Profiling
Checklist