[router] configure workflow retries and timeout based on routerConfig#12252
[router] configure workflow retries and timeout based on routerConfig#12252
Conversation
Summary of ChangesHello @slin1237, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the router's worker registration workflow by integrating Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request makes the worker registration workflow's timeout and retry behavior configurable based on the routerConfig. This is a good improvement for flexibility. The changes in server.rs and test files to pass the configuration are correct. However, I've found a logical issue in the new calculation for max_attempts in worker_registration.rs. The current formula can result in a very high number of attempts that is not consistent with the configured timeout, potentially causing the workflow step to time out prematurely. I've left a detailed comment with an example and a suggestion for revision. I also pointed out a minor inaccuracy in a comment to improve clarity.
e3e09da to
0dd01cd
Compare
0dd01cd to
b863aed
Compare
|
Hello. I need the feature implemented in this commit of max_retries being configured via parameter. How long does it take to your changes be available in pip https://pypi.org/project/sglang-router/? Currently, my model takes more than 10 minutes to load, and thus, while the router is trying to discover it via Kubernetes, it reaches the max_retries = 100 and returns an error. Note: Sorry if here is not the right channel to ask; I wasn't able to find the pip frequency. I did look at the tags, but the one from pip is so behind that I am afraid I am looking in the right place. |
|
@slin1237 I'm confused by the logic of this PR — I believe it's basically capping the timeout for worker registration to I hit this issue because the router started running and registering workers before a decode worker was ready and only waited ~30s for the decode worker to respond before giving up on it. The only way I can see to resolve that is to increase health_check_timeout to a very large value (which then means normal health checks also have a really long request-level timeout), or manually check worker-readiness on my end before starting the router. |
Thanks for reporting this. |
|
Checklist