[Docs] Add RL documentation#17663
Conversation
Summary of ChangesHello @zijiexia, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the project's documentation by introducing a dedicated guide for integrating SGLang into Reinforcement Learning (RL) and post-training workflows. The new content aims to provide infrastructure teams with practical insights and API references to optimize iteration latency, ensure correctness, and align rollout and training behaviors in production environments, thereby solidifying SGLang's position as a robust backend for advanced AI systems. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds comprehensive documentation on using SGLang for Reinforcement Learning systems. The new document is well-structured and covers key features like memory management, weight updates, and deterministic inference, which will be very helpful for users. I've made a couple of minor suggestions to fix a formatting issue and a grammatical error in the new documentation file to improve its clarity.
| After training completes each step, rollout engines must be refit with new weights. SGLang supports three refit strategies so you can match your infrastructure style (co-located vs disaggregated) and scaling needs. Each strategy maps to a concrete API with clear request schemas. Here's the detail of how to integrate SGLang weight update in verl's co-located strategy: [RL System Deep Thinking: Weight Update Mechanisms | ||
| ](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/sys-design/readme-1-EN.md) |
There was a problem hiding this comment.
There's a newline character within the link's display text, which can cause rendering issues. Please remove it to ensure the link is displayed correctly on a single line.
| After training completes each step, rollout engines must be refit with new weights. SGLang supports three refit strategies so you can match your infrastructure style (co-located vs disaggregated) and scaling needs. Each strategy maps to a concrete API with clear request schemas. Here's the detail of how to integrate SGLang weight update in verl's co-located strategy: [RL System Deep Thinking: Weight Update Mechanisms | |
| ](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/sys-design/readme-1-EN.md) | |
| After training completes each step, rollout engines must be refit with new weights. SGLang supports three refit strategies so you can match your infrastructure style (co-located vs disaggregated) and scaling needs. Each strategy maps to a concrete API with clear request schemas. Here's the detail of how to integrate SGLang weight update in verl's co-located strategy: [RL System Deep Thinking: Weight Update Mechanisms](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/sys-design/readme-1-EN.md) |
| Key benefits for RL infrastructure: | ||
|
|
||
| - **Async Non-blocking Efficiency**: SGLang’s native async server/router architecture (HTTPS/gRPC) manages concurrency automatically. This guarantees maximum GPU saturation and effective continuous batching without requiring complex, manual implementation by engineers. | ||
| - **Elasticity and fault tolerance**: By encapsulating the Reward Model and Rollout as independent servers, SGLang decoupling them logically and physically. This architecture provides robust disaster recovery for large-scale distributed training; if a server fails, the router automatically redirects traffic to healthy nodes, ensuring the training process continues without interruption. |
There was a problem hiding this comment.
There's a minor grammatical error here. decoupling should be decouples to match the subject SGLang.
| - **Elasticity and fault tolerance**: By encapsulating the Reward Model and Rollout as independent servers, SGLang decoupling them logically and physically. This architecture provides robust disaster recovery for large-scale distributed training; if a server fails, the router automatically redirects traffic to healthy nodes, ensuring the training process continues without interruption. | |
| - **Elasticity and fault tolerance**: By encapsulating the Reward Model and Rollout as independent servers, SGLang decouples them logically and physically. This architecture provides robust disaster recovery for large-scale distributed training; if a server fails, the router automatically redirects traffic to healthy nodes, ensuring the training process continues without interruption. |
| @@ -0,0 +1,243 @@ | |||
| # SGLang for RL Systems | |||
|
|
|||
| This document is a practical guide for infrastructure teams integrating SGLang into RL and post-training systems. It focuses on the operational pain points in the loop (rollout, evaluation, training, weight sync) and maps them to concrete SGLang APIs, flags, and integration patterns. The emphasis is on minimizing iteration latency, preserving correctness, and keeping rollout and training behavior aligned in production environments. | |||
| ### Pause Generation | ||
|
|
||
| **Endpoint:** `POST /pause_generation` | ||
|
|
||
| **Request body:** | ||
|
|
||
| | Field | Description | Defaults | Options | | ||
| | --- | --- | --- | --- | | ||
| | `mode` | Pause mode. | `abort` | `abort`, `retract`, `in_place` | | ||
|
|
||
| **Modes:** | ||
|
|
||
| - `abort`: Abort and return all running requests. | ||
| - `retract`: Pause inference; move running requests back to waiting queue. KV cache can be flushed and recomputed later. | ||
| - `in_place`: Pause inference; keep requests in event loop with existing KV cache. KV flush will fail if a running batch exists. | ||
|
|
||
| ### Continue Generation | ||
|
|
||
| **Endpoint:** `POST /continue_generation` |
There was a problem hiding this comment.
there shall be a really complex difference with the endpoint of abort. You can find details on #10071
There was a problem hiding this comment.
add a more detailed explanation on these mode.
| @@ -0,0 +1,247 @@ | |||
| # SGLang for RL Systems | |||
|
|
|||
| This document is a practical guide for infrastructure teams integrating SGLang into RL and post-training systems. It focuses on the operational pain points in the loop (rollout, evaluation, training, weight sync) and maps them to concrete SGLang APIs, flags, and integration patterns. The emphasis is on minimizing iteration latency, preserving correctness, and keeping rollout and training behavior aligned in production environments. | |||
There was a problem hiding this comment.
minimizing iteration latency, preserving correctness, and keeping rollout and training behavior aligned in production environments.
Looks weird to me. I think we can stress, "rollout latency and throughput, accuracy and stability, and rollout-serving behavior aligned in production environments.
| ## Easy To Postpone Generation | ||
|
|
||
| Multi-turn RL rollouts often suffer from long-tail requests that block the entire batch. A small number of slow interactions can stall all GPUs, and the long-tail behavior makes profiling and monitoring difficult. | ||
|
|
||
| SGLang exposes explicit pause/resume APIs so you can pause slow requests and continue them later. This pattern matches systems like [APRIL](https://arxiv.org/abs/2509.18521), which over-provision rollouts, terminate once enough responses are collected, and recycle incomplete responses in the next step. The result is higher GPU utilization without discarding partial work. | ||
|
|
||
| ### Pause Generation | ||
|
|
||
| **Endpoint:** `POST /pause_generation` | ||
|
|
||
| **Request body:** | ||
|
|
||
| | Field | Description | Defaults | Options | | ||
| | --- | --- | --- | --- | | ||
| | `mode` | Pause mode. | `abort` | `abort`, `retract`, `in_place` | | ||
|
|
||
| **Modes:** | ||
|
|
||
| - `abort`: Hard stop; repeatedly aborts all requests until no in-flight work remains. Scheduler is cleared. | ||
| - `retract`: Pause inference; move running requests back to waiting queue. KV cache can be flushed and recomputed later. | ||
| - `in_place`: Pause inference; keep requests in event loop with existing KV cache. Note: In `in_place` mode, `flush_cache` will fail if there are any requests in the running batch. | ||
|
|
||
| ### Continue Generation | ||
|
|
||
| **Endpoint:** `POST /continue_generation` |
Co-authored-by: JD <jaedon.guo@gmail.com>
Co-authored-by: JD <jaedon.guo@gmail.com>
Co-authored-by: JD <jaedon.guo@gmail.com>
Motivation
Modifications
Add a document about how to use SGLang as rollout backend for RL system
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci