Skip to content

[Docs] Add RL documentation#17663

Merged
zhaochenyang20 merged 7 commits intosgl-project:mainfrom
zijiexia:rl_doc
Jan 26, 2026
Merged

[Docs] Add RL documentation#17663
zhaochenyang20 merged 7 commits intosgl-project:mainfrom
zijiexia:rl_doc

Conversation

@zijiexia
Copy link
Contributor

Motivation

Modifications

Add a document about how to use SGLang as rollout backend for RL system

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jan 24, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @zijiexia, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the project's documentation by introducing a dedicated guide for integrating SGLang into Reinforcement Learning (RL) and post-training workflows. The new content aims to provide infrastructure teams with practical insights and API references to optimize iteration latency, ensure correctness, and align rollout and training behaviors in production environments, thereby solidifying SGLang's position as a robust backend for advanced AI systems.

Highlights

  • New RL Documentation: A comprehensive guide has been added detailing how to integrate SGLang as a rollout backend for Reinforcement Learning (RL) and post-training systems. This document covers operational pain points and maps them to concrete SGLang APIs and integration patterns.
  • Enhanced Feature Visibility: The main README and documentation index (index.rst) have been updated to prominently feature SGLang's role as an 'RL & Post-Training Backbone', highlighting its native RL integrations and adoption by various post-training frameworks.
  • Detailed API Usage for RL: The new documentation provides in-depth explanations and API specifications for fine-grained engine sleep/wake-up, three distinct weight refit strategies (from disk, tensor, and distributed group), mechanisms to postpone generation, deterministic inference, and the SGLang Model Gateway for load balancing.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds comprehensive documentation on using SGLang for Reinforcement Learning systems. The new document is well-structured and covers key features like memory management, weight updates, and deterministic inference, which will be very helpful for users. I've made a couple of minor suggestions to fix a formatting issue and a grammatical error in the new documentation file to improve its clarity.

Comment on lines +60 to +61
After training completes each step, rollout engines must be refit with new weights. SGLang supports three refit strategies so you can match your infrastructure style (co-located vs disaggregated) and scaling needs. Each strategy maps to a concrete API with clear request schemas. Here's the detail of how to integrate SGLang weight update in verl's co-located strategy: [RL System Deep Thinking: Weight Update Mechanisms
](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/sys-design/readme-1-EN.md)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a newline character within the link's display text, which can cause rendering issues. Please remove it to ensure the link is displayed correctly on a single line.

Suggested change
After training completes each step, rollout engines must be refit with new weights. SGLang supports three refit strategies so you can match your infrastructure style (co-located vs disaggregated) and scaling needs. Each strategy maps to a concrete API with clear request schemas. Here's the detail of how to integrate SGLang weight update in verl's co-located strategy: [RL System Deep Thinking: Weight Update Mechanisms
](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/sys-design/readme-1-EN.md)
After training completes each step, rollout engines must be refit with new weights. SGLang supports three refit strategies so you can match your infrastructure style (co-located vs disaggregated) and scaling needs. Each strategy maps to a concrete API with clear request schemas. Here's the detail of how to integrate SGLang weight update in verl's co-located strategy: [RL System Deep Thinking: Weight Update Mechanisms](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/sys-design/readme-1-EN.md)

Key benefits for RL infrastructure:

- **Async Non-blocking Efficiency**: SGLang’s native async server/router architecture (HTTPS/gRPC) manages concurrency automatically. This guarantees maximum GPU saturation and effective continuous batching without requiring complex, manual implementation by engineers.
- **Elasticity and fault tolerance**: By encapsulating the Reward Model and Rollout as independent servers, SGLang decoupling them logically and physically. This architecture provides robust disaster recovery for large-scale distributed training; if a server fails, the router automatically redirects traffic to healthy nodes, ensuring the training process continues without interruption.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a minor grammatical error here. decoupling should be decouples to match the subject SGLang.

Suggested change
- **Elasticity and fault tolerance**: By encapsulating the Reward Model and Rollout as independent servers, SGLang decoupling them logically and physically. This architecture provides robust disaster recovery for large-scale distributed training; if a server fails, the router automatically redirects traffic to healthy nodes, ensuring the training process continues without interruption.
- **Elasticity and fault tolerance**: By encapsulating the Reward Model and Rollout as independent servers, SGLang decouples them logically and physically. This architecture provides robust disaster recovery for large-scale distributed training; if a server fails, the router automatically redirects traffic to healthy nodes, ensuring the training process continues without interruption.

@@ -0,0 +1,243 @@
# SGLang for RL Systems

This document is a practical guide for infrastructure teams integrating SGLang into RL and post-training systems. It focuses on the operational pain points in the loop (rollout, evaluation, training, weight sync) and maps them to concrete SGLang APIs, flags, and integration patterns. The emphasis is on minimizing iteration latency, preserving correctness, and keeping rollout and training behavior aligned in production environments.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Comment on lines +198 to +216
### Pause Generation

**Endpoint:** `POST /pause_generation`

**Request body:**

| Field | Description | Defaults | Options |
| --- | --- | --- | --- |
| `mode` | Pause mode. | `abort` | `abort`, `retract`, `in_place` |

**Modes:**

- `abort`: Abort and return all running requests.
- `retract`: Pause inference; move running requests back to waiting queue. KV cache can be flushed and recomputed later.
- `in_place`: Pause inference; keep requests in event loop with existing KV cache. KV flush will fail if a running batch exists.

### Continue Generation

**Endpoint:** `POST /continue_generation`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there shall be a really complex difference with the endpoint of abort. You can find details on #10071

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a more detailed explanation on these mode.

@@ -0,0 +1,247 @@
# SGLang for RL Systems

This document is a practical guide for infrastructure teams integrating SGLang into RL and post-training systems. It focuses on the operational pain points in the loop (rollout, evaluation, training, weight sync) and maps them to concrete SGLang APIs, flags, and integration patterns. The emphasis is on minimizing iteration latency, preserving correctness, and keeping rollout and training behavior aligned in production environments.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minimizing iteration latency, preserving correctness, and keeping rollout and training behavior aligned in production environments.

Looks weird to me. I think we can stress, "rollout latency and throughput, accuracy and stability, and rollout-serving behavior aligned in production environments.

Comment on lines +196 to +220
## Easy To Postpone Generation

Multi-turn RL rollouts often suffer from long-tail requests that block the entire batch. A small number of slow interactions can stall all GPUs, and the long-tail behavior makes profiling and monitoring difficult.

SGLang exposes explicit pause/resume APIs so you can pause slow requests and continue them later. This pattern matches systems like [APRIL](https://arxiv.org/abs/2509.18521), which over-provision rollouts, terminate once enough responses are collected, and recycle incomplete responses in the next step. The result is higher GPU utilization without discarding partial work.

### Pause Generation

**Endpoint:** `POST /pause_generation`

**Request body:**

| Field | Description | Defaults | Options |
| --- | --- | --- | --- |
| `mode` | Pause mode. | `abort` | `abort`, `retract`, `in_place` |

**Modes:**

- `abort`: Hard stop; repeatedly aborts all requests until no in-flight work remains. Scheduler is cleared.
- `retract`: Pause inference; move running requests back to waiting queue. KV cache can be flushed and recomputed later.
- `in_place`: Pause inference; keep requests in event loop with existing KV cache. Note: In `in_place` mode, `flush_cache` will fail if there are any requests in the running batch.

### Continue Generation

**Endpoint:** `POST /continue_generation`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JD-ETH could you help to check this?

Copy link
Contributor

@JD-ETH JD-ETH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

personal remarks

Copy link
Contributor

@JD-ETH JD-ETH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@zhaochenyang20 zhaochenyang20 merged commit dd97e1f into sgl-project:main Jan 26, 2026
46 checks passed
Chen-0210 pushed a commit to Chen-0210/sglang that referenced this pull request Jan 30, 2026
Co-authored-by: JD <jaedon.guo@gmail.com>
Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026
Co-authored-by: JD <jaedon.guo@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants