[trainer] feat: add new trainer with TranferQueue#5401
[trainer] feat: add new trainer with TranferQueue#5401wuxibin89 merged 53 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new synchronous PPO trainer using TransferQueue, which is a significant architectural change. The new implementation includes a replay buffer and updated data handling utilities for nested tensors. However, the review identified several critical issues. Key methods like _save_checkpoint and _compute_reward_colocate are not implemented, which will lead to runtime crashes. There is also a critical bug in the response_to_nested utility function that will cause errors during tensor creation. Additionally, there are high-severity concerns regarding unsafe process termination and data dispatching logic that could lead to silent failures. These issues must be addressed before this PR can be merged.
| def _compute_reward_colocate(self, batch: KVBatchMeta, metrics: dict) -> KVBatchMeta: | ||
| """Compute the reward with colocate reward model.""" | ||
| # TODO: add reward model | ||
| raise NotImplementedError |
|
#5375 (comment) May take a look at this when you have some spare time. Thanks. |
### What does this PR do? In verl 0.8, we will refactor the integration approach for TransferQueue. For reference: #5401. This PR will first clean up the legacy code to facilitate the subsequent TQ integration. CC @zw0610 @wuxibin89 ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`, `fully_async`, `one_step_off` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`. Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
…l-project#5454) ### What does this PR do? In verl 0.8, we will refactor the integration approach for TransferQueue. For reference: verl-project#5401. This PR will first clean up the legacy code to facilitate the subsequent TQ integration. CC @zw0610 @wuxibin89 ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`, `fully_async`, `one_step_off` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`. Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
|
Begunner seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
|
A sidenote for future merge: there is a bug in metrics log: response/aborted_ratio + prompt_length/clip_ratio is near 100%. The root cause is the removal of padding in TQWorker. Inside the metric_utils.py, we should find a nice way to modify the compute_data_metrics(), which is currently designed for general usage. |
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com> update Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com> fix precommit Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com> fix Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com> fix sanity and annotation checks Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
…l-project#5454) ### What does this PR do? In verl 0.8, we will refactor the integration approach for TransferQueue. For reference: verl-project#5401. This PR will first clean up the legacy code to facilitate the subsequent TQ integration. CC @zw0610 @wuxibin89 ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward`, `fully_async`, `one_step_off` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`. Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a synchronous PPO trainer (main_ppo_sync.py) leveraging TransferQueue and ReplayBuffer for zero-copy data transfer and flexible sampling. It includes a new tqbridge decorator to facilitate communication between the trainer and distributed workers using BatchMeta objects. Additionally, the PR adds support for multi-trajectory advantage estimation (GRPO) and optimizes metric computation by utilizing pre-calculated sequence lengths. Review feedback identifies a critical TypeError in the NonTensorStack initialization within list_of_dict_to_tensordict and a logging syntax error in the ReplayBuffer class.
| else ( | ||
| torch.nested.as_nested_tensor(val_list, layout=torch.jagged) | ||
| if val_list and all(isinstance(item, torch.Tensor) for item in val_list) | ||
| else NonTensorStack(*val_list) |
There was a problem hiding this comment.
The NonTensorStack constructor expects TensorDictBase objects (such as NonTensorData). Passing a list of raw objects (e.g., strings or integers) via *val_list will result in a TypeError. Use NonTensorStack.from_list with explicit NonTensorData wrapping to correctly handle non-tensor data, consistent with the implementation in get_tensordict at line 405.
| else NonTensorStack(*val_list) | |
| else NonTensorStack.from_list([NonTensorData(item) for item in val_list]) |
| tags.append(tag) | ||
| else: | ||
| logger.debug(f"Unknown status {tag['status']} for key {key}") | ||
| logger.info("partitions", self.partitions) |
There was a problem hiding this comment.
The logging call is missing a format specifier. In Python's logging module, logger.info(msg, *args) expects msg to be a format string if additional arguments are provided. As written, self.partitions will be ignored and only the string "partitions" will be logged. Use a format string like "partitions: %s" to ensure the data is captured.
| logger.info("partitions", self.partitions) | |
| logger.info("partitions: %s", self.partitions) |
|
When an agent loop has multiple outputs for a given session, right now rewards are computed based on the last sequence only and propagated for the other outputs. |
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com> mock TQ to pass some UT Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
e414fde to
776af09
Compare
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
| final_output = tu.get_tensordict(tensor_dict=model_output, non_tensor_dict={"metrics": final_metrics}) | ||
| return final_output | ||
|
|
||
| @tqbridge(dispatch_mode=make_nd_compute_dataproto_dispatch_fn(mesh_name="train")) |
There was a problem hiding this comment.
@0oshowero0 We better hide tqbridge in @register
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
…hts to step timing
### What does this PR do? ### Background Inside the current tq_trainer, AgentLoopWorkerTQ already supports the multi-trajectory feature. However, during actual training, sample (trajectory)-level padding is still required for each batch so that the number of samples is divisible by both `dp_size` and `mini_batch_size`; otherwise, an error will be thrown. This PR fixes the bug and addresses the following considerations: #### Upsampling: 1. **LCM alignment**: `% dp_size == 0` and `% mini_batch_size == 0` (and `% critic_mini_batch_size == 0` if training the critic). 2. **Notice**:We should **not** use pad_dataproto_to_divisor() for padding, as it takes first `pad_size` existing samples and would pollute the grpo_adv, gradient, metrics and etc., which causes inconsistency. #### Padding: 1. Padded samples use **independent UIDs** to avoid interfering with GRPO advantage computation. 3. Padded samples are constructed with the **shortest possible sequence** — one prompt token + one response token — to minimize redundant computation. 4. An `is_padding` flag is added to the tags of padded samples to avoid impacting the **accuracy metrics** such as score, reward, and response length (while performance metrics still include padded samples). 5. Maintains the router_replay shape, position_ids shape and add multi-modal inputs placeholder in VLM case when necessary. ### Related PR 1. #5636 2. #5401 ### Verification Experiments with Multi-Trajectory Agent: #### The three smallest primes — 2, 3, and 5 — are chosen to form the relevant hyperparameters: [dp=2, batch_size=45, mini_batch_size=15, rollout.n=8]. <img width="2160" height="1224" alt="20260410_225225_critic_score_mean" src="https://github.com/user-attachments/assets/64de98fa-4356-45bd-abf7-6d5cb2fa9640" />
What does this PR do?
This PR introduces
main_ppo_sync.py, a new synchronous PPO trainer that decouples control flow from data flow by integrating TransferQueue as the data plane.Key innovation: The trainer (single-controller) now only orchestrates metadata (
KVBatchMeta), while large tensors are transferred directly between workers via TransferQueue using zero-copy serialization. This eliminates the controller bottleneck in large-scale RL training.See RFC #5400 for detailed design discussion.
Performance Results
Comprehensive benchmarks on both GPU (H100/H20) and NPU (A3) platforms demonstrate 7%–49.2% E2E performance gains:
Details for reproducing the reported performance can be found in this blog post.
DAPO-Math-17k + Router Replay R3, Qwen3-30B-A3B, 2x8 H100, 14.1% E2E Improvement
Geo3k, 4x8 H100, Qwen3-VL-30B-A3B, 12.5% E2E Improvement
Geo3k, 2x16 A3, Qwen3-VL-30B-A3B, 7% E2E Improvement
OneThinker, 2x8 H100, Qwen3-VL-30B-A3B, 22.4% E2E Improvement
OneThinker, 2x16 A3, Qwen3-VL-30B-A3B, 20% E2E Improvement
OneThinker, 16x8 H100, Qwen3-VL-30B-A3B, 49.2% E2E Improvement
Accuracy Test
DAPO17k + AIME24+ Qwen3-8B-Base, 1x8 H200
DAPO17k + AIME24 + Qwen3-8B-Base, 1x8 A3
Geo3k, Qwen3-VL-30B-A3B, 4x8 H100
TransferQueue Stress Test
We provide a stress test report that demonstrates more than 8192 concurrent clients writing 2 TB of data into TransferQueue across 4 nodes. The system remains stable without any crashes or data loss.
During early development, we have successfully deployed TransferQueue x verl on 64x16 A3 nodes for Qwen3-235B, with seq_len=64k, GBS=256, N_Sample=16.
Scripts