[fully_async, rollout] feat: enable online policy distillation in fully async training by xiefan46 · Pull Request #6056 · verl-project/verl

xiefan46 · 2026-04-18T10:39:23Z

What does this PR do?

Enables Online Policy Distillation (OPD) in fully async training mode. Previously, distillation was only supported in sync mode because TeacherModelManager required a shared resource pool. This PR allows the
teacher model to run in standalone mode with its own GPU, removing the dependency on sync resource pool allocation.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, veomni, sglang, vllm, vllm_omni, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

Tested on 4 H100 GPU
full wandb link:
https://wandb.ai/models-xx/verl-test-fully-async-opd?nw=nwuserfxie46

critic/score/mean

actor/loss

actor/grad_norm

actor/distillation/loss

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)
If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main.

gemini-code-assist

Code Review

This pull request integrates Online Policy Distillation (OPD) into the fully async policy training pipeline. It enables distillation in the agent loop, implements standalone teacher model management using thread executors to avoid event loop conflicts, and includes a new E2E test script. Review feedback highlights a potential TypeError when initializing workers with unsupported distillation arguments and suggests adding validation to ensure the teacher model replicas are correctly initialized when no resource pool is provided.

gemini-code-assist · 2026-04-18T10:41:38Z

+        if self.resource_pool:
+            world_size = self.resource_pool.world_size
+        else:
+            world_size = teacher_model_config.n_gpus_per_node * teacher_model_config.nnodes
+        num_replicas = world_size // teacher_world_size


If resource_pool is None and the configuration for n_gpus_per_node or nnodes is missing or set to zero, world_size will be zero. This results in num_replicas being zero, leading to an empty rollout_replicas list. Consequently, the GlobalRequestLoadBalancer will be initialized with an empty list of server addresses, which will cause runtime errors when distillation requests are dispatched. You should add a validation check to ensure num_replicas > 0 when distillation is enabled.

…ly async training Enable OPD in fully async mode by allowing TeacherModelManager and MultiTeacherModelManager to run in standalone mode (resource_pool=None). Each teacher replica allocates its own GPU via init_standalone(), removing the dependency on centralized resource pool allocation. Changes: - teacher_model.py: make resource_pool optional in both TeacherModelManager and MultiTeacherModelManager. When None, replicas use init_standalone(). - fully_async_rollouter.py: create MultiTeacherModelManager(resource_pool=None) and pass to AgentLoopManager - fully_async_trainer.py: pass distillation_config to DetachActorWorker; add self.distillation_config for _update_actor - engine_workers.py: accept **kwargs in DetachActorWorker to forward distillation_config to parent - agent_loop.py: use MultiTeacherModelManager type (matching base class) Tested on 4x H100 (1 rollout + 2 training + 1 teacher), GSM8K, GRPO. Student: Qwen2.5-0.5B, Teacher: Qwen2.5-3B-Instruct.

… faster run

gemini-code-assist bot reviewed Apr 18, 2026

View reviewed changes

xiefan46 changed the title ~~Async opd~~ [fully_async, rollout] feat: enable online policy distillation in fully async training Apr 20, 2026

xiefan46 force-pushed the async-opd branch 6 times, most recently from f700e33 to d927953 Compare April 21, 2026 06:57

xiefan46 force-pushed the async-opd branch from d927953 to 084e522 Compare April 21, 2026 07:02

xiefan46 added 7 commits April 21, 2026 15:05

remove default teacher_model entry for multi-teacher config

3c77620

keep default teacher_model entry, let _resolve_teacher_models handle it

b4c247e

increase checkpoint bucket to 2048MB for Qwen3-VL-2B embedding

50bec77

disable mm_processor_cache to fix vLLM multimodal cache assertion error

1e11983

tune: reduce rollout steps, increase sync interval and batch size for…

40c1df7

… faster run

restore total_rollout_steps to 3200

1a53f84

increase ppo_mini_batch_size to 32 for smoother training curves

68ea39f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fully_async, rollout] feat: enable online policy distillation in fully async training#6056

[fully_async, rollout] feat: enable online policy distillation in fully async training#6056
xiefan46 wants to merge 8 commits intoverl-project:mainfrom
xiefan46:async-opd

xiefan46 commented Apr 18, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xiefan46 commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xiefan46 commented Apr 18, 2026 •

edited

Loading