Skip to content

[fully_async, rollout] feat: enable online policy distillation in fully async training#6056

Draft
xiefan46 wants to merge 8 commits intoverl-project:mainfrom
xiefan46:async-opd
Draft

[fully_async, rollout] feat: enable online policy distillation in fully async training#6056
xiefan46 wants to merge 8 commits intoverl-project:mainfrom
xiefan46:async-opd

Conversation

@xiefan46
Copy link
Copy Markdown

@xiefan46 xiefan46 commented Apr 18, 2026

What does this PR do?

Enables Online Policy Distillation (OPD) in fully async training mode. Previously, distillation was only supported in sync mode because TeacherModelManager required a shared resource pool. This PR allows the
teacher model to run in standalone mode with its own GPU, removing the dependency on sync resource pool allocation.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, vllm_omni, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward, fully_async, one_step_off
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

Tested on 4 H100 GPU
full wandb link:
https://wandb.ai/models-xx/verl-test-fully-async-opd?nw=nwuserfxie46

critic/score/mean
image

actor/loss
image

actor/grad_norm
image

actor/distillation/loss
image

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates Online Policy Distillation (OPD) into the fully async policy training pipeline. It enables distillation in the agent loop, implements standalone teacher model management using thread executors to avoid event loop conflicts, and includes a new E2E test script. Review feedback highlights a potential TypeError when initializing workers with unsupported distillation arguments and suggests adding validation to ensure the teacher model replicas are correctly initialized when no resource pool is provided.

Comment thread verl/experimental/fully_async_policy/fully_async_trainer.py
Comment on lines +61 to +65
if self.resource_pool:
world_size = self.resource_pool.world_size
else:
world_size = teacher_model_config.n_gpus_per_node * teacher_model_config.nnodes
num_replicas = world_size // teacher_world_size
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If resource_pool is None and the configuration for n_gpus_per_node or nnodes is missing or set to zero, world_size will be zero. This results in num_replicas being zero, leading to an empty rollout_replicas list. Consequently, the GlobalRequestLoadBalancer will be initialized with an empty list of server addresses, which will cause runtime errors when distillation requests are dispatched. You should add a validation check to ensure num_replicas > 0 when distillation is enabled.

@xiefan46 xiefan46 changed the title Async opd [fully_async, rollout] feat: enable online policy distillation in fully async training Apr 20, 2026
@xiefan46 xiefan46 force-pushed the async-opd branch 6 times, most recently from f700e33 to d927953 Compare April 21, 2026 06:57
…ly async training

Enable OPD in fully async mode by allowing TeacherModelManager and
MultiTeacherModelManager to run in standalone mode (resource_pool=None).
Each teacher replica allocates its own GPU via init_standalone(), removing
the dependency on centralized resource pool allocation.

Changes:
- teacher_model.py: make resource_pool optional in both TeacherModelManager
  and MultiTeacherModelManager. When None, replicas use init_standalone().
- fully_async_rollouter.py: create MultiTeacherModelManager(resource_pool=None)
  and pass to AgentLoopManager
- fully_async_trainer.py: pass distillation_config to DetachActorWorker;
  add self.distillation_config for _update_actor
- engine_workers.py: accept **kwargs in DetachActorWorker to forward
  distillation_config to parent
- agent_loop.py: use MultiTeacherModelManager type (matching base class)

Tested on 4x H100 (1 rollout + 2 training + 1 teacher), GSM8K, GRPO.
Student: Qwen2.5-0.5B, Teacher: Qwen2.5-3B-Instruct.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant