[fully_async, rollout] feat: enable online policy distillation in fully async training#6056
[fully_async, rollout] feat: enable online policy distillation in fully async training#6056xiefan46 wants to merge 8 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request integrates Online Policy Distillation (OPD) into the fully async policy training pipeline. It enables distillation in the agent loop, implements standalone teacher model management using thread executors to avoid event loop conflicts, and includes a new E2E test script. Review feedback highlights a potential TypeError when initializing workers with unsupported distillation arguments and suggests adding validation to ensure the teacher model replicas are correctly initialized when no resource pool is provided.
| if self.resource_pool: | ||
| world_size = self.resource_pool.world_size | ||
| else: | ||
| world_size = teacher_model_config.n_gpus_per_node * teacher_model_config.nnodes | ||
| num_replicas = world_size // teacher_world_size |
There was a problem hiding this comment.
If resource_pool is None and the configuration for n_gpus_per_node or nnodes is missing or set to zero, world_size will be zero. This results in num_replicas being zero, leading to an empty rollout_replicas list. Consequently, the GlobalRequestLoadBalancer will be initialized with an empty list of server addresses, which will cause runtime errors when distillation requests are dispatched. You should add a validation check to ensure num_replicas > 0 when distillation is enabled.
f700e33 to
d927953
Compare
…ly async training Enable OPD in fully async mode by allowing TeacherModelManager and MultiTeacherModelManager to run in standalone mode (resource_pool=None). Each teacher replica allocates its own GPU via init_standalone(), removing the dependency on centralized resource pool allocation. Changes: - teacher_model.py: make resource_pool optional in both TeacherModelManager and MultiTeacherModelManager. When None, replicas use init_standalone(). - fully_async_rollouter.py: create MultiTeacherModelManager(resource_pool=None) and pass to AgentLoopManager - fully_async_trainer.py: pass distillation_config to DetachActorWorker; add self.distillation_config for _update_actor - engine_workers.py: accept **kwargs in DetachActorWorker to forward distillation_config to parent - agent_loop.py: use MultiTeacherModelManager type (matching base class) Tested on 4x H100 (1 rollout + 2 training + 1 teacher), GSM8K, GRPO. Student: Qwen2.5-0.5B, Teacher: Qwen2.5-3B-Instruct.
What does this PR do?
Enables Online Policy Distillation (OPD) in fully async training mode. Previously, distillation was only supported in sync mode because TeacherModelManager required a shared resource pool. This PR allows the
teacher model to run in standalone mode with its own GPU, removing the dependency on sync resource pool allocation.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,vllm_omni,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
Tested on 4 H100 GPU
full wandb link:
https://wandb.ai/models-xx/verl-test-fully-async-opd?nw=nwuserfxie46
critic/score/mean

actor/loss

actor/grad_norm

actor/distillation/loss

API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.