Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds a diffusion reward loop pipeline to support image-based reward computation in reinforcement learning workflows. It introduces a specialized reward loop manager (DiffusionRewardLoopManager) and reward manager (DiffusionRewardManager) designed to handle image outputs from diffusion models, along with tests and reward computation functions for OCR-based evaluation.
- Adds
DiffusionRewardLoopManagerandDiffusionRewardLoopWorkerfor distributed reward computation on image data - Implements
DiffusionRewardManagerwith support for async reward scoring of image outputs - Provides OCR reward computation function with generative reward model (GRM) support using image-to-text extraction
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
verl/experimental/reward_loop/reward_manager/diffusion.py |
Implements DiffusionRewardManager that extends RewardManagerBase for image-based reward computation |
verl/experimental/reward_loop/diffusion_reward_loop.py |
Core implementation of DiffusionRewardLoopWorker and DiffusionRewardLoopManager for distributed image reward processing |
verl/experimental/reward_loop/reward_manager/__init__.py |
Registers DiffusionRewardManager in module exports |
verl/experimental/reward_loop/__init__.py |
Exports DiffusionRewardLoopManager for external use |
tests/experimental/reward_loop/test_diffusion_reward_model_genrm.py |
Adds unit test for DiffusionRewardLoopManager with OCR-based image evaluation |
tests/experimental/reward_loop/reward_fn.py |
Adds compute_score_ocr function for OCR reward computation using GRM with Levenshtein distance scoring |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def prepare_query(self, chat, prompt, image_base64: str) -> list: | ||
| query = [ | ||
| { | ||
| "type": "image_url", | ||
| "image_url": {"url": image_base64}, | ||
| }, | ||
| ] | ||
| return query |
There was a problem hiding this comment.
The parameters chat and prompt are not used in the prepare_query method body. Consider removing them if they are not needed, or include them in the query if they were intended to be used.
|
let me know if it is ok for merge |
| @@ -0,0 +1,110 @@ | |||
| # Copyright 2024 Bytedance Ltd. and/or its affiliates | |||
| # Copyright 2026 Huawei Technologies Co., Ltd | |||
There was a problem hiding this comment.
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved. There was a problem hiding this comment.
need Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved. , CI check the whole sentence
* init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright
* add entroypoint (#1) * add training engine (#2) * add training engine * fix init * fix typs * move folders & make for two-forward pass in training loop (#4) * Add diffusion reward loop (#3) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * [fix] update customized reward func in UT (#5) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn * init dataset for Qwen-Image * pass UT * update return, update UT * pass UT * align with rl_dataset * pass UT * update filter long prompts * debug * clean code --------- Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com>
* init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright
* add entroypoint (#1) * add training engine (#2) * add training engine * fix init * fix typs * move folders & make for two-forward pass in training loop (#4) * Add diffusion reward loop (#3) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * [fix] update customized reward func in UT (#5) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn * init dataset for Qwen-Image * pass UT * update return, update UT * pass UT * align with rl_dataset * pass UT * update filter long prompts * debug * clean code --------- Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com>
* add entroypoint (#1) * add training engine (#2) * add training engine * fix init * fix typs * move folders & make for two-forward pass in training loop (#4) * Add diffusion reward loop (#3) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * [fix] update customized reward func in UT (#5) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn * Update 20260109 (#8) * Update 20260109 * update * fix CI * [data] feat: Add dataset for Qwen-Image (#6) * add entroypoint (#1) * add training engine (#2) * add training engine * fix init * fix typs * move folders & make for two-forward pass in training loop (#4) * Add diffusion reward loop (#3) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * [fix] update customized reward func in UT (#5) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn * init dataset for Qwen-Image * pass UT * update return, update UT * pass UT * align with rl_dataset * pass UT * update filter long prompts * debug * clean code --------- Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com> * add new config; debug actor * debug; add reward config; add adv, policy loss * debug reward loop * init diffusers engine UT * debug * debug * deubg actor forward * debug * merge * add UT for adv and loss * pass adv&loss UTs; pass engine backward UT * clean debug code --------- Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com>
* add entroypoint (#1) * add training engine (#2) * add training engine * fix init * fix typs * move folders & make for two-forward pass in training loop (#4) * Add diffusion reward loop (#3) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * [fix] update customized reward func in UT (#5) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn * Update 20260109 (#8) * Update 20260109 * update * fix CI * [data] feat: Add dataset for Qwen-Image (#6) * add entroypoint (#1) * add training engine (#2) * add training engine * fix init * fix typs * move folders & make for two-forward pass in training loop (#4) * Add diffusion reward loop (#3) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * [fix] update customized reward func in UT (#5) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn * init dataset for Qwen-Image * pass UT * update return, update UT * pass UT * align with rl_dataset * pass UT * update filter long prompts * debug * clean code --------- Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com> * update to align verl data format * debug --------- Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com>
What does this PR do?
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.