[fsdp,trainer,vllm_omni,algo] feat: support FlowGRPO training for QwenImage#5297
[fsdp,trainer,vllm_omni,algo] feat: support FlowGRPO training for QwenImage#5297zhtmike wants to merge 34 commits intoverl-project:mainfrom
Conversation
* add training engine * fix init * fix typs
* init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright
* init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn
* Update 20260109 * update * fix CI
* add entroypoint (#1) * add training engine (#2) * add training engine * fix init * fix typs * move folders & make for two-forward pass in training loop (#4) * Add diffusion reward loop (#3) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * [fix] update customized reward func in UT (#5) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn * init dataset for Qwen-Image * pass UT * update return, update UT * pass UT * align with rl_dataset * pass UT * update filter long prompts * debug * clean code --------- Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com>
* add entroypoint (#1) * add training engine (#2) * add training engine * fix init * fix typs * move folders & make for two-forward pass in training loop (#4) * Add diffusion reward loop (#3) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * [fix] update customized reward func in UT (#5) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn * Update 20260109 (#8) * Update 20260109 * update * fix CI * [data] feat: Add dataset for Qwen-Image (#6) * add entroypoint (#1) * add training engine (#2) * add training engine * fix init * fix typs * move folders & make for two-forward pass in training loop (#4) * Add diffusion reward loop (#3) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * [fix] update customized reward func in UT (#5) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn * init dataset for Qwen-Image * pass UT * update return, update UT * pass UT * align with rl_dataset * pass UT * update filter long prompts * debug * clean code --------- Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com> * add new config; debug actor * debug; add reward config; add adv, policy loss * debug reward loop * init diffusers engine UT * debug * debug * deubg actor forward * debug * merge * add UT for adv and loss * pass adv&loss UTs; pass engine backward UT * clean debug code --------- Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com>
* add entroypoint (#1) * add training engine (#2) * add training engine * fix init * fix typs * move folders & make for two-forward pass in training loop (#4) * Add diffusion reward loop (#3) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * [fix] update customized reward func in UT (#5) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn * Update 20260109 (#8) * Update 20260109 * update * fix CI * [data] feat: Add dataset for Qwen-Image (#6) * add entroypoint (#1) * add training engine (#2) * add training engine * fix init * fix typs * move folders & make for two-forward pass in training loop (#4) * Add diffusion reward loop (#3) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * [fix] update customized reward func in UT (#5) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn * init dataset for Qwen-Image * pass UT * update return, update UT * pass UT * align with rl_dataset * pass UT * update filter long prompts * debug * clean code --------- Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com> * update to align verl data format * debug --------- Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com>
* add agent loop * add server manager * Add single turn loop * add test case * add replica * clean dummy input * fix bugs * fix bugs 2 * fix bugs 3 * fix bugs 4 and add vllm-omni patch * implement sde * add custom_pipeline option in verl * fix some bugs in custom pipeline * fix OOM * add intermediate outputs * support inputs without mask * clean & bug fix * rebase master * fix some bugs * fix chat template (temproraly fix) * fix several bugs & add custom pipeline * fix several bugs * fix reward loop * pass CI (single card) * minor fix * fix import * fix bugs * fix import * merge master * add sleep mode back * merge main * support passing num_inference steps * update accoriding to suggestion * align with master * add input_id & attention_mask back, drop hard code of chat template * support varlen prompt input
* update scripts * fix engine name & use image compressibility temporarily * fix some bugs * clean uncessary change * fix some bugs * fix bugs & clean configs * add autogen * fix CI * clean args * fix typo * update script * fix update weight * add hijack * fix checkpoint loading * disable free cache engine temporaily
* support wandb val visual log; support async genrm/rule reward_loop in val * update script * add comment
* enable reward loop * add timeout check for replica sleep * fix train script * consistent naming & fix mask * fix UT for multi-card * fix seq_len & clean files * drop sleep due to bug fix in vllm-omni side
* fix bugs * fix timesteps * fix lora * consistent script * fix image size * fix pipeline parse * add max model len to qwen-image * by pass bug * fix misc. bugs
* fix bugs * fix bugs * fix advantage cal
* support sync reward for val * wake up rollout after reward in val * debug
* fix sleep mode & non-lora weight update * fix from review
* fix bugs * update UT * fix config * update config * fix lora weight exporting * revert noise * revert size * format
* fix training * update script
* merge main * fix * fix * merge main * fix merge * fix * fix * fix * update ci
There was a problem hiding this comment.
Code Review
This pull request introduces significant new functionality to support FlowGRPO training, including a new trainer, core algorithm implementations, and integrations with vLLM-Omni and diffusers. The changes are extensive and well-structured. My review primarily focuses on the new test files, where I've identified several instances of hardcoded paths. These paths make the tests non-portable and likely to fail in CI environments or for other developers. Addressing these will be crucial for ensuring the long-term maintainability and reliability of the new features.
| with initialize_config_dir(config_dir=os.path.abspath("verl/trainer/config")): | ||
| config = compose(config_name="ppo_diffusion_trainer") | ||
|
|
||
| model_path = os.path.expanduser("~/models/Qwen/Qwen-Image") |
There was a problem hiding this comment.
The test configuration hardcodes a model path using os.path.expanduser. This makes the test non-portable and dependent on a specific local file structure, which will cause it to fail in CI environments or for other developers. Tests should be self-contained and not rely on external, user-specific files.
| model_path = os.path.expanduser("~/models/Qwen/Qwen-Image") | |
| # It's recommended to use a mock model or a small, downloadable test model. | |
| # For example, you could use a fixture to create a temporary model directory. | |
| model_path = "path/to/test/model" # Replace with a portable path solution | |
| images_pil = (result.batch["responses"].permute(0, 2, 3, 1).numpy() * 255.0).astype("uint8") | ||
| for i, image in enumerate(images_pil): | ||
| image_path = os.path.join(f"{i}.jpg") | ||
| Image.fromarray(image).save(image_path) |
There was a problem hiding this comment.
This test produces side effects by saving generated images to the filesystem. Tests should be hermetic and not write files to the working directory, as this can interfere with the test environment and other tests. The TODO comment indicates this is likely temporary debugging code that should be removed.
| ] | ||
|
|
||
| sampling_params = {"temperature": 0.7, "top_p": 0.8, "max_tokens": 4096} | ||
| model_name = model_name or os.path.expanduser("~/models/Qwen/Qwen2.5-VL-3B-Instruct") |
There was a problem hiding this comment.
| rollout_model_name = os.path.expanduser("~/models/Qwen/Qwen-Image") | ||
| reward_model_name = os.path.expanduser("~/models/Qwen/Qwen2.5-VL-3B-Instruct") |
There was a problem hiding this comment.
The test hardcodes model paths using os.path.expanduser, which makes it non-portable and likely to fail in CI or for other developers. Tests should be self-contained.
| rollout_model_name = os.path.expanduser("~/models/Qwen/Qwen-Image") | |
| reward_model_name = os.path.expanduser("~/models/Qwen/Qwen2.5-VL-3B-Instruct") | |
| # Consider using a fixture to provide a path to a small, downloadable test model | |
| # or mocking the model loading process entirely. | |
| rollout_model_name = "path/to/test/rollout_model" | |
| reward_model_name = "path/to/test/reward_model" | |
| outputs = reward_loop_manager.compute_rm_score(data) | ||
|
|
||
| for idx, output in enumerate(outputs): | ||
| print(f"GRM Response {idx}:\n{output.non_tensor_batch['genrm_response']}\n") | ||
| print(f"Score:\n{output.non_tensor_batch['score']}\n") | ||
| print("=" * 50 + "\n") |
There was a problem hiding this comment.
This test function computes results but lacks assertions to verify their correctness. It only prints the output. A test without assertions does not validate the behavior of the code and can only confirm that it runs without crashing. Please add assertions to check the properties of the outputs, such as shape, type, or value ranges.
# Example assertion:
# assert len(outputs) > 0
# assert "score" in outputs[0].non_tensor_batch
# assert isinstance(outputs[0].non_tensor_batch["score"], float)| model_type="diffusion_model", | ||
| strategy=strategy, | ||
| device_count=device_count, | ||
| model="~/models/Qwen/Qwen-Image", |
There was a problem hiding this comment.
The test configuration hardcodes a model path using ~/models/Qwen/Qwen-Image. This makes the test non-portable and dependent on a specific user's local file setup. Tests should be hermetic and use mock objects or small, self-contained test artifacts.
model="path/to/test/model", # Replace with a portable path solution|
|
||
| def get_ocr_data(): | ||
| # prepare test dataset | ||
| local_folder = os.path.expanduser("~/data/ocr/") |
|
|
||
|
|
||
| def test_qwen_dataset(): | ||
| tokenizer = hf_tokenizer(os.path.expanduser("~/models/Qwen/Qwen-Image/tokenizer"), trust_remote_code=True) |
| ) | ||
| return GenerationConfig.from_model_config(config) | ||
| except OSError: # Not found | ||
| except (OSError, ValueError): # Not found |
What does this PR do?
Follow-up Work for #4639
vLLM-Omnihas been added to the rollout engine.Diffusershas been integrated as the training engine for the diffusion model.This is currently a draft PR and contains repeated or redundant code/configurations. A pruned version will be available once it is ready for review.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
We use Levenshtein distance for OCR reward calculation.
Qwen2.5-VL-3Bis employed as the reward model. The following figure shows the scores for the testing dataset.API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.