[fsdp,trainer,vllm_omni,algo] feat: support FlowGRPO training for QwenImage by zhtmike · Pull Request #5297 · verl-project/verl

zhtmike · 2026-02-12T07:04:21Z

What does this PR do?

Follow-up Work for #4639

A trainable script for the FlowGRPO algorithm for Qwen-Image is provided.
Support for vLLM-Omni has been added to the rollout engine.
Diffusers has been integrated as the training engine for the diffusion model.

This is currently a draft PR and contains repeated or redundant code/configurations. A pruned version will be available once it is ready for review.

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

We use Levenshtein distance for OCR reward calculation. Qwen2.5-VL-3B is employed as the reward model. The following figure shows the scores for the testing dataset.

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)
If your PR is related to the recipe submodule, please also update the reference to the submodule commit via git submodule update --remote or cd recipe && git pull origin main.

* add training engine * fix init * fix typs

* init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright

* init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn

* Update 20260109 * update * fix CI

* add entroypoint (#1) * add training engine (#2) * add training engine * fix init * fix typs * move folders & make for two-forward pass in training loop (#4) * Add diffusion reward loop (#3) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * [fix] update customized reward func in UT (#5) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn * init dataset for Qwen-Image * pass UT * update return, update UT * pass UT * align with rl_dataset * pass UT * update filter long prompts * debug * clean code --------- Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com>

* add entroypoint (#1) * add training engine (#2) * add training engine * fix init * fix typs * move folders & make for two-forward pass in training loop (#4) * Add diffusion reward loop (#3) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * [fix] update customized reward func in UT (#5) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn * Update 20260109 (#8) * Update 20260109 * update * fix CI * [data] feat: Add dataset for Qwen-Image (#6) * add entroypoint (#1) * add training engine (#2) * add training engine * fix init * fix typs * move folders & make for two-forward pass in training loop (#4) * Add diffusion reward loop (#3) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * [fix] update customized reward func in UT (#5) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn * init dataset for Qwen-Image * pass UT * update return, update UT * pass UT * align with rl_dataset * pass UT * update filter long prompts * debug * clean code --------- Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com> * add new config; debug actor * debug; add reward config; add adv, policy loss * debug reward loop * init diffusers engine UT * debug * debug * deubg actor forward * debug * merge * add UT for adv and loss * pass adv&loss UTs; pass engine backward UT * clean debug code --------- Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com>

* add entroypoint (#1) * add training engine (#2) * add training engine * fix init * fix typs * move folders & make for two-forward pass in training loop (#4) * Add diffusion reward loop (#3) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * [fix] update customized reward func in UT (#5) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn * Update 20260109 (#8) * Update 20260109 * update * fix CI * [data] feat: Add dataset for Qwen-Image (#6) * add entroypoint (#1) * add training engine (#2) * add training engine * fix init * fix typs * move folders & make for two-forward pass in training loop (#4) * Add diffusion reward loop (#3) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * [fix] update customized reward func in UT (#5) * init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn * init dataset for Qwen-Image * pass UT * update return, update UT * pass UT * align with rl_dataset * pass UT * update filter long prompts * debug * clean code --------- Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com> * update to align verl data format * debug --------- Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com>

* add agent loop * add server manager * Add single turn loop * add test case * add replica * clean dummy input * fix bugs * fix bugs 2 * fix bugs 3 * fix bugs 4 and add vllm-omni patch * implement sde * add custom_pipeline option in verl * fix some bugs in custom pipeline * fix OOM * add intermediate outputs * support inputs without mask * clean & bug fix * rebase master * fix some bugs * fix chat template (temproraly fix) * fix several bugs & add custom pipeline * fix several bugs * fix reward loop * pass CI (single card) * minor fix * fix import * fix bugs * fix import * merge master * add sleep mode back * merge main * support passing num_inference steps * update accoriding to suggestion * align with master * add input_id & attention_mask back, drop hard code of chat template * support varlen prompt input

* update scripts * fix engine name & use image compressibility temporarily * fix some bugs * clean uncessary change * fix some bugs * fix bugs & clean configs * add autogen * fix CI * clean args * fix typo * update script * fix update weight * add hijack * fix checkpoint loading * disable free cache engine temporaily

* support wandb val visual log; support async genrm/rule reward_loop in val * update script * add comment

* enable reward loop * add timeout check for replica sleep * fix train script * consistent naming & fix mask * fix UT for multi-card * fix seq_len & clean files * drop sleep due to bug fix in vllm-omni side

* fix bugs * fix timesteps * fix lora * consistent script * fix image size * fix pipeline parse * add max model len to qwen-image * by pass bug * fix misc. bugs

* fix bugs * fix bugs * fix advantage cal

* support sync reward for val * wake up rollout after reward in val * debug

* fix sleep mode & non-lora weight update * fix from review

* fix bugs * update UT * fix config * update config * fix lora weight exporting * revert noise * revert size * format

* fix training * update script

* merge main * fix * fix * merge main * fix merge * fix * fix * fix * update ci

CLAassistant · 2026-02-12T07:04:30Z

All committers have signed the CLA.

gemini-code-assist

Code Review

This pull request introduces significant new functionality to support FlowGRPO training, including a new trainer, core algorithm implementations, and integrations with vLLM-Omni and diffusers. The changes are extensive and well-structured. My review primarily focuses on the new test files, where I've identified several instances of hardcoded paths. These paths make the tests non-portable and likely to fail in CI environments or for other developers. Addressing these will be crucial for ensuring the long-term maintainability and reliability of the new features.

gemini-code-assist · 2026-02-12T07:11:28Z

tests/experimental/agent_loop/test_diffusion_agent_loop.py

+    with initialize_config_dir(config_dir=os.path.abspath("verl/trainer/config")):
+        config = compose(config_name="ppo_diffusion_trainer")
+
+    model_path = os.path.expanduser("~/models/Qwen/Qwen-Image")


The test configuration hardcodes a model path using os.path.expanduser. This makes the test non-portable and dependent on a specific local file structure, which will cause it to fail in CI environments or for other developers. Tests should be self-contained and not rely on external, user-specific files.

Suggested change

model_path = os.path.expanduser("~/models/Qwen/Qwen-Image")

# It's recommended to use a mock model or a small, downloadable test model.

# For example, you could use a fixture to create a temporary model directory.

model_path = "path/to/test/model" # Replace with a portable path solution

gemini-code-assist · 2026-02-12T07:11:28Z

tests/experimental/agent_loop/test_diffusion_agent_loop.py

+    images_pil = (result.batch["responses"].permute(0, 2, 3, 1).numpy() * 255.0).astype("uint8")
+    for i, image in enumerate(images_pil):
+        image_path = os.path.join(f"{i}.jpg")
+        Image.fromarray(image).save(image_path)


This test produces side effects by saving generated images to the filesystem. Tests should be hermetic and not write files to the working directory, as this can interfere with the test environment and other tests. The TODO comment indicates this is likely temporary debugging code that should be removed.

gemini-code-assist · 2026-02-12T07:11:28Z

tests/experimental/reward_loop/reward_fn.py

+    ]
+
+    sampling_params = {"temperature": 0.7, "top_p": 0.8, "max_tokens": 4096}
+    model_name = model_name or os.path.expanduser("~/models/Qwen/Qwen2.5-VL-3B-Instruct")


The function compute_score_ocr contains a hardcoded model path using os.path.expanduser. This makes the function difficult to reuse and test in different environments. The model path should be provided as a configuration parameter rather than being hardcoded.

model_name = model_name

gemini-code-assist · 2026-02-12T07:11:28Z

tests/experimental/reward_loop/test_diffusion_reward_model_genrm.py

+    rollout_model_name = os.path.expanduser("~/models/Qwen/Qwen-Image")
+    reward_model_name = os.path.expanduser("~/models/Qwen/Qwen2.5-VL-3B-Instruct")


The test hardcodes model paths using os.path.expanduser, which makes it non-portable and likely to fail in CI or for other developers. Tests should be self-contained.

Suggested change

rollout_model_name = os.path.expanduser("~/models/Qwen/Qwen-Image")

reward_model_name = os.path.expanduser("~/models/Qwen/Qwen2.5-VL-3B-Instruct")

# Consider using a fixture to provide a path to a small, downloadable test model

# or mocking the model loading process entirely.

rollout_model_name = "path/to/test/rollout_model"

reward_model_name = "path/to/test/reward_model"

gemini-code-assist · 2026-02-12T07:11:28Z

tests/experimental/reward_loop/test_diffusion_reward_model_genrm.py

+    outputs = reward_loop_manager.compute_rm_score(data)
+
+    for idx, output in enumerate(outputs):
+        print(f"GRM Response {idx}:\n{output.non_tensor_batch['genrm_response']}\n")
+        print(f"Score:\n{output.non_tensor_batch['score']}\n")
+        print("=" * 50 + "\n")


This test function computes results but lacks assertions to verify their correctness. It only prints the output. A test without assertions does not validate the behavior of the code and can only confirm that it runs without crashing. Please add assertions to check the properties of the outputs, such as shape, type, or value ranges.

# Example assertion: # assert len(outputs) > 0 # assert "score" in outputs[0].non_tensor_batch # assert isinstance(outputs[0].non_tensor_batch["score"], float)

gemini-code-assist · 2026-02-12T07:11:28Z

tests/models/test_diffusers_fsdp_engine.py

+        model_type="diffusion_model",
+        strategy=strategy,
+        device_count=device_count,
+        model="~/models/Qwen/Qwen-Image",


The test configuration hardcodes a model path using ~/models/Qwen/Qwen-Image. This makes the test non-portable and dependent on a specific user's local file setup. Tests should be hermetic and use mock objects or small, self-contained test artifacts.

model="path/to/test/model", # Replace with a portable path solution

gemini-code-assist · 2026-02-12T07:11:28Z

tests/utils/dataset/test_qwen_dataset_on_cpu.py

+
+def get_ocr_data():
+    # prepare test dataset
+    local_folder = os.path.expanduser("~/data/ocr/")


The test hardcodes a data path using os.path.expanduser. This makes the test non-portable and reliant on a specific local directory structure. Test data should be created programmatically within the test or included as a small test artifact.

gemini-code-assist · 2026-02-12T07:11:28Z

tests/utils/dataset/test_qwen_dataset_on_cpu.py

+
+
+def test_qwen_dataset():
+    tokenizer = hf_tokenizer(os.path.expanduser("~/models/Qwen/Qwen-Image/tokenizer"), trust_remote_code=True)


This test hardcodes the path to the tokenizer using os.path.expanduser. This makes the test non-portable and will cause it to fail for other users or in CI environments. Please use a mock tokenizer or a small, self-contained test artifact.

gemini-code-assist · 2026-02-12T07:11:28Z

verl/utils/model.py

            )
            return GenerationConfig.from_model_config(config)
-        except OSError:  # Not found
+        except (OSError, ValueError):  # Not found


Catching ValueError here is a good addition, as AutoConfig.from_pretrained can raise it for certain model types like diffusers that don't have a standard config.json. This makes the function more robust.

zhtmike and others added 30 commits January 26, 2026 09:46

add entroypoint (#1)

70a155a

add training engine (#2)

62c5286

* add training engine * fix init * fix typs

move folders & make for two-forward pass in training loop (#4)

c0150da

Add diffusion reward loop (#3)

43915bc

* init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright

[fix] update customized reward func in UT (#5)

0833f81

* init reward; add ocr reward * update disrm input * add unit test * pass ut * fix typos/bugs * update copyright * update customized reward_fn

Update 20260109 (#8)

4d0a8d8

* Update 20260109 * update * fix CI

small fix after rebase (#12)

3c354d1

Merge branch 'main' into verl-omni

b418656

merge main (#13)

7d522ee

Merge remote-tracking branch 'origin/main' into verl-omni

abdb5d4

fix worker extension (#15)

80738a3

fix worker extension

a9b88f3

Merge branch 'main' into verl-omni

cf314d0

merge main

6eb395a

[reward, misc] fix: support async reward loop for validation (#18)

24d00a7

* support wandb val visual log; support async genrm/rule reward_loop in val * update script * add comment

[rollout] feat: enable reward model (#17)

be667a3

* enable reward loop * add timeout check for replica sleep * fix train script * consistent naming & fix mask * fix UT for multi-card * fix seq_len & clean files * drop sleep due to bug fix in vllm-omni side

[trainer] feat: fix training loop (#19)

8edd6d5

* fix bugs * fix timesteps * fix lora * consistent script * fix image size * fix pipeline parse * add max model len to qwen-image * by pass bug * fix misc. bugs

[rollout] fix: fix misc. bugs (#20)

b008b15

* fix bugs * fix bugs * fix advantage cal

turn on offload to avoid oom

46ffce8

[misc] feat: support sync reward loop for validation (#21)

af7ab01

* support sync reward for val * wake up rollout after reward in val * debug

[rollout] fix: fix sleep mode & non-lora weight update (#22)

109427b

* fix sleep mode & non-lora weight update * fix from review

add padding conversion (#24)

37f60a3

[rollout] fix: fix lora weight export from trainer (#23)

8fe64da

* fix bugs * update UT * fix config * update config * fix lora weight exporting * revert noise * revert size * format

[trainer] fix: fix training (#25)

838e28c

* fix training * update script

Merge branch 'main' into verl-omni-main

ac8122a

zhtmike added 4 commits February 12, 2026 09:42

Merge branch 'main' into verl-omni-main

c903e63

Merge branch 'main' into verl-omni-main

937abd0

[fsdp,vllm_omni,algo] fix: Merge main (#26)

e3b41ff

* merge main * fix * fix * merge main * fix merge * fix * fix * fix * update ci

revert python change

1942ed3

gemini-code-assist bot reviewed Feb 12, 2026

View reviewed changes

zhtmike changed the title ~~[fsdp,trainer,vllm_omni,algo] feat: support FlowGRPO-algo training~~ [fsdp,trainer,vllm_omni,algo] feat: support FlowGRPO training for QwenImage Feb 12, 2026

chenyingshu mentioned this pull request Feb 12, 2026

[WIP][RFC] Support Qwen-Image Flow-GRPO Training based on vLLM-Omni #4639

Open

22 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[fsdp,trainer,vllm_omni,algo] feat: support FlowGRPO training for QwenImage#5297

[fsdp,trainer,vllm_omni,algo] feat: support FlowGRPO training for QwenImage#5297
zhtmike wants to merge 34 commits intoverl-project:mainfrom
zhtmike:verl-omni-pr

zhtmike commented Feb 12, 2026

Uh oh!

CLAassistant commented Feb 12, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 12, 2026

Uh oh!

gemini-code-assist bot Feb 12, 2026

Uh oh!

gemini-code-assist bot Feb 12, 2026

Uh oh!

gemini-code-assist bot Feb 12, 2026

Uh oh!

gemini-code-assist bot Feb 12, 2026

Uh oh!

gemini-code-assist bot Feb 12, 2026

Uh oh!

gemini-code-assist bot Feb 12, 2026

Uh oh!

gemini-code-assist bot Feb 12, 2026

Uh oh!

gemini-code-assist bot Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-    model_path = os.path.expanduser("~/models/Qwen/Qwen-Image")
+    # It's recommended to use a mock model or a small, downloadable test model.
+    # For example, you could use a fixture to create a temporary model directory.
+    model_path = "path/to/test/model"  # Replace with a portable path solution

		rollout_model_name = os.path.expanduser("~/models/Qwen/Qwen-Image")
		reward_model_name = os.path.expanduser("~/models/Qwen/Qwen2.5-VL-3B-Instruct")

-    rollout_model_name = os.path.expanduser("~/models/Qwen/Qwen-Image")
-    reward_model_name = os.path.expanduser("~/models/Qwen/Qwen2.5-VL-3B-Instruct")
+    # Consider using a fixture to provide a path to a small, downloadable test model
+    # or mocking the model loading process entirely.
+    rollout_model_name = "path/to/test/rollout_model"
+    reward_model_name = "path/to/test/reward_model"



		def test_qwen_dataset():
		tokenizer = hf_tokenizer(os.path.expanduser("~/models/Qwen/Qwen-Image/tokenizer"), trust_remote_code=True)

Comments

Conversation

zhtmike commented Feb 12, 2026

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

CLAassistant commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented Feb 12, 2026 •

edited

Loading