Skip to content

Comments

[fsdp,trainer,vllm_omni,algo] feat: support FlowGRPO training for QwenImage#5297

Draft
zhtmike wants to merge 34 commits intoverl-project:mainfrom
zhtmike:verl-omni-pr
Draft

[fsdp,trainer,vllm_omni,algo] feat: support FlowGRPO training for QwenImage#5297
zhtmike wants to merge 34 commits intoverl-project:mainfrom
zhtmike:verl-omni-pr

Conversation

@zhtmike
Copy link

@zhtmike zhtmike commented Feb 12, 2026

What does this PR do?

Follow-up Work for #4639

  • A trainable script for the FlowGRPO algorithm for Qwen-Image is provided.
  • Support for vLLM-Omni has been added to the rollout engine.
  • Diffusers has been integrated as the training engine for the diffusion model.

This is currently a draft PR and contains repeated or redundant code/configurations. A pruned version will be available once it is ready for review.

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

We use Levenshtein distance for OCR reward calculation. Qwen2.5-VL-3B is employed as the reward model. The following figure shows the scores for the testing dataset.

螢幕截圖 2026-02-12 下午3 01 01

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

zhtmike and others added 30 commits January 26, 2026 09:46
* add training engine

* fix init

* fix typs
* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright
* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* update customized reward_fn
* Update 20260109

* update

* fix CI
* add entroypoint (#1)

* add training engine (#2)

* add training engine

* fix init

* fix typs

* move folders & make for two-forward pass in training loop (#4)

* Add diffusion reward loop (#3)

* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* [fix] update customized reward func in UT (#5)

* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* update customized reward_fn

* init dataset for Qwen-Image

* pass UT

* update return, update UT

* pass UT

* align with rl_dataset

* pass UT

* update filter long prompts

* debug

* clean code

---------

Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com>
* add entroypoint (#1)

* add training engine (#2)

* add training engine

* fix init

* fix typs

* move folders & make for two-forward pass in training loop (#4)

* Add diffusion reward loop (#3)

* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* [fix] update customized reward func in UT (#5)

* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* update customized reward_fn

* Update 20260109 (#8)

* Update 20260109

* update

* fix CI

* [data] feat: Add dataset for Qwen-Image (#6)

* add entroypoint (#1)

* add training engine (#2)

* add training engine

* fix init

* fix typs

* move folders & make for two-forward pass in training loop (#4)

* Add diffusion reward loop (#3)

* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* [fix] update customized reward func in UT (#5)

* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* update customized reward_fn

* init dataset for Qwen-Image

* pass UT

* update return, update UT

* pass UT

* align with rl_dataset

* pass UT

* update filter long prompts

* debug

* clean code

---------

Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com>

* add new config; debug actor

* debug; add reward config; add adv, policy loss

* debug reward loop

* init diffusers engine UT

* debug

* debug

* deubg actor forward

* debug

* merge

* add UT for adv and loss

* pass adv&loss UTs; pass engine backward UT

* clean debug code

---------

Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com>
* add entroypoint (#1)

* add training engine (#2)

* add training engine

* fix init

* fix typs

* move folders & make for two-forward pass in training loop (#4)

* Add diffusion reward loop (#3)

* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* [fix] update customized reward func in UT (#5)

* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* update customized reward_fn

* Update 20260109 (#8)

* Update 20260109

* update

* fix CI

* [data] feat: Add dataset for Qwen-Image (#6)

* add entroypoint (#1)

* add training engine (#2)

* add training engine

* fix init

* fix typs

* move folders & make for two-forward pass in training loop (#4)

* Add diffusion reward loop (#3)

* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* [fix] update customized reward func in UT (#5)

* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* update customized reward_fn

* init dataset for Qwen-Image

* pass UT

* update return, update UT

* pass UT

* align with rl_dataset

* pass UT

* update filter long prompts

* debug

* clean code

---------

Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com>

* update to align verl data format

* debug

---------

Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com>
* add agent loop

* add server manager

* Add single turn loop

* add test case

* add replica

* clean dummy input

* fix bugs

* fix bugs 2

* fix bugs 3

* fix bugs 4 and add vllm-omni patch

* implement sde

* add custom_pipeline option in verl

* fix some bugs in custom pipeline

* fix OOM

* add intermediate outputs

* support inputs without mask

* clean & bug fix

* rebase master

* fix some bugs

* fix chat template (temproraly fix)

* fix several bugs & add custom pipeline

* fix several bugs

* fix reward loop

* pass CI (single card)

* minor fix

* fix import

* fix bugs

* fix import

* merge master

* add sleep mode back

* merge main

* support passing num_inference steps

* update accoriding to suggestion

* align with master

* add input_id & attention_mask back, drop hard code of chat template

* support varlen prompt input
* update scripts

* fix engine name & use image compressibility temporarily

* fix some bugs

* clean uncessary change

* fix some bugs

* fix bugs & clean configs

* add autogen

* fix CI

* clean args

* fix typo

* update script

* fix update weight

* add hijack

* fix checkpoint loading

* disable free cache engine temporaily
* support wandb val visual log; support async genrm/rule reward_loop in val

* update script

* add comment
* enable reward loop

* add timeout check for replica sleep

* fix train script

* consistent naming & fix mask

* fix UT for multi-card

* fix seq_len & clean files

* drop sleep due to bug fix in vllm-omni side
* fix bugs

* fix timesteps

* fix lora

* consistent script

* fix image size

* fix pipeline parse

* add max model len to qwen-image

* by pass bug

* fix misc. bugs
* fix bugs

* fix bugs

* fix advantage cal
* support sync reward for val

* wake up rollout after reward in val

* debug
* fix sleep mode & non-lora weight update

* fix from review
* fix bugs

* update UT

* fix config

* update config

* fix lora weight exporting

* revert noise

* revert size

* format
* fix training

* update script
@CLAassistant
Copy link

CLAassistant commented Feb 12, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant new functionality to support FlowGRPO training, including a new trainer, core algorithm implementations, and integrations with vLLM-Omni and diffusers. The changes are extensive and well-structured. My review primarily focuses on the new test files, where I've identified several instances of hardcoded paths. These paths make the tests non-portable and likely to fail in CI environments or for other developers. Addressing these will be crucial for ensuring the long-term maintainability and reliability of the new features.

with initialize_config_dir(config_dir=os.path.abspath("verl/trainer/config")):
config = compose(config_name="ppo_diffusion_trainer")

model_path = os.path.expanduser("~/models/Qwen/Qwen-Image")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The test configuration hardcodes a model path using os.path.expanduser. This makes the test non-portable and dependent on a specific local file structure, which will cause it to fail in CI environments or for other developers. Tests should be self-contained and not rely on external, user-specific files.

Suggested change
model_path = os.path.expanduser("~/models/Qwen/Qwen-Image")
# It's recommended to use a mock model or a small, downloadable test model.
# For example, you could use a fixture to create a temporary model directory.
model_path = "path/to/test/model" # Replace with a portable path solution

Comment on lines +125 to +128
images_pil = (result.batch["responses"].permute(0, 2, 3, 1).numpy() * 255.0).astype("uint8")
for i, image in enumerate(images_pil):
image_path = os.path.join(f"{i}.jpg")
Image.fromarray(image).save(image_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This test produces side effects by saving generated images to the filesystem. Tests should be hermetic and not write files to the working directory, as this can interfere with the test environment and other tests. The TODO comment indicates this is likely temporary debugging code that should be removed.

]

sampling_params = {"temperature": 0.7, "top_p": 0.8, "max_tokens": 4096}
model_name = model_name or os.path.expanduser("~/models/Qwen/Qwen2.5-VL-3B-Instruct")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The function compute_score_ocr contains a hardcoded model path using os.path.expanduser. This makes the function difficult to reuse and test in different environments. The model path should be provided as a configuration parameter rather than being hardcoded.

    model_name = model_name

Comment on lines +76 to +77
rollout_model_name = os.path.expanduser("~/models/Qwen/Qwen-Image")
reward_model_name = os.path.expanduser("~/models/Qwen/Qwen2.5-VL-3B-Instruct")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The test hardcodes model paths using os.path.expanduser, which makes it non-portable and likely to fail in CI or for other developers. Tests should be self-contained.

Suggested change
rollout_model_name = os.path.expanduser("~/models/Qwen/Qwen-Image")
reward_model_name = os.path.expanduser("~/models/Qwen/Qwen2.5-VL-3B-Instruct")
# Consider using a fixture to provide a path to a small, downloadable test model
# or mocking the model loading process entirely.
rollout_model_name = "path/to/test/rollout_model"
reward_model_name = "path/to/test/reward_model"

Comment on lines +105 to +110
outputs = reward_loop_manager.compute_rm_score(data)

for idx, output in enumerate(outputs):
print(f"GRM Response {idx}:\n{output.non_tensor_batch['genrm_response']}\n")
print(f"Score:\n{output.non_tensor_batch['score']}\n")
print("=" * 50 + "\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This test function computes results but lacks assertions to verify their correctness. It only prints the output. A test without assertions does not validate the behavior of the code and can only confirm that it runs without crashing. Please add assertions to check the properties of the outputs, such as shape, type, or value ranges.

    # Example assertion:
    # assert len(outputs) > 0
    # assert "score" in outputs[0].non_tensor_batch
    # assert isinstance(outputs[0].non_tensor_batch["score"], float)

model_type="diffusion_model",
strategy=strategy,
device_count=device_count,
model="~/models/Qwen/Qwen-Image",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The test configuration hardcodes a model path using ~/models/Qwen/Qwen-Image. This makes the test non-portable and dependent on a specific user's local file setup. Tests should be hermetic and use mock objects or small, self-contained test artifacts.

        model="path/to/test/model", # Replace with a portable path solution


def get_ocr_data():
# prepare test dataset
local_folder = os.path.expanduser("~/data/ocr/")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The test hardcodes a data path using os.path.expanduser. This makes the test non-portable and reliant on a specific local directory structure. Test data should be created programmatically within the test or included as a small test artifact.



def test_qwen_dataset():
tokenizer = hf_tokenizer(os.path.expanduser("~/models/Qwen/Qwen-Image/tokenizer"), trust_remote_code=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This test hardcodes the path to the tokenizer using os.path.expanduser. This makes the test non-portable and will cause it to fail for other users or in CI environments. Please use a mock tokenizer or a small, self-contained test artifact.

)
return GenerationConfig.from_model_config(config)
except OSError: # Not found
except (OSError, ValueError): # Not found
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Catching ValueError here is a good addition, as AutoConfig.from_pretrained can raise it for certain model types like diffusers that don't have a standard config.json. This makes the function more robust.

@zhtmike zhtmike changed the title [fsdp,trainer,vllm_omni,algo] feat: support FlowGRPO-algo training [fsdp,trainer,vllm_omni,algo] feat: support FlowGRPO training for QwenImage Feb 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants