Skip to content

Comments

[data] feat: Add dataset for Qwen-Image#6

Merged
zhtmike merged 15 commits intozhtmike:verl-omnifrom
chenyingshu:verl-omni-data
Jan 9, 2026
Merged

[data] feat: Add dataset for Qwen-Image#6
zhtmike merged 15 commits intozhtmike:verl-omnifrom
chenyingshu:verl-omni-data

Conversation

@chenyingshu
Copy link

What does this PR do?

Add dataset QwenDataset for Qwen-Image.
Add UT for dataset, testing dataloader and dataset configs.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

zhtmike and others added 7 commits January 6, 2026 16:04
* add training engine

* fix init

* fix typs
* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright
* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* update customized reward_fn
@chenyingshu chenyingshu changed the title Add dataset for Qwen-Image [data] Add dataset for Qwen-Image Jan 9, 2026
@zhtmike zhtmike requested a review from Copilot January 9, 2026 03:06
@chenyingshu chenyingshu changed the title [data] Add dataset for Qwen-Image [data] feat: Add dataset for Qwen-Image Jan 9, 2026
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
Copy link
Owner

@zhtmike zhtmike Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need # ============================================================================

Add a comment on line R14`

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new QwenDataset class for handling text prompts in Qwen-Image models, particularly for text-guided vision generation tasks. The dataset supports loading prompts from text files, tokenization with configurable templates, and extraction of ground truth data for OCR tasks.

Key changes:

  • New dataset implementation with prompt filtering, truncation, and tokenization support
  • Integration with existing reward loop for diffusion models
  • Unit tests for dataset functionality and dataloader compatibility

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
verl/utils/dataset/qwen_dataset.py New dataset class for Qwen-Image with prompt loading, tokenization, and ground truth extraction
verl/utils/dataset/init.py Export the new QwenDataset class
verl/experimental/reward_loop/reward_manager/diffusion.py Remove unused commented code
tests/utils/dataset/test_qwen_dataset_on_cpu.py Unit tests for QwenDataset with basic functionality and max_samples parameter
tests/experimental/reward_loop/test_diffusion_reward_model_genrm.py Update test to use "input_ids" instead of "prompts" key

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Args:
data_files (str): Path to the text file containing prompts.
tokenizer (PreTrainedTokenizer): Tokenizer to tokenize the prompts.
config (OmegaConf): the data config.
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says the parameter type is OmegaConf but the actual type hint is DictConfig. Consider updating the docstring to match the type hint for consistency.

Suggested change
config (OmegaConf): the data config.
config (DictConfig): the data config.

Copilot uses AI. Check for mistakes.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 61 to 65
if self.filter_overlong_prompts:
self.prompts = [x for x in self.prompts if len(x) <= self.max_prompt_length]

if self.max_samples > 0 and self.max_samples < len(self.prompts):
self.prompts = self.prompts[: self.max_samples]
Copy link

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of operations could be optimized. Currently, the code filters overlong prompts (line 62) before applying max_samples (lines 64-65). If filter_overlong_prompts significantly reduces the dataset size, the final dataset might have fewer samples than max_samples. Consider applying max_samples before filtering to ensure you get the requested number of samples, or document this behavior clearly.

Copilot uses AI. Check for mistakes.
@zhtmike
Copy link
Owner

zhtmike commented Jan 9, 2026

if possible, we maintain the original configurations and returns from https://github.com/volcengine/verl/blob/main/verl/utils/dataset/rl_dataset.py#L70
and shows the minimal changes adapting to QwenDataset

@chenyingshu
Copy link
Author

if possible, we maintain the original configurations and returns from https://github.com/volcengine/verl/blob/main/verl/utils/dataset/rl_dataset.py#L70 and shows the minimal changes adapting to QwenDataset

updated returns

def maybe_filter_out_long_prompts(self, prompts: list):
# filter out too long prompts
if self.filter_overlong_prompts:
prompts = [x for x in prompts if len(x) <= self.max_prompt_length]
Copy link
Owner

@zhtmike zhtmike Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the official filter_out_long_prompts are based on the length of the tokenized integers, not the length of the string.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@zhtmike zhtmike merged commit ef70047 into zhtmike:verl-omni Jan 9, 2026
4 checks passed
zhtmike added a commit that referenced this pull request Jan 26, 2026
* add entroypoint (#1)

* add training engine (#2)

* add training engine

* fix init

* fix typs

* move folders & make for two-forward pass in training loop (#4)

* Add diffusion reward loop (#3)

* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* [fix] update customized reward func in UT (#5)

* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* update customized reward_fn

* init dataset for Qwen-Image

* pass UT

* update return, update UT

* pass UT

* align with rl_dataset

* pass UT

* update filter long prompts

* debug

* clean code

---------

Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com>
zhtmike added a commit that referenced this pull request Jan 27, 2026
* add entroypoint (#1)

* add training engine (#2)

* add training engine

* fix init

* fix typs

* move folders & make for two-forward pass in training loop (#4)

* Add diffusion reward loop (#3)

* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* [fix] update customized reward func in UT (#5)

* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* update customized reward_fn

* Update 20260109 (#8)

* Update 20260109

* update

* fix CI

* [data] feat: Add dataset for Qwen-Image (#6)

* add entroypoint (#1)

* add training engine (#2)

* add training engine

* fix init

* fix typs

* move folders & make for two-forward pass in training loop (#4)

* Add diffusion reward loop (#3)

* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* [fix] update customized reward func in UT (#5)

* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* update customized reward_fn

* init dataset for Qwen-Image

* pass UT

* update return, update UT

* pass UT

* align with rl_dataset

* pass UT

* update filter long prompts

* debug

* clean code

---------

Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com>

* add new config; debug actor

* debug; add reward config; add adv, policy loss

* debug reward loop

* init diffusers engine UT

* debug

* debug

* deubg actor forward

* debug

* merge

* add UT for adv and loss

* pass adv&loss UTs; pass engine backward UT

* clean debug code

---------

Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com>
@chenyingshu chenyingshu deleted the verl-omni-data branch January 29, 2026 07:26
zhtmike added a commit that referenced this pull request Jan 29, 2026
* add entroypoint (#1)

* add training engine (#2)

* add training engine

* fix init

* fix typs

* move folders & make for two-forward pass in training loop (#4)

* Add diffusion reward loop (#3)

* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* [fix] update customized reward func in UT (#5)

* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* update customized reward_fn

* Update 20260109 (#8)

* Update 20260109

* update

* fix CI

* [data] feat: Add dataset for Qwen-Image (#6)

* add entroypoint (#1)

* add training engine (#2)

* add training engine

* fix init

* fix typs

* move folders & make for two-forward pass in training loop (#4)

* Add diffusion reward loop (#3)

* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* [fix] update customized reward func in UT (#5)

* init reward; add ocr reward

* update disrm input

* add unit test

* pass ut

* fix typos/bugs

* update copyright

* update customized reward_fn

* init dataset for Qwen-Image

* pass UT

* update return, update UT

* pass UT

* align with rl_dataset

* pass UT

* update filter long prompts

* debug

* clean code

---------

Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com>

* update to align verl data format

* debug

---------

Co-authored-by: Cheung Ka Wai <zhtmike@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants