feat: add VLMEvalKit-compatible Qwen task variants for MMMU and MMStar by Luodian · Pull Request #1021 · EvolvingLMMs-Lab/lmms-eval

Luodian · 2026-01-22T16:36:58Z

Summary

Adds VLMEvalKit-compatible task variants for Qwen models to address the 10-20% score gaps reported between lmms-eval and VLMEvalKit.

Problem

Users reported significant score differences when evaluating Qwen models:

MMMU: 10-15% lower scores in lmms-eval vs VLMEvalKit
MMStar: Similar gaps observed

Root cause: Different prompt formatting between the two frameworks:

lmms-eval default: "Options: A: option1\nB: option2"
VLMEvalKit: "Question: ...\nOptions:\nA. option1\nB. option2\nAnswer with the option letter only."

Changes

Added new task variants:

`mmmu_val_qwen` (lmms_eval/tasks/mmmu/mmmu_val_qwen.yaml)

Uses pre_prompt: "Question: "
Uses post_prompt: "Answer with the option letter only."
Format: qwen3_vl for proper prompt construction

`mmstar_qwen` (lmms_eval/tasks/mmstar/mmstar_qwen.yaml)

Same VLMEvalKit-style prompts
Maintains compatibility with existing metrics

Usage

# Use VLMEvalKit-compatible prompts for Qwen models
python -m lmms_eval --model qwen2_5_vl \
  --model_args pretrained=Qwen/Qwen2.5-VL-3B-Instruct \
  --tasks mmmu_val_qwen,mmstar_qwen

Related Issues

Addresses #935, #932, #881, #901

Testing

Verified YAML syntax validity
Verified task registration works
Prompt format matches VLMEvalKit implementation

Add new task variants that use VLMEvalKit-style prompt formatting: - mmmu_val_qwen: Uses 'Question: {q}' prefix and 'Answer with the option letter only.' suffix - mmstar_qwen: Uses same VLMEvalKit-compatible prompt structure These variants help users reproduce benchmark scores closer to official Qwen results reported in VLMEvalKit evaluations. Usage: python -m lmms_eval --model qwen2_5_vl --tasks mmmu_val_qwen,mmstar_qwen ... Addresses score reproduction gaps reported in Issues #935, #932, #881, #901

kcz358 · 2026-01-23T01:38:04Z

lmms_eval/tasks/mmstar/mmstar_qwen.yaml

This change seem to duplicated with #907

kcz358 · 2026-01-23T01:38:59Z

lmms_eval/tasks/mmmu/mmmu_val_qwen.yaml

A small mismatch to the open ended propmt but the same motivation compare to #929. Seems this version more make sense to me?

Lewis-Lu · 2026-01-24T04:32:10Z

Hi there,

As I used lmms-eval main branch #714f4fed7, for Qwen3-VL-4B-Instruct the results are as follows:

Still 0.6216 is a 7% gap compared to the official report 0.698 with respect to MMStar metric.

Is there any suggestion or any further merge can resolve this issue ?

Best,
Lewis

#1021) Add new task variants that use VLMEvalKit-style prompt formatting: - mmmu_val_qwen: Uses 'Question: {q}' prefix and 'Answer with the option letter only.' suffix - mmstar_qwen: Uses same VLMEvalKit-compatible prompt structure These variants help users reproduce benchmark scores closer to official Qwen results reported in VLMEvalKit evaluations. Usage: python -m lmms_eval --model qwen2_5_vl --tasks mmmu_val_qwen,mmstar_qwen ... Addresses score reproduction gaps reported in Issues #935, #932, #881, #901

kcz358 reviewed Jan 23, 2026

View reviewed changes

lmms_eval/tasks/mmstar/mmstar_qwen.yaml

Copy link

Collaborator

kcz358 Jan 23, 2026 •

edited

Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seem to duplicated with #907

kcz358 reviewed Jan 23, 2026

View reviewed changes

Luodian merged commit c39b6c4 into main Feb 8, 2026
6 checks passed

Luodian deleted the feat/qwen-vlmevalkit-prompts branch February 8, 2026 03:57

This was referenced Feb 9, 2026

MMStar for Qwen3-VL-4B cannot reach the official report #881

Closed

The experimental results of Qwen3-VL-8B-Instruct could not be reproduced #935

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add VLMEvalKit-compatible Qwen task variants for MMMU and MMStar#1021

feat: add VLMEvalKit-compatible Qwen task variants for MMMU and MMStar#1021
Luodian merged 1 commit intomainfrom
feat/qwen-vlmevalkit-prompts

Luodian commented Jan 22, 2026

Uh oh!

kcz358 Jan 23, 2026 •

edited

Loading

Uh oh!

kcz358 Jan 23, 2026 •

edited

Loading

Uh oh!

Lewis-Lu commented Jan 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Luodian commented Jan 22, 2026

Summary

Problem

Changes

mmmu_val_qwen (lmms_eval/tasks/mmmu/mmmu_val_qwen.yaml)

mmstar_qwen (lmms_eval/tasks/mmstar/mmstar_qwen.yaml)

Usage

Related Issues

Testing

Uh oh!

kcz358 Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kcz358 Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Lewis-Lu commented Jan 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`mmmu_val_qwen` (lmms_eval/tasks/mmmu/mmmu_val_qwen.yaml)

`mmstar_qwen` (lmms_eval/tasks/mmstar/mmstar_qwen.yaml)

kcz358 Jan 23, 2026 •

edited

Loading

kcz358 Jan 23, 2026 •

edited

Loading