fix(tasks): use submission metric for reasoning test splits without ground truth by kcz358 · Pull Request #1213 · EvolvingLMMs-Lab/lmms-eval

kcz358 · 2026-02-28T04:14:41Z

Summary

Fixed MMBench reasoning test splits to use submission metric instead of acc_score (test splits have no ground truth answers)
Added InfoVQA reasoning test split with submission metric
Added DocVQA reasoning test split with submission metric
All test splits now extract answers from <answer> tags for submission files

Changes

MMBench Reasoning

Updated mmbench_cn_test_reasoning.yaml and mmbench_en_test_reasoning.yaml to use submission metric
Added mmbench_process_results_test() to extract answers from <answer> tags
Added mmbench_aggregate_test_results_cn() and mmbench_aggregate_test_results_en() to save Excel submissions

InfoVQA Reasoning

Created infovqa_test_reasoning.yaml with submission metric
Created infovqa_reasoning.yaml group with val + test
Added infovqa_reasoning_test_process_results() and infovqa_reasoning_test_aggregate_results()

DocVQA Reasoning

Created docvqa/reasoning/ directory with full reasoning implementation
Created docvqa_val_reasoning.yaml and docvqa_test_reasoning.yaml
Created docvqa_reasoning.yaml group with val + test
Added docvqa_reasoning_test_process_results() and docvqa_reasoning_test_aggregate_results()

All reasoning test splits now correctly use submission format for benchmarks without ground truth answers.

…l messages frame time mismatch:

* feat(mindcube): Add YAML configurations and utility functions for MindCube tasks * refactor(mindcube): Enhance docstrings and improve code readability in utils.py * feat(mindcube): Introduce _default_template_yaml and refactor task YAML files to include shared configurations

Change doc_to_visual function() for Karpathy test to coco_doc_to_visual_karpathy()

The previous code uses an unsafe way to remove the image key from the sample set. Yet, since the same image is used multiple times for eval the script fails, specifically when doing a distributed eval across several GPUs. Instead the new function remove just from the copy not from the original dict.

* add qwen3vl huggingface (no slang/vllm) * fix handle batch_size > 1 with padding_side='left' --------- Co-authored-by: ardalan.mehrani <ardalan.mehrani@bytedance.com>

* add UEval * chore: apply pre-commit fixes --------- Co-authored-by: LB <libo81501@gmail.com>

* add MME-SCI * update config and path * update utils.py

* Add SciVideoBench benchmark to lmms-eval * style: apply black/isort formatting fixes * fix: update scivideobench HF integration

Co-authored-by: ardalan.mehrani <ardalan.mehrani@bytedance.com>

* Add try catch for longvila * Fix gqa doc id key error for more robustness * Revise PR template * Lint

…#904)

* add MME-SCI * update config and path * update utils.py * update utils.py * add MME-SCI

add an OpenCompass version of MMstar with official Qwen3 prompt template

* Fix simple_parse_args_strings Function in utils.py * Apply black * Separate _smart_comma_split function * Apply black

…icial results (#912) * Update VideoMME with Qwen3VL prompt * update Qwen3VL to better handle qwen-vl-utils params

Handle empty response lists by returning an empty string.

* Add reasoning utils * Add charxiv reasoning version * Add reasoning tasks for images * Add text reasoning task * Lint

* spatial benchmarks added * small fixes

* update mmmu with qwen3vl prompts * update mmmu and mmmu pro with qwen3vl prompts --------- Co-authored-by: ardalan.mehrani <ardalan.mehrani@bytedance.com>

* Fix tqdm bar for qwen25 vl when batch responding * Fix qwen3 vl batch tqdm processing issue * Lint

* Add bagel lmms-engine version for better api to transformers * Allow bagel to load from chat messages

…ge in load_video

- Rename read_video_pyav to read_video in load_video.py with backward-compat alias - Delete _resize_image and read_video_pyav_base64 dead functions - Update all 12 caller files to use read_video directly - Inline base64 encoding logic in qwen2_5_omni.py (was read_video_pyav_base64) - Fix missing import in vila.py (latent bug) - Remove use_custom_video_loader dead code from 5 models that declared but never checked it (qwen2_5_vl, qwen3_vl, qwen3_omni, llava_onevision1_5, huggingface)

…d Section 7.2

Add docs/external_usage.md covering CLI subcommands (tasks, models, eval wizard, ui, serve, power, version) and Python library usage (TaskManager, datasets, evaluator, metrics). Update docs index link. Polish v0.7 release notes for consistency.

Move benchmark task YAMLs to lmms-lab-eval datasets and centralize media path lookup so local and cached media resolve consistently across environments.

Add LRU caches for candidate root and extension-variant generation while keeping per-call filesystem existence checks. Document the optimization in the v0.7 changelog.

Add a chat-style evaluation model for NanoVLM (SigLIP2 + MLP projector + Qwen3-0.6B) trained with lmms-engine. Key features: - Async multi-GPU inference: loads model replicas on N GPUs, dispatches work via job queue with independent worker threads (no sync overhead) - NanoVLM-specific image token expansion (<|image_pad|> -> 256 tokens) - Single GPU fallback when only one device is available - Configurable via worker_gpus/worker_count model args Co-authored-by: Brian Li <drluodian@gmail.com>

- Add pytest configuration (pyproject.toml markers, conftest.py fixtures, __init__.py files) - Add test_protocol.py (32 tests for ChatMessages protocol) - Add test_construct_requests.py (23 tests for req.args tuple shapes) - Add test_evaluator.py (15 tests for agentic flow) - Add test_cli_dispatch_parametrized.py (17 parametrized CLI tests) - Add test_determinism_parametrized.py (14 parametrized determinism tests) - Add prompt_stability/ (22 tests + 11 golden snapshots for 8 benchmarks) - Remove dead code: qwen2_5_vl/, test_reasoning_tag_stripping/, utils.py, run_cicd.py, task_input_specs/ - Mark test_usage_metrics.py with @pytest.mark.api and @pytest.mark.slow

- Replace unittest.TestCase with pure pytest functions - Convert self.assert* to plain assert statements - Use @pytest.mark.parametrize for role validation tests - Apply AAA pattern (Arrange, Act, Assert) with section comments - Rename tests to follow test_<unit>_<scenario>_<expected> pattern - Preserve all 32 test cases and exact same coverage - Keep module docstring and section structure

…style - Convert all TestCase classes to plain functions - Replace self.assert* with plain assert statements - Use @pytest.mark.parametrize for cross-type consistency checks - Follow test_<unit>_<scenario>_<expected> naming pattern - Implement AAA pattern (Arrange, Act, Assert) with blank line separation - Preserve exact same test coverage (23 tests) - Keep module docstring and section comments - All tests pass, pre-commit checks pass

…test style - Remove 7 unittest.TestCase classes, convert to 15 pure pytest functions - All test functions follow test_<unit>_<scenario>_<expected> naming pattern - Replace self.assert* with plain assert statements - Preserve exact same test coverage (15 tests) - Keep helper functions and fake classes as module-level - Maintain module docstring and section comments - All tests pass, pre-commit checks pass

@pytest

- Replace unittest.TestCase classes with pure pytest functions - Convert self.assert* calls to plain assert statements - Replace _get_tm() global function with @pytest.fixture(scope='module') - Use @pytest.mark.parametrize for test variants instead of self.subTest loops - Rename tests to follow test_<unit>_<scenario>_<expected> pattern - Preserve all 79 tests with identical coverage - Keep module docstring, section comments, and helper functions unchanged - All tests pass with no functional changes

…date README - Convert test_token_counts.py and test_efficiency_metrics.py to pure pytest style - Remove TestDeterminismDetection from test_response_cache.py (canonical version in test_determinism_parametrized.py) - Delete test_cli_dispatch.py (replaced by parametrized version) - Update README: 292 tests, remove deleted file refs, fix fixture descriptions

Each test layer section now opens with a representative code snippet showing what the tests actually verify, making the README scannable for developers unfamiliar with the suite.

…d categorized TOC - Add six-stage pipeline ASCII diagram mapping each stage to its doc page - Add concrete code examples for model registration, task YAML, Python API, and caching - Organize TOC into logical sections: Getting Started, Extending, Library Usage, Performance, Task Catalog - Add release notes table covering v0.3 through v0.7 (v0.4 and v0.5 were previously missing) - Improve prose: active voice, present tense, complete sentences per document-writer guidelines

…VideoBench - Replace all Qwen2.5-VL local GPU examples with openai_compatible + gpt-4.1-mini - Feature three recommended benchmarks: mmmu_val, video_mmmu, longvideobench_val_v - Add task catalog table with modality and description for each benchmark - Show multi-task evaluation command with --log_samples and --output_path

…ff123c

* refactor(models/chat): improve async_openai code structure and readability (#1102) * refactor(models/chat): extract prepare_messages method * refactor(models/chat): refactor async concurrency control and add docstrings - Extract _AdaptiveConcurrencyTracker for cleaner state management - Split generate_until's run() into focused helper methods - Add comprehensive docstrings to all new methods - Simplify run() from 130 lines to 8 lines - Update async_openai_qwen3_vl.py with class docstring * style: auto-fix lint (black + isort) * refactor: replace async_openai_qwen3_vl class with message_format parameter - Add message_format param to AsyncOpenAIChat (default='openai', supports 'qwen3_vl') - Extract _build_video_kwargs() to eliminate DRY violation - Remove separate async_openai_qwen3_vl.py and its registry entry - Fix missing f-string prefix in tool response formatting - Fix duplicate .gitignore entry * refactor(models/chat): add message_format parameter to support qwen3_vl - Add message_format parameter to AsyncOpenAIChat - Support both 'default' and 'qwen3_vl' message formats - Remove async_openai_qwen3_vl.py (no longer needed) - Unregister async_openai_qwen3_vl from model registry - Fix string formatting for tool call tags * fix tool response tag format --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Bo Li <drluodian@gmail.com> * docs: add MMMU eval discrepancy report and TLDR FP definitions * fix(ci): make lint workflow fork-PR safe * feat(tasks): add MMStar reasoning task * refactor(tasks): merge cn and en reasoning into unified structure - Combine cn_reasoning and en_reasoning into single reasoning directory - Share common template yaml across both cn and en reasoning tasks - Unified utils.py handles cn/en via DATASET_NAME environment variable - Keep separate group files for mmbench_cn_reasoning and mmbench_en_reasoning * refactor(tasks): unify cn and en reasoning with single group - Remove environment variable dependency - Add separate doc_to_text/doc_to_messages for cn and en in utils.py - Template yaml shared, specific functions defined in task yaml - Single mmbench_reasoning group containing both cn and en dev tasks - Unified process_results without data_source distinction * fix(tasks): add dataset_name to reasoning task configs * feat(tasks): add test split for mmbench reasoning tasks - Add mmbench_cn_test_reasoning and mmbench_en_test_reasoning - Add test_split to dev reasoning configs - Update mmbench_reasoning group to include all four tasks * feat(tasks): add MME-RealWorld reasoning tasks - Add mme_realworld_reasoning (en) and mme_realworld_cn_reasoning (cn) - Include doc_to_messages for both languages with reasoning prompts - Support accuracy and format scoring metrics * feat(tasks): add SEED-Bench reasoning tasks - Add seedbench_reasoning with doc_to_messages for reasoning format - Add seedbench_2_plus_reasoning with doc_to_messages for reasoning format - Support accuracy and format scoring metrics for both benchmarks * feat(tasks): add CV-Bench reasoning tasks - Add cv_bench_reasoning, cv_bench_2d_reasoning, cv_bench_3d_reasoning - Include doc_to_messages for reasoning format - Support accuracy and format scoring metrics * fix(reasoning): improve mcq matching with normalize comparison - Apply parse_mcq to ground_truth for consistency - Use case-insensitive comparison for MCQ answers - Strip whitespace for more robust matching * feat(tasks): add OCR-Bench reasoning task - Add ocrbench_reasoning with doc_to_messages for reasoning format - Support accuracy and format scoring metrics * feat(tasks): add ChartQA reasoning task - Add chartqa_reasoning with doc_to_messages for reasoning format - Support accuracy and format scoring metrics * feat(tasks): add InfoVQA reasoning task - Add infovqa_val_reasoning with doc_to_messages for reasoning format - Support accuracy and format scoring metrics * feat(tasks): add CountBenchQA reasoning task - Add countbenchqa_reasoning with doc_to_messages for reasoning format - Support accuracy and format scoring metrics * feat(tasks): add CountBenchQA benchmark - Add countbenchqa task config and utils - Add countbenchqa_reasoning with doc_to_messages for reasoning format - Support accuracy and format scoring metrics * feat(tasks): add VStar-Bench reasoning tasks - Add vstar_bench_reasoning with doc_to_messages for reasoning format - Add vstar_bench_direct_attributes_reasoning - Add vstar_bench_relative_position_reasoning - Support accuracy and format scoring metrics * feat(tasks): add PixMo-Count benchmark - Add pixmo_count task config and utils - Add pixmo_count_reasoning with doc_to_messages for reasoning format - Support accuracy and format scoring metrics * feat(models): add system_prompt_file support to AsyncOpenAIChat - Allow loading system prompt from file via system_prompt_file parameter - Add _apply_system_prompt method to inject system prompt into messages - Apply system prompt before generation in generate_until * style: auto-fix lint (black + isort) * refactor(reasoning): extract acc_score computation to separate function Extracted accuracy reward logic from compute_score into acc_reward function for better separation of concerns. * Fix async oai rebase error * Lint * refactor(reasoning): add model-side system_prompt support and deduplicate reasoning task utils - Add _resolve_system_prompt() and _apply_system_prompt() to base lmms class for model-side system prompt injection (supports file paths and literal strings) - Add factory functions make_reasoning_doc_to_messages() and make_reasoning_process_results() to reasoning_utils.py, eliminating ~400 lines of copy-paste across 12 reasoning task modules - Update AsyncOpenAIChat: replace system_prompt_file with system_prompt using base class utilities, remove duplicate _apply_system_prompt method - Wire up HuggingFace chat model to inject system_prompt into messages during generation (opt-in only, default None to avoid overwriting task-level prompts) - Fix infovqa reasoning: anls(ground_truth, results) -> anls(ground_truth, [extracted]) - Fix mmbench reasoning: cache YAML parsing with @lru_cache instead of per-sample I/O - Fix format_reward() to also match <analysis>...</analysis> tag pattern - Expand --reasoning_tags default to include <analysis> tags * fix(ci): restore task_input_specs/redundancy_refactor.yaml deleted by bff123c * fix: remove duplicate --reasoning_tags CLI argument * docs: restore docs/README.md from dev-v0d7 --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Bo Li <drluodian@gmail.com>

Replace hand-rolled doc_to_messages and process_results in 10 reasoning task utils with make_reasoning_doc_to_messages and make_reasoning_process_results from _task_utils/reasoning_utils.py. Files: ai2d, chartqa, charxiv, logicvista, mathvision, olympiadbench_mimo, phyx, realworldqa, seedbench, seedbench_2_plus. All YAML-referenced function names, metric keys, and scoring logic preserved exactly.

* feat(models): add Phi4 multimodal backend Added Phi4 multimodal model support with chat template interface. * style: auto-fix lint (black + isort) --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Test splits lack ground truth answers. Changed to submission metric to save predictions, extracting answers from <answer> tags.

- Added infovqa_test_reasoning.yaml with submission metric - Test split extracts answers from <answer> tags for submission - Created infovqa_reasoning.yaml group with val + test

- Added docvqa_test_reasoning.yaml with submission metric - Test split extracts answers from <answer> tags for submission - Created docvqa_reasoning.yaml group with val + test

Luodian · 2026-02-28T16:27:30Z

Cherry-picked the 3 core commits directly onto dev-v0d7:

1cb06e3d fix(tasks): use submission metric for mmbench reasoning test splits
ef707048 feat(tasks): add infovqa reasoning test split and group
c2790f61 feat(tasks): add docvqa reasoning test split and group

Closing this PR since the branch was too far diverged (100 commits behind) to rebase cleanly. Changes are now live on dev-v0d7.

kcz358 and others added 30 commits October 29, 2025 14:33

Add mcp client timeout and fix sglang json override error and qwen3 v…

783327d

…l messages frame time mismatch:

Fix data loading for coco_karpathy_test (#884)

80f1d32

Change doc_to_visual function() for Karpathy test to coco_doc_to_visual_karpathy()

[feat] Add llava ov 1.5 chat (#887)

e68e05c

Add Qwen3-VL models (w/o vllm or sglang) (#883)

f395834

* add qwen3vl huggingface (no slang/vllm) * fix handle batch_size > 1 with padding_side='left' --------- Co-authored-by: ardalan.mehrani <ardalan.mehrani@bytedance.com>

Update README.md (#891)

4b4049d

【Task】add UEval benchmark to lmms-eval (#890)

0856f72

* add UEval * chore: apply pre-commit fixes --------- Co-authored-by: LB <libo81501@gmail.com>

[TASK] MME-SCI Benchmark (#878)

eb4ef7a

* add MME-SCI * update config and path * update utils.py

[benchmark] add SciVideoBench benchmark to lmms-eval (#875)

4a3ccf3

* Add SciVideoBench benchmark to lmms-eval * style: apply black/isort formatting fixes * fix: update scivideobench HF integration

fix qwen to handle bsz>1 (#889)

9a180ec

Co-authored-by: ardalan.mehrani <ardalan.mehrani@bytedance.com>

[Fix] LongVila prepare request and gqa doc to visual (#906)

5a8d253

* Add try catch for longvila * Fix gqa doc id key error for more robustness * Revise PR template * Lint

Add OmniSpatial task (#896)

f1e812a

edit current_tasks to include mathvision which is already implemented (…

c75f1a4

…#904)

[Docs] Add MME-SCI to current_tasks.md (#909)

30bfaec

* add MME-SCI * update config and path * update utils.py * update utils.py * add MME-SCI

refactor mmstar with default template (#907)

4993e6e

add an OpenCompass version of MMstar with official Qwen3 prompt template

Update available models in __init__.py (#914)

7f8e6b7

[bugfix] Fix nested dictionary input for vllm mm_processor_kwargs (#915)

a57ee5a

* Fix simple_parse_args_strings Function in utils.py * Apply black * Separate _smart_comma_split function * Apply black

Update Qwen3VL video generation logic to match reproduce VideoMME off…

36393e9

…icial results (#912) * Update VideoMME with Qwen3VL prompt * update Qwen3VL to better handle qwen-vl-utils params

Update apply method to handle empty responses (#917)

1be834e

Handle empty response lists by returning an empty string.

update qwen3vl (simple model) processor (#922)

ebd82a5

[bugfix] Filter unsupported model_kwargs for LLaVA-OneVision-1.5 (#924)

cac818f

[feat] Add reasoning version of image and text dataset (#926)

e0b271d

* Add reasoning utils * Add charxiv reasoning version * Add reasoning tasks for images * Add text reasoning task * Lint

[Task] Spatial benchmarks: Blink, CV_Bench, Embspatial, ERQA (#927)

fcd18a7

* spatial benchmarks added * small fixes

add VLMs are biased benchmark (#928)

ca3c1a1

Update MMMU and MMMUpro with Qwen3VL prompt (#929)

9d32951

* update mmmu with qwen3vl prompts * update mmmu and mmmu pro with qwen3vl prompts --------- Co-authored-by: ardalan.mehrani <ardalan.mehrani@bytedance.com>

add snsbench (#930)

1930b8e

Implement 'Vision Language Models are Blind' benchmark (#931)

d1f9efd

[fix] Fixing the tqdm bar for the qwen vl series for batch infer (#936)

b2cc1eb

* Fix tqdm bar for qwen25 vl when batch responding * Fix qwen3 vl batch tqdm processing issue * Lint

[feat] Bagel lmms-engine eval inference pipeline (#938)

2120d74

* Add bagel lmms-engine version for better api to transformers * Allow bagel to load from chat messages

Luodian and others added 27 commits February 25, 2026 02:54

feat: add async hf model multi-gpu worker backend (#1204)

9ce4c59

refactor: remove dead read_video_pyav_pil and deduplicate _resize_ima…

dc42ca9

…ge in load_video

docs: rewrite Section 7.1 to document read_video backends, remove dea…

9c30b5f

…d Section 7.2

feat(tasks): switch benchmark media/tasks to HF-source resolution

11a3088

Move benchmark task YAMLs to lmms-lab-eval datasets and centralize media path lookup so local and cached media resolve consistently across environments.

perf(tasks): cache media path expansion in resolver

d88e586

Add LRU caches for candidate root and extension-variant generation while keeping per-call filesystem existence checks. Document the optimization in the v0.7 changelog.

test: include neptune in benchmark registration coverage

fc3d45c

clean coutix

628391e

docs: add comprehensive test suite README

c21de3f

docs: add concrete code examples to each test README section

78cd4ea

Each test layer section now opens with a representative code snippet showing what the tests actually verify, making the README scannable for developers unfamiliar with the suite.

fix(ci): restore task_input_specs/redundancy_refactor.yaml deleted by b…

eaad0a5

…ff123c

fix(tasks): use submission metric for mmbench reasoning test splits

1cb06e3

Test splits lack ground truth answers. Changed to submission metric to save predictions, extracting answers from <answer> tags.

feat(tasks): add infovqa reasoning test split and group

ef70704

- Added infovqa_test_reasoning.yaml with submission metric - Test split extracts answers from <answer> tags for submission - Created infovqa_reasoning.yaml group with val + test

feat(tasks): add docvqa reasoning test split and group

c2790f6

- Added docvqa_test_reasoning.yaml with submission metric - Test split extracts answers from <answer> tags for submission - Created docvqa_reasoning.yaml group with val + test

Luodian force-pushed the dev-v0d7 branch from 8714a3f to a484427 Compare February 28, 2026 16:17

Luodian closed this Feb 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(tasks): use submission metric for reasoning test splits without ground truth#1213

fix(tasks): use submission metric for reasoning test splits without ground truth#1213
kcz358 wants to merge 1724 commits intodev-v0d7from
fix/submission

kcz358 commented Feb 28, 2026

Uh oh!

Luodian commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

kcz358 commented Feb 28, 2026

Summary

Changes

MMBench Reasoning

InfoVQA Reasoning

DocVQA Reasoning

Uh oh!

Luodian commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants