Conversation
Luodian
pushed a commit
that referenced
this pull request
Feb 28, 2026
* Add try catch for longvila * Fix gqa doc id key error for more robustness * Revise PR template * Lint
Luodian
added a commit
that referenced
this pull request
Feb 28, 2026
…s (human/agents) (#1210) * [TASK] Added MindCube Task (#876) * feat(mindcube): Add YAML configurations and utility functions for MindCube tasks * refactor(mindcube): Enhance docstrings and improve code readability in utils.py * feat(mindcube): Introduce _default_template_yaml and refactor task YAML files to include shared configurations * Fix data loading for coco_karpathy_test (#884) Change doc_to_visual function() for Karpathy test to coco_doc_to_visual_karpathy() * fix hallusionbench processing for distributed eval (#885) The previous code uses an unsafe way to remove the image key from the sample set. Yet, since the same image is used multiple times for eval the script fails, specifically when doing a distributed eval across several GPUs. Instead the new function remove just from the copy not from the original dict. * [feat] Add llava ov 1.5 chat (#887) * Add Qwen3-VL models (w/o vllm or sglang) (#883) * add qwen3vl huggingface (no slang/vllm) * fix handle batch_size > 1 with padding_side='left' --------- Co-authored-by: ardalan.mehrani <ardalan.mehrani@bytedance.com> * Update README.md (#891) * 【Task】add UEval benchmark to lmms-eval (#890) * add UEval * chore: apply pre-commit fixes --------- Co-authored-by: LB <libo81501@gmail.com> * [TASK] MME-SCI Benchmark (#878) * add MME-SCI * update config and path * update utils.py * [benchmark] add SciVideoBench benchmark to lmms-eval (#875) * Add SciVideoBench benchmark to lmms-eval * style: apply black/isort formatting fixes * fix: update scivideobench HF integration * fix qwen to handle bsz>1 (#889) Co-authored-by: ardalan.mehrani <ardalan.mehrani@bytedance.com> * [Fix] LongVila prepare request and gqa doc to visual (#906) * Add try catch for longvila * Fix gqa doc id key error for more robustness * Revise PR template * Lint * Add OmniSpatial task (#896) * edit current_tasks to include mathvision which is already implemented (#904) * [Docs] Add MME-SCI to current_tasks.md (#909) * add MME-SCI * update config and path * update utils.py * update utils.py * add MME-SCI * refactor mmstar with default template (#907) add an OpenCompass version of MMstar with official Qwen3 prompt template * Update available models in __init__.py (#914) * [bugfix] Fix nested dictionary input for vllm mm_processor_kwargs (#915) * Fix simple_parse_args_strings Function in utils.py * Apply black * Separate _smart_comma_split function * Apply black * Update Qwen3VL video generation logic to match reproduce VideoMME official results (#912) * Update VideoMME with Qwen3VL prompt * update Qwen3VL to better handle qwen-vl-utils params * Update apply method to handle empty responses (#917) Handle empty response lists by returning an empty string. * update qwen3vl (simple model) processor (#922) * [bugfix] Filter unsupported model_kwargs for LLaVA-OneVision-1.5 (#924) * [feat] Add reasoning version of image and text dataset (#926) * Add reasoning utils * Add charxiv reasoning version * Add reasoning tasks for images * Add text reasoning task * Lint * [Task] Spatial benchmarks: Blink, CV_Bench, Embspatial, ERQA (#927) * spatial benchmarks added * small fixes * add VLMs are biased benchmark (#928) * Update MMMU and MMMUpro with Qwen3VL prompt (#929) * update mmmu with qwen3vl prompts * update mmmu and mmmu pro with qwen3vl prompts --------- Co-authored-by: ardalan.mehrani <ardalan.mehrani@bytedance.com> * add snsbench (#930) * Implement 'Vision Language Models are Blind' benchmark (#931) * [fix] Fixing the tqdm bar for the qwen vl series for batch infer (#936) * Fix tqdm bar for qwen25 vl when batch responding * Fix qwen3 vl batch tqdm processing issue * Lint * [feat] Bagel lmms-engine eval inference pipeline (#938) * Add bagel lmms-engine version for better api to transformers * Allow bagel to load from chat messages * [Dataset] Add Gedit Bench for Bagel in lmms-eval (#939) * Checkout patch for gedit * Remove redundant mllm tools * Revise and add gedit bench * Fix bagel lmms-engine on multi rank --------- Co-authored-by: KemingWu <wukemingcqu@gmail.com> * add jmmmu_pro (#937) * add jmmmu_pro * fix lint error * fix lint error * [Task] Pointing benchmarks: RefSpatial, Where2Place (#940) * add pointing benchmarks * refactoring * [NEW TASK] Add FALCON-Bench to tasks (#942) * [TASK] Add FALCONBench task - FALCON-Bench is part of the paper https://cplou99.github.io/FALCONEye/ accepted at WACV 2026. - Introduced new YAML configuration files for FALCONBench tasks: FALCONBench_mcq.yaml, FALCONBench_mcq_temploc.yaml, FALCONBench_oq.yaml, and FALCONBench_oq_temploc.yaml and utils.py with the eval metrics of the paper. - Downloading the benchmark requires soccernet==0.1.62 (via pip). Did not include it in case you do not want to add it to the requirements. * Add readme * New default template yaml file with common args * feat: add LongVT evaluation tasks for long video understanding with tool calling (#944) * docs: update current_tasks.md with comprehensive model and task listings Reorganize documentation to include: - Summary statistics table for quick reference (190+ tasks, 70+ models) - Complete task listings organized by modality (image, video, audio, etc.) - Full model registry with Chat Template and Simple/Legacy categories - New tasks: FALCON-Bench, LongVT, PhyX, VideoMathQA, JMMMU-Pro, etc. - Modality support annotations for each model * Add claude GitHub actions 1767101104580 (#958) * "Claude PR Assistant workflow" * "Claude Code Review workflow" * [bugfix] mmsi bugfix (#945) Merging bugfix for MMSI benchmark * sub_task metrics added (#954) Merging OmniSpatial refactor with sub-task metrics * [Task] Video Streaming Benchmark: OVOBench (#957) Merging OVOBench integration - comprehensive video streaming benchmark * feat: add automated PR code review skill add comprehensive code review skill with parallel agents - 8-step systematic review process with todo tracking - 5 parallel Sonnet agents for multi-dimensional analysis: * CLAUDE.md compliance checking * shallow bug scanning (focus on changed lines) * git history analysis for pattern violations * previous PR pattern matching * code comment compliance verification - 0-5 scoring system with confidence-based filtering (>= 4) - auto-publishes formatted review comments on GitHub - intelligent false positive filtering - includes detailed documentation and quick reference the skill enables consistent, thorough code reviews across the team while reducing manual review time from 15-20min to 2-3min per PR usage: review PR #123 review all open PRs performance: - parallel agent execution: ~60% faster than sequential - average review time: 2-3 minutes per PR - filters to high-confidence issues only (score >= 4) * [feat]: add generate_until_multi_round for qwen_2_5_vl and qwen_2_vl models inference (#960) Co-authored-by: mdrepin <mdrepin@sberdevices.ru> * Add Qwen 3 Omni and Video Salmonn 2 (#955) * feat: add Qwen3-Omni, Uni-MoE-2.0-Omni, and video-SALMONN-2 models Add support for three new omnimodal models to lmms-eval: - qwen3_omni: Qwen3-Omni-30B-A3B-Instruct model from Alibaba's Qwen team - Supports text, image, audio, and video inputs - Uses transformers' Qwen3OmniMoeForConditionalGeneration - Requires qwen-omni-utils package - uni_moe: Uni-MoE-2.0-Omni model from HIT-TMG - Dynamic-capacity MoE architecture with 33B parameters - Supports multimodal inputs through specialized expert routing - Requires custom installation from HITsz-TMG/Uni-MoE - video_salmonn: video-SALMONN-2_plus_7B from Tsinghua/ByteDance - Built on Qwen2.5-VL-7B with LoRA adapters - Specialized for audio-visual understanding - Uses PEFT for efficient model loading * fix: add AudioDecoder support and stereo-to-mono conversion for qwen3_omni - Add `_decode_audio` helper to handle AudioDecoder objects from datasets library - Add stereo-to-mono conversion in `resample_audio` for audio-only tasks - Include evaluation results summary in EVALUATION_RESULTS.md Tested on mme, mmau, and omni_bench tasks with 10 sample limit: - qwen3_omni: mme=0, mmau=20%, omni_bench=0% - video_salmonn: mme=170, mmau=30%, omni_bench=40% * fix: resolve empty responses for mixed modality inputs in qwen3_omni - Fix flatten logic to preserve [audio, image] groupings from omni_bench - Fix all() check to properly validate homogeneous audio lists - Handle tuple output from model (returns (text_ids, audio) even when return_audio=False) - Add warning for video_salmonn unsupported standalone audio inputs - Update EVALUATION_RESULTS.md with correct scores (mmau: 60%, omni_bench: 70%) These fixes resolve the issue where qwen3_omni returned empty responses on mixed audio+image tasks like omni_bench. * chore: remove unused code in qwen3_omni Remove no-op loop and unused content list variable. * fix: improve uni_moe compatibility with processor and video inputs - Add deepspeed_moe_inference_utils import for single-GPU MoE layers - Fix processor/model key mismatch: second_grid_ts -> second_per_grid_ts - Add video token replacement for video inputs - Convert pixel_values to bfloat16 to match model dtype - Improve generation parameter handling Note: uni_moe still has fundamental limitations - requires >80GB VRAM for video inputs on single GPU, and multi-GPU has device mismatch issues in the MoE aux_loss computation. * refactor: remove uni_moe and optimize qwen3_omni for faster inference - Remove uni_moe model (has fundamental issues with multi-GPU inference) - Optimize qwen3_omni for evaluation benchmarks: - Add disable_talker() to skip audio generation (~10GB memory saved) - Reduce default max_num_frames from 768 to 128 - Hardcode return_audio=False for text-only evaluation * chore: clean up video_salmonn and qwen3_omni code - Remove unused imports (os, decord, base64, BytesIO) - Remove verbose logging and unnecessary comments - Simplify frame sampling logic in video_salmonn - Refactor audio decoding into helper methods in qwen3_omni - Apply consistent code style matching framework standards * lint * minor changes * refactor: rename video_salmonn to video_salmonn_2 for clarity Rename model to match the actual checkpoint (video-SALMONN-2_plus_7B). * [Task] Add imgedit bench (#941) * Checkout patch for imgedit * Add and refactor img edit benchmark * Add readme for task * Remove pdb lines, better formating output if no need generation --------- Co-authored-by: KemingWu <wukemingcqu@gmail.com> * Add STARE task (#893) * bugfix: missing fields in doc when using --log_samples (#731) * bugfix: missing fields in doc when using --log_samples Bug: Fields with "image" in keys or of type dict are not saved in the sample log file. Fix: save dicts and all fields to the JSONL file * lint --------- Co-authored-by: Bo Li <drluodian@gmail.com> * The evaluation strategy is changed from LLM-judge to rule-judge + LLM… (#953) * The evaluation strategy is changed from LLM-judge to rule-judge + LLM-judge * black format * fix: add type hints to new functions in mmvu utils Add type hints to: - get_llm_judge_server() -> Any - normalize_math_notation(text: str) -> str - evaluate_with_rule_based(doc: Dict, prediction: str) -> bool - evaluate_with_llm_judge(doc: Dict, prediction: str) -> tuple[bool, str] CLAUDE.md requires type hints for all code. --------- Co-authored-by: Brian Li <drluodian@gmail.com> * feat: add task groundingme (#949) * feat: add task groundingme * fix: address code review issues in GroundingME - Replace bare except with specific exception types - Replace print() with eval_logger.info() - Add type hints to public functions - Add typing imports --------- Co-authored-by: Brian Li <drluodian@gmail.com> * [Task] add AV-SpeakerBench (#943) * [Task] add AV-SpeakerBench * fix loading audio * fix: add type hints to av_speakerbench/utils.py Add comprehensive type annotations to all functions: - _parse_choices: Union input, detailed return tuple type - doc_to_* functions: Dict[str, Any] input, appropriate return types - parse_multi_choice_response: Optional[str] input, str return - process_results/aggregate_results: proper dict and list types --------- Co-authored-by: Le Thien Phuc Nguyen <leos@Les-MacBook-Air.local> Co-authored-by: Le Thien Phuc Nguyen <leos@Mac.lan> Co-authored-by: Brian Li <drluodian@gmail.com> * add-task-seephys (#903) * WhisperTT evals (#899) * Register new WhisperTT model * Update pyproject.toml to pin transformers version * Update pyproject.toml to pin numpy version * Update registration with new directory structure * Change path for sample audio to use TT_METAL_HOME * Import pathlib.Path * Remove deprecated call to enable_async() * Update tt-metal installation location * Fix f-string typo * Update warmup_model function to accept model_repo parameter in WhisperTT class * Add Chinese text normalization and English text normalization utilities - Added `openslr_librispeech_other.yaml` and `openslr_librispeech.yaml` configuration files for task definitions. - Implemented utility functions in `utils.py` for processing audio and text documents. - Created `basic.py`, `english.py`, and `english.json` for English text normalization, including handling of spelling variations and number normalization. - Enhanced the whisper normalizer with new functionalities for both Chinese and English text processing. * Update openslr_librispeech_other.yaml configuration - Changed dataset_path to 'parquet' and updated dataset_kwargs to include a specific data file URL. - Modified test_split from 'test' to 'train' and set dataset_name to 'null' for task definition adjustments. * Refactor warmup_model function to remove default model_repo parameter * Refactor import_function to support both relative and absolute imports - Enhanced the `import_function` to first attempt a relative file import and fallback to an absolute module import if the relative path does not exist. - Improved error handling by re-raising import errors with context for better debugging. - Removed unused `openslr_librispeech/_default_yaml_template` and related whisper normalizer files to streamline the codebase. * Enhance librispeech_process_result to support multiple ground truth field names - Updated the `librispeech_process_result` function to handle both "gt" and "transcript" as valid keys for ground truth in documents, improving compatibility with different LibriSpeech datasets. - Added error handling to raise a KeyError if neither field is found, providing clearer feedback on document structure. * Enhance librispeech_process_result to handle missing source and task fields - Updated the `librispeech_process_result` function to safely retrieve the "source" field, defaulting to "unknown" if not present. - Added logic to infer the "task" field from context, defaulting to "asr_en" for LibriSpeech datasets when not explicitly provided. * Enhance librispeech_doc_to_audio function to handle multiple audio field names - Updated the `librispeech_doc_to_audio` function to check for various field names ("audio", "file", "path", "audio_path") in the document, improving compatibility with different LibriSpeech datasets. - Added error handling to raise a KeyError if no valid audio field is found, providing clearer feedback on document structure. * Refactor librispeech utility functions for improved clarity and functionality - Simplified the `librispeech_doc_to_audio` function to directly return the "audio" field, removing unnecessary checks. - Streamlined the `librispeech_process_result` function to directly access "gt", "source", and "task" fields without additional error handling, assuming their presence. - Added a new `librispeech_doc_to_target` function to return the ground truth from the document, enhancing modularity. * Refactor Open-ASR utility functions for improved compatibility and modularity - Updated the `openasr_doc_to_audio` function to handle multiple audio field names ("audio", "file", "path", "audio_path"), enhancing compatibility with various datasets. - Introduced a new `openasr_doc_to_target` function to normalize the retrieval of ground truth fields ("text", "transcript", "gt"), improving modularity and error handling in the `openasr_process_result` function. * Enhance warmup_model function to support mesh device creation for Whisper model - Updated the `warmup_model` function to create a mesh device instead of a single device, enabling compatibility with the mesh-enabled Whisper model. - Added logging to indicate the creation and successful warming up of the Whisper model. * Refactor warmup_model function to simplify mesh device creation - Updated the `warmup_model` function to streamline the creation of the mesh device by removing unnecessary parameters, enhancing code clarity and maintainability. * Implement HTTP API client for Whisper model - Refactored the WhisperTT class to utilize HTTP calls to the tt-media-server for audio transcription, allowing evaluations to run outside of Docker. - Added methods for encoding audio to base64 and transcribing audio via the API. - Updated model initialization to include parameters for base URL, timeout, and retries, enhancing flexibility and error handling. * Enhance WhisperTT initialization to log unexpected kwargs - Added a new parameter `num_concurrent` to the WhisperTT class for improved concurrency handling. - Updated the initialization method to log a warning for any unexpected keyword arguments instead of raising an assertion error, enhancing robustness and user feedback. * Ensure audio array is float32 for 32-bit WAV file creation in WhisperTT class - Updated the audio array conversion to float32 to prevent "Unsupported bit depth: 64" errors when creating WAV files, ensuring compatibility with server requirements. * Update default API key in WhisperTT class for testing purposes * Run requests in parallel * fix: address code review issues in WhisperTT - Use self.api_key instead of hardcoded placeholder - Remove commented-out code blocks - Fix inconsistent return type (return "" instead of tuple on error) - Revert dependency downgrades (numpy 1.26.4, transformers >=4.39.2) - Fix log message ("Audio transcription" instead of "Image generation") --------- Co-authored-by: bgoelTT <bgoel@tenstorrent.com> Co-authored-by: stisi <stisi@tenstorrent.com> Co-authored-by: Brian Li <drluodian@gmail.com> * Add SpatialViz task (#894) * easier code for multiple images (#879) * add coco captioning chair * add chair recall * bootstrapping * add amber_g * amber-works * Add an easy flag to control image ordering (incomplete code) * we nned to do interleaved at the end of the day... * fix typo * update evaluator to prevent chair customized * mmbench two * clean code * file upload bug fix * enable double img mmmu * hallusion_bench * bootstrap for amber * fix: resolve conflicts and fix code review issues - Fix args.output -> args.output_path in file_utils.py - Replace hardcoded user path with reasonable default in amber_g - Remove commented-out debug print statements - Remove dead code blocks and informal comments Fixes issues identified in code review. --------- Co-authored-by: Patrick Wu <tsunghan_wu@berkeley.edu> Co-authored-by: Brian Li <drluodian@gmail.com> * fix: filter multimodal content from log samples while preserving metadata (#962) * fix: improve spatialviz utils quality - Fix FileExistsError -> FileNotFoundError (correct exception type) - Replace print() with eval_logger for consistent logging - Add type hints to all functions - Fix missing comma bug in final_answer_patterns list - Remove redundant image_path = image_path assignment - Initialize op variable to prevent potential UnboundLocalError - Break long prompt string for readability (88 char line limit) * style: apply black formatting * fix: filter multimodal content from log samples while preserving metadata When using --log_samples, the previous implementation either saved all fields (causing serialization issues with images/audio) or filtered based on key names (missing useful metadata like image_id, image_path). This fix introduces is_multimodal_content() that detects actual multimodal data types (PIL.Image, numpy arrays, torch tensors, HuggingFace audio/image dicts) while preserving all scalar metadata fields for dataset traceability. Github-Issue:#943 * fix: improve spatialviz utils quality (#961) * fix: improve spatialviz utils quality - Fix FileExistsError -> FileNotFoundError (correct exception type) - Replace print() with eval_logger for consistent logging - Add type hints to all functions - Fix missing comma bug in final_answer_patterns list - Remove redundant image_path = image_path assignment - Initialize op variable to prevent potential UnboundLocalError - Break long prompt string for readability (88 char line limit) * style: apply black formatting * style: apply black and isort formatting to all files * [Fix] Fix imgedit eval logic for call openai client (#966) * Add intern vl3 and internvl3_5 (#963) * add InternVL3 model support Add support for InternVL3 and InternVL3.5 models including: - InternVL3-8B (OpenGVLab/InternVL3-8B) - InternVL3.5-30B-A3B (OpenGVLab/InternVL3_5-30B-A3B) The implementation supports both single GPU and multi-GPU inference with automatic device mapping. * separate InternVL3 and InternVL3.5 models Add internvl3_5.py as thin wrapper around InternVL3 with different default pretrained model. Both share same logic since they have identical interfaces. * minor changes * lint * address review * Update current_tasks.md (#965) add AV-SpeakerBench to the current_tasks.md in docs. * [Fix] Qwen2VL batchsize>1 visual alignment (#971) Co-authored-by: TerryUV <1207335715@qq.com> * use deps lower bounds (#969) * chore: remove automatic Claude code review workflow (#973) The automatic review on every PR open/synchronize was too frequent. Users can still trigger Claude reviews by @claude in PR comments, which is handled by claude.yml. * [Task] Added VSIBench debiased & pruned (#975) * added vsibench_debiased and vsibench_pruned subset * fixed yaml name * fixed yaml files * docs: add i18n README translations for 18 languages (#979) Add internationalized documentation to improve accessibility for global users. Includes translations in: - Asian: Chinese (Simplified/Traditional), Japanese, Korean, Vietnamese, Indonesian, Hindi - European: Spanish, French, German, Portuguese, Russian, Italian, Dutch, Polish - Other: Turkish, Arabic Each translation maintains the core documentation structure with localized content for installation, usage, and contribution guides. * docs: improve Chinese translation quality (#980) Fix machine translation artifacts in zh-CN and zh-TW READMEs: - Rewrite awkward phrases for natural flow - Add missing context about benchmark discovery challenges - Simplify verbose expressions - Improve terminology consistency (LLM/LMM) * [TASK] Add Mantis-Eval Task (#978) Add Mantis-Eval task for evaluating multi-image reasoning capabilities. Includes low-level, high-level, and full evaluation variants with quality/fidelity/coherence metrics. * [Model] Added Cambrian-S model (#977) Add Cambrian-S model support for multi-modal evaluation. Supports both image and video inputs with decord-based video processing. * [feat] Init an http eval server and entrypoints for lmms_eval (#972) Add HTTP evaluation server for remote model evaluation. Provides REST API endpoints for running evaluations and querying results, enabling distributed evaluation workflows. * docs: clarify that batch_size=auto is not implemented (#981) * docs: clarify that batch_size=auto is not implemented The --batch_size auto option is documented but not actually implemented for most models. Models cast batch_size to int, causing ValueError when 'auto' is passed. Updated docs to reflect current behavior and guide users to use explicit integer values instead. Github-Issue: #967 * style: fix black formatting * fix: add missing 'all' extra to pyproject.toml (#982) Fixes #976 * [Task] Added ViewSpatial task (#983) Add ViewSpatial benchmark task for spatial reasoning evaluation * [Task] Added SiteBench task (#984) Add SiteBench benchmark task for spatial understanding evaluation * [fix] Align the image text order in evaluation with original evaluation (#986) Fix image-text order in imgedit evaluation to match original benchmark * fix: properly handle qwen2.5 video frames edge case (#987) Fix qwen2.5 video frames edge case without changing default sampling behavior * [feat] Add decontamination probing settings for video benchmarks (#990) * feat(videomme): add 7 decontamination task settings Add the following decontamination settings for VideoMME benchmark: - no_visual: evaluate without video content - random_choice: shuffle options to test position bias - gt_none_option: replace correct answer with 'None' - number_option: change A/B/C/D to 1/2/3/4 - revert_oe_mcq: convert MCQ to open-ended - convert_mcq_oe: open-ended with LLM matching - video_only_abcd: only video, no question * feat(videommmu): add 4 decontamination task settings Add the following decontamination settings for VideoMMMU benchmark: - no_visual: evaluate without video content - random_choice: shuffle options to test position bias - gt_none_option: replace correct answer with 'None' - number_option: change A/B/C/D to 1/2/3/4 Each setting includes perception, comprehension, and adaptation splits. * feat(longvideobench): add 2 decontamination task settings Add the following decontamination settings for LongVideoBench: - no_visual: evaluate without video content - random_choice: shuffle options to test position bias * feat(lvbench): add 2 decontamination task settings Add the following decontamination settings for LVBench: - no_visual: evaluate without video content - random_choice: shuffle options to test position bias * feat(longvt): add no_visual decontamination task setting Add no_visual setting for LongVT benchmark to evaluate model performance without video content. * style: format qwen2_5_vl.py with black * [Task] Add vsibench multi-image variant (#993) * [feat] Add CLT and clustered standard error estimation for statistical rigor (#989) * feat(api): add cluster_key config and clustered_stderr function - Add cluster_key field to TaskConfig for specifying clustering field name - Add clustered_stderr() for calculating SE when questions share context - Fix: add missing Any type import in metrics.py * feat(evaluator): add CLT and clustered stderr calculation - Add calculate_clt_aggregate_metric() to TaskOutput - Read cluster_key from task config to determine clustering field - Convention: dict items with 'score' key for 0/1 correctness - Output stderr_clt and stderr_clustered in results JSON * feat(output): add Stderr_CLT and Stderr_Clustered columns to results table - Extend make_table() to display all three stderr types as columns - Backward compatible: shows N/A for tasks without new stderr fields * feat(videomme): add score field and cluster_key config for stderr calculation - Add 'score' key (0/1 correctness) to process_results output - Add 'videoID' to output dict for clustered SE calculation - Add cluster_key: videoID to yaml config - Questions from same video are now properly treated as correlated * style: format qwen2_5_vl.py with black * fix(metrics): correct clustered_stderr formula per Eq.4 from paper - Implement Equation 4 from arxiv:2411.00640: SE_clustered = sqrt(SE_CLT^2 + cross_term) where cross_term accounts for within-cluster correlations - Remove redundant 'import numpy as np' inside function - Add 'import collections' at module level for defaultdict - Formula now correctly handles: - No clustering effect: SE_clustered = SE_CLT - Positive correlation: SE_clustered > SE_CLT - Negative correlation: SE_clustered < SE_CLT Addresses reviewer feedback from kcz358 * feat(config): add score_key option for configurable score field - Add score_key field to TaskConfig (default: "score") - Update calculate_clt_aggregate_metric to use score_key from config - Allows tasks to customize which dict field contains 0/1 scores - Improves extensibility for tasks with different output structures Addresses reviewer feedback from kcz358 * fix(utils): handle numpy empty array comparison in make_table Fix ValueError when comparing numpy empty arrays with Python lists. numpy array comparison like `arr == []` raises ambiguous truth value error. Use `hasattr` and `len()` for safe empty check instead. * [feat] ignore opencode files * fix: qwen2.5vl nframes bug (#992) * fix: nframes bug * precommit * fix: remove debug logging and make decord import optional * style: apply black formatting to fix CI --------- Co-authored-by: Bo Li <drluodian@gmail.com> * Add CaptionQA benchmark task (#991) * Add CaptionQA benchmark task CaptionQA evaluates how well image captions preserve information for downstream QA tasks. The benchmark uses: - A vision-language model to generate captions - Qwen2.5-72B-Instruct as a judge to answer questions based on captions - Scoring based on correctness with partial credit for 'cannot answer' Usage: python -m lmms_eval \ --model qwen2_5_vl \ --model_args pretrained=Qwen/Qwen2.5-VL-3B-Instruct \ --tasks captionqa \ --batch_size 1 \ --output_path ./logs/captionqa_results Requirements: - At least 2 GPUs with ~80GB VRAM each (for the 72B judge model) - vLLM installed for efficient judge inference Paper: https://arxiv.org/abs/2511.21025 Dataset: https://huggingface.co/datasets/Borise/CaptionQA * fix: remove unused imports and fix bare except * style: apply black formatting to fix CI --------- Co-authored-by: Bo Li <drluodian@gmail.com> * Revert "Add CaptionQA benchmark task (#991)" (#1002) This reverts commit d9e3753b845da212a0a3026e26db273a55b36075. * [Bug] internvl3 duplicate <image> token issue (#999) * fix: internvl3 duplicate <image> token bug * precommit * fix: remove unused import and debug logging from internvl3 * style: apply black formatting to fix CI --------- Co-authored-by: Bo Li <drluodian@gmail.com> * [Task] Added SiteBench multi-image variant and bug fix (#996) * [Task] Add SiteBench Multi-image variant * fix: Fixed the issue where post and pre prompt are not added properly * fix lmms_eval_specific_kwargs bug * pre-commit * fix: convert lmmseval_specific_kwargs to input as a dict to align with lmmseval convention * feat: Added option to use interleave_visual to align with vlmevalkit's format * precommit * fix: remove unused torch import and fix not-in syntax * style: apply black formatting to fix CI --------- Co-authored-by: Bo Li <drluodian@gmail.com> * [TASK] Add SpatialTreeBench task (#994) * Add SpatialTreeBench task * clean output format * style: auto format code with black and isort * fix: remove unused imports, fix bare excepts, and format code * style: apply black formatting to fix CI --------- Co-authored-by: Bo Li <drluodian@gmail.com> * Add CaptionQA benchmark task (#1004) CaptionQA evaluates image captioning models by testing how well their generated captions enable a downstream QA model to answer questions. Features: - Supports 4 splits: natural, document, ecommerce, embodiedai - Uses LLM judge (Qwen2.5-72B-Instruct via SGLang) for QA evaluation - Includes partial credit scoring for 'cannot answer' responses - Deterministic shuffle permutations for reproducibility Important: Requires transformers<=4.56.0 due to Qwen2.5-VL image processing regression. * Add BabyVision benchmark task (#1008) BabyVision is a benchmark for evaluating visual reasoning capabilities on tasks that even 3-year-old children can solve but remain challenging for AI models. - Add task YAML configuration pointing to UnipatAI/BabyVision dataset (388 items) - Implement utils with doc_to_visual/text/target functions - Add LLM judge integration for blank question evaluation - Include type-wise and subtype-wise accuracy aggregation - Add example evaluation script Categories covered: - Fine-grained Discrimination - Visual Tracking - Spatial Perception - Visual Pattern Recognition Reference: https://github.com/UniPat-AI/BabyVision * Revert "Add BabyVision benchmark task (#1008)" (#1009) This reverts commit aa82ee164dd8dfdb69775d4b3ce34b1e4a5b242d. * Update README and PR Code Review documentation * update README.md * update README.md * update i18n README.md * [Task] Added SPAR-bench (#1011) * [Task] Add sparbench and sparbench-tiny * [styles] pre-commit * scalable choice selection added (#1005) * add structeditbench task (#1016) * [Model] Bagel UMM (#1012) * [Model] Added bagel_umm which supports both inference and generation * fix: allows using huggingface path * fix: fixed image output folder when using multiple gpu * chore: merge bagel.py and bagel_umm.py * feat(tasks): add BabyVision Und task with LLM-based evaluation (#1015) * feat(task): add BabyVision task configuration Add task configuration for BabyVision MLLM evaluation benchmark. - babyvision.yaml: Task config with LLM-as-Judge evaluation - Dataset: UnipatAI/BabyVision (388 samples) - Extended max_new_tokens (8192) for reasoning models - Configurable API via BABYVISION_* environment variables - __init__.py: Package initialization Paper: https://arxiv.org/abs/2601.06521 * feat(task): add BabyVision evaluation logic Add core evaluation logic for BabyVision benchmark. - utils.py: Main evaluation functions - babyvision_doc_to_visual/text/target: Data processing - babyvision_process_results: LLM-as-Judge evaluation - babyvision_aggregate_results: Per-subtype accuracy breakdown - extract_boxed_answer: Answer extraction with \boxed{} and <think> support - Dynamic environment variable reading for flexibility - prompt.py: LLM Judge prompt template - Aligned with original BabyVision evaluation code - Semantic matching for answer correctness * docs: add BabyVision task documentation Add README with usage instructions and benchmark overview. - Task categories: Fine-grained Discrimination, Visual Tracking, Spatial Perception, Visual Pattern Recognition - Environment variables: BABYVISION_API_KEY, BABYVISION_BASE_URL, BABYVISION_MODEL_NAME - Metric: babyvision_overall_accuracy - Answer format: \boxed{} with reasoning model support * feat(tasks): add BabyVision Gen task with LLM-based evaluation (#1010) * feat(tasks): add BabyVision Gen task with LLM-based evaluation Add new task for evaluating visual reasoning image generation using LLM. Features: - Configurable OpenAI-compatible API (via BABYVISION_* env vars) - Type-specific evaluation criteria for all task types - Overall accuracy with per-type/subtype breakdown - Clean code structure following imgedit/gedit_bench patterns * Lint * docs: update development guidelines - Replace pytest/ruff sections with CICD testing reference - Simplify code formatting to pre-commit hooks only - Remove redundant type checking sections - Remove trailing newline at end of file * [feat] Add baseline comparison with paired t-test (#1006) * feat(metrics): add paired_ttest function for baseline comparison Implement paired t-test statistical analysis: - Calculate mean difference and standard error - Compute 95% confidence interval - Return t-statistic and p-value - Fallback to normal approximation when scipy unavailable * feat(baselines): add registry and loader module New module for baseline management: - registry.py: Model × task preset registry structure - loader.py: Load baselines from local/HF/registry - Support hf://user/repo/file.jsonl URL format * feat(evaluator_utils): add compute_baseline_comparison helper Add helper function to compute paired t-test comparison: - Wrap paired_ttest with baseline metadata - Calculate baseline and current mean scores * feat(evaluator): integrate baseline comparison into evaluation Add baseline comparison logic to simple_evaluate(): - Load baseline data from registry/local/HF - Match samples by doc_id and extract scores - Compute paired t-test and store results with paired_ prefix - Add get_baseline_display_name() for short display names * feat(utils): add baseline comparison columns to output table Update make_table() for baseline comparison display: - Add Baseline/Diff/CI/P_Value columns - Auto-hide columns when all values are N/A - Dynamically compute Diff from current score and baseline - Format p-value with * for significance (p < 0.05) * feat(cli): add --baseline argument for model comparison Add CLI parameter to specify baseline for paired t-test: - Support preset name (e.g., qwen25vl) - Support local JSONL path - Support HuggingFace URL (hf://user/repo/file.jsonl) * style: apply isort and black formatting * refactor: move imports to top of file in evaluator.py Move baseline-related imports from inside function to module level, following Python best practices for import organization. * refactor: move get_baseline_display_name to baselines module Extract inline function to baselines/__init__.py for better code organization. The function is now exported and can be imported from lmms_eval.baselines. * refactor: use score_key for baseline comparison score extraction - Get score_key from task config instead of hardcoded "score" lookup - Simplify score extraction logic by using score_key directly - Skip baseline comparison gracefully when no valid scores found - Add debug logging when skipping tasks due to missing scores * style: apply isort formatting to evaluator.py imports * fix: add fallback for *_score fields in baseline comparison The score extraction now falls back to searching for fields ending with "_score" (e.g., videomme_perception_score) when the exact score_key is not found. This handles task-specific score field naming patterns. * chore: add .worktrees to gitignore * [feat] Add Power Analysis for Pre-Evaluation Planning (#1007) * feat(metrics): add power_analysis function for sample size calculation Add statistical function to calculate minimum sample size needed to detect a given effect size using paired t-test power analysis. * feat(cli): add --power-analysis mode for pre-evaluation planning Add CLI arguments and handler for power analysis: - --power-analysis: enable power analysis mode - --effect-size: minimum effect to detect (default 0.03) - --alpha: significance level (default 0.05) - --power: desired power (default 0.80) - --correlation: expected correlation (default 0.5) * docs: clarify std should be estimated from previous eval data Add note in docstring that std parameter should ideally be estimated from previous evaluation results rather than using default value. Add reference to Miller 2024 paper (arXiv:2411.00640). * fix: use separate std_a/std_b params for general variance formula - Replace single 'std' param with 'std_a' and 'std_b' for general case - Fix formula: var_diff = std_a^2 + std_b^2 - 2*rho*std_a*std_b - Add --std-a and --std-b CLI arguments - Backward compatible: defaults to 0.5 if neither provided * [Release] v0.6 Development Branch - TUI, CLT/Clustered SE, Paired T-Test, Power Analysis, Stability Metrics, Decontamination, Import Refactor (#1001) * feat: add TUI interface for interactive evaluation configuration - Add lmms_eval/tui/ module with Textual-based TUI app - Support model selection, task selection, and settings configuration - Add --tui flag to launch interactive mode - Add lmms-eval-tui entry point - Add textual as optional dependency [tui] * [feat] added openrouter api model evaluation example script. * fix: handle --tui flag before heavy imports for instant startup * fix: show quick help on no args, only TUI on explicit --tui flag * fix: make decord optional to fix macOS Python 3.12+ installation - Move decord/eva-decord to [video] optional dependency - eva-decord only for macOS Python < 3.12 (no wheels for 3.12+) - Base package now installable on all platforms * [feat] added real-time performance metrics and redesigned tui layout. * [chore] refactored tui logo rendering using textual-image. * [feat] refined tui terminology, styling, and added command highlighting. * [feat] improved tui logo rendering with image support detection. * [feat] improved tui configuration ui and handling of dependencies * [feat] added core bm25 search engine for ui/ux style guides. * [feat] added bash syntax highlighting and line numbers to tui command preview. * refactor(tui): replace terminal UI with web UI - Remove OpenTUI terminal-based UI (had rendering issues) - Add React + Vite + Tailwind CSS web UI - FastAPI backend serves both API and static files - CLI starts server and opens browser automatically Features: - Model selection dropdown - Task list with search/filter and checkboxes - Real-time command preview - Live output streaming via SSE - Start/Stop evaluation controls - Settings: batch size, limit, device, verbosity, output path * fix(tui): suppress server output to prevent pipe blocking * [refactor] lazy import tui functions and overhaul web ui. * [feat] display git and system information in the tui web interface. * [feat] tui now supports environment variables. * feat(api): add model stability metrics functions (EA, CA, IV, CR) Add four new metrics for measuring model stability in k-samples mode: - expected_accuracy: mean accuracy across all k samples - consensus_accuracy: accuracy after majority voting - internal_variance: average variance within each question (lower is better) - consistency_rate: fraction of questions with consistent answers Reference: HackMD v0.6 roadmap section 2.5 Model Stability Measurement * feat(evaluator): add stability metrics calculation and result output - Add calculate_stability_metrics() method to TaskOutput class Groups scores by question and computes EA, CA, IV, CR when repeats > 1 - Update consolidate_results() to include stability metrics in output The metrics are only computed when num_samples > 1 (k-samples mode). * feat(evaluator): enable k-samples mode with task repeats override - Override task repeats with num_samples when n > 1 for stability measurement - Call calculate_stability_metrics() after aggregate metric calculation When --num_samples is set > 1, the evaluator runs each question k times to measure model consistency and stability. * feat(cli): add --num_samples/-n parameter for model stability measurement Add CLI argument to enable k-samples mode: -n, --num_samples: Number of samples per question (default: 1) When n > 1, enables k-samples mode and computes stability metrics (EA, CA, IV, CR) to measure model consistency. Usage: lmms-eval --model xxx --tasks xxx -n 5 * feat(output): display stability metrics (EA, CA, IV, CR) in results table - Add EA, CA, IV, CR columns to make_table() output - Skip stability metric variants in main metric loop (shown as columns) Example output: |Task|Metric|Value|Stderr|Stderr_CLT|Stderr_Clustered|EA |CA |IV |CR | |----|------|-----|------|----------|----------------|----|----|----|----| |mme |score |85.0 |N/A |0.0435 |0.0512 |0.80|0.82|0.05|0.75| * fix(lint): use specific exception type instead of bare except * feat(tui): improve UI with syntax highlighting and collapsible sections - Add shell syntax highlighting for command preview and env vars editor - Add ANSI color code parsing for log output - Make Tasks and Environment Variables sections collapsible - Unify typography with monospace font and consistent sizing - Add search functionality to Select dropdown component - Remove broken INVERT button - Add group collapse/expand controls in task list - Make log output maximizable * fix: apply black and isort formatting * docs: add TUI screenshots * docs: add log streaming screenshot * fix: make Web UI responsive to larger viewports * refactor: centralize optional import handling with imports module Add lmms_eval/imports.py with unified utilities for optional dependencies: - `optional_import()` - returns (value, is_available) tuple for graceful fallback - `require_package()` - raises MissingOptionalDependencyError with install instructions - `is_package_available()` - cached package availability check - `make_lazy_getattr()` - factory for lazy module-level imports Updates all model files to use the new utilities instead of scattered try/except ImportError blocks. This provides consistent error messages with install instructions and reduces boilerplate. Also applies code formatting fixes across affected files. * refactor: centralize optional import handling with imports module Add lmms_eval/imports.py with unified utilities for optional dependencies: - `optional_import()` - returns (value, is_available) tuple for graceful fallback - `require_package()` - raises MissingOptionalDependencyError with install instructions - `is_package_available()` - cached package availability check - `make_lazy_getattr()` - factory for lazy module-level imports Updates all model files to use the new utilities instead of scattered try/except ImportError blocks. This provides consistent error messages with install instructions and reduces boilerplate. Also applies code formatting fixes across affected files. * Revert "refactor: centralize optional import handling with imports module" This reverts commit 516ca914b1f39b69ef27ef36070e70a34c70e7c2. * fix: suppress unused variable warning in protocol.py * fix: remove unrelated files and fix TUI import path - Remove accidentally included core.py and search.py (BM25 search engine files) - Fix import from lmms_eval.tui.app to lmms_eval.tui.cli in __main__.py * fix: correct stability metrics calculation for k-samples mode - Add per_sample_metrics field to TaskOutput for storing per-sample scores - Compute individual sample scores by calling process_results for each sample - Update calculate_stability_metrics to use per_sample_metrics - Add proper multi-GPU gathering for per_sample_metrics - Add warning logs for unknown score types during extraction * fix: convert unconditional imports to optional imports - vllm_generate.py: use optional_import for qwen_vl_utils - sglang.py: add try/except for qwen_vl_utils with fallback - huggingface.py: add try/except for decord, set process_vision_info=None on failure * fix: resolve lint issues - Fix undefined 'logging' -> 'eval_logger' in __main__.py - Remove unused imports across codebase (auto-fixed by ruff) - Fix f-strings without placeholders * style: fix formatting with black and isort * chore: remove accidentally committed .opencode/ files and add to gitignore * refactor(tools): clean up obsolete scripts and add CLI interface (#1013) * refactor(tools): clean up obsolete scripts and add CLI interface Remove one-off dataset creation scripts that contained hardcoded paths and were no longer useful: - get_video_avg_time.py (internal analysis) - make_vatex.py (one-time HF upload) - make_video_hf_dataset_from_json.py (contained pdb.set_trace()) - make_audio_hf_dataset.ipynb (minimal template) - make_video_hf_dataset.ipynb (hardcoded paths) - makecvrr.ipynb (missing dependencies) Refactor get_split_zip.py to be a proper CLI tool with argparse, supporting configurable max file size with human-readable units. Add tools/README.md documenting the remaining utilities. * refactor(tools): remove lite/ and live_bench/ modules These modules are archived and can be found in the main branch if needed. * style: fix black formatting * fix(evaluator): resolve model import issue for inference Fixed import path to use direct models module reference instead of lmms_eval.models prefix, resolving inference failures in CI/CD tests. * docs: update v0.6 roadmap to reflect implementation status Remove Lance storage details (not implemented), add lmms-engine#127 reference for HTTP server, mark frontier evaluation section as TODO. * fix(tui): fix Ctrl+C handling to properly stop server Simplified signal handling by removing complex cleanup logic and process groups. Direct subprocess.wait() now allows Ctrl+C to properly terminate the uvicorn server. * fix(webui): enable scrolling in log output panel Added min-h-0 to the log output panel container to fix flexbox overflow issue. Without this, the flex-1 container defaults to min-height: auto which prevents overflow-auto from working properly, making the scrollbar non-functional. * chore(webui): update build artifacts for scrolling fix * Fix import error for lmms_eval.models --------- Co-authored-by: mwxely <zuhao@ualberta.ca> Co-authored-by: kcz358 <kaichenzhang358@outlook.com> * feat: add reasoning task versions for multiple benchmarks (#1038) * feat(ai2d): add reasoning task version with thinking and answer tags * feat(realworldqa): add reasoning task version with thinking and answer tags * feat(mmbench): add en reasoning task version with thinking and answer tags * feat(phyx): add reasoning task versions with thinking and answer tags * feat(ocrbench_v2): add reasoning task version with thinking and answer tags * fix: resolve bugs in ocrbench_v2 and phyx reasoning tasks * feat(olympiadbench): add reasoning task versions with thinking and answer tags * fix: correct dataset_name indentation in olympiadbench reasoning yaml files * Update BLINK benchmark link in current_tasks.md (#1036) The GitHub page for Blink has been changed. * fix: support for `partial` used in vsibench metric calculation. (#1041) * add kris_bench task (#1017) * add kris_bench task * style: format prepare_dataset.py with black * kris update * kris_update * update kris_bench * feat: add WenetSpeech test_net split for evaluation (#1027) Add wenet_speech_test_net task to evaluate on the test_net split of the WenetSpeech dataset. The dataset has three splits available: - dev (already supported) - test_meeting (already supported) - test_net (now added) This completes the coverage of all WenetSpeech evaluation splits. * feat(tasks): add MMVP task with ground truth corrections (#1028) * feat(tasks): add MMVP task with ground truth corrections Add MMVP (Multimodal Visual Patterns) benchmark task that tests VLMs on CLIP-blind pairs - images perceived as similar by CLIP but with clear visual differences. Key features: - Loads dataset from MMVP/MMVP on HuggingFace - Reports both individual accuracy and pair accuracy metrics - Applies verified ground truth corrections for indices 99 and 279 as documented in issue #1018 The pair accuracy metric requires models to correctly answer BOTH questions in each CLIP-blind pair, providing a stricter evaluation of genuine visual understanding. Github-Issue: #1018 * style: apply black formatting to mmvp utils * feat(tasks): add RealUnify benchmark (#1033) * feat(tasks): add RealUnify benchmark for unified multimodal evaluation Add RealUnify benchmark implementation for evaluating bidirectional capability synergy in unified multimodal models (GEU tasks). Tasks implemented: - realunify_mental_tracking: Visual transformation reasoning - realunify_mental_reconstruction: Shuffled image reconstruction - realunify_attentional_focusing: Region attention tasks The benchmark requires parquet data files prepared from the official RealUnify dataset. See README for data preparation instructions. Reference: https://arxiv.org/abs/2509.24897 * style: format realunify files with black * fix(realunify): use lmms-lab-eval/RealUnify HuggingFace dataset Replaced local parquet file requirement with HuggingFace dataset. Dataset uploaded to lmms-lab-eval/RealUnify with separate configs for each task type (mental_tracking, mental_reconstruction, attentional_focusing, cognitive_navigation). * feat(tasks): add Spatial457 benchmark for 6D spatial reasoning (#1031) * feat(tasks): add Spatial457 benchmark for 6D spatial reasoning Add Spatial457 benchmark from CVPR 2025 Highlight paper. This diagnostic benchmark evaluates 6D spatial reasoning in large multimodal models with: - 7 question types across 5 difficulty levels (L1-L5) - Multi-object recognition, 2D/3D location, 3D orientation evaluation - JSON-based answer extraction following official implementation - Per-category accuracy reporting Dataset: RyanWW/Spatial457 on HuggingFace Paper: https://arxiv.org/abs/2502.08636 * style: format spatial457 files with black * fix(spatial457): use category from task config instead of question_index inference The original implementation inferred category from question_index thresholds, but HuggingFace datasets don't follow the expected index patterns, causing all subtasks to map to L1_single. Now: - Each subtask YAML passes category via lmms_eval_specific_kwargs - doc_to_text and process_results use category from task config if available - Added docstrings and line length fixes for code quality Also adds missing lmms_eval_specific_kwargs to L5 subtask YAMLs. * style: format spatial457 utils.py with black * fix(spatial457): add trust_remote_code for dataset loading The Spatial457 dataset uses a custom loading script that requires trust_remote_code=True to function properly. * fix(spatial457): add datasets>=4.0 compatibility workaround The Spatial457 HuggingFace dataset uses a custom loading script that is no longer supported in datasets>=4.0. This commit: 1. Adds helper functions to create datasets from JSON files directly 2. Updates README with data preparation instructions 3. Adds load_spatial457_from_local for pre-converted data * fix(spatial457): use lmms-lab-eval/Spatial457 dataset Converted original RyanWW/Spatial457 to standard HuggingFace format and uploaded to lmms-lab-eval org. This fixes datasets>=4.0 compatibility since custom loading scripts are no longer supported. Changes: - Update dataset_path to lmms-lab-eval/Spatial457 - Change split from validation to test - Remove trust_remote_code (no longer needed) - Simplify utils.py to use embedded PIL images * style: format spatial457 utils.py with black * feat(tasks): add AuxSolidMath benchmark (#1034) * feat(tasks): add AuxSolidMath benchmark for solid geometry reasoning Add evaluation task for AuxSolidMath benchmark which tests solid geometry reasoning with auxiliary line construction. - 3,018 real-exam solid geometry problems from HuggingFace dataset - Two difficulty splits: test_easy (150) and test_hard (152) - String matching with numerical tolerance for answer evaluation - No external API dependencies for evaluation * style: format auxsolidmath files with black * feat(tasks): add IllusionBench (#1035) * feat(tasks): add IllusionBench for visual illusion understanding Add IllusionBench task to evaluate visual illusion understanding in VLMs. Key features: - Supports 1,041 images with 5,577 QA pairs - True/False and Multiple Choice question types - Three categories: Classic Cognitive, Real Scene, Trap illusions - Accuracy metric aggregated by category and question type Dataset: lmms-lab/IllusionBench (to be uploaded) Paper: https://arxiv.org/abs/2501.00848 Original: MingZhangSJTU/IllusionBench * style: format illusionbench files with black * fix(illusionbench): use lmms-lab-eval/IllusionBench dataset Dataset uploaded to lmms-lab-eval/IllusionBench with 5,357 QA pairs. Updated utils.py to handle missing question_id field. * style(illusionbench): format utils.py with black * feat(tasks): add Uni-MMMU benchmark (#1029) * feat(tasks): add Uni-MMMU benchmark for unified multimodal evaluation Add Uni-MMMU benchmark that evaluates bidirectional synergy between generation and understanding capabilities. Includes four subtasks: - jigsaw: 2x2 image puzzle completion - maze: path finding through maze - sliding: sliding puzzle solving - geometry: geometry problem solving with diagrams Dataset requires manual download from Vchitect/Uni-MMMU-Eval. Reference: https://arxiv.org/abs/2510.13759 * style: format uni-mmmu files with black/isort * fix(uni_mmmu): improve geometry scoring and add parsing fallbacks - Replace substring match with normalized exact match for geometry scoring to prevent false positives (e.g., "12" incorrectly matching "120") - Add ast.literal_eval fallback for JSON parsing in maze/sliding tasks - Add tagless JSON list fallback parser for maze/sliding responses - Fix Optional[Dict] type hints for lmms_eval_specific_kwargs parameters * style: format uni_mmmu utils.py with black/isort * fix(uni_mmmu): use lmms-lab-eval/UniMMMU HuggingFace dataset Replace local parquet files with proper HF dataset at lmms-lab-eval/UniMMMU. Dataset includes 4 configs: jigsaw (150), maze (150), sliding (84), geometry (140). No longer requires manual data download and UNI_MMMU_DATA_DIR env variable. * feat(tasks): add Geometry3K benchmark for geometry problem solving (#1030) Add evaluation task for the Geometry3K dataset, a benchmark with 3,002 high school geometry multi-choice problems combining text descriptions and diagrams. - 4-choice multiple choice format (A/B/C/D) - Uses Yang130/geometry3k_4choices_mixed on HuggingFace - Includes answer extraction with multiple fallback patterns - Based on Inter-GPS paper (ACL 2021) * fix: replace hardcoded .cuda() with .to(self._device) for multi-GPU support (#1024) * fix: replace hardcoded .cuda() with .to(self._device) for multi-GPU support Replace hardcoded .cuda() calls with .to(self._device) to support proper device placement when using multi-GPU setups or non-default CUDA devices. Models fixed: - llava_onevision - llava_vid - vila - longva - internvl - internvl2 - internvideo2 - internvideo2_5 - auroracap This enables running evaluations on specific GPUs (e.g., cuda:1) instead of always defaulting to cuda:0. Github-Issue: #952 Github-Issue: #845 * style: format model files with black * fix: add resource cleanup in video loaders to prevent memory leaks (#1026) * fix: add resource cleanup in video loaders to prevent memory leaks - Add 'del vr' after using decord VideoReader in load_video_decord() - Add try/finally with container.close() in read_video_pyav() - Replace bare 'except:' with 'except Exception:' for better practices These changes prevent file handle and memory leaks when processing many videos during evaluation. * style: format load_video.py with black * [Model]: add InternVL-HF model support (#1039) Add InternVL-HF model support for multi-modal evaluation, specifically for InternVL3-HF and InternVL3.5-HF. The adaptation leverages the implementation of the InternVL model using the HuggingFace transformers library. Supports both image and video inputs. The implementation is designed to be forward-compatible with future InternVL1.0-2.5 hf-format weights if provided officially. - Integrates with HF transformers library for InternVL model inference - Maintains compatibility with existing image and video processing pipelines - Uses class name `InternVLHf` for version-agnostic design - Ready for potential official release of InternVL1.0-2.5 hf-format weights * Refine StructEditBench utils logic more simple (#1044) * refien struct edit bench utils logic * refien struct edit bench utils logic * add dependency for reasoning tasks (#1048) * [Task] add PAIBench-U (#1050) * [Task] add PAIBench-U * fix: address reviewer comments * fix: format code style (lint) * [Task] Add MMSI-Video-Bench (#1053) * feat(tasks): Add MMSI-Video-Bench * precommit * style: precommit fix * docs: restructure README with HTTP eval server and custom integration guides (#1052) * docs: restructure README with HTTP eval server and custom integration guides - Simplify examples section to 5 essential models (vLLM, SGLang, OpenAI-compatible, Qwen2.5-VL, Qwen3-VL) - Add comprehensive HTTP Evaluation Server section with client usage and API endpoints - Add Custom Model Integration guide explaining chat vs simple model types - Add Custom Dataset Integration guide with doc_to_messages and cluster_key - Improve documentation structure and code examples * lint * Add MMSearch-Plus (#1054) * [Model] Add Audio Flamingo 3 and Kimi Audio (#1055) * [Model] add Kimi-Audio-7B model support Add support for Kimi-Audio-7B-Instruct audio understanding model from MoonshotAI. Features: - Audio-to-text understanding for tasks like ASR, audio QA - Based on kimia_infer.api.kimia for model loading and inference - Supports audio input via temporary WAV file conversion - Compatible with MMAU benchmark and other audio evaluation tasks Usage: python -m lmms_eval --model kimi_audio \ --model_args pretrained=moonshotai/Kimi-Audio-7B-Instruct \ --tasks mmau --device cuda:0 * [Model] add Audio-Flamingo-3 model support Add support for NVIDIA's Audio-Flamingo-3 model for audio understanding tasks. The model uses HuggingFace's AudioFlamingo3ForConditionalGeneration and supports the standard lmms-eval interface for evaluation on audio benchmarks like MMAU. * fix Audio-Flamingo-3 tensor dimension mismatch - Process conversations individually instead of batching - Put text before audio in conversation format (matching official example) - Fix causes tensor error with variable-length audio inputs * fix Kimi-Audio message order Put text before audio in conversation format (matching official example). This, combined with downgrading transformers to 4.57.1, fixes generation errors. * Lint * lint * [Task] Add mmar benchmark (#1057) * add mmar benchmark Add MMAR (Massive Multi-disciplinary Audio Reasoning) benchmark with: - 1,000 audio-question-answer triplets - 4 reasoning layers: Signal, Perception, Semantic, Cultural - 7 audio modalities including mixed scenarios - MCQ format with accuracy metric - Category and modality-based result aggregation * lint * fix mmar mcq response parsing with mmmu-style approach Use robust parsing logic adapted from MMMU evaluation: - Check multiple patterns: (A), A , A. - Fall back to content matching for long responses - Take last occurrence when multiple candidates found - Add get_multi_choice_info for index2ans mapping * lint * Update README.md * [Model] Add Uni-MoE-2.0-Omni and Baichuan-Omni-1d5 (#1059) * [Model]: Add Baichuan-Omni-1.5 and Uni-MoE-2.0-Omni model support Add two omni-modal models to the lmms-eval framework: - Baichuan-Omni-1.5 (baichuan-inc/Baichuan-Omni-1d5): End-to-end trained omni-modal large model supporting text, image, video, and audio inputs with text and audio outputs. - Uni-MoE-2.0-Omni (HIT-TMG/Uni-MoE-2.0-Omni): Multimodal mixture-of-experts model supporting image, video, and audio understanding. Both models can be tested with omni_bench and similar multimodal benchmarks. * Fix Uni-MoE-2.0-Omni single-GPU inference by importing DeepSpeed MoE patch The uni_moe package uses DeepSpeed MoE which requires expert parallelism groups (ep_group) to be initialized. When running on a single GPU without distributed training, this causes errors in _AllToAll operations. The uni_moe package provides a deepspeed_moe_inference_utils module that patches DeepSpeed to disable AllToAll communication, allowing single-GPU inference. This import must happen before the model classes are imported. * lint * Fix Baichuan omni * [Fix] Add dynamic max_num calculation to InternVL3 to align with VLMEvalKit (#1069) * feat: add dynamic max num to align with vlmevalkit's implementation * style: apply pre-commit formatting for lint --------- Co-authored-by: Bo Li <drluodian@gmail.com> * [Task] Added OSI-bench (#1068) * [task] Add OSI-Bench * add visual_first flag * precommit * style: fix lint formatting on osi-bench branch * style: format mmsearch_plus files for lint --------- Co-authored-by: Bo Li <drluodian@gmail.com> * [Model] Add GLM4V and LLaMA 4 (#1056) * add GLM-4.6V model * add Llama-4-Scout model * fix padding for batch_size > 1 in GLM4V and Llama4Scout Add padding=True to processor.apply_chat_template() in all model files to fix tensor shape mismatch when using batch_size > 1. * Commit all staged changes: modified and deleted files --------- Co-authored-by: Bo Li <drluodian@gmail.com> * [Task] Add PRISMM-Bench (#1063) * Added default task * Added edit & pair match task * Updated message format to match original benchmark setting * Update task and split names * Added whole doc and whole page tasks; Added prismmbench extras in pyproject * Added entry to global task list in docs * Ran pre-commit * Refactored into _default_template_yaml --------- Co-authored-by: Bo Li <drluodian@gmail.com> * Add --offset option (#1042) * Add --offset option * style: format main entrypoint for lint --------- Co-authored-by: Bo Li <drluodian@gmail.com> * [Model] Add OmniVinci and MiniCPM-o-2_6 (#1060) * [Model]: Add OmniVinci and MiniCPM-o-2.6 omni model support Add support for two new omni-modal models: 1. OmniVinci (nvidia/omnivinci) - NVIDIA's omni-modal LLM for vision, audio, and language understanding - Built on VILA codebase - Requires specific NVILA environment setup 2. MiniCPM-o-2.6 (openbmb/MiniCPM-o-2_6) - GPT-4o level MLLM for vision, speech, and multimodal streaming - Requires transformers==4.44.2 - Uses model.chat() interface for inference Both models support: - Image, video, and audio inputs - Omni-modal benchmarks (omnibench) - Distributed inference via accelerate Note: …
stisiTT
pushed a commit
to bgoelTT/lmms-eval
that referenced
this pull request
Mar 6, 2026
…b#906) * Add try catch for longvila * Fix gqa doc id key error for more robustness * Revise PR template * Lint
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Before you open a pull-request, please check if a similar issue already exists or has been closed before.
When you open a pull-request, please be sure to include the following
If you meet the lint warnings, you can use following scripts to reformat code.
Thank you for your contributions!