refactor(models/chat): improve async_openai code structure and readability by kcz358 · Pull Request #1102 · EvolvingLMMs-Lab/lmms-eval

kcz358 · 2026-02-19T09:45:17Z

Summary

Extract prepare_messages() method to separate message formatting logic, allowing subclasses to override
Refactor complex generate_until() concurrency control from 130 lines into focused helper methods
Add comprehensive docstrings to all new methods and classes
Create new async_openai_qwen3_vl.py model that inherits from AsyncOpenAIChat and overrides only prepare_messages()
Register async_openai_qwen3_vl in model registry

Changes

Code Structure Improvements

_AdaptiveConcurrencyTracker: Dataclass for tracking adaptive concurrency statistics
_get_initial_concurrency(): Calculate initial concurrency level
_compute_dispatch_order(): Compute request dispatch order with prefix-aware sorting
_process_with_retry(): Handle single request execution with retry logic
_should_update_concurrency(): Determine when to update concurrency
_update_concurrency(): Update concurrency based on tracked statistics
_run_scheduling_loop(): Main async scheduling loop

Simplified `generate_until()`

The run() inner function is now only 8 lines, making the overall flow much clearer:

async def run():
    pbar = tqdm(...)
    current_concurrency = self._get_initial_concurrency()
    dispatch_order = self._compute_dispatch_order(requests)
    res = await self._run_scheduling_loop(requests, dispatch_order, pbar, current_concurrency)
    pbar.close()
    return res

New Model: `async_openai_qwen3_vl`

Inherits from AsyncOpenAIChat with minimal code duplication
Overrides only prepare_messages() to use to_qwen3_vl_openai_messages()
Maintains all async concurrency control and retry logic from parent class

Files Changed

lmms_eval/models/chat/async_openai.py: Refactored with new helper methods (+244, -127)
lmms_eval/models/chat/async_openai_qwen3_vl.py: New file (+35 lines)
lmms_eval/models/__init__.py: Registered new model (+1 line)
.gitignore: Ignore scripts directory

Testing

Verified with --limit 5 test runs on both async_openai and async_openai_qwen3_vl models.

…strings - Extract _AdaptiveConcurrencyTracker for cleaner state management - Split generate_until's run() into focused helper methods - Add comprehensive docstrings to all new methods - Simplify run() from 130 lines to 8 lines - Update async_openai_qwen3_vl.py with class docstring

…ameter - Add message_format param to AsyncOpenAIChat (default='openai', supports 'qwen3_vl') - Extract _build_video_kwargs() to eliminate DRY violation - Remove separate async_openai_qwen3_vl.py and its registry entry - Fix missing f-string prefix in tool response formatting - Fix duplicate .gitignore entry

Luodian · 2026-02-19T12:38:37Z

Review & Changes Pushed

The refactoring of async_openai.py (extracting helper methods, _AdaptiveConcurrencyTracker, etc.) is solid — nice work cleaning up generate_until().

I've pushed a commit on top of your branch that replaces the separate async_openai_qwen3_vl model class with a message_format parameter on AsyncOpenAIChat. Here's the reasoning:

Why `message_format` parameter instead of a separate class

The AsyncOpenAIQwen3VLChat class only overrides prepare_messages() to swap to_openai_messages() for to_qwen3_vl_openai_messages(). That's a one-line behavioral difference. Creating a new file + class + registry entry for this has a scalability problem: when Qwen4-VL or another model needs its own message format, we'd need async_openai_qwen4_vl.py, async_openai_gemini.py, etc. — class explosion for what is fundamentally a configuration choice.

The new approach:

# Before (separate model):
python -m lmms_eval --model async_openai_qwen3_vl --model_args pretrained=...

# After (same model, parameterized):
python -m lmms_eval --model async_openai --model_args pretrained=...,message_format=qwen3_vl

Adding a new format in the future is just adding an elif in prepare_messages() — no new files, no registry changes.

Other fixes in this commit

f-string bug: all_response += "</{call.function.name}>" was missing the f prefix — it concatenated the literal string </{call.function.name}> instead of the variable value. Fixed to f"</{call.function.name}>".
DRY violation: Both the parent and child class duplicated video_kwargs construction. Extracted _build_video_kwargs() so prepare_messages() stays focused on format dispatch.
.gitignore: Removed duplicate scripts entry (scripts/ already covers it).

- Add message_format parameter to AsyncOpenAIChat - Support both 'default' and 'qwen3_vl' message formats - Remove async_openai_qwen3_vl.py (no longer needed) - Unregister async_openai_qwen3_vl from model registry - Fix string formatting for tool call tags

…ility (#1102) * refactor(models/chat): extract prepare_messages method * refactor(models/chat): refactor async concurrency control and add docstrings - Extract _AdaptiveConcurrencyTracker for cleaner state management - Split generate_until's run() into focused helper methods - Add comprehensive docstrings to all new methods - Simplify run() from 130 lines to 8 lines - Update async_openai_qwen3_vl.py with class docstring * style: auto-fix lint (black + isort) * refactor: replace async_openai_qwen3_vl class with message_format parameter - Add message_format param to AsyncOpenAIChat (default='openai', supports 'qwen3_vl') - Extract _build_video_kwargs() to eliminate DRY violation - Remove separate async_openai_qwen3_vl.py and its registry entry - Fix missing f-string prefix in tool response formatting - Fix duplicate .gitignore entry * refactor(models/chat): add message_format parameter to support qwen3_vl - Add message_format parameter to AsyncOpenAIChat - Support both 'default' and 'qwen3_vl' message formats - Remove async_openai_qwen3_vl.py (no longer needed) - Unregister async_openai_qwen3_vl from model registry - Fix string formatting for tool call tags * fix tool response tag format --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Bo Li <drluodian@gmail.com>

* refactor(models/chat): improve async_openai code structure and readability (#1102) * refactor(models/chat): extract prepare_messages method * refactor(models/chat): refactor async concurrency control and add docstrings - Extract _AdaptiveConcurrencyTracker for cleaner state management - Split generate_until's run() into focused helper methods - Add comprehensive docstrings to all new methods - Simplify run() from 130 lines to 8 lines - Update async_openai_qwen3_vl.py with class docstring * style: auto-fix lint (black + isort) * refactor: replace async_openai_qwen3_vl class with message_format parameter - Add message_format param to AsyncOpenAIChat (default='openai', supports 'qwen3_vl') - Extract _build_video_kwargs() to eliminate DRY violation - Remove separate async_openai_qwen3_vl.py and its registry entry - Fix missing f-string prefix in tool response formatting - Fix duplicate .gitignore entry * refactor(models/chat): add message_format parameter to support qwen3_vl - Add message_format parameter to AsyncOpenAIChat - Support both 'default' and 'qwen3_vl' message formats - Remove async_openai_qwen3_vl.py (no longer needed) - Unregister async_openai_qwen3_vl from model registry - Fix string formatting for tool call tags * fix tool response tag format --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Bo Li <drluodian@gmail.com> * docs: add MMMU eval discrepancy report and TLDR FP definitions * fix(ci): make lint workflow fork-PR safe * feat(tasks): add MMStar reasoning task * refactor(tasks): merge cn and en reasoning into unified structure - Combine cn_reasoning and en_reasoning into single reasoning directory - Share common template yaml across both cn and en reasoning tasks - Unified utils.py handles cn/en via DATASET_NAME environment variable - Keep separate group files for mmbench_cn_reasoning and mmbench_en_reasoning * refactor(tasks): unify cn and en reasoning with single group - Remove environment variable dependency - Add separate doc_to_text/doc_to_messages for cn and en in utils.py - Template yaml shared, specific functions defined in task yaml - Single mmbench_reasoning group containing both cn and en dev tasks - Unified process_results without data_source distinction * fix(tasks): add dataset_name to reasoning task configs * feat(tasks): add test split for mmbench reasoning tasks - Add mmbench_cn_test_reasoning and mmbench_en_test_reasoning - Add test_split to dev reasoning configs - Update mmbench_reasoning group to include all four tasks * feat(tasks): add MME-RealWorld reasoning tasks - Add mme_realworld_reasoning (en) and mme_realworld_cn_reasoning (cn) - Include doc_to_messages for both languages with reasoning prompts - Support accuracy and format scoring metrics * feat(tasks): add SEED-Bench reasoning tasks - Add seedbench_reasoning with doc_to_messages for reasoning format - Add seedbench_2_plus_reasoning with doc_to_messages for reasoning format - Support accuracy and format scoring metrics for both benchmarks * feat(tasks): add CV-Bench reasoning tasks - Add cv_bench_reasoning, cv_bench_2d_reasoning, cv_bench_3d_reasoning - Include doc_to_messages for reasoning format - Support accuracy and format scoring metrics * fix(reasoning): improve mcq matching with normalize comparison - Apply parse_mcq to ground_truth for consistency - Use case-insensitive comparison for MCQ answers - Strip whitespace for more robust matching * feat(tasks): add OCR-Bench reasoning task - Add ocrbench_reasoning with doc_to_messages for reasoning format - Support accuracy and format scoring metrics * feat(tasks): add ChartQA reasoning task - Add chartqa_reasoning with doc_to_messages for reasoning format - Support accuracy and format scoring metrics * feat(tasks): add InfoVQA reasoning task - Add infovqa_val_reasoning with doc_to_messages for reasoning format - Support accuracy and format scoring metrics * feat(tasks): add CountBenchQA reasoning task - Add countbenchqa_reasoning with doc_to_messages for reasoning format - Support accuracy and format scoring metrics * feat(tasks): add CountBenchQA benchmark - Add countbenchqa task config and utils - Add countbenchqa_reasoning with doc_to_messages for reasoning format - Support accuracy and format scoring metrics * feat(tasks): add VStar-Bench reasoning tasks - Add vstar_bench_reasoning with doc_to_messages for reasoning format - Add vstar_bench_direct_attributes_reasoning - Add vstar_bench_relative_position_reasoning - Support accuracy and format scoring metrics * feat(tasks): add PixMo-Count benchmark - Add pixmo_count task config and utils - Add pixmo_count_reasoning with doc_to_messages for reasoning format - Support accuracy and format scoring metrics * feat(models): add system_prompt_file support to AsyncOpenAIChat - Allow loading system prompt from file via system_prompt_file parameter - Add _apply_system_prompt method to inject system prompt into messages - Apply system prompt before generation in generate_until * style: auto-fix lint (black + isort) * refactor(reasoning): extract acc_score computation to separate function Extracted accuracy reward logic from compute_score into acc_reward function for better separation of concerns. * Fix async oai rebase error * Lint * refactor(reasoning): add model-side system_prompt support and deduplicate reasoning task utils - Add _resolve_system_prompt() and _apply_system_prompt() to base lmms class for model-side system prompt injection (supports file paths and literal strings) - Add factory functions make_reasoning_doc_to_messages() and make_reasoning_process_results() to reasoning_utils.py, eliminating ~400 lines of copy-paste across 12 reasoning task modules - Update AsyncOpenAIChat: replace system_prompt_file with system_prompt using base class utilities, remove duplicate _apply_system_prompt method - Wire up HuggingFace chat model to inject system_prompt into messages during generation (opt-in only, default None to avoid overwriting task-level prompts) - Fix infovqa reasoning: anls(ground_truth, results) -> anls(ground_truth, [extracted]) - Fix mmbench reasoning: cache YAML parsing with @lru_cache instead of per-sample I/O - Fix format_reward() to also match <analysis>...</analysis> tag pattern - Expand --reasoning_tags default to include <analysis> tags * fix(ci): restore task_input_specs/redundancy_refactor.yaml deleted by bff123c * fix: remove duplicate --reasoning_tags CLI argument * docs: restore docs/README.md from dev-v0d7 --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Bo Li <drluodian@gmail.com>

* refactor(models/chat): improve async_openai code structure and readability (#1102) * refactor(models/chat): extract prepare_messages method * refactor(models/chat): refactor async concurrency control and add docstrings - Extract _AdaptiveConcurrencyTracker for cleaner state management - Split generate_until's run() into focused helper methods - Add comprehensive docstrings to all new methods - Simplify run() from 130 lines to 8 lines - Update async_openai_qwen3_vl.py with class docstring * style: auto-fix lint (black + isort) * refactor: replace async_openai_qwen3_vl class with message_format parameter - Add message_format param to AsyncOpenAIChat (default='openai', supports 'qwen3_vl') - Extract _build_video_kwargs() to eliminate DRY violation - Remove separate async_openai_qwen3_vl.py and its registry entry - Fix missing f-string prefix in tool response formatting - Fix duplicate .gitignore entry * refactor(models/chat): add message_format parameter to support qwen3_vl - Add message_format parameter to AsyncOpenAIChat - Support both 'default' and 'qwen3_vl' message formats - Remove async_openai_qwen3_vl.py (no longer needed) - Unregister async_openai_qwen3_vl from model registry - Fix string formatting for tool call tags * fix tool response tag format --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Bo Li <drluodian@gmail.com> * docs: add MMMU eval discrepancy report and TLDR FP definitions * fix(ci): make lint workflow fork-PR safe * feat(tasks): add MMStar reasoning task * refactor(tasks): merge cn and en reasoning into unified structure - Combine cn_reasoning and en_reasoning into single reasoning directory - Share common template yaml across both cn and en reasoning tasks - Unified utils.py handles cn/en via DATASET_NAME environment variable - Keep separate group files for mmbench_cn_reasoning and mmbench_en_reasoning * refactor(tasks): unify cn and en reasoning with single group - Remove environment variable dependency - Add separate doc_to_text/doc_to_messages for cn and en in utils.py - Template yaml shared, specific functions defined in task yaml - Single mmbench_reasoning group containing both cn and en dev tasks - Unified process_results without data_source distinction * fix(tasks): add dataset_name to reasoning task configs * feat(tasks): add test split for mmbench reasoning tasks - Add mmbench_cn_test_reasoning and mmbench_en_test_reasoning - Add test_split to dev reasoning configs - Update mmbench_reasoning group to include all four tasks * feat(tasks): add MME-RealWorld reasoning tasks - Add mme_realworld_reasoning (en) and mme_realworld_cn_reasoning (cn) - Include doc_to_messages for both languages with reasoning prompts - Support accuracy and format scoring metrics * feat(tasks): add SEED-Bench reasoning tasks - Add seedbench_reasoning with doc_to_messages for reasoning format - Add seedbench_2_plus_reasoning with doc_to_messages for reasoning format - Support accuracy and format scoring metrics for both benchmarks * feat(tasks): add CV-Bench reasoning tasks - Add cv_bench_reasoning, cv_bench_2d_reasoning, cv_bench_3d_reasoning - Include doc_to_messages for reasoning format - Support accuracy and format scoring metrics * fix(reasoning): improve mcq matching with normalize comparison - Apply parse_mcq to ground_truth for consistency - Use case-insensitive comparison for MCQ answers - Strip whitespace for more robust matching * feat(tasks): add OCR-Bench reasoning task - Add ocrbench_reasoning with doc_to_messages for reasoning format - Support accuracy and format scoring metrics * feat(tasks): add ChartQA reasoning task - Add chartqa_reasoning with doc_to_messages for reasoning format - Support accuracy and format scoring metrics * feat(tasks): add InfoVQA reasoning task - Add infovqa_val_reasoning with doc_to_messages for reasoning format - Support accuracy and format scoring metrics * feat(tasks): add CountBenchQA reasoning task - Add countbenchqa_reasoning with doc_to_messages for reasoning format - Support accuracy and format scoring metrics * feat(tasks): add CountBenchQA benchmark - Add countbenchqa task config and utils - Add countbenchqa_reasoning with doc_to_messages for reasoning format - Support accuracy and format scoring metrics * feat(tasks): add VStar-Bench reasoning tasks - Add vstar_bench_reasoning with doc_to_messages for reasoning format - Add vstar_bench_direct_attributes_reasoning - Add vstar_bench_relative_position_reasoning - Support accuracy and format scoring metrics * feat(tasks): add PixMo-Count benchmark - Add pixmo_count task config and utils - Add pixmo_count_reasoning with doc_to_messages for reasoning format - Support accuracy and format scoring metrics * feat(models): add system_prompt_file support to AsyncOpenAIChat - Allow loading system prompt from file via system_prompt_file parameter - Add _apply_system_prompt method to inject system prompt into messages - Apply system prompt before generation in generate_until * style: auto-fix lint (black + isort) * refactor(reasoning): extract acc_score computation to separate function Extracted accuracy reward logic from compute_score into acc_reward function for better separation of concerns. * Fix async oai rebase error * Lint * refactor(reasoning): add model-side system_prompt support and deduplicate reasoning task utils - Add _resolve_system_prompt() and _apply_system_prompt() to base lmms class for model-side system prompt injection (supports file paths and literal strings) - Add factory functions make_reasoning_doc_to_messages() and make_reasoning_process_results() to reasoning_utils.py, eliminating ~400 lines of copy-paste across 12 reasoning task modules - Update AsyncOpenAIChat: replace system_prompt_file with system_prompt using base class utilities, remove duplicate _apply_system_prompt method - Wire up HuggingFace chat model to inject system_prompt into messages during generation (opt-in only, default None to avoid overwriting task-level prompts) - Fix infovqa reasoning: anls(ground_truth, results) -> anls(ground_truth, [extracted]) - Fix mmbench reasoning: cache YAML parsing with @lru_cache instead of per-sample I/O - Fix format_reward() to also match <analysis>...</analysis> tag pattern - Expand --reasoning_tags default to include <analysis> tags * fix(ci): restore task_input_specs/redundancy_refactor.yaml deleted by 418bfe6 * fix: remove duplicate --reasoning_tags CLI argument * docs: restore docs/README.md from dev-v0d7 --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Bo Li <drluodian@gmail.com>

…ility (#1102) * refactor(models/chat): extract prepare_messages method * refactor(models/chat): refactor async concurrency control and add docstrings - Extract _AdaptiveConcurrencyTracker for cleaner state management - Split generate_until's run() into focused helper methods - Add comprehensive docstrings to all new methods - Simplify run() from 130 lines to 8 lines - Update async_openai_qwen3_vl.py with class docstring * style: auto-fix lint (black + isort) * refactor: replace async_openai_qwen3_vl class with message_format parameter - Add message_format param to AsyncOpenAIChat (default='openai', supports 'qwen3_vl') - Extract _build_video_kwargs() to eliminate DRY violation - Remove separate async_openai_qwen3_vl.py and its registry entry - Fix missing f-string prefix in tool response formatting - Fix duplicate .gitignore entry * refactor(models/chat): add message_format parameter to support qwen3_vl - Add message_format parameter to AsyncOpenAIChat - Support both 'default' and 'qwen3_vl' message formats - Remove async_openai_qwen3_vl.py (no longer needed) - Unregister async_openai_qwen3_vl from model registry - Fix string formatting for tool call tags * fix tool response tag format --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Bo Li <drluodian@gmail.com>

xinjiezhang and others added 4 commits February 19, 2026 17:31

refactor(models/chat): extract prepare_messages method

a51a65e

style: auto-fix lint (black + isort)

fc7d6d8

xinjiezhang added 2 commits February 19, 2026 20:53

fix tool response tag format

348eb81

Luodian approved these changes Feb 19, 2026

View reviewed changes

Luodian merged commit 17124b0 into main Feb 19, 2026
2 checks passed

Luodian deleted the rfc/async_oai branch February 23, 2026 08:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(models/chat): improve async_openai code structure and readability#1102

refactor(models/chat): improve async_openai code structure and readability#1102
Luodian merged 6 commits intomainfrom
rfc/async_oai

kcz358 commented Feb 19, 2026

Uh oh!

Luodian commented Feb 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kcz358 commented Feb 19, 2026

Summary

Changes

Code Structure Improvements

Simplified generate_until()

New Model: async_openai_qwen3_vl

Files Changed

Testing

Uh oh!

Luodian commented Feb 19, 2026

Review & Changes Pushed

Why message_format parameter instead of a separate class

Other fixes in this commit

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Simplified `generate_until()`

New Model: `async_openai_qwen3_vl`

Why `message_format` parameter instead of a separate class