Skip to content

refactor(models/chat): improve async_openai code structure and readability#1102

Merged
Luodian merged 6 commits intomainfrom
rfc/async_oai
Feb 19, 2026
Merged

refactor(models/chat): improve async_openai code structure and readability#1102
Luodian merged 6 commits intomainfrom
rfc/async_oai

Conversation

@kcz358
Copy link
Collaborator

@kcz358 kcz358 commented Feb 19, 2026

Summary

  • Extract prepare_messages() method to separate message formatting logic, allowing subclasses to override
  • Refactor complex generate_until() concurrency control from 130 lines into focused helper methods
  • Add comprehensive docstrings to all new methods and classes
  • Create new async_openai_qwen3_vl.py model that inherits from AsyncOpenAIChat and overrides only prepare_messages()
  • Register async_openai_qwen3_vl in model registry

Changes

Code Structure Improvements

  • _AdaptiveConcurrencyTracker: Dataclass for tracking adaptive concurrency statistics
  • _get_initial_concurrency(): Calculate initial concurrency level
  • _compute_dispatch_order(): Compute request dispatch order with prefix-aware sorting
  • _process_with_retry(): Handle single request execution with retry logic
  • _should_update_concurrency(): Determine when to update concurrency
  • _update_concurrency(): Update concurrency based on tracked statistics
  • _run_scheduling_loop(): Main async scheduling loop

Simplified generate_until()

The run() inner function is now only 8 lines, making the overall flow much clearer:

async def run():
    pbar = tqdm(...)
    current_concurrency = self._get_initial_concurrency()
    dispatch_order = self._compute_dispatch_order(requests)
    res = await self._run_scheduling_loop(requests, dispatch_order, pbar, current_concurrency)
    pbar.close()
    return res

New Model: async_openai_qwen3_vl

  • Inherits from AsyncOpenAIChat with minimal code duplication
  • Overrides only prepare_messages() to use to_qwen3_vl_openai_messages()
  • Maintains all async concurrency control and retry logic from parent class

Files Changed

  • lmms_eval/models/chat/async_openai.py: Refactored with new helper methods (+244, -127)
  • lmms_eval/models/chat/async_openai_qwen3_vl.py: New file (+35 lines)
  • lmms_eval/models/__init__.py: Registered new model (+1 line)
  • .gitignore: Ignore scripts directory

Testing

Verified with --limit 5 test runs on both async_openai and async_openai_qwen3_vl models.

xinjiezhang and others added 4 commits February 19, 2026 17:31
…strings

- Extract _AdaptiveConcurrencyTracker for cleaner state management
- Split generate_until's run() into focused helper methods
- Add comprehensive docstrings to all new methods
- Simplify run() from 130 lines to 8 lines
- Update async_openai_qwen3_vl.py with class docstring
…ameter

- Add message_format param to AsyncOpenAIChat (default='openai', supports 'qwen3_vl')
- Extract _build_video_kwargs() to eliminate DRY violation
- Remove separate async_openai_qwen3_vl.py and its registry entry
- Fix missing f-string prefix in tool response formatting
- Fix duplicate .gitignore entry
@Luodian
Copy link
Contributor

Luodian commented Feb 19, 2026

Review & Changes Pushed

The refactoring of async_openai.py (extracting helper methods, _AdaptiveConcurrencyTracker, etc.) is solid — nice work cleaning up generate_until().

I've pushed a commit on top of your branch that replaces the separate async_openai_qwen3_vl model class with a message_format parameter on AsyncOpenAIChat. Here's the reasoning:

Why message_format parameter instead of a separate class

The AsyncOpenAIQwen3VLChat class only overrides prepare_messages() to swap to_openai_messages() for to_qwen3_vl_openai_messages(). That's a one-line behavioral difference. Creating a new file + class + registry entry for this has a scalability problem: when Qwen4-VL or another model needs its own message format, we'd need async_openai_qwen4_vl.py, async_openai_gemini.py, etc. — class explosion for what is fundamentally a configuration choice.

The new approach:

# Before (separate model):
python -m lmms_eval --model async_openai_qwen3_vl --model_args pretrained=...

# After (same model, parameterized):
python -m lmms_eval --model async_openai --model_args pretrained=...,message_format=qwen3_vl

Adding a new format in the future is just adding an elif in prepare_messages() — no new files, no registry changes.

Other fixes in this commit

  1. f-string bug: all_response += "</{call.function.name}>" was missing the f prefix — it concatenated the literal string </{call.function.name}> instead of the variable value. Fixed to f"</{call.function.name}>".

  2. DRY violation: Both the parent and child class duplicated video_kwargs construction. Extracted _build_video_kwargs() so prepare_messages() stays focused on format dispatch.

  3. .gitignore: Removed duplicate scripts entry (scripts/ already covers it).

xinjiezhang added 2 commits February 19, 2026 20:53
- Add message_format parameter to AsyncOpenAIChat
- Support both 'default' and 'qwen3_vl' message formats
- Remove async_openai_qwen3_vl.py (no longer needed)
- Unregister async_openai_qwen3_vl from model registry
- Fix string formatting for tool call tags
@Luodian Luodian merged commit 17124b0 into main Feb 19, 2026
2 checks passed
kcz358 added a commit that referenced this pull request Feb 23, 2026
…ility (#1102)

* refactor(models/chat): extract prepare_messages method

* refactor(models/chat): refactor async concurrency control and add docstrings

- Extract _AdaptiveConcurrencyTracker for cleaner state management
- Split generate_until's run() into focused helper methods
- Add comprehensive docstrings to all new methods
- Simplify run() from 130 lines to 8 lines
- Update async_openai_qwen3_vl.py with class docstring

* style: auto-fix lint (black + isort)

* refactor: replace async_openai_qwen3_vl class with message_format parameter

- Add message_format param to AsyncOpenAIChat (default='openai', supports 'qwen3_vl')
- Extract _build_video_kwargs() to eliminate DRY violation
- Remove separate async_openai_qwen3_vl.py and its registry entry
- Fix missing f-string prefix in tool response formatting
- Fix duplicate .gitignore entry

* refactor(models/chat): add message_format parameter to support qwen3_vl

- Add message_format parameter to AsyncOpenAIChat
- Support both 'default' and 'qwen3_vl' message formats
- Remove async_openai_qwen3_vl.py (no longer needed)
- Unregister async_openai_qwen3_vl from model registry
- Fix string formatting for tool call tags

* fix tool response tag format

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Bo Li <drluodian@gmail.com>
@Luodian Luodian deleted the rfc/async_oai branch February 23, 2026 08:26
kcz358 added a commit that referenced this pull request Feb 27, 2026
…ility (#1102)

* refactor(models/chat): extract prepare_messages method

* refactor(models/chat): refactor async concurrency control and add docstrings

- Extract _AdaptiveConcurrencyTracker for cleaner state management
- Split generate_until's run() into focused helper methods
- Add comprehensive docstrings to all new methods
- Simplify run() from 130 lines to 8 lines
- Update async_openai_qwen3_vl.py with class docstring

* style: auto-fix lint (black + isort)

* refactor: replace async_openai_qwen3_vl class with message_format parameter

- Add message_format param to AsyncOpenAIChat (default='openai', supports 'qwen3_vl')
- Extract _build_video_kwargs() to eliminate DRY violation
- Remove separate async_openai_qwen3_vl.py and its registry entry
- Fix missing f-string prefix in tool response formatting
- Fix duplicate .gitignore entry

* refactor(models/chat): add message_format parameter to support qwen3_vl

- Add message_format parameter to AsyncOpenAIChat
- Support both 'default' and 'qwen3_vl' message formats
- Remove async_openai_qwen3_vl.py (no longer needed)
- Unregister async_openai_qwen3_vl from model registry
- Fix string formatting for tool call tags

* fix tool response tag format

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Bo Li <drluodian@gmail.com>
Luodian added a commit that referenced this pull request Feb 27, 2026
* refactor(models/chat): improve async_openai code structure and readability (#1102)

* refactor(models/chat): extract prepare_messages method

* refactor(models/chat): refactor async concurrency control and add docstrings

- Extract _AdaptiveConcurrencyTracker for cleaner state management
- Split generate_until's run() into focused helper methods
- Add comprehensive docstrings to all new methods
- Simplify run() from 130 lines to 8 lines
- Update async_openai_qwen3_vl.py with class docstring

* style: auto-fix lint (black + isort)

* refactor: replace async_openai_qwen3_vl class with message_format parameter

- Add message_format param to AsyncOpenAIChat (default='openai', supports 'qwen3_vl')
- Extract _build_video_kwargs() to eliminate DRY violation
- Remove separate async_openai_qwen3_vl.py and its registry entry
- Fix missing f-string prefix in tool response formatting
- Fix duplicate .gitignore entry

* refactor(models/chat): add message_format parameter to support qwen3_vl

- Add message_format parameter to AsyncOpenAIChat
- Support both 'default' and 'qwen3_vl' message formats
- Remove async_openai_qwen3_vl.py (no longer needed)
- Unregister async_openai_qwen3_vl from model registry
- Fix string formatting for tool call tags

* fix tool response tag format

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Bo Li <drluodian@gmail.com>

* docs: add MMMU eval discrepancy report and TLDR FP definitions

* fix(ci): make lint workflow fork-PR safe

* feat(tasks): add MMStar reasoning task

* refactor(tasks): merge cn and en reasoning into unified structure

- Combine cn_reasoning and en_reasoning into single reasoning directory
- Share common template yaml across both cn and en reasoning tasks
- Unified utils.py handles cn/en via DATASET_NAME environment variable
- Keep separate group files for mmbench_cn_reasoning and mmbench_en_reasoning

* refactor(tasks): unify cn and en reasoning with single group

- Remove environment variable dependency
- Add separate doc_to_text/doc_to_messages for cn and en in utils.py
- Template yaml shared, specific functions defined in task yaml
- Single mmbench_reasoning group containing both cn and en dev tasks
- Unified process_results without data_source distinction

* fix(tasks): add dataset_name to reasoning task configs

* feat(tasks): add test split for mmbench reasoning tasks

- Add mmbench_cn_test_reasoning and mmbench_en_test_reasoning
- Add test_split to dev reasoning configs
- Update mmbench_reasoning group to include all four tasks

* feat(tasks): add MME-RealWorld reasoning tasks

- Add mme_realworld_reasoning (en) and mme_realworld_cn_reasoning (cn)
- Include doc_to_messages for both languages with reasoning prompts
- Support accuracy and format scoring metrics

* feat(tasks): add SEED-Bench reasoning tasks

- Add seedbench_reasoning with doc_to_messages for reasoning format
- Add seedbench_2_plus_reasoning with doc_to_messages for reasoning format
- Support accuracy and format scoring metrics for both benchmarks

* feat(tasks): add CV-Bench reasoning tasks

- Add cv_bench_reasoning, cv_bench_2d_reasoning, cv_bench_3d_reasoning
- Include doc_to_messages for reasoning format
- Support accuracy and format scoring metrics

* fix(reasoning): improve mcq matching with normalize comparison

- Apply parse_mcq to ground_truth for consistency
- Use case-insensitive comparison for MCQ answers
- Strip whitespace for more robust matching

* feat(tasks): add OCR-Bench reasoning task

- Add ocrbench_reasoning with doc_to_messages for reasoning format
- Support accuracy and format scoring metrics

* feat(tasks): add ChartQA reasoning task

- Add chartqa_reasoning with doc_to_messages for reasoning format
- Support accuracy and format scoring metrics

* feat(tasks): add InfoVQA reasoning task

- Add infovqa_val_reasoning with doc_to_messages for reasoning format
- Support accuracy and format scoring metrics

* feat(tasks): add CountBenchQA reasoning task

- Add countbenchqa_reasoning with doc_to_messages for reasoning format
- Support accuracy and format scoring metrics

* feat(tasks): add CountBenchQA benchmark

- Add countbenchqa task config and utils
- Add countbenchqa_reasoning with doc_to_messages for reasoning format
- Support accuracy and format scoring metrics

* feat(tasks): add VStar-Bench reasoning tasks

- Add vstar_bench_reasoning with doc_to_messages for reasoning format
- Add vstar_bench_direct_attributes_reasoning
- Add vstar_bench_relative_position_reasoning
- Support accuracy and format scoring metrics

* feat(tasks): add PixMo-Count benchmark

- Add pixmo_count task config and utils
- Add pixmo_count_reasoning with doc_to_messages for reasoning format
- Support accuracy and format scoring metrics

* feat(models): add system_prompt_file support to AsyncOpenAIChat

- Allow loading system prompt from file via system_prompt_file parameter
- Add _apply_system_prompt method to inject system prompt into messages
- Apply system prompt before generation in generate_until

* style: auto-fix lint (black + isort)

* refactor(reasoning): extract acc_score computation to separate function

Extracted accuracy reward logic from compute_score into acc_reward function
for better separation of concerns.

* Fix async oai rebase error

* Lint

* refactor(reasoning): add model-side system_prompt support and deduplicate reasoning task utils

- Add _resolve_system_prompt() and _apply_system_prompt() to base lmms class
  for model-side system prompt injection (supports file paths and literal strings)
- Add factory functions make_reasoning_doc_to_messages() and
  make_reasoning_process_results() to reasoning_utils.py, eliminating ~400 lines
  of copy-paste across 12 reasoning task modules
- Update AsyncOpenAIChat: replace system_prompt_file with system_prompt using
  base class utilities, remove duplicate _apply_system_prompt method
- Wire up HuggingFace chat model to inject system_prompt into messages during
  generation (opt-in only, default None to avoid overwriting task-level prompts)
- Fix infovqa reasoning: anls(ground_truth, results) -> anls(ground_truth, [extracted])
- Fix mmbench reasoning: cache YAML parsing with @lru_cache instead of per-sample I/O
- Fix format_reward() to also match <analysis>...</analysis> tag pattern
- Expand --reasoning_tags default to include <analysis> tags

* fix(ci): restore task_input_specs/redundancy_refactor.yaml deleted by bff123c

* fix: remove duplicate --reasoning_tags CLI argument

* docs: restore docs/README.md from dev-v0d7

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Bo Li <drluodian@gmail.com>
Luodian added a commit that referenced this pull request Feb 28, 2026
* refactor(models/chat): improve async_openai code structure and readability (#1102)

* refactor(models/chat): extract prepare_messages method

* refactor(models/chat): refactor async concurrency control and add docstrings

- Extract _AdaptiveConcurrencyTracker for cleaner state management
- Split generate_until's run() into focused helper methods
- Add comprehensive docstrings to all new methods
- Simplify run() from 130 lines to 8 lines
- Update async_openai_qwen3_vl.py with class docstring

* style: auto-fix lint (black + isort)

* refactor: replace async_openai_qwen3_vl class with message_format parameter

- Add message_format param to AsyncOpenAIChat (default='openai', supports 'qwen3_vl')
- Extract _build_video_kwargs() to eliminate DRY violation
- Remove separate async_openai_qwen3_vl.py and its registry entry
- Fix missing f-string prefix in tool response formatting
- Fix duplicate .gitignore entry

* refactor(models/chat): add message_format parameter to support qwen3_vl

- Add message_format parameter to AsyncOpenAIChat
- Support both 'default' and 'qwen3_vl' message formats
- Remove async_openai_qwen3_vl.py (no longer needed)
- Unregister async_openai_qwen3_vl from model registry
- Fix string formatting for tool call tags

* fix tool response tag format

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Bo Li <drluodian@gmail.com>

* docs: add MMMU eval discrepancy report and TLDR FP definitions

* fix(ci): make lint workflow fork-PR safe

* feat(tasks): add MMStar reasoning task

* refactor(tasks): merge cn and en reasoning into unified structure

- Combine cn_reasoning and en_reasoning into single reasoning directory
- Share common template yaml across both cn and en reasoning tasks
- Unified utils.py handles cn/en via DATASET_NAME environment variable
- Keep separate group files for mmbench_cn_reasoning and mmbench_en_reasoning

* refactor(tasks): unify cn and en reasoning with single group

- Remove environment variable dependency
- Add separate doc_to_text/doc_to_messages for cn and en in utils.py
- Template yaml shared, specific functions defined in task yaml
- Single mmbench_reasoning group containing both cn and en dev tasks
- Unified process_results without data_source distinction

* fix(tasks): add dataset_name to reasoning task configs

* feat(tasks): add test split for mmbench reasoning tasks

- Add mmbench_cn_test_reasoning and mmbench_en_test_reasoning
- Add test_split to dev reasoning configs
- Update mmbench_reasoning group to include all four tasks

* feat(tasks): add MME-RealWorld reasoning tasks

- Add mme_realworld_reasoning (en) and mme_realworld_cn_reasoning (cn)
- Include doc_to_messages for both languages with reasoning prompts
- Support accuracy and format scoring metrics

* feat(tasks): add SEED-Bench reasoning tasks

- Add seedbench_reasoning with doc_to_messages for reasoning format
- Add seedbench_2_plus_reasoning with doc_to_messages for reasoning format
- Support accuracy and format scoring metrics for both benchmarks

* feat(tasks): add CV-Bench reasoning tasks

- Add cv_bench_reasoning, cv_bench_2d_reasoning, cv_bench_3d_reasoning
- Include doc_to_messages for reasoning format
- Support accuracy and format scoring metrics

* fix(reasoning): improve mcq matching with normalize comparison

- Apply parse_mcq to ground_truth for consistency
- Use case-insensitive comparison for MCQ answers
- Strip whitespace for more robust matching

* feat(tasks): add OCR-Bench reasoning task

- Add ocrbench_reasoning with doc_to_messages for reasoning format
- Support accuracy and format scoring metrics

* feat(tasks): add ChartQA reasoning task

- Add chartqa_reasoning with doc_to_messages for reasoning format
- Support accuracy and format scoring metrics

* feat(tasks): add InfoVQA reasoning task

- Add infovqa_val_reasoning with doc_to_messages for reasoning format
- Support accuracy and format scoring metrics

* feat(tasks): add CountBenchQA reasoning task

- Add countbenchqa_reasoning with doc_to_messages for reasoning format
- Support accuracy and format scoring metrics

* feat(tasks): add CountBenchQA benchmark

- Add countbenchqa task config and utils
- Add countbenchqa_reasoning with doc_to_messages for reasoning format
- Support accuracy and format scoring metrics

* feat(tasks): add VStar-Bench reasoning tasks

- Add vstar_bench_reasoning with doc_to_messages for reasoning format
- Add vstar_bench_direct_attributes_reasoning
- Add vstar_bench_relative_position_reasoning
- Support accuracy and format scoring metrics

* feat(tasks): add PixMo-Count benchmark

- Add pixmo_count task config and utils
- Add pixmo_count_reasoning with doc_to_messages for reasoning format
- Support accuracy and format scoring metrics

* feat(models): add system_prompt_file support to AsyncOpenAIChat

- Allow loading system prompt from file via system_prompt_file parameter
- Add _apply_system_prompt method to inject system prompt into messages
- Apply system prompt before generation in generate_until

* style: auto-fix lint (black + isort)

* refactor(reasoning): extract acc_score computation to separate function

Extracted accuracy reward logic from compute_score into acc_reward function
for better separation of concerns.

* Fix async oai rebase error

* Lint

* refactor(reasoning): add model-side system_prompt support and deduplicate reasoning task utils

- Add _resolve_system_prompt() and _apply_system_prompt() to base lmms class
  for model-side system prompt injection (supports file paths and literal strings)
- Add factory functions make_reasoning_doc_to_messages() and
  make_reasoning_process_results() to reasoning_utils.py, eliminating ~400 lines
  of copy-paste across 12 reasoning task modules
- Update AsyncOpenAIChat: replace system_prompt_file with system_prompt using
  base class utilities, remove duplicate _apply_system_prompt method
- Wire up HuggingFace chat model to inject system_prompt into messages during
  generation (opt-in only, default None to avoid overwriting task-level prompts)
- Fix infovqa reasoning: anls(ground_truth, results) -> anls(ground_truth, [extracted])
- Fix mmbench reasoning: cache YAML parsing with @lru_cache instead of per-sample I/O
- Fix format_reward() to also match <analysis>...</analysis> tag pattern
- Expand --reasoning_tags default to include <analysis> tags

* fix(ci): restore task_input_specs/redundancy_refactor.yaml deleted by 418bfe6

* fix: remove duplicate --reasoning_tags CLI argument

* docs: restore docs/README.md from dev-v0d7

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Bo Li <drluodian@gmail.com>
Luodian added a commit that referenced this pull request Feb 28, 2026
…ility (#1102)

* refactor(models/chat): extract prepare_messages method

* refactor(models/chat): refactor async concurrency control and add docstrings

- Extract _AdaptiveConcurrencyTracker for cleaner state management
- Split generate_until's run() into focused helper methods
- Add comprehensive docstrings to all new methods
- Simplify run() from 130 lines to 8 lines
- Update async_openai_qwen3_vl.py with class docstring

* style: auto-fix lint (black + isort)

* refactor: replace async_openai_qwen3_vl class with message_format parameter

- Add message_format param to AsyncOpenAIChat (default='openai', supports 'qwen3_vl')
- Extract _build_video_kwargs() to eliminate DRY violation
- Remove separate async_openai_qwen3_vl.py and its registry entry
- Fix missing f-string prefix in tool response formatting
- Fix duplicate .gitignore entry

* refactor(models/chat): add message_format parameter to support qwen3_vl

- Add message_format parameter to AsyncOpenAIChat
- Support both 'default' and 'qwen3_vl' message formats
- Remove async_openai_qwen3_vl.py (no longer needed)
- Unregister async_openai_qwen3_vl from model registry
- Fix string formatting for tool call tags

* fix tool response tag format

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Bo Li <drluodian@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants