Skip to content

fix: fix MMMU loading issue#11759

Merged
merrymercy merged 22 commits intosgl-project:mainfrom
ZailiWang:mmmu-load
Oct 24, 2025
Merged

fix: fix MMMU loading issue#11759
merrymercy merged 22 commits intosgl-project:mainfrom
ZailiWang:mmmu-load

Conversation

@ZailiWang
Copy link
Contributor

Motivation

#9812 did amazing work. However, it did not port the adaptation of Phi4MM benchmarking contributed in #9900. Another issue is in DatasetRow construction, the original input param text_prompt is used instead of the processed prompt_str. This leads to all the VLM benchmark fail at prompt decompose phase at server side.

Modifications

Fix the mistakes above and make VLM benchmarks usable.

Accuracy Tests

N/A

Benchmarking and Profiling

N/A

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ZailiWang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses critical issues that were preventing Vision-Language Model (VLM) benchmarks from functioning correctly. It integrates a previously overlooked adaptation for Phi4MM benchmarking and corrects a bug in how prompts are constructed for DatasetRow objects, ensuring that processed prompts are utilized. These changes collectively make VLM benchmarks usable and reliable.

Highlights

  • Phi4MM Benchmarking Adaptation: Implemented the necessary adaptation for Phi4MM benchmarking, which was previously missing, ensuring proper handling of image tokens for Phi-4-Multimodal models.
  • DatasetRow Prompt Correction: Fixed an issue in DatasetRow construction where the raw text_prompt was used instead of the processed prompt_str, resolving VLM benchmark failures during the prompt decomposition phase.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two important fixes for VLM benchmarking. First, it corrects a bug where the raw text prompt was used instead of the fully processed prompt string when constructing DatasetRow, which resolves a server-side failure. Second, it adds special handling for the Phi4MMProcessor to correctly format prompts for MMMU benchmarking, adapting a change from a previous pull request. While the logic is sound, I've identified a potential issue in the implementation of the processor-specific handling and suggested a more robust approach along with a correction for a likely typo in a special token.

ZailiWang and others added 2 commits October 17, 2025 18:17
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@ZailiWang
Copy link
Contributor Author

Hi @AlienKevin would you help review the PR? It looks like a typo in #9812 but you said you have verified.
Anyway I cannot pass the bench_serving test without my change in this PR.

Copy link
Contributor

@AlienKevin AlienKevin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for bring up this PR. Not sure what happened after my PR was merged but at the time I used it to benchmark Qwen 2.5 VL extensively and it worked.

@vincentzed
Copy link
Contributor

I think it's here at the exception?
Since it was excepted before. not sure if this is 100% your error you mean.
https://github.com/sgl-project/sglang/pull/11385/files#diff-00e43baca391ee5f5d8c587be8480acc86944e56adc9b28bf5a24f1dd4314ae3R1294

@ZailiWang
Copy link
Contributor Author

I think it's here at the exception? Since it was excepted before. not sure if this is 100% your error you mean. https://github.com/sgl-project/sglang/pull/11385/files#diff-00e43baca391ee5f5d8c587be8480acc86944e56adc9b28bf5a24f1dd4314ae3R1294

Thanks for the discussion. No, it has nothing to do with this snippet. The error I met is the same as described in #9900 ,which is due that the prompt format is not recognized by the server, which is due that in my test the prompt requires wrapping.

@ZailiWang
Copy link
Contributor Author

ZailiWang commented Oct 18, 2025

Hi @AlienKevin , I think the current issue is due that bench_serving supports both generate and openAI chat/completion APIs. Actually I was always using generate not aware about the usage of chat/completion interface in this script.

Looks like an if-else should be applied based on the requests interface to decide whether the text_prompt should be applied here. Let me study the code and update the PR.

@AlienKevin
Copy link
Contributor

@ZailiWang I remember gating the MMMU and image benchmarks to only allow chat completions. For multimodal use cases, I think chat makes more sense because the chat template is already required to pass images into the prompt.

@ZailiWang
Copy link
Contributor Author

ZailiWang commented Oct 20, 2025

Hi @AlienKevin I get your point, but I still have the following questions/concerns.

  • I didn't meet the "gating" functionality you mentioned. I was calling generate instead of v1/chat/completions with no error occurred before 9812 merged.
  • I can understand your point about "applying chat template makes more sense". However, the default backend setting in bench_serving.py is sglang, which calls generate API. This means, if a user started use SGLang to benchmark VLM with default settings, he/she may get an error which is hard to understand.

I updated the PR for handling both cases where the prompt wrapping should or should not be applied in create_mm_data_row(). Would you help verify if this works with your benchmarking approach? Thanks.

@AlienKevin
Copy link
Contributor

I had assert checks for apply_chat_template which got changed into a default setting by a maintainer in a later commit f7dff0d that I wasn't aware of. The idea of the check is to ensure users are aware that chat is required for MM benchmarks.

I think you made a good point in that the default backend may be confusing. In that case, I think we can also override the default backend in the same manner as the commit above so chat is always used.

For testing, you can follow the bench_serving.md guide. Currently we require explicitly passing a chat backend like sglang-oai-chat.

@mingfeima
Copy link
Collaborator

@ZailiWang fix the lint issue.

@ZailiWang
Copy link
Contributor Author

@ZailiWang fix the lint issue.

It's the lint error on main, which was fixed later yesterday.

@ZailiWang
Copy link
Contributor Author

I just verified that the current PR changes can pass the bench_serving tests for requesting either sglang or sglang-oai-chat backends.

Let me summarize the candidate change approaches:
a. Keep all the backends usable for multimodal tests. Add the adaptation logic for the different backends: for openAI completion APIs, keep raw prompt text at the client side; otherwise apply the chat template when constructing the requests.
b. Add the limitation for the users that only openAI completion API should be used for multimodal tests.

The current PR content is implementing option (a), which is my preference. My points are:

  • Calling openAI completion API is not a necessity for applying chat template. I think wrapping the raw prompts with the chat template at the client side is also a implementation of chat template application.
  • For the users who only care about backend performance, we may prefer to use sglang backend calling generate interface directly, which more precisely measures backend inference . Per my test, calling generate gives a lower TTFT than calling v1/chat/completion. I believe the diff is due that the extra pre-processing of the prompt is also counted. Hence I don't think use sglang backend and calling generate API is a misuse.

@ZailiWang
Copy link
Contributor Author

ZailiWang commented Oct 22, 2025

@mickqian @JustinTong0323 would you help decide which solution is preferred? Thanks. Please also review the PR if Solution a) is preferred, or I can update the PR if Solution b) is preferred.

@mickqian mickqian changed the title Fix MMMU loading issue fix: fix MMMU loading issue Oct 22, 2025
@mickqian
Copy link
Collaborator

Hi, can you run test_nightly_vlms_mmmu_eval.py locally to see if it is broken? cheers

@ZailiWang
Copy link
Contributor Author

Hi @mickqian , I cannot complete all cases in the test file, since I can only use Xeon CPU servers at hand, which does not support many models in the test cases. Some models requires specific authentication so the downloading is forbidden.

I extracted one succeed case deepseek-ai/Janus-Pro-7B and its results passed the acc criterion of 0.285:

{
  "Overall-Art and Design": {
    "num": 10,
    "acc": 0.4
  },
  "Art": {
    "num": 10,
    "acc": 0.4
  },
  "Overall-Business": {
    "num": 30,
    "acc": 0.367
  },
  "Accounting": {
    "num": 30,
    "acc": 0.367
  },
  "Overall-Tech and Engineering": {
    "num": 60,
    "acc": 0.267
  },
  "Agriculture": {
    "num": 30,
    "acc": 0.233
  },
  "Architecture_and_Engineering": {
    "num": 30,
    "acc": 0.3
  },
  "Overall": {
    "num": 100,
    "acc": 0.31
  },
  "score": 0.31
}

And from source code perspective, I confirmed that the requests in this test are sent via OpenAI API. The changes in this PR has ensured that the logic remains unchanged for OpenAI API requests (sglang-oai or sglang-oai-chat backends). My PR only ensures non-OpenAI-API requests would work too and adds a specific format change required by Phi4MM.

Does the results and explanation look good to you?

@sglang-bot sglang-bot enabled auto-merge (squash) October 24, 2025 03:19
@merrymercy merrymercy disabled auto-merge October 24, 2025 03:20
@merrymercy merrymercy merged commit 92009bd into sgl-project:main Oct 24, 2025
71 checks passed
@mickqian
Copy link
Collaborator

mickqian commented Oct 28, 2025

Hi @ZailiWang, can you help fix this?

@ZailiWang
Copy link
Contributor Author

Hi @ZailiWang, can you help fix this?

Created #12256 for fixing it.

@ZailiWang ZailiWang deleted the mmmu-load branch October 28, 2025 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants

Comments