fix: fix MMMU loading issue by ZailiWang · Pull Request #11759 · sgl-project/sglang

ZailiWang · 2025-10-17T10:12:33Z

Motivation

#9812 did amazing work. However, it did not port the adaptation of Phi4MM benchmarking contributed in #9900. Another issue is in DatasetRow construction, the original input param text_prompt is used instead of the processed prompt_str. This leads to all the VLM benchmark fail at prompt decompose phase at server side.

Modifications

Fix the mistakes above and make VLM benchmarks usable.

Accuracy Tests

N/A

Benchmarking and Profiling

N/A

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

…load

…mmmu-load

gemini-code-assist · 2025-10-17T10:12:45Z

Summary of Changes

Hello @ZailiWang, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses critical issues that were preventing Vision-Language Model (VLM) benchmarks from functioning correctly. It integrates a previously overlooked adaptation for Phi4MM benchmarking and corrects a bug in how prompts are constructed for DatasetRow objects, ensuring that processed prompts are utilized. These changes collectively make VLM benchmarks usable and reliable.

Highlights

Phi4MM Benchmarking Adaptation: Implemented the necessary adaptation for Phi4MM benchmarking, which was previously missing, ensuring proper handling of image tokens for Phi-4-Multimodal models.
DatasetRow Prompt Correction: Fixed an issue in DatasetRow construction where the raw text_prompt was used instead of the processed prompt_str, resolving VLM benchmark failures during the prompt decomposition phase.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces two important fixes for VLM benchmarking. First, it corrects a bug where the raw text prompt was used instead of the fully processed prompt string when constructing DatasetRow, which resolves a server-side failure. Second, it adds special handling for the Phi4MMProcessor to correctly format prompts for MMMU benchmarking, adapting a change from a previous pull request. While the logic is sound, I've identified a potential issue in the implementation of the processor-specific handling and suggested a more robust approach along with a correction for a likely typo in a special token.

python/sglang/bench_serving.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

ZailiWang · 2025-10-17T10:40:33Z

Hi @AlienKevin would you help review the PR? It looks like a typo in #9812 but you said you have verified.
Anyway I cannot pass the bench_serving test without my change in this PR.

AlienKevin

Hi, thanks for bring up this PR. Not sure what happened after my PR was merged but at the time I used it to benchmark Qwen 2.5 VL extensively and it worked.

python/sglang/bench_serving.py

vincentzed · 2025-10-18T00:50:43Z

I think it's here at the exception?
Since it was excepted before. not sure if this is 100% your error you mean.
https://github.com/sgl-project/sglang/pull/11385/files#diff-00e43baca391ee5f5d8c587be8480acc86944e56adc9b28bf5a24f1dd4314ae3R1294

ZailiWang · 2025-10-18T08:56:53Z

I think it's here at the exception? Since it was excepted before. not sure if this is 100% your error you mean. https://github.com/sgl-project/sglang/pull/11385/files#diff-00e43baca391ee5f5d8c587be8480acc86944e56adc9b28bf5a24f1dd4314ae3R1294

Thanks for the discussion. No, it has nothing to do with this snippet. The error I met is the same as described in #9900 ,which is due that the prompt format is not recognized by the server, which is due that in my test the prompt requires wrapping.

ZailiWang · 2025-10-18T09:00:04Z

Hi @AlienKevin , I think the current issue is due that bench_serving supports both generate and openAI chat/completion APIs. Actually I was always using generate not aware about the usage of chat/completion interface in this script.

Looks like an if-else should be applied based on the requests interface to decide whether the text_prompt should be applied here. Let me study the code and update the PR.

AlienKevin · 2025-10-18T17:00:52Z

@ZailiWang I remember gating the MMMU and image benchmarks to only allow chat completions. For multimodal use cases, I think chat makes more sense because the chat template is already required to pass images into the prompt.

ZailiWang · 2025-10-20T09:45:26Z

Hi @AlienKevin I get your point, but I still have the following questions/concerns.

I didn't meet the "gating" functionality you mentioned. I was calling generate instead of v1/chat/completions with no error occurred before 9812 merged.
I can understand your point about "applying chat template makes more sense". However, the default backend setting in bench_serving.py is sglang, which calls generate API. This means, if a user started use SGLang to benchmark VLM with default settings, he/she may get an error which is hard to understand.

I updated the PR for handling both cases where the prompt wrapping should or should not be applied in create_mm_data_row(). Would you help verify if this works with your benchmarking approach? Thanks.

AlienKevin · 2025-10-20T15:24:17Z

I had assert checks for apply_chat_template which got changed into a default setting by a maintainer in a later commit f7dff0d that I wasn't aware of. The idea of the check is to ensure users are aware that chat is required for MM benchmarks.

I think you made a good point in that the default backend may be confusing. In that case, I think we can also override the default backend in the same manner as the commit above so chat is always used.

For testing, you can follow the bench_serving.md guide. Currently we require explicitly passing a chat backend like sglang-oai-chat.

mingfeima · 2025-10-21T02:40:45Z

@ZailiWang fix the lint issue.

ZailiWang · 2025-10-21T02:42:51Z

@ZailiWang fix the lint issue.

It's the lint error on main, which was fixed later yesterday.

ZailiWang · 2025-10-21T06:30:50Z

I just verified that the current PR changes can pass the bench_serving tests for requesting either sglang or sglang-oai-chat backends.

Let me summarize the candidate change approaches:
a. Keep all the backends usable for multimodal tests. Add the adaptation logic for the different backends: for openAI completion APIs, keep raw prompt text at the client side; otherwise apply the chat template when constructing the requests.
b. Add the limitation for the users that only openAI completion API should be used for multimodal tests.

The current PR content is implementing option (a), which is my preference. My points are:

Calling openAI completion API is not a necessity for applying chat template. I think wrapping the raw prompts with the chat template at the client side is also a implementation of chat template application.
For the users who only care about backend performance, we may prefer to use sglang backend calling generate interface directly, which more precisely measures backend inference . Per my test, calling generate gives a lower TTFT than calling v1/chat/completion. I believe the diff is due that the extra pre-processing of the prompt is also counted. Hence I don't think use sglang backend and calling generate API is a misuse.

ZailiWang · 2025-10-22T03:57:53Z

@mickqian @JustinTong0323 would you help decide which solution is preferred? Thanks. Please also review the PR if Solution a) is preferred, or I can update the PR if Solution b) is preferred.

mickqian · 2025-10-22T09:01:29Z

Hi, can you run test_nightly_vlms_mmmu_eval.py locally to see if it is broken? cheers

ZailiWang · 2025-10-23T02:30:11Z

Hi @mickqian , I cannot complete all cases in the test file, since I can only use Xeon CPU servers at hand, which does not support many models in the test cases. Some models requires specific authentication so the downloading is forbidden.

I extracted one succeed case deepseek-ai/Janus-Pro-7B and its results passed the acc criterion of 0.285:

{
  "Overall-Art and Design": {
    "num": 10,
    "acc": 0.4
  },
  "Art": {
    "num": 10,
    "acc": 0.4
  },
  "Overall-Business": {
    "num": 30,
    "acc": 0.367
  },
  "Accounting": {
    "num": 30,
    "acc": 0.367
  },
  "Overall-Tech and Engineering": {
    "num": 60,
    "acc": 0.267
  },
  "Agriculture": {
    "num": 30,
    "acc": 0.233
  },
  "Architecture_and_Engineering": {
    "num": 30,
    "acc": 0.3
  },
  "Overall": {
    "num": 100,
    "acc": 0.31
  },
  "score": 0.31
}

And from source code perspective, I confirmed that the requests in this test are sent via OpenAI API. The changes in this PR has ensured that the logic remains unchanged for OpenAI API requests (sglang-oai or sglang-oai-chat backends). My PR only ensures non-OpenAI-API requests would work too and adds a specific format change required by Phi4MM.

Does the results and explanation look good to you?

mickqian · 2025-10-28T05:38:33Z

Hi @ZailiWang, can you help fix this?

ZailiWang · 2025-10-28T07:05:23Z

Hi @ZailiWang, can you help fix this?

Created #12256 for fixing it.

ZailiWang and others added 9 commits October 15, 2025 15:36

Fix phi4-mm prompt issue in bench_serving

85da556

update

3a6d149

Merge branch 'main' of https://github.com/ZailiWang/sglang into mmmu-…

d773b51

…load

[temp] repo/branch name change

08aadf4

Merge branch 'main' of https://github.com/ZailiWang/sglang into mmmu-…

c3010eb

…load

Merge branch 'main' of https://github.com/ZailiWang/sglang into mmmu-…

ed6a60d

…load

Merge branch 'sgl-project:main' into mmmu-load

b95f18f

change back repo/branch name

30dbf9a

Merge branch 'mmmu-load' of https://github.com/ZailiWang/sglang into …

6abc3bc

…mmmu-load

ZailiWang requested review from Ying1123, hnyls2002, merrymercy and zhyncs as code owners October 17, 2025 10:12

gemini-code-assist bot reviewed Oct 17, 2025

View reviewed changes

python/sglang/bench_serving.py Outdated Show resolved Hide resolved

ZailiWang and others added 2 commits October 17, 2025 18:17

gemini suggestion

6851fe6

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

remove duplicated bracket

d49e917

AlienKevin reviewed Oct 17, 2025

View reviewed changes

python/sglang/bench_serving.py Outdated Show resolved Hide resolved

Merge branch 'main' into mmmu-load

d8d3f4a

ZailiWang and others added 4 commits October 20, 2025 14:31

Merge branch 'main' into mmmu-load

0df4fb2

Adaptation for wrap prompt or not

5da6c0a

lint

81a8037

lint(cont)

df0c570

Merge branch 'main' into mmmu-load

d74a86b

mingfeima approved these changes Oct 21, 2025

View reviewed changes

mingfeima added run-ci intel labels Oct 21, 2025

Merge branch 'main' into mmmu-load

b4ebeb0

ZailiWang added 2 commits October 21, 2025 23:06

Merge branch 'main' into mmmu-load

72a932c

Merge branch 'main' into mmmu-load

a02b5ae

Merge branch 'main' into mmmu-load

533c5c5

mickqian changed the title ~~Fix MMMU loading issue~~ fix: fix MMMU loading issue Oct 22, 2025

Merge branch 'main' into mmmu-load

7292378

sglang-bot approved these changes Oct 24, 2025

View reviewed changes

sglang-bot enabled auto-merge (squash) October 24, 2025 03:19

merrymercy disabled auto-merge October 24, 2025 03:20

merrymercy merged commit 92009bd into sgl-project:main Oct 24, 2025
71 checks passed

ZailiWang mentioned this pull request Oct 28, 2025

fix: Add default value for backend in sample_mmmu_requests #12256

Merged

ZailiWang deleted the mmmu-load branch October 28, 2025 08:13

Conversation

ZailiWang commented Oct 17, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Oct 17, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

ZailiWang commented Oct 17, 2025

Uh oh!

AlienKevin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vincentzed commented Oct 18, 2025

Uh oh!

ZailiWang commented Oct 18, 2025

Uh oh!

ZailiWang commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlienKevin commented Oct 18, 2025

Uh oh!

ZailiWang commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlienKevin commented Oct 20, 2025

Uh oh!

mingfeima commented Oct 21, 2025

Uh oh!

ZailiWang commented Oct 21, 2025

Uh oh!

ZailiWang commented Oct 21, 2025

Uh oh!

ZailiWang commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mickqian commented Oct 22, 2025

Uh oh!

ZailiWang commented Oct 23, 2025

Uh oh!

Uh oh!

mickqian commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZailiWang commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Comments

ZailiWang commented Oct 18, 2025 •

edited

Loading

ZailiWang commented Oct 20, 2025 •

edited

Loading

ZailiWang commented Oct 22, 2025 •

edited

Loading

mickqian commented Oct 28, 2025 •

edited

Loading