fix: fix MMMU loading issue#11759
Conversation
Summary of ChangesHello @ZailiWang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses critical issues that were preventing Vision-Language Model (VLM) benchmarks from functioning correctly. It integrates a previously overlooked adaptation for Phi4MM benchmarking and corrects a bug in how prompts are constructed for Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces two important fixes for VLM benchmarking. First, it corrects a bug where the raw text prompt was used instead of the fully processed prompt string when constructing DatasetRow, which resolves a server-side failure. Second, it adds special handling for the Phi4MMProcessor to correctly format prompts for MMMU benchmarking, adapting a change from a previous pull request. While the logic is sound, I've identified a potential issue in the implementation of the processor-specific handling and suggested a more robust approach along with a correction for a likely typo in a special token.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
Hi @AlienKevin would you help review the PR? It looks like a typo in #9812 but you said you have verified. |
AlienKevin
left a comment
There was a problem hiding this comment.
Hi, thanks for bring up this PR. Not sure what happened after my PR was merged but at the time I used it to benchmark Qwen 2.5 VL extensively and it worked.
|
I think it's here at the exception? |
Thanks for the discussion. No, it has nothing to do with this snippet. The error I met is the same as described in #9900 ,which is due that the prompt format is not recognized by the server, which is due that in my test the prompt requires wrapping. |
|
Hi @AlienKevin , I think the current issue is due that Looks like an |
|
@ZailiWang I remember gating the MMMU and image benchmarks to only allow chat completions. For multimodal use cases, I think chat makes more sense because the chat template is already required to pass images into the prompt. |
|
Hi @AlienKevin I get your point, but I still have the following questions/concerns.
I updated the PR for handling both cases where the prompt wrapping should or should not be applied in |
|
I had assert checks for I think you made a good point in that the default backend may be confusing. In that case, I think we can also override the default backend in the same manner as the commit above so chat is always used. For testing, you can follow the |
|
@ZailiWang fix the lint issue. |
It's the lint error on main, which was fixed later yesterday. |
|
I just verified that the current PR changes can pass the Let me summarize the candidate change approaches: The current PR content is implementing option (a), which is my preference. My points are:
|
|
@mickqian @JustinTong0323 would you help decide which solution is preferred? Thanks. Please also review the PR if Solution a) is preferred, or I can update the PR if Solution b) is preferred. |
|
Hi, can you run |
|
Hi @mickqian , I cannot complete all cases in the test file, since I can only use Xeon CPU servers at hand, which does not support many models in the test cases. Some models requires specific authentication so the downloading is forbidden. I extracted one succeed case And from source code perspective, I confirmed that the requests in this test are sent via OpenAI API. The changes in this PR has ensured that the logic remains unchanged for OpenAI API requests ( Does the results and explanation look good to you? |
|
Hi @ZailiWang, can you help fix this? |
Created #12256 for fixing it. |
Motivation
#9812 did amazing work. However, it did not port the adaptation of Phi4MM benchmarking contributed in #9900. Another issue is in
DatasetRowconstruction, the original input paramtext_promptis used instead of the processedprompt_str. This leads to all the VLM benchmark fail at prompt decompose phase at server side.Modifications
Fix the mistakes above and make VLM benchmarks usable.
Accuracy Tests
N/A
Benchmarking and Profiling
N/A
Checklist