Fix LoRA support for multimodal models (VLMs) by implementing a consistent pattern for skipping vision components#11261
Conversation
Summary of ChangesHello @ConnorLi96, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves issues preventing LoRA (Low-Rank Adaptation) from functioning correctly with multimodal models, specifically Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request aims to fix LoRA support for multimodal models, specifically Gemma3. The changes involve patching the model configuration in gemma3_mm.py to be compatible with the LoRA manager's expectations and generalizing the logic in lora_manager.py to skip vision components.
My review identifies a potential bug in the updated skipping logic in lora_manager.py and suggests a more robust implementation. Additionally, while the config patching in gemma3_mm.py provides a functional fix, I've noted it as a short-term solution and recommend a more scalable, architectural improvement in the LoRAManager for better long-term maintainability.
|
Thank you @ConnorLi96 for the contribution! |
| if not hasattr(config, "num_hidden_layers"): | ||
| config.num_hidden_layers = config.text_config.num_hidden_layers | ||
| if not hasattr(config, "hidden_size"): | ||
| config.hidden_size = config.text_config.hidden_size |
There was a problem hiding this comment.
Why only Gemma3 needs config attribute exposure:
Gemma3ForConditionalGenerationuses a config structure where language model attributes (num_hidden_layers,hidden_size) are exclusively inconfig.text_config, with no top-level copies- Other VLMs like
Phi4ForConditionalGenerationandLlama4ForConditionalGenerationeither:- Already have these attributes at the top level, OR
- Have different config inheritance that makes them accessible
- LoRA's
LoRAMemoryPool.__init__directly accessesbase_hf_config.num_hidden_layers(line 59 in mem_pool.py), which fails for Gemma3's nested-only structure
…stent pattern for skipping vision components (sgl-project#11261)
…stent pattern for skipping vision components (sgl-project#11261)
Motivation
config.num_hidden_layersandconfig.hidden_sizeat the top level of model configGemma3ForConditionalGeneration, these attributes are nested underconfig.text_config(not at top level)AssertionError: LoRA buffer shape torch.Size([4096, 32]) does not match weight shape torch.Size([1728, 32])Modifications
gemma3_mm.py- Add LoRA support with config attribute exposuremllama4.py- Add LoRA support (follows same pattern)lora_manager.py- Remove generic vision skipping logicAccuracy Tests
N/A
Benchmarking and Profiling
N/A
Checklist