Add Support for Qwen2-VL Multi-modal Embedding Models by Titan-p · Pull Request #3694 · sgl-project/sglang

Titan-p · 2025-02-19T10:19:45Z

Add Support for Qwen2-VL Multi-modal Embedding Models

Motivation

This PR introduces multi-modal embedding capabilities to support the Alibaba-NLP/gme-Qwen2-VL-2B-Instruct model, enabling unified processing of both text and image inputs.

Modifications

Model Integration
- Added model launch configuration for gme-Qwen2-VL
- Implemented new conversation template
API modification
- v1/embeddings support text/image
- Usage Example
  - payload = json.dumps({ "input": [ { "text": "text string" }, { "image": "/home/panlyu/images/006.jpg" } ] })

TODO

fused input support

Checklist

Formatted code using pre-commit hooks
Updated API documentation with image embedding examples
Verified throughput with mixed text/image batches
Benchmark results will be shared in Slack channel

zhaochenyang20 · 2025-02-27T02:12:38Z

#3772

@Titan-p What's the difference?

zhaochenyang20 · 2025-02-27T02:18:06Z

@Titan-p Please add test to it. And might @mickqian and @yizhang2077 take a look? Thanks a lot!

Titan-p · 2025-03-04T03:01:03Z

@Titan-p Please add test to it. And might @mickqian and @yizhang2077 take a look? Thanks a lot!

Unit test added.

zhaochenyang20 · 2025-03-04T07:50:48Z

python/sglang/srt/conversation.py

It should be okay right now. But later we need to refactor this. cc @yizhang2077 Do you agree with this? @mickqian

docs/references/supported_models.md

zhaochenyang20 · 2025-03-05T03:43:32Z

@simveit Could you continue help on this? Thanks so much. If you feel okay. I can merge this

simveit · 2025-03-05T08:50:30Z

@simveit Could you continue help on this? Thanks so much. If you feel okay. I can merge this

I will go over the code one more time and test it tonight also @zhaochenyang20

simveit · 2025-03-05T17:56:20Z

The test failed

texts similarity diff tensor(1.9073e-06)
images similarity diff tensor(0.0001)
F
======================================================================
FAIL: test_accuracy (__main__.TestQmeQwenModels)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/misc/simon/sglang/test/srt/models/test_gme_qwen_models.py", line 81, in test_accuracy
    self.assert_close_embeddings(model, prefill_tolerance, torch_dtype)
  File "/home/misc/simon/sglang/test/srt/models/test_gme_qwen_models.py", line 74, in assert_close_embeddings
    assert torch.all(
AssertionError: embeddings are not all close

Before merging it we should increase prefill_tolerance. @zhaochenyang20 @Titan-p

Titan-p · 2025-03-06T01:24:50Z

The test failed

texts similarity diff tensor(1.9073e-06)
images similarity diff tensor(0.0001)
F
======================================================================
FAIL: test_accuracy (__main__.TestQmeQwenModels)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/misc/simon/sglang/test/srt/models/test_gme_qwen_models.py", line 81, in test_accuracy
    self.assert_close_embeddings(model, prefill_tolerance, torch_dtype)
  File "/home/misc/simon/sglang/test/srt/models/test_gme_qwen_models.py", line 74, in assert_close_embeddings
    assert torch.all(
AssertionError: embeddings are not all close

Before merging it we should increase prefill_tolerance. @zhaochenyang20 @Titan-p

I think it might be related to differences in the image processing. Could you please provide the test images?

simveit · 2025-03-06T06:31:00Z

The test failed

texts similarity diff tensor(1.9073e-06)
images similarity diff tensor(0.0001)
F
======================================================================
FAIL: test_accuracy (__main__.TestQmeQwenModels)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/misc/simon/sglang/test/srt/models/test_gme_qwen_models.py", line 81, in test_accuracy
    self.assert_close_embeddings(model, prefill_tolerance, torch_dtype)
  File "/home/misc/simon/sglang/test/srt/models/test_gme_qwen_models.py", line 74, in assert_close_embeddings
    assert torch.all(
AssertionError: embeddings are not all close

Before merging it we should increase prefill_tolerance. @zhaochenyang20 @Titan-p

I think it might be related to differences in the image processing. Could you please provide the test images?

i ran it with default on a100

zhaochenyang20 · 2025-03-06T06:33:04Z

also, fix the confilictws

Titan-p · 2025-03-06T07:27:06Z

cc @zhaochenyang20 @simveit. I think this PR is ready to be merged.

zhaochenyang20 · 2025-03-07T00:46:14Z

I tested it locally and LGTM.

simveit · 2025-03-07T09:26:21Z

@zhaochenyang20 i will integrate an corresponding example for this over the weekend.

zhaochenyang20 · 2025-03-07T11:14:53Z

@simveit Thansk1

Titan-p requested review from ByronHsu, Ying1123, hnyls2002, ispobock, merrymercy and zhyncs as code owners February 19, 2025 10:19

Titan-p force-pushed the multimodal-embedding branch 2 times, most recently from 53d20e5 to d169463 Compare February 26, 2025 10:20

Titan-p force-pushed the multimodal-embedding branch 2 times, most recently from 0302714 to 2d21d6c Compare March 4, 2025 03:00

Titan-p force-pushed the multimodal-embedding branch from 8b6692e to 40f3f60 Compare March 4, 2025 05:35

zhaochenyang20 reviewed Mar 4, 2025

View reviewed changes

simveit reviewed Mar 4, 2025

View reviewed changes

docs/references/supported_models.md Outdated Show resolved Hide resolved

Titan-p added 3 commits March 6, 2025 15:10

Add Support for Qwen2-VL Multi-modal Embedding Models

db1fcfc

add unit test for gme qwen vl models

10471de

fix model id/typo

8235dce

Titan-p force-pushed the multimodal-embedding branch from 66acbdb to 90fc9ea Compare March 6, 2025 07:15

fix doc & increase gme model test tolerance

ca86152

Titan-p force-pushed the multimodal-embedding branch from 90fc9ea to ca86152 Compare March 6, 2025 07:19

Merge branch 'main' into multimodal-embedding

a948157

zhaochenyang20 merged commit 361971b into sgl-project:main Mar 7, 2025
20 checks passed

simveit mentioned this pull request Mar 8, 2025

Added example for multimodal embedding #4206

Merged

aoshen524 pushed a commit to aoshen524/sglang that referenced this pull request Mar 10, 2025

Add Support for Qwen2-VL Multi-modal Embedding Models (sgl-project#3694)

f481d60

Titan-p mentioned this pull request Apr 21, 2025

[Bug] Qwen-gme embedding model: cannot get fused embedding from text+image, and image input format may be incorrect #5498

Closed

5 tasks

Conversation

Titan-p commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add Support for Qwen2-VL Multi-modal Embedding Models

Motivation

Modifications

TODO

Checklist

Uh oh!

zhaochenyang20 commented Feb 27, 2025

Uh oh!

zhaochenyang20 commented Feb 27, 2025

Uh oh!

Titan-p commented Mar 4, 2025

Uh oh!

zhaochenyang20 Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhaochenyang20 commented Mar 5, 2025

Uh oh!

simveit commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simveit commented Mar 5, 2025

Uh oh!

Titan-p commented Mar 6, 2025

Uh oh!

simveit commented Mar 6, 2025

Uh oh!

zhaochenyang20 commented Mar 6, 2025

Uh oh!

Titan-p commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhaochenyang20 commented Mar 7, 2025

Uh oh!

Uh oh!

simveit commented Mar 7, 2025

Uh oh!

zhaochenyang20 commented Mar 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Titan-p commented Feb 19, 2025 •

edited

Loading

simveit commented Mar 5, 2025 •

edited

Loading

Titan-p commented Mar 6, 2025 •

edited

Loading