Skip to content

feat: basic support for server-level multimodal cache#10775

Merged
hnyls2002 merged 4 commits intosgl-project:mainfrom
mickqian:server-level-mm-cache
Nov 9, 2025
Merged

feat: basic support for server-level multimodal cache#10775
hnyls2002 merged 4 commits intosgl-project:mainfrom
mickqian:server-level-mm-cache

Conversation

@mickqian
Copy link
Collaborator

Motivation

Previously the MultimodalCache is only for chunked_prefill, and the embedding's lifetime ends with prefill.

Modifications

This PR provides the basic support for a server-level multimodal cache, with LRU as evict policy

TODOs

  1. make cache size and evict policy configurable

Accuracy Tests

testing with qwen/qwen2.5-VL-7B-Instruct

Before

Benchmark time: 119.34484012098983
Overall accuracy: 0.514

After

Benchmark time: 119.47181225300301
Overall accuracy: 0.514

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@merrymercy
Copy link
Contributor

@mickqian please resolve the conflicts.

@mickqian mickqian force-pushed the server-level-mm-cache branch from 4471af1 to f5c32c5 Compare October 24, 2025 01:15
@mickqian mickqian requested a review from zhyncs as a code owner October 24, 2025 02:14
@hnyls2002 hnyls2002 merged commit f5b3ccd into sgl-project:main Nov 9, 2025
99 of 107 checks passed
@zhangml
Copy link

zhangml commented Nov 13, 2025

Hello, could you please provide a command for using server-level multimodal cache?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments