Encoder Global Cache Manager#16137
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/tag-and-rerun-ci |
There was a problem hiding this comment.
Pull request overview
This PR introduces an Encoder Global Cache Manager feature that enables caching of encoder embeddings using Mooncake distributed storage backend. The implementation aims to reduce redundant GPU encoding by caching computed embeddings across multiple requests and nodes.
Changes:
- Added
--enable-mm-global-cacheCLI argument to enable the global cache feature - Implemented
MooncakeEmbeddingStorefor distributed storage of embeddings - Created
EmbeddingCacheControllerto manage local memory allocation and coordinate with Mooncake backend - Integrated cache checking and prefetching into the encode server workflow
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 29 comments.
Show a summary per file
| File | Description |
|---|---|
| python/sglang/srt/server_args.py | Adds new CLI flag to enable multimodal global cache |
| python/sglang/srt/mem_cache/storage/mooncake_store/mooncake_embedding_store.py | New file implementing Mooncake-based distributed embedding storage |
| python/sglang/srt/managers/embedding_cache_controller.py | New file implementing cache controller with memory management and async I/O |
| python/sglang/srt/managers/scheduler.py | Integrates cache controller into scheduler initialization |
| python/sglang/srt/managers/schedule_batch.py | Adds debug print statement (should be removed) |
| python/sglang/srt/disaggregation/encode_server.py | Implements cache-aware encoding workflow with hit/miss handling |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
python/sglang/srt/mem_cache/storage/mooncake_store/mooncake_embedding_store.py
Outdated
Show resolved
Hide resolved
python/sglang/srt/mem_cache/storage/mooncake_store/embedding_cache_controller.py
Outdated
Show resolved
Hide resolved
python/sglang/srt/mem_cache/storage/mooncake_store/embedding_cache_controller.py
Outdated
Show resolved
Hide resolved
| import logging | ||
| import threading | ||
| from typing import List, Optional | ||
|
|
||
| import torch | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
There was a problem hiding this comment.
This import of module threading is redundant, as it was previously imported on line 2.
| import logging | |
| import threading | |
| from typing import List, Optional | |
| import torch | |
| logger = logging.getLogger(__name__) | |
| from typing import Optional |
| op.mark_done(all(results)) | ||
| self.prefetch_queue.task_done() | ||
| processed_any = True | ||
| except Empty: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| ) | ||
| self.insert_queue.task_done() | ||
| processed_any = True | ||
| except Empty: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
python/sglang/srt/mem_cache/storage/mooncake_store/embedding_cache_controller.py
Show resolved
Hide resolved
python/sglang/srt/mem_cache/storage/mooncake_store/mooncake_embedding_store.py
Outdated
Show resolved
Hide resolved
|
/rerun-failed-ci |
3 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
7 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
5 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
Co-authored-by: Zheng Wengang <zwg0606@gmail.com> Co-authored-by: Teng Ma <sima.mt@alibaba-inc.com>
Co-authored-by: Zheng Wengang <zwg0606@gmail.com> Co-authored-by: Teng Ma <sima.mt@alibaba-inc.com>
Co-authored-by: Zheng Wengang <zwg0606@gmail.com> Co-authored-by: Teng Ma <sima.mt@alibaba-inc.com>
Motivation
This PR introduces a multi-level multimodal embedding cache powered by Mooncake distributed store, enabling cross-instance sharing of Vision Transformer (ViT) embeddings to avoid redundant GPU computation for previously processed images.
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci) or contact authorized users to do so.