Skip to content

[RFC] Remote KV Connector for SGLang Global Cache Reuse and PD  #7746

@hzh0425

Description

@hzh0425

co-authors: @yizhang2077

[RFC] Remote KV Connector for SGLang Global Cache Reuse

1. Abstract

This RFC proposes a Remote KVCache Connector System to enable global KV cache reuse across SGLang nodes, solving redundant computation problems in multi-turn conversation scenarios and achieving global compute-to-storage conversion. The system introduces a connector abstraction layer that allows nodes to store and retrieve KV cache data from external storage, enabling prefix-based cache matching and reuse across distributed inference workers.

Key benefits include:

Global KV Cache Reuse: Reducing redundant computation via cross-node global KV cache sharing through Global Prefix Index Management capabilities, achieving ~50% TTFT reduction in Qwen-32B 4TP multi-turn dialogues

Flexible Storage Backend: Supporting diverse storage backends via a universal KVConnector interface; enabling direct RDMA-based HBM KV Cache access for high-throughput data transfer

Seamless Integration:

  • Native integration with SGLang Scheduler through asynchronous external KV Cache read/write
  • Native integration with PD Disaggregation architecture

2. Motivation

Problem Statement

SGLang currently lacks global PrefixCache reuse functionality. In multi-turn conversation scenarios, completed inference KV caches cannot be reused across nodes. When similar prefixes are processed repeatedly across different worker nodes, this leads to significant redundant computation and resource waste.

Goals

  • Goal 1: Define a universal Remote Connector Interface for SGLang that provides basic KV cache read/write operations, implementing 3FS Connector

  • Goal 2: Enable Normal/Overlap Scheduler to implement async PreFetch and Offload KVCache through Connector without blocking scheduler operations

  • Goal 3: Achieve global cache reuse with measurable performance improvements in multi-turn scenarios

  • Goal 4:Integration with PD separation mode

3. Technical Design

Architecture Overview

The overall workflow involves:

Image
  1. First Time Scheduling (A):

    • A2: Query remote prefix matched cache through Connector, returns 0

    • A3: Execute model inference, write KVCache to VRAM

    • A4: Write KVCache and Prefix Key to remote store through Connector

  2. Second Time Scheduling (B):

    • B2: Query remote prefix matched cache through Connector, assume match result is 1024 (first 1024 tokens computed on other nodes)

    • B3: Execute PreFetch through Connector, pull first 1024 tokens' KVCache to corresponding GPU

    • B5: Execute model inference

    • B6: Write inference results back to remote cache

Key Components

GlobalRemoteStorage

Remote storage engine, can be 3FS/Mooncake/LMCache etc.

GlobalKVManager

Remote global prefix index, responsible for maintaining prefix indices stored in remote storage. GlobalKVManager is an optional plugin - if RemoteStorage itself has prefix query capabilities, GlobalKVManager is not needed.

RemoteKVConnector

Connector component abstracts the API interface required for Scheduler to read/write external KVCache.

SchedulerRemoteKVQueue

Scheduler async schedules read/write of external KVCache through RemoteKVQueue:

  • AsyncKVPrefetchQueue: Requests attempt to match remote prefix tokens through Connector, pre-allocate Token KVCache Slots, and async pull Remote KVCache through Connector

  • AsyncKVStoreQueue: Requests wait for all Chunk KVCache to complete transfer before releasing occupied resources

Scheduler Overlap Optimization

The Scheduler reads/writes external KVCache data through Connector. To optimize overall performance after introducing Connector and avoid blocking Scheduler operations, multi-layered queue overlap between KV cache transfer and computation across independent requests is implemented.

This architecture uses a layered queue mechanism to decouple scheduling and transfer:

  • Async Pipeline Processing: PreFetch Queue (async prefetch), Scheduling Queue (scheduling decisions), PostTransfer Queue (async transfer) form a non-blocking task processing pipeline

  • Design Goal: Ensure Scheduler focuses on resource allocation and task distribution, isolating I/O latency through queue buffering and async operations to maximize system throughput

  • Example: When requests 0,1,2,3 initiate PreFetch, requests 4,5 can compute in parallel without waiting; when 4,5 initiate async writeback, 0,1 can also compute in parallel

Image

Workflow Phases

  1. PreFetch Phase:

    • Scheduler adds Req to AsyncKVPrefetch Queue

    • Queue uses KVConnector.num_external_matched_tokens to determine remote prefix cache length

    • When exceeding retrieve_remote_cache_threshold, initiate PreFetch

    • Allocate Token KVCache Page Slots and read remote cache via KVConnector.retrieve_kv_async

  2. Forward Phase:

    • Execute model inference workflow

    • For each produced Chunk KVCache, async send via KVConnector.store_kv_async

  3. Transfer Phase:

    • After inference completion, place Req in AsyncKVStore Queue

    • Wait for all Chunk transfers to complete

Image

Connector Interface Definition

class RemoteKVConnector(ABC):
    """
    Base class for all connectors.
    A connector is responsible for storing and retrieving KV caches from the external storage.
    """

    @abstractmethod
    def __init__(self, metadata: ConnectorMetadata):
        """
        Initialize the connector with the metadata.
        Args:
            metadata: The specific metadata for the connector.
        """
        pass

    @abstractmethod
    def register_kv_buffers(self, kv_args: KVArgs):
        """
        Initialize with the KV caches. Useful for pre-registering the
        KV Buffers (e.g. KV ptrs) into the KVConnector (e.g. for NIXL, Mooncake).
        """
        pass

    @abstractmethod
    def store_kv_async(
        self,
        req_id: str,
        page_to_tokens_mapping: list[tuple[int, slice]],
        tokens_list: list[int],
        **kwargs
    ) -> tuple[Future, int]:
        """
        Async store the KV cache of the given token range to the connector.
        
        This method stores KV cache data for tokens organized by page indices.
        Each page contains a list of tokens that need to be stored together.
        
        Args:
            req_id: Unique request identifier for tracking this store operation.
            page_to_tokens_mapping: A list of tuples, where each tuple contains:
                - page_index: The index of the KV cache page
                - token_index_range: List of tokens associated with this page
            tokens_list: The complete token list.
            **kwargs: Additional arguments for the store operation.
        
        Returns:
            tuple[Future, int]: A tuple containing:
                - Future: The future object that indicates success or failure of the store operation.
                - int: The end index of the stored tokens, indicating the last token position
                       that was successfully stored in this operation.
        """
        pass


    @abstractmethod
    def retrieve_kv_async(
        self,
        req_id: str,
        page_to_tokens_mapping: list[tuple[int, slice]],
        tokens_list: list[int],
        **kwargs,
    ) -> Future:
        """
        Async retrieve the KV cache from the connector to the given kv indices.
        
        This method retrieves KV cache data for tokens organized by page indices.
        Each page contains a list of tokens that need to be retrieved together.
        
        Args:
            req_id: Unique request identifier for tracking this retrieve operation.
            page_to_tokens_mapping: A list of tuples, where each tuple contains:
                - page_index: The index of the KV cache page
                - token_index_range: List of tokens associated with this page
            tokens_list: The complete token list.
            **kwargs: Additional arguments for the retrieve operation.
        
        Returns:
            Future: The future object that indicates success or failure of the retrieve operation.
        """
        pass

    @abstractmethod
    def num_external_matched_tokens(
        self, 
        req_id: str,
        tokens: list[int],
        num_local_computed_tokens: int,
        **kwargs
    ) -> int:
        """
        Get number of new tokens that can be loaded from the
        external KV cache beyond the num_local_computed_tokens for a batch of inputs.

        Args:
            req_id: Unique request identifier for tracking this operation.
            tokens: The tokens to match against the external KV cache.
            num_local_computed_tokens: The number of locally computed tokens.
            **kwargs: Additional arguments for the operation.

        Returns:
            int: The number of new tokens that can be loaded from the
                 external KV cache beyond the num_local_computed_tokens.
        """
        pass    

    @abstractmethod
    def finish_req(
        self,
        req_id: str
    ):
        """
        Finish the request, release the related resources.
        """
        pass

    @abstractmethod
    def close(self):
        pass

4. Implementation Plan

Phase 1: Core Infrastructure (Weeks 1-2)

Timeline: 2 weeks   Deliverables:

  • ✅ Task01: RemoteKVConnector Interface Definition

    • Define abstract base class with all required methods
  • ✅ Task02: RemoteKVManager Interface

    • Design prefix index management interface
  • ✅ Task03: 3FS KVConnector Implementation

    • Implement 3FS connector  

    • Implement localfile connector

    • Add chunking and compression support

    • Include error handling and retry mechanisms

  • ✅ Task4: Mini RemoteKVManager

    • Basic in-memory prefix index implementation

    • Thread-safe operations

    • Implement Restful API

Phase 2: Queue Infrastructure (Week 3)

Timeline: 1 week   Deliverables:

  • ✅  Task5: AsyncKVPrefetchQueue Implementation

    • Resource allocation management,Token pool allocation for external cache

    • Async Prefetch kvcache

    • Completion polling with TP consistency guarantees

  • ✅  Task6: AsyncKVStoreQueue Implementation

    • Chunked transfer management with configurable thresholds

    • Completion polling with TP consistency guarantees

    • Resource cleanup and error recovery

Phase 3: Scheduler Integration (Weeks 4-6)

Timeline: 3 weeks   Deliverables:

  • ✅  Task7: Core Scheduler Logic Integration

    • Integrate AsyncPrefetch/StoreQueue into event loops

    • Add remote KV cache configuration options

    • Implement polling mechanisms in normal/overlap modes

Phase 4: Testing and Optimization (Weeks 7-8)

Timeline: 2 weeks   Deliverables:

  •  Task8: Performance Evaluation

    • Multi-turn dialogue scenario testing with Qwen/DeepSeek models

    • TTFT reduction measurement (target: ~50% improvement)

    • Throughput comparison against baseline

  •  Task9: Stability Testing

    • Long-running conversation scenarios

    • Error injection and recovery testing

    • Memory leak detection and prevention

  •  Task10: Documentation and Examples

    • Configuration guides for different storage backends

    • Performance tuning recommendations

    • Troubleshooting documentation

5. Long-Term Vision With PD

Upon completion of this RFC, the Connector mode can be integrated with SGLang's PD (Prefill-Decode) separation architecture. 

Options include:

• Reusing Remote Cache via Connector during Prefill

• Direct KVCache transfer between PD nodes through Connector

Image Image

6. Different with HiRadixCache

HiRadixCache​ is another highly efficient multi-tier storage implementation from the SGLang community. Extending HiRadixCache appropriately to schedule the ​Remote Cache​ is also a viable design option.

This solution also draws inspiration from the concepts of HiRadixCache. Its main differences from HiRadixCache are:

  1. Current Solution:​ It ​forming a ​completely independent schedule pipeline. It interacts ​directly with the GPU​ without intermediate buffering in DRAM, enabling ​direct transfer of KVCache via RDMA.

  2. Current Solution:​ It employs a ​dedicated Global KVCacheManager​ component to uniformly manage ​global prefix indexing. Crucially, the underlying storage ​does not require​ native support for prefix query capabilities.

7. References

  1. SGLang PD Disaggregation Design

  2. SGLang Scheduler Implementation

    • python/sglang/srt/managers/scheduler.py
  3. Memory Pool and Cache Systems

    • python/sglang/srt/mem_cache/memory_pool.py

    • python/sglang/srt/mem_cache/radix_cache.py

    • python/sglang/srt/mem_cache/hiradix_cache.py

Related resources

No response

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions