Skip to content

[Bug]: Potential QP leak on transfer failure for PD disaggregation scenario #1845

@thincal

Description

@thincal

Environment

Components: Mooncake v0.3.7-post2, SGLang v0.5.7
Use Case: SGLang Prefill-Decode (PD) disaggregation for LLM inference

Problem Description

We observed that after running for some time in a PD disaggregation scenario, the Prefill side encounters the error: Failed to create QP: Cannot allocate memory

Analysis

  1. When Prefill transfers KV cache to Decode, if the transfer engine returns an error:
  • SGLang only marks this session as failed in Python layer
  • The transfer engine should have already allocated underlying resources (endpoints, QPs) and established connection with Decode side
  1. When transfer fails, we want to confirm whether Mooncake:
  • Automatically cleans up QP resources
  • Automatically cleans up endpoint resources
  • Or leaves these resources allocated (potential leak)

Current Understanding

From code analysis of mooncake-integration/transfer_engine/transfer_engine_py.cpp:

int TransferEnginePy::batchTransferSync(...) {
    // Get or create segment handle
    Transport::SegmentHandle handle;
    {
        std::lock_guard<std::mutex> guard(mutex_);
        if (handle_map_.count(target_hostname)) {
            handle = handle_map_[target_hostname];
        } else {
            handle = engine_->openSegment(target_hostname);
            if (handle == (Transport::SegmentHandle)-1) return -1;
            handle_map_[target_hostname] = handle;  // ← Cached permanently
        }
    }

    // ... submit transfer ...

    // On transfer failure
    else if (status.s == TransferStatusEnum::FAILED) {
        engine_->freeBatchID(batch_id);  // Only frees batch ID
        already_freed = true;
        completed = true;
        // Question: Are QP/endpoint resources cleaned up here?
    }

    return -1;
}

We notice that:

  • freeBatchID(batch_id) is called on failure
  • handle_map_[target_hostname] is never removed
  • closeSegment() appears to be a no-op (return 0)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions