[Bug]: Potential QP leak on transfer failure for PD disaggregation scenario

# Environment
Components: Mooncake v0.3.7-post2, SGLang v0.5.7
Use Case: SGLang Prefill-Decode (PD) disaggregation for LLM inference

# Problem Description
We observed that after running for some time in a PD disaggregation scenario, the Prefill side encounters the error: `Failed to create QP: Cannot allocate memory`

# Analysis
1. When Prefill transfers KV cache to Decode, if the transfer engine returns an error:
- SGLang only marks this session as failed in Python layer
- The transfer engine should have already allocated underlying resources (endpoints, QPs) and established connection with Decode side

2. When transfer fails, we want to confirm whether Mooncake:
- Automatically cleans up QP resources
- Automatically cleans up endpoint resources
- Or leaves these resources allocated (potential leak)

# Current Understanding
From code analysis of mooncake-integration/transfer_engine/transfer_engine_py.cpp:
```
int TransferEnginePy::batchTransferSync(...) {
    // Get or create segment handle
    Transport::SegmentHandle handle;
    {
        std::lock_guard<std::mutex> guard(mutex_);
        if (handle_map_.count(target_hostname)) {
            handle = handle_map_[target_hostname];
        } else {
            handle = engine_->openSegment(target_hostname);
            if (handle == (Transport::SegmentHandle)-1) return -1;
            handle_map_[target_hostname] = handle;  // ← Cached permanently
        }
    }

    // ... submit transfer ...

    // On transfer failure
    else if (status.s == TransferStatusEnum::FAILED) {
        engine_->freeBatchID(batch_id);  // Only frees batch ID
        already_freed = true;
        completed = true;
        // Question: Are QP/endpoint resources cleaned up here?
    }

    return -1;
}
```

We notice that:
- freeBatchID(batch_id) is called on failure
- handle_map_[target_hostname] is never removed
- closeSegment() appears to be a no-op (return 0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Potential QP leak on transfer failure for PD disaggregation scenario #1845

Environment

Problem Description

Analysis

Current Understanding

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Potential QP leak on transfer failure for PD disaggregation scenario #1845

Description

Environment

Problem Description

Analysis

Current Understanding

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions