Environment
Components: Mooncake v0.3.7-post2, SGLang v0.5.7
Use Case: SGLang Prefill-Decode (PD) disaggregation for LLM inference
Problem Description
We observed that after running for some time in a PD disaggregation scenario, the Prefill side encounters the error: Failed to create QP: Cannot allocate memory
Analysis
- When Prefill transfers KV cache to Decode, if the transfer engine returns an error:
- SGLang only marks this session as failed in Python layer
- The transfer engine should have already allocated underlying resources (endpoints, QPs) and established connection with Decode side
- When transfer fails, we want to confirm whether Mooncake:
- Automatically cleans up QP resources
- Automatically cleans up endpoint resources
- Or leaves these resources allocated (potential leak)
Current Understanding
From code analysis of mooncake-integration/transfer_engine/transfer_engine_py.cpp:
int TransferEnginePy::batchTransferSync(...) {
// Get or create segment handle
Transport::SegmentHandle handle;
{
std::lock_guard<std::mutex> guard(mutex_);
if (handle_map_.count(target_hostname)) {
handle = handle_map_[target_hostname];
} else {
handle = engine_->openSegment(target_hostname);
if (handle == (Transport::SegmentHandle)-1) return -1;
handle_map_[target_hostname] = handle; // ← Cached permanently
}
}
// ... submit transfer ...
// On transfer failure
else if (status.s == TransferStatusEnum::FAILED) {
engine_->freeBatchID(batch_id); // Only frees batch ID
already_freed = true;
completed = true;
// Question: Are QP/endpoint resources cleaned up here?
}
return -1;
}
We notice that:
- freeBatchID(batch_id) is called on failure
- handle_map_[target_hostname] is never removed
- closeSegment() appears to be a no-op (return 0)
Environment
Components: Mooncake v0.3.7-post2, SGLang v0.5.7
Use Case: SGLang Prefill-Decode (PD) disaggregation for LLM inference
Problem Description
We observed that after running for some time in a PD disaggregation scenario, the Prefill side encounters the error:
Failed to create QP: Cannot allocate memoryAnalysis
Current Understanding
From code analysis of mooncake-integration/transfer_engine/transfer_engine_py.cpp:
We notice that: