[RFC]: Hard Pin and Upsert for Mooncake Store

## 1. Summary

This RFC proposes two independent, composable features to enhance Mooncake Store for model weight storage workloads:

1. **Hard Pin** — A mechanism that guarantees an object will never be evicted, until explicitly unpinned or removed. Hard pin is set at creation time via `ReplicateConfig`.
2. **Upsert** — An update interface that replaces an existing object's data in-place (when size is unchanged) or via delete-and-reallocate (when size differs). Concurrent in-progress writes on the same key are preempted.

Both features are designed as orthogonal primitives. They can be used independently or together, preserving flexibility for reinforcement learning, model management, and hybrid KVCache + model weight workloads.

The rest of this document first explains the design of the two features, then summarizes implementation impact, client APIs, configuration, compatibility, and testing.

## 2. Design

### 2.1 Design Principles

1. **Minimal data structure changes** — Reuse existing mechanisms (`replicas_`, `ReplicaStatus`, `processing_keys`, `DiscardedReplicas`) wherever possible.
2. **Predictable reader behavior** — During in-place upsert (Case B), the key temporarily has no readable replicas (all replicas are PROCESSING). Readers receive `REPLICA_IS_NOT_READY`, consistent with a key mid-`PutStart`. This is acceptable for model weight workloads where updates are coordinated and inference is paused during checkpoint loading. See §2.3.8 for a detailed discussion of this trade-off.
3. **Memory efficiency** — Avoid 2x memory overhead by reusing existing buffers (in-place) or freeing before reallocating, instead of COW.
4. **Failure safety** — Crashed or failed operations must be automatically cleaned up by existing timeout mechanisms.


### 2.2 Hard Pin

#### 2.2.1 Data Model

Add a `hard_pinned` boolean to `ObjectMetadata`:

```cpp
// master_service.h — ObjectMetadata (alongside existing lease_timeout and soft_pin_timeout)
mutable bool hard_pinned GUARDED_BY(lock){false};
```

Why a boolean instead of a timeout?
- Model weights need **indefinite** protection. Using a timeout to approximate "forever" is the exact anti-pattern we're replacing.
- A boolean is cheaper to check (no clock comparison) and semantically unambiguous.

#### 2.2.2 Setting Hard Pin

Hard pin is set at creation time via `ReplicateConfig`:

```cpp
// replica.h
struct ReplicateConfig {
 size_t replica_num{1};
 bool with_soft_pin{false};
 bool with_hard_pin{false}; // NEW
 // ...
};
```

When `with_hard_pin=true`, the object is hard-pinned from creation. This is the common path for model weight storage. Hard pin state is preserved across Upsert operations.

To unpin a hard-pinned object, the user must `Remove` it and re-create it without `with_hard_pin`.

#### 2.2.3 Eviction Behavior

The eviction loop in `BatchEvict()` currently has two passes:

| Pass | What it evicts |
|------|---------------|
| First | Non-soft-pinned objects with expired leases |
| Second | Soft-pinned objects (if `allow_evict_soft_pinned_objects_` is true) |

With hard pin, the change is minimal — add a single check at the top of the iteration:

```cpp
// master_service.cpp — BatchEvict(), in the per-object loop (3 locations)
if (it->second.IsHardPinned()) {
 ++it;
 continue; // Skip — hard-pinned objects are NEVER evicted
}
// ... existing soft pin / lease checks unchanged
```

The eviction priority becomes:

```
Eviction Priority (first evicted → last evicted):
1. Non-pinned, lease-expired objects ← evicted first
2. Soft-pinned objects (if allowed) ← evicted under pressure
3. Hard-pinned objects ← NEVER evicted
```

#### 2.2.4 HA mode Serialization

The `MetadataSerializer` must persist the `hard_pinned` field so that hard pin state survives master restarts. This is a single boolean field appended to the existing msgpack serialization.

### 2.3 Upsert (In-Place Update / Delete-and-Reallocate)

#### 2.3.1 Core Insight

Instead of a COW approach that requires 2x memory (old + new data coexist), Upsert takes a simpler, more memory-efficient approach:

- **When size is unchanged**: reuse the existing memory buffers — mark replicas as PROCESSING, let the client overwrite the data in-place, then mark them COMPLETE again.
- **When size differs**: free the old replicas first, then allocate new ones at the required size.
- **When the key has an in-progress write** (Put or Upsert): preempt the in-progress operation — move its PROCESSING replicas to `DiscardedReplicas` for delayed release, then proceed with the new Upsert.

This design reuses existing mechanisms with minimal changes:

| Existing Mechanism | Role in Upsert |
|---|---|
| `ReplicaStatus` state machine | COMPLETE → PROCESSING → COMPLETE lifecycle for in-place updates |
| `processing_keys` set | Tracks keys with active Upsert/Put writes |
| `DiscardedReplicas` + TTL | Safe delayed release of preempted PROCESSING replicas |
| `DiscardExpiredProcessingReplicas` | Auto-cleanup of failed upsert's replicas on timeout |

Required metadata changes:

1. `ObjectMetadata::put_start_time` must become non-`const` — refreshed to `now` on in-place UpsertStart, so `DiscardExpiredProcessingReplicas` times out the current write cycle, not the original creation time.
2. `ObjectMetadata::client_id` must become non-`const` — updated to the new client on in-place UpsertStart.
3. `ObjectMetadata::size` remains `const` — when size changes, the metadata entry is deleted and recreated.
4. `Replica` needs new methods:
 - `mark_processing()` — transitions COMPLETE → PROCESSING for in-place updates.
 - `is_processing()` / `fn_is_processing()` — predicate for filtering PROCESSING replicas during preemption.
 - `is_busy()` / `fn_is_busy()` — checks `refcnt > 0`, used as a safety gate before allowing Case B/C.

New error code:

- `OBJECT_REPLICA_BUSY` (`-714`) — returned when UpsertStart finds replicas with non-zero refcnt (active RDMA reads in flight). The client should retry after readers finish.

No changes are needed to eviction logic or any other existing mechanism.

#### 2.3.2 API Design

##### Master Service API

Following the existing `PutStart/PutEnd/PutRevoke` pattern:

```cpp
// Single-key operations
auto UpsertStart(const UUID& client_id, const std::string& key,
 const uint64_t slice_length, const ReplicateConfig& config)
 -> tl::expected<std::vector<Replica::Descriptor>, ErrorCode>;

auto UpsertEnd(const UUID& client_id, const std::string& key,
 ReplicaType replica_type)
 -> tl::expected<void, ErrorCode>;

auto UpsertRevoke(const UUID& client_id, const std::string& key,
 ReplicaType replica_type)
 -> tl::expected<void, ErrorCode>;

// Batch variants
std::vector<tl::expected<std::vector<Replica::Descriptor>, ErrorCode>>
BatchUpsertStart(const UUID& client_id,
 const std::vector<std::string>& keys,
 const std::vector<uint64_t>& slice_lengths,
 const ReplicateConfig& config);

std::vector<tl::expected<void, ErrorCode>>
BatchUpsertEnd(const UUID& client_id, const std::vector<std::string>& keys);

std::vector<tl::expected<void, ErrorCode>>
BatchUpsertRevoke(const UUID& client_id, const std::vector<std::string>& keys);
```

`UpsertEnd` and `UpsertRevoke` delegate to `PutEnd` and `PutRevoke` — no special logic is needed. The batch variants follow the same delegation pattern (`BatchUpsertEnd` → `BatchPutEnd`, etc.).

##### Client Service API (C++)

```cpp
// Single-key: orchestrates UpsertStart → TransferWrite → UpsertEnd/Revoke
tl::expected<void, ErrorCode> Upsert(const ObjectKey& key,
 std::vector<Slice>& slices,
 const ReplicateConfig& config);

// Batch: pipeline — BatchUpsertStart → parallel RDMA → BatchUpsertEnd/Revoke
std::vector<tl::expected<void, ErrorCode>> BatchUpsert(
 const std::vector<ObjectKey>& keys,
 std::vector<std::vector<Slice>>& batched_slices,
 const ReplicateConfig& config);
```

##### PyClient / RealClient / DummyClient API

The upsert API surface mirrors the existing `put` family. Each `put` variant has an `upsert` counterpart:

| Put variant | Upsert counterpart | Description |
|---|---|---|
| `put(key, span)` | `upsert(key, span)` | Single key, copy semantics |
| `put_from(key, buffer, size)` | `upsert_from(key, buffer, size)` | Single key, zero-copy from registered buffer |
| `put_parts(key, spans)` | `upsert_parts(key, spans)` | Single key, multi-part concatenation |
| `put_batch(keys, spans)` | `upsert_batch(keys, spans)` | Batch, copy semantics |
| `batch_put_from(keys, buffers, sizes)` | `batch_upsert_from(keys, buffers, sizes)` | Batch, zero-copy |

All variants accept an optional `ReplicateConfig` parameter.

##### Python Binding API

**Raw bytes** — mirrors `put` / `put_from` / `put_batch` / `batch_put_from` / `put_parts`:

```python
store.upsert(key, value, config=None)
store.upsert_from(key, buffer_ptr, size, config=None)
store.upsert_parts(key, *parts, config=None)
store.upsert_batch(keys, values, config=None)
store.batch_upsert_from(keys, buffer_ptrs, sizes, config=None)
```

**Tensor** — mirrors `put_tensor` / `put_tensor_from` / `batch_put_tensor` / `batch_put_tensor_from`:

```python
store.upsert_tensor(key, tensor)
store.upsert_tensor_from(key, buffer_ptr, size)
store.batch_upsert_tensor(keys, tensors_list)
store.batch_upsert_tensor_from(keys, buffer_ptrs, sizes)
```

**Tensor with configurable replication** (upsert-only, no put counterpart):

```python
store.upsert_pub_tensor(key, tensor, config)
store.batch_upsert_pub_tensor(keys, tensors_list, config)
```

> **Note:** Tensor Parallelism (TP) variants (`upsert_tensor_with_tp`, etc.) are not included in V1. They can be added later following the same pattern as `put_tensor_with_tp`.

#### 2.3.3 Client Call Path

The upsert call path mirrors `Put`, with 6 layers from Python down to the master service:

```
Python Binding (store_py.cpp)
 │ upsert_tensor() / upsert_from() / batch_upsert_from() / ...
 │ Tensor methods: extract metadata + data → call upsert_parts / upsert_from
 │ Raw bytes methods: extract buffer info → call PyClient interface
 ▼
PyClient interface (pyclient.h) ── pure virtual base class
 ├─ RealClient (real_client.cpp) ── production path
 │ upsert_internal() / upsert_from_internal() / ...
 │ Allocates RDMA-registered buffer, copies data, splits into Slices
 │ Calls Client::Upsert(key, slices, config)
 │
 └─ DummyClient (dummy_client.cpp) ── test/single-node path
 IPC proxy → calls RealClient::upsert_dummy_helper() on server side
 ▼
Client Service (client_service.cpp)
 │ Client::Upsert() ── orchestrates the three-phase protocol:
 │ 1. MasterClient::UpsertStart()
 │ 2. TransferWrite() ── RDMA write to allocated buffers
 │ 3. MasterClient::UpsertEnd() or UpsertRevoke() on failure
 │
 │ Client::BatchUpsert() ── pipeline (see §2.3.4):
 │ 1. StartBatchUpsert() ── one BatchUpsertStart RPC
 │ 2. SubmitTransfers() ── async RDMA submit for all keys
 │ 3. WaitForTransfers() ── wait for all RDMA to complete (parallel I/O)
 │ 4. FinalizeBatchUpsert() ── BatchUpsertEnd / BatchUpsertRevoke RPCs
 ▼
Master Client (master_client.cpp) ── RPC client stub
 │ Serializes parameters, sends RPC to master service
 │ UpsertStart: sums slice_lengths into a single total_length before sending
 ▼
RPC Service (rpc_service.cpp) ── RPC server handler
 │ Deserializes, delegates to MasterService, records metrics
 ▼
Master Service (master_service.cpp) ── core metadata logic
 UpsertStart: three-way dispatch (Case A / B / C)
 UpsertEnd: delegates to PutEnd
 UpsertRevoke: delegates to PutRevoke
```

The middle layers (Client Service transfer phase, Master Client, RPC Service) are structurally identical to the `Put` path. The only new logic is in `MasterService::UpsertStart` (§2.3.5) and the `BatchUpsert` pipeline (§2.3.4).

#### 2.3.4 BatchUpsert Pipeline

`BatchUpsert` uses the same pipeline architecture as `BatchPut`, reusing its transfer infrastructure. The pipeline replaces a serial loop (3N RPCs + N serial RDMA writes) with batched RPCs and parallel RDMA transfers:

```
Serial (old): Pipeline (new):
 for each key: 1. BatchUpsertStart ── 1 RPC
 UpsertStart ── 1 RPC 2. SubmitTransfers ── N async RDMA
 TransferWrite ── 1 RDMA (blocking) 3. WaitForTransfers ── parallel wait
 UpsertEnd ── 1 RPC 4. FinalizeBatchUpsert ── 1-2 RPCs
 Total: 3N RPCs, N serial RDMA Total: 2-3 RPCs, N parallel RDMA
```

The implementation adds two thin wrapper methods (`StartBatchUpsert`, `FinalizeBatchUpsert`) that call the upsert-specific RPCs. The data transfer phase (`SubmitTransfers`, `WaitForTransfers`) and operation state tracking (`PutOperation` class) are reused from `BatchPut` without modification — they are operation-agnostic.

> **Note:** `BatchUpsert` does not support `prefer_alloc_in_same_node` (returns `INVALID_PARAMS`), so the `BatchPutWhenPreferSameNode` code path is not needed.

#### 2.3.5 Detailed Flow

```mermaid
flowchart TD
 Start["UpsertStart(client_id, key, slice_length, config)"]

 Start --> Validate{"Validate params: key non-empty, slice_length > 0, replica_num > 0, cachelib slice limit"}
 Validate -- "invalid" --> ErrParams["INVALID_PARAMS"]
 Validate -- "ok" --> Lock

 Lock["Lock: snapshot_mutex_(shared) + shard(exclusive) it = shard->metadata.find(key)"]
 Lock --> Found{"key found?"}

 Found -- "no" --> CaseA
 Found -- "yes" --> Stale{"CleanupStaleHandles: all handles stale?"}

 Stale -- "yes" --> EraseStale["erase metadata & processing_keys"]
 EraseStale --> CaseA
 Stale -- "no" --> ChkRepl{"replication_tasks contains key?"}

 ChkRepl -- "yes" --> ErrRepl["OBJECT_HAS_REPLICATION_TASK"]
 ChkRepl -- "no" --> ChkOffload{"offloading_tasks contains key?"}
 ChkOffload -- "yes" --> ErrRepl
 ChkOffload -- "no" --> ChkProc{"key in processing_keys?"}

 ChkProc -- "no" --> ChkRefcnt
 ChkProc -- "yes" --> Preempt["Preempt ongoing Put/Upsert: PopReplicas(PROCESSING) → discarded (delayed free after timeout) erase from processing_keys"]
 Preempt --> HasComplete{"COMPLETE replicas remain?"}
 HasComplete -- "no" --> EraseMeta["erase metadata"]
 EraseMeta --> CaseA
 HasComplete -- "yes" --> ChkRefcnt

 ChkRefcnt{"any replica refcnt > 0?"}
 ChkRefcnt -- "yes" --> ErrBusy["OBJECT_REPLICA_BUSY"]
 ChkRefcnt -- "no" --> SizeEq{"metadata.size == total_length?"}

 SizeEq -- "equal" --> CaseB
 SizeEq -- "different" --> CaseC

 CaseA["Case A: key does not exist AllocateAndInsertMetadata() — allocate new buffers, create metadata"]
 CaseB["Case B: same size — in-place update 1. update client_id & put_start_time 2. reconcile soft_pin 3. mark_processing() all COMPLETE replicas 4. return existing descriptors (same addrs)"]
 CaseC["Case C: different size — reallocate 1. PopReplicas() → discarded (delayed free) 2. erase old metadata 3. AllocateAndInsertMetadata() — new buffers"]

 CaseA --> OK
 CaseB --> OK
 CaseC --> OK
 OK["return vector of Replica::Descriptor → client performs RDMA write"]

 style ErrParams fill:#ddd,stroke:#888
 style ErrRepl fill:#ddd,stroke:#888
 style ErrBusy fill:#ddd,stroke:#888
```

**UpsertStart(client_id, key, slice_length, config)**:

```
Step 0: Cleanup stale handles
 IF key exists AND all handles are stale (CleanupStaleHandles returns true):
 → Erase processing_keys entry and metadata
 → Treat as non-existent (fall through to Case A)

Step 1: Safety checks and preemption (only if key exists)
 1a. IF key has active replication_tasks (Copy/Move in progress):
 → Return OBJECT_HAS_REPLICATION_TASK
 1b. IF key has active offloading_tasks:
 → Return OBJECT_HAS_REPLICATION_TASK
 1c. IF key is in processing_keys (another Put/Upsert in progress):
 → Pop all PROCESSING replicas → DiscardedReplicas (TTL for delayed release)
 → Remove key from processing_keys
 → If no COMPLETE replicas remain, erase metadata entirely
 and fall through to Case A

Step 2: Dispatch by key existence and size

 Case A — Key does not exist (or erased by Step 0/1c):
 → Equivalent to PutStart (allocate replicas, create ObjectMetadata)
 → Return new replica descriptors

 (Safety gate before Case B/C)
 IF any replica has non-zero refcnt (is_busy):
 → Return OBJECT_REPLICA_BUSY

 Case B — Key exists, size unchanged (in-place update):
 → Mark all COMPLETE replicas → PROCESSING (via mark_processing())
 → Refresh put_start_time = now
 → Update client_id = new client_id
 → Insert key into processing_keys
 → Return existing replica descriptors (same memory addresses)

 Case C — Key exists, size changed (delete-and-reallocate):
 → Pop all replicas → DiscardedReplicas (TTL for delayed release,
 same as MoveEnd — needed because GetReplicaList doesn't use refcnt,
 so readers may still hold descriptors without incrementing refcnt)
 → Erase metadata entry
 → Allocate new replicas + create new ObjectMetadata (same as PutStart)
 → Return new replica descriptors
```

**Client transfers data** (identical to Put — TransferWrite via RDMA/TCP)

**UpsertEnd(client_id, key, replica_type)**:
- Equivalent to `PutEnd` — mark PROCESSING → COMPLETE, grant lease, remove from processing_keys.

**UpsertRevoke(client_id, key, replica_type)**:
- Delegates to `PutRevoke` — erases all PROCESSING replicas of the given type, cleans up metadata if empty.
- **Known limitation for Case B (in-place update):** In Case B, `UpsertStart` reuses existing buffers by marking COMPLETE replicas as PROCESSING. If the client's transfer fails and it calls `UpsertRevoke`, `PutRevoke` erases these replicas — **the old data is lost**. This is a deliberate trade-off: in-place update avoids 2x memory overhead but sacrifices rollback capability. For model weight workloads, this is acceptable because the source data (GPU/CPU memory) is still available and the client can retry the upsert.

#### 2.3.6 Swimlane Diagrams

##### In-Place Update Path (key exists, size unchanged)

```mermaid
sequenceDiagram
 participant C as Client
 participant M as Master Service
 participant TE as Transfer Engine
 participant R as Server

 C->>M: UpsertStart(client_id, key, same_size, config)

 Note over M: Acquire shard write lock
 M->>M: Find existing ObjectMetadata
 M->>M: Safety checks (replication/offloading tasks, refcnt)
 M->>M: size unchanged → in-place path
 M->>M: mark_processing() on all COMPLETE replicas
 M->>M: Refresh put_start_time, update client_id
 M->>M: Insert key into processing_keys
 Note over M: Release shard write lock
 M-->>C: Return existing replica descriptors (same addresses)

 C->>TE: TransferWrite(descriptors, new_data)
 TE->>R: RDMA Write to same buffer
 R-->>TE: Write complete
 TE-->>C: Transfer success

 C->>M: UpsertEnd(client_id, key, MEMORY)

 Note over M: Same as PutEnd
 M->>M: mark_complete() on PROCESSING replicas
 M->>M: Remove from processing_keys
 M->>M: GrantLease (preserve hard_pin/soft_pin)
 M-->>C: Success
```

##### Delete-and-Reallocate Path (key exists, size changed)

```mermaid
sequenceDiagram
 participant C as Client
 participant M as Master Service
 participant TE as Transfer Engine
 participant R as Server

 C->>M: UpsertStart(client_id, key, new_size, config)

 Note over M: Acquire shard write lock
 M->>M: Find existing ObjectMetadata
 M->>M: size differs → delete-and-reallocate path
 M->>M: Check refcnt → reject if busy (OBJECT_REPLICA_BUSY)
 M->>M: Pop all replicas → DiscardedReplicas (TTL delayed release)
 M->>M: Erase old metadata
 M->>M: Allocate new replicas (same as PutStart)
 M->>M: Create new ObjectMetadata
 Note over M: Release shard write lock
 M-->>C: Return new replica descriptors

 C->>TE: TransferWrite(descriptors, data)
 TE->>R: RDMA Write to new buffer
 R-->>TE: Write complete
 TE-->>C: Transfer success

 C->>M: UpsertEnd(client_id, key, MEMORY)
 M->>M: Same as PutEnd
 M-->>C: Success
```

##### Preemption Path (key has in-progress write)

```mermaid
sequenceDiagram
 participant A as Client A
 participant B as Client B
 participant M as Master Service

 A->>M: PutStart(key) or UpsertStart(key)
 M-->>A: Return descriptors
 Note over A: Transfer in progress...

 B->>M: UpsertStart(key, ...)

 Note over M: key is in processing_keys
 M->>M: Pop A's PROCESSING replicas → DiscardedReplicas (with TTL)
 M->>M: Remove from processing_keys
 M->>M: Continue with normal Upsert flow (Case A/B/C)
 M-->>B: Return descriptors

 Note over M: A's replicas released after TTL by ReleaseExpiredDiscardedReplicas()

 A->>M: PutEnd(key) or UpsertEnd(key)
 M-->>A: ILLEGAL_CLIENT or OBJECT_NOT_FOUND
```

#### 2.3.7 Failure Handling

- **Client crash after UpsertStart (Case A/C)**: `DiscardExpiredProcessingReplicas` will clean up after `put_start_discard_timeout_sec_`, same as for `PutStart`. The newly allocated replicas are freed, and the metadata entry is removed.
- **Client crash after UpsertStart (Case B — in-place)**: All replicas are PROCESSING with no COMPLETE replicas remaining. `DiscardExpiredProcessingReplicas` removes the entire metadata entry — **old data is lost**. This is the same trade-off described in UpsertRevoke above: no rollback for in-place updates.
- **Transfer failure and UpsertRevoke (Case A/C)**: `PutRevoke` erases the newly allocated PROCESSING replicas. For Case C, the old replicas were already moved to `DiscardedReplicas` and will be freed after TTL. The key effectively disappears until the client retries.
- **Transfer failure and UpsertRevoke (Case B — in-place)**: `PutRevoke` erases the reused PROCESSING replicas — **old data is lost**. The client must retry from source. See "Known Limitations" below.
- **Allocation failure in delete-and-reallocate (Case C)**: Old replicas have been moved to `DiscardedReplicas`, but new allocation fails. The key's data is lost. The client can retry.
- **Preempted client calls UpsertEnd/PutEnd**: Returns `ILLEGAL_CLIENT` (if metadata was reused with a new client_id) or `OBJECT_NOT_FOUND` (if metadata was erased and recreated). The preempted client should treat this as a no-op — the new client has taken over the key.
- **Concurrent Copy/Move/Offload**: Returns `OBJECT_HAS_REPLICATION_TASK`. The client should retry after the replication task completes.
- **Active readers (non-zero refcnt)**: Returns `OBJECT_REPLICA_BUSY`. The client should retry after readers release their references. Note: this only blocks Case B/C; if the key is mid-write (in processing_keys), preemption proceeds regardless of refcnt on the PROCESSING replicas.

#### 2.3.8 Read Visibility During Upsert

During an in-place update (Case B), all replicas are marked PROCESSING. `GetReplicaList` only returns COMPLETE replicas, so **the key is temporarily unreadable** — readers receive `REPLICA_IS_NOT_READY` until `UpsertEnd` marks them COMPLETE again.

This is a deliberate design choice to avoid read-write data races on RDMA buffers. The alternatives and their trade-offs:

| Approach | Read during write | Memory overhead | Complexity |
|---|---|---|---|
| **Current (mark_processing)** | Not readable | 1x | Low |
| COW (allocate new, swap on complete) | Reads see old data | 2x peak | Medium |
| Shadow key (write to temp key, rename) | Reads see old data | 2x peak | High (namespace pollution) |

For model weight workloads, the current approach is acceptable because:
1. Weight updates are coordinated operations — inference is typically paused during checkpoint loading.
2. The unreadable window is short (duration of RDMA transfer, typically milliseconds to seconds for GB-scale data).
3. Avoiding 2x memory overhead is critical — model weights can consume the majority of available memory.

For workloads that require read availability during updates, COW semantics should be considered in a future RFC.

#### 2.3.9 Known Limitations

| Limitation | Description | Impact |
|---|---|---|
| In-place revoke loses old data | Case B UpsertRevoke erases reused replicas. No rollback to previous value. | Client must retry from source (GPU/CPU). Acceptable for model weights. |
| Key unreadable during upsert | Case B marks all replicas PROCESSING; Get returns REPLICA_IS_NOT_READY. | Short window. Acceptable for coordinated weight updates. |
| replica_num changes ignored in Case B | If upsert requests a different `replica_num` than the existing object, Case B only checks size equality and reuses existing replicas. The new replica count is silently ignored. | Client gets fewer (or more) replicas than requested. |
| Metrics reuse put counters | RPC layer increments `put_start_requests` / `put_end_requests` for upsert operations. | Cannot distinguish put vs upsert request volume in monitoring. |



## 4. Client-Side API

The full API surface is defined in §2.3.2. Below are usage examples.

### C++

```cpp
// Single upsert (zero-copy from registered buffer)
ReplicateConfig config;
config.with_hard_pin = true;
client->upsert_from("model/policy/weights", buffer, size, config);

// Single upsert (copy semantics)
client->upsert("model/policy/weights", data_span, config);

// Batch upsert (zero-copy, pipeline — parallel RDMA)
auto results = client->batch_upsert_from(keys, buffers, sizes, config);
```

### Python

```python
# Raw bytes
config = ReplicateConfig(with_hard_pin=True)
store.upsert("model/weights", weight_bytes, config)
store.upsert_from("model/weights", buffer_ptr, size, config)
store.batch_upsert_from(keys, buffer_ptrs, sizes, config)

# Tensor (default ReplicateConfig)
store.upsert_tensor("model/layer0/weights", tensor)
store.batch_upsert_tensor(keys, tensors_list)

# Tensor with custom replication
config = ReplicateConfig()
config.replica_num = 2
store.upsert_pub_tensor("model/layer0/weights", tensor, config)

# Zero-copy tensor (buffer layout: [TensorMetadata][tensor data])
store.upsert_tensor_from("model/layer0/weights", buffer_ptr, size)
store.batch_upsert_tensor_from(keys, buffer_ptrs, sizes)
```

## 5. Configuration

No new configuration parameters are required. Hard pin is a per-object attribute set via `ReplicateConfig.with_hard_pin` at creation time. All existing configuration parameters retain their current behavior.

## 6. Migration and Backward Compatibility

- **Wire format**: `ReplicateConfig` gains one new boolean field (`with_hard_pin`, default `false`). Old clients sending the old format will default to `false` — no hard pin, existing behavior preserved.
- **Snapshot format**: A new `hard_pinned` field is appended to the serialized `ObjectMetadata`. On deserialization, if the field is absent (old snapshot), it defaults to `false`.
- **Existing APIs**: `Put`, `Get`, `Remove`, `PutStart/PutEnd/PutRevoke` are unchanged. No existing behavior is modified.








### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues and read the [documentation](https://kvcache-ai.github.io/Mooncake/)

Put variant	Upsert counterpart	Description
`put(key, span)`	`upsert(key, span)`	Single key, copy semantics
`put_from(key, buffer, size)`	`upsert_from(key, buffer, size)`	Single key, zero-copy from registered buffer
`put_parts(key, spans)`	`upsert_parts(key, spans)`	Single key, multi-part concatenation
`put_batch(keys, spans)`	`upsert_batch(keys, spans)`	Batch, copy semantics
`batch_put_from(keys, buffers, sizes)`	`batch_upsert_from(keys, buffers, sizes)`	Batch, zero-copy

Pass	What it evicts
First	Non-soft-pinned objects with expired leases
Second	Soft-pinned objects (if `allow_evict_soft_pinned_objects_` is true)

Existing Mechanism	Role in Upsert
`ReplicaStatus` state machine	COMPLETE → PROCESSING → COMPLETE lifecycle for in-place updates
`processing_keys` set	Tracks keys with active Upsert/Put writes
`DiscardedReplicas` + TTL	Safe delayed release of preempted PROCESSING replicas
`DiscardExpiredProcessingReplicas`	Auto-cleanup of failed upsert's replicas on timeout

Approach	Read during write	Memory overhead	Complexity
Current (mark_processing)	Not readable	1x	Low
COW (allocate new, swap on complete)	Reads see old data	2x peak	Medium
Shadow key (write to temp key, rename)	Reads see old data	2x peak	High (namespace pollution)

Limitation	Description	Impact
In-place revoke loses old data	Case B UpsertRevoke erases reused replicas. No rollback to previous value.	Client must retry from source (GPU/CPU). Acceptable for model weights.
Key unreadable during upsert	Case B marks all replicas PROCESSING; Get returns REPLICA_IS_NOT_READY.	Short window. Acceptable for coordinated weight updates.
replica_num changes ignored in Case B	If upsert requests a different `replica_num` than the existing object, Case B only checks size equality and reuses existing replicas. The new replica count is silently ignored.	Client gets fewer (or more) replicas than requested.
Metrics reuse put counters	RPC layer increments `put_start_requests` / `put_end_requests` for upsert operations.	Cannot distinguish put vs upsert request volume in monitoring.

[RFC]: Hard Pin and Upsert for Mooncake Store #1645

Description

1. Summary

2. Design

2.1 Design Principles

2.2 Hard Pin

2.2.1 Data Model

2.2.2 Setting Hard Pin

2.2.3 Eviction Behavior

2.2.4 HA mode Serialization

2.3 Upsert (In-Place Update / Delete-and-Reallocate)

2.3.1 Core Insight

2.3.2 API Design

Master Service API

Client Service API (C++)

PyClient / RealClient / DummyClient API

Python Binding API

2.3.3 Client Call Path

2.3.4 BatchUpsert Pipeline

2.3.5 Detailed Flow

2.3.6 Swimlane Diagrams

In-Place Update Path (key exists, size unchanged)

Delete-and-Reallocate Path (key exists, size changed)

Preemption Path (key has in-progress write)

2.3.7 Failure Handling

2.3.8 Read Visibility During Upsert

2.3.9 Known Limitations

4. Client-Side API

C++

Python

5. Configuration

6. Migration and Backward Compatibility

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions