Skip to content

[RFC]: Hard Pin and Upsert for Mooncake Store #1645

@00fish0

Description

@00fish0

1. Summary

This RFC proposes two independent, composable features to enhance Mooncake Store for model weight storage workloads:

  1. Hard Pin — A mechanism that guarantees an object will never be evicted, until explicitly unpinned or removed. Hard pin is set at creation time via ReplicateConfig.
  2. Upsert — An update interface that replaces an existing object's data in-place (when size is unchanged) or via delete-and-reallocate (when size differs). Concurrent in-progress writes on the same key are preempted.

Both features are designed as orthogonal primitives. They can be used independently or together, preserving flexibility for reinforcement learning, model management, and hybrid KVCache + model weight workloads.

The rest of this document first explains the design of the two features, then summarizes implementation impact, client APIs, configuration, compatibility, and testing.

2. Design

2.1 Design Principles

  1. Minimal data structure changes — Reuse existing mechanisms (replicas_, ReplicaStatus, processing_keys, DiscardedReplicas) wherever possible.
  2. Predictable reader behavior — During in-place upsert (Case B), the key temporarily has no readable replicas (all replicas are PROCESSING). Readers receive REPLICA_IS_NOT_READY, consistent with a key mid-PutStart. This is acceptable for model weight workloads where updates are coordinated and inference is paused during checkpoint loading. See §2.3.8 for a detailed discussion of this trade-off.
  3. Memory efficiency — Avoid 2x memory overhead by reusing existing buffers (in-place) or freeing before reallocating, instead of COW.
  4. Failure safety — Crashed or failed operations must be automatically cleaned up by existing timeout mechanisms.

2.2 Hard Pin

2.2.1 Data Model

Add a hard_pinned boolean to ObjectMetadata:

// master_service.h — ObjectMetadata (alongside existing lease_timeout and soft_pin_timeout)
mutable bool hard_pinned GUARDED_BY(lock){false};

Why a boolean instead of a timeout?

  • Model weights need indefinite protection. Using a timeout to approximate "forever" is the exact anti-pattern we're replacing.
  • A boolean is cheaper to check (no clock comparison) and semantically unambiguous.

2.2.2 Setting Hard Pin

Hard pin is set at creation time via ReplicateConfig:

// replica.h
struct ReplicateConfig {
    size_t replica_num{1};
    bool with_soft_pin{false};
    bool with_hard_pin{false};  // NEW
    // ...
};

When with_hard_pin=true, the object is hard-pinned from creation. This is the common path for model weight storage. Hard pin state is preserved across Upsert operations.

To unpin a hard-pinned object, the user must Remove it and re-create it without with_hard_pin.

2.2.3 Eviction Behavior

The eviction loop in BatchEvict() currently has two passes:

Pass What it evicts
First Non-soft-pinned objects with expired leases
Second Soft-pinned objects (if allow_evict_soft_pinned_objects_ is true)

With hard pin, the change is minimal — add a single check at the top of the iteration:

// master_service.cpp — BatchEvict(), in the per-object loop (3 locations)
if (it->second.IsHardPinned()) {
    ++it;
    continue;  // Skip — hard-pinned objects are NEVER evicted
}
// ... existing soft pin / lease checks unchanged

The eviction priority becomes:

Eviction Priority (first evicted → last evicted):
1. Non-pinned, lease-expired objects     ← evicted first
2. Soft-pinned objects (if allowed)      ← evicted under pressure
3. Hard-pinned objects                   ← NEVER evicted

2.2.4 HA mode Serialization

The MetadataSerializer must persist the hard_pinned field so that hard pin state survives master restarts. This is a single boolean field appended to the existing msgpack serialization.

2.3 Upsert (In-Place Update / Delete-and-Reallocate)

2.3.1 Core Insight

Instead of a COW approach that requires 2x memory (old + new data coexist), Upsert takes a simpler, more memory-efficient approach:

  • When size is unchanged: reuse the existing memory buffers — mark replicas as PROCESSING, let the client overwrite the data in-place, then mark them COMPLETE again.
  • When size differs: free the old replicas first, then allocate new ones at the required size.
  • When the key has an in-progress write (Put or Upsert): preempt the in-progress operation — move its PROCESSING replicas to DiscardedReplicas for delayed release, then proceed with the new Upsert.

This design reuses existing mechanisms with minimal changes:

Existing Mechanism Role in Upsert
ReplicaStatus state machine COMPLETE → PROCESSING → COMPLETE lifecycle for in-place updates
processing_keys set Tracks keys with active Upsert/Put writes
DiscardedReplicas + TTL Safe delayed release of preempted PROCESSING replicas
DiscardExpiredProcessingReplicas Auto-cleanup of failed upsert's replicas on timeout

Required metadata changes:

  1. ObjectMetadata::put_start_time must become non-const — refreshed to now on in-place UpsertStart, so DiscardExpiredProcessingReplicas times out the current write cycle, not the original creation time.
  2. ObjectMetadata::client_id must become non-const — updated to the new client on in-place UpsertStart.
  3. ObjectMetadata::size remains const — when size changes, the metadata entry is deleted and recreated.
  4. Replica needs new methods:
    • mark_processing() — transitions COMPLETE → PROCESSING for in-place updates.
    • is_processing() / fn_is_processing() — predicate for filtering PROCESSING replicas during preemption.
    • is_busy() / fn_is_busy() — checks refcnt > 0, used as a safety gate before allowing Case B/C.

New error code:

  • OBJECT_REPLICA_BUSY (-714) — returned when UpsertStart finds replicas with non-zero refcnt (active RDMA reads in flight). The client should retry after readers finish.

No changes are needed to eviction logic or any other existing mechanism.

2.3.2 API Design

Master Service API

Following the existing PutStart/PutEnd/PutRevoke pattern:

// Single-key operations
auto UpsertStart(const UUID& client_id, const std::string& key,
                 const uint64_t slice_length, const ReplicateConfig& config)
    -> tl::expected<std::vector<Replica::Descriptor>, ErrorCode>;

auto UpsertEnd(const UUID& client_id, const std::string& key,
               ReplicaType replica_type)
    -> tl::expected<void, ErrorCode>;

auto UpsertRevoke(const UUID& client_id, const std::string& key,
                  ReplicaType replica_type)
    -> tl::expected<void, ErrorCode>;

// Batch variants
std::vector<tl::expected<std::vector<Replica::Descriptor>, ErrorCode>>
BatchUpsertStart(const UUID& client_id,
                 const std::vector<std::string>& keys,
                 const std::vector<uint64_t>& slice_lengths,
                 const ReplicateConfig& config);

std::vector<tl::expected<void, ErrorCode>>
BatchUpsertEnd(const UUID& client_id, const std::vector<std::string>& keys);

std::vector<tl::expected<void, ErrorCode>>
BatchUpsertRevoke(const UUID& client_id, const std::vector<std::string>& keys);

UpsertEnd and UpsertRevoke delegate to PutEnd and PutRevoke — no special logic is needed. The batch variants follow the same delegation pattern (BatchUpsertEndBatchPutEnd, etc.).

Client Service API (C++)
// Single-key: orchestrates UpsertStart → TransferWrite → UpsertEnd/Revoke
tl::expected<void, ErrorCode> Upsert(const ObjectKey& key,
                                     std::vector<Slice>& slices,
                                     const ReplicateConfig& config);

// Batch: pipeline — BatchUpsertStart → parallel RDMA → BatchUpsertEnd/Revoke
std::vector<tl::expected<void, ErrorCode>> BatchUpsert(
    const std::vector<ObjectKey>& keys,
    std::vector<std::vector<Slice>>& batched_slices,
    const ReplicateConfig& config);
PyClient / RealClient / DummyClient API

The upsert API surface mirrors the existing put family. Each put variant has an upsert counterpart:

Put variant Upsert counterpart Description
put(key, span) upsert(key, span) Single key, copy semantics
put_from(key, buffer, size) upsert_from(key, buffer, size) Single key, zero-copy from registered buffer
put_parts(key, spans) upsert_parts(key, spans) Single key, multi-part concatenation
put_batch(keys, spans) upsert_batch(keys, spans) Batch, copy semantics
batch_put_from(keys, buffers, sizes) batch_upsert_from(keys, buffers, sizes) Batch, zero-copy

All variants accept an optional ReplicateConfig parameter.

Python Binding API

Raw bytes — mirrors put / put_from / put_batch / batch_put_from / put_parts:

store.upsert(key, value, config=None)
store.upsert_from(key, buffer_ptr, size, config=None)
store.upsert_parts(key, *parts, config=None)
store.upsert_batch(keys, values, config=None)
store.batch_upsert_from(keys, buffer_ptrs, sizes, config=None)

Tensor — mirrors put_tensor / put_tensor_from / batch_put_tensor / batch_put_tensor_from:

store.upsert_tensor(key, tensor)
store.upsert_tensor_from(key, buffer_ptr, size)
store.batch_upsert_tensor(keys, tensors_list)
store.batch_upsert_tensor_from(keys, buffer_ptrs, sizes)

Tensor with configurable replication (upsert-only, no put counterpart):

store.upsert_pub_tensor(key, tensor, config)
store.batch_upsert_pub_tensor(keys, tensors_list, config)

Note: Tensor Parallelism (TP) variants (upsert_tensor_with_tp, etc.) are not included in V1. They can be added later following the same pattern as put_tensor_with_tp.

2.3.3 Client Call Path

The upsert call path mirrors Put, with 6 layers from Python down to the master service:

Python Binding (store_py.cpp)
  │  upsert_tensor() / upsert_from() / batch_upsert_from() / ...
  │  Tensor methods: extract metadata + data → call upsert_parts / upsert_from
  │  Raw bytes methods: extract buffer info → call PyClient interface
  ▼
PyClient interface (pyclient.h) ── pure virtual base class
  ├─ RealClient (real_client.cpp) ── production path
  │    upsert_internal() / upsert_from_internal() / ...
  │    Allocates RDMA-registered buffer, copies data, splits into Slices
  │    Calls Client::Upsert(key, slices, config)
  │
  └─ DummyClient (dummy_client.cpp) ── test/single-node path
       IPC proxy → calls RealClient::upsert_dummy_helper() on server side
  ▼
Client Service (client_service.cpp)
  │  Client::Upsert() ── orchestrates the three-phase protocol:
  │    1. MasterClient::UpsertStart()
  │    2. TransferWrite() ── RDMA write to allocated buffers
  │    3. MasterClient::UpsertEnd() or UpsertRevoke() on failure
  │
  │  Client::BatchUpsert() ── pipeline (see §2.3.4):
  │    1. StartBatchUpsert()  ── one BatchUpsertStart RPC
  │    2. SubmitTransfers()   ── async RDMA submit for all keys
  │    3. WaitForTransfers()  ── wait for all RDMA to complete (parallel I/O)
  │    4. FinalizeBatchUpsert() ── BatchUpsertEnd / BatchUpsertRevoke RPCs
  ▼
Master Client (master_client.cpp) ── RPC client stub
  │  Serializes parameters, sends RPC to master service
  │  UpsertStart: sums slice_lengths into a single total_length before sending
  ▼
RPC Service (rpc_service.cpp) ── RPC server handler
  │  Deserializes, delegates to MasterService, records metrics
  ▼
Master Service (master_service.cpp) ── core metadata logic
     UpsertStart: three-way dispatch (Case A / B / C)
     UpsertEnd: delegates to PutEnd
     UpsertRevoke: delegates to PutRevoke

The middle layers (Client Service transfer phase, Master Client, RPC Service) are structurally identical to the Put path. The only new logic is in MasterService::UpsertStart (§2.3.5) and the BatchUpsert pipeline (§2.3.4).

2.3.4 BatchUpsert Pipeline

BatchUpsert uses the same pipeline architecture as BatchPut, reusing its transfer infrastructure. The pipeline replaces a serial loop (3N RPCs + N serial RDMA writes) with batched RPCs and parallel RDMA transfers:

Serial (old):                          Pipeline (new):
  for each key:                          1. BatchUpsertStart  ── 1 RPC
    UpsertStart   ── 1 RPC               2. SubmitTransfers   ── N async RDMA
    TransferWrite ── 1 RDMA (blocking)   3. WaitForTransfers  ── parallel wait
    UpsertEnd     ── 1 RPC               4. FinalizeBatchUpsert ── 1-2 RPCs
  Total: 3N RPCs, N serial RDMA          Total: 2-3 RPCs, N parallel RDMA

The implementation adds two thin wrapper methods (StartBatchUpsert, FinalizeBatchUpsert) that call the upsert-specific RPCs. The data transfer phase (SubmitTransfers, WaitForTransfers) and operation state tracking (PutOperation class) are reused from BatchPut without modification — they are operation-agnostic.

Note: BatchUpsert does not support prefer_alloc_in_same_node (returns INVALID_PARAMS), so the BatchPutWhenPreferSameNode code path is not needed.

2.3.5 Detailed Flow

flowchart TD
    Start["UpsertStart(client_id, key, slice_length, config)"]

    Start --> Validate{"Validate params:<br/>key non-empty, slice_length > 0,<br/>replica_num > 0, cachelib slice limit"}
    Validate -- "invalid" --> ErrParams["INVALID_PARAMS"]
    Validate -- "ok" --> Lock

    Lock["Lock: snapshot_mutex_(shared) + shard(exclusive)<br/>it = shard->metadata.find(key)"]
    Lock --> Found{"key found?"}

    Found -- "no" --> CaseA
    Found -- "yes" --> Stale{"CleanupStaleHandles:<br/>all handles stale?"}

    Stale -- "yes" --> EraseStale["erase metadata & processing_keys"]
    EraseStale --> CaseA
    Stale -- "no" --> ChkRepl{"replication_tasks<br/>contains key?"}

    ChkRepl -- "yes" --> ErrRepl["OBJECT_HAS_REPLICATION_TASK"]
    ChkRepl -- "no" --> ChkOffload{"offloading_tasks<br/>contains key?"}
    ChkOffload -- "yes" --> ErrRepl
    ChkOffload -- "no" --> ChkProc{"key in<br/>processing_keys?"}

    ChkProc -- "no" --> ChkRefcnt
    ChkProc -- "yes" --> Preempt["Preempt ongoing Put/Upsert:<br/>PopReplicas(PROCESSING) → discarded<br/>(delayed free after timeout)<br/>erase from processing_keys"]
    Preempt --> HasComplete{"COMPLETE replicas<br/>remain?"}
    HasComplete -- "no" --> EraseMeta["erase metadata"]
    EraseMeta --> CaseA
    HasComplete -- "yes" --> ChkRefcnt

    ChkRefcnt{"any replica<br/>refcnt > 0?"}
    ChkRefcnt -- "yes" --> ErrBusy["OBJECT_REPLICA_BUSY"]
    ChkRefcnt -- "no" --> SizeEq{"metadata.size<br/>== total_length?"}

    SizeEq -- "equal" --> CaseB
    SizeEq -- "different" --> CaseC

    CaseA["Case A: key does not exist<br/>AllocateAndInsertMetadata()<br/>— allocate new buffers, create metadata"]
    CaseB["Case B: same size — in-place update<br/>1. update client_id & put_start_time<br/>2. reconcile soft_pin<br/>3. mark_processing() all COMPLETE replicas<br/>4. return existing descriptors (same addrs)"]
    CaseC["Case C: different size — reallocate<br/>1. PopReplicas() → discarded (delayed free)<br/>2. erase old metadata<br/>3. AllocateAndInsertMetadata() — new buffers"]

    CaseA --> OK
    CaseB --> OK
    CaseC --> OK
    OK["return vector of Replica::Descriptor<br/>→ client performs RDMA write"]

    style ErrParams fill:#ddd,stroke:#888
    style ErrRepl fill:#ddd,stroke:#888
    style ErrBusy fill:#ddd,stroke:#888
Loading

UpsertStart(client_id, key, slice_length, config):

Step 0: Cleanup stale handles
    IF key exists AND all handles are stale (CleanupStaleHandles returns true):
        → Erase processing_keys entry and metadata
        → Treat as non-existent (fall through to Case A)

Step 1: Safety checks and preemption (only if key exists)
    1a. IF key has active replication_tasks (Copy/Move in progress):
        → Return OBJECT_HAS_REPLICATION_TASK
    1b. IF key has active offloading_tasks:
        → Return OBJECT_HAS_REPLICATION_TASK
    1c. IF key is in processing_keys (another Put/Upsert in progress):
        → Pop all PROCESSING replicas → DiscardedReplicas (TTL for delayed release)
        → Remove key from processing_keys
        → If no COMPLETE replicas remain, erase metadata entirely
           and fall through to Case A

Step 2: Dispatch by key existence and size

    Case A — Key does not exist (or erased by Step 0/1c):
        → Equivalent to PutStart (allocate replicas, create ObjectMetadata)
        → Return new replica descriptors

    (Safety gate before Case B/C)
    IF any replica has non-zero refcnt (is_busy):
        → Return OBJECT_REPLICA_BUSY

    Case B — Key exists, size unchanged (in-place update):
        → Mark all COMPLETE replicas → PROCESSING (via mark_processing())
        → Refresh put_start_time = now
        → Update client_id = new client_id
        → Insert key into processing_keys
        → Return existing replica descriptors (same memory addresses)

    Case C — Key exists, size changed (delete-and-reallocate):
        → Pop all replicas → DiscardedReplicas (TTL for delayed release,
          same as MoveEnd — needed because GetReplicaList doesn't use refcnt,
          so readers may still hold descriptors without incrementing refcnt)
        → Erase metadata entry
        → Allocate new replicas + create new ObjectMetadata (same as PutStart)
        → Return new replica descriptors

Client transfers data (identical to Put — TransferWrite via RDMA/TCP)

UpsertEnd(client_id, key, replica_type):

  • Equivalent to PutEnd — mark PROCESSING → COMPLETE, grant lease, remove from processing_keys.

UpsertRevoke(client_id, key, replica_type):

  • Delegates to PutRevoke — erases all PROCESSING replicas of the given type, cleans up metadata if empty.
  • Known limitation for Case B (in-place update): In Case B, UpsertStart reuses existing buffers by marking COMPLETE replicas as PROCESSING. If the client's transfer fails and it calls UpsertRevoke, PutRevoke erases these replicas — the old data is lost. This is a deliberate trade-off: in-place update avoids 2x memory overhead but sacrifices rollback capability. For model weight workloads, this is acceptable because the source data (GPU/CPU memory) is still available and the client can retry the upsert.

2.3.6 Swimlane Diagrams

In-Place Update Path (key exists, size unchanged)
sequenceDiagram
    participant C as Client
    participant M as Master Service
    participant TE as Transfer Engine
    participant R as Server

    C->>M: UpsertStart(client_id, key, same_size, config)

    Note over M: Acquire shard write lock
    M->>M: Find existing ObjectMetadata
    M->>M: Safety checks (replication/offloading tasks, refcnt)
    M->>M: size unchanged → in-place path
    M->>M: mark_processing() on all COMPLETE replicas
    M->>M: Refresh put_start_time, update client_id
    M->>M: Insert key into processing_keys
    Note over M: Release shard write lock
    M-->>C: Return existing replica descriptors (same addresses)

    C->>TE: TransferWrite(descriptors, new_data)
    TE->>R: RDMA Write to same buffer
    R-->>TE: Write complete
    TE-->>C: Transfer success

    C->>M: UpsertEnd(client_id, key, MEMORY)

    Note over M: Same as PutEnd
    M->>M: mark_complete() on PROCESSING replicas
    M->>M: Remove from processing_keys
    M->>M: GrantLease (preserve hard_pin/soft_pin)
    M-->>C: Success
Loading
Delete-and-Reallocate Path (key exists, size changed)
sequenceDiagram
    participant C as Client
    participant M as Master Service
    participant TE as Transfer Engine
    participant R as Server

    C->>M: UpsertStart(client_id, key, new_size, config)

    Note over M: Acquire shard write lock
    M->>M: Find existing ObjectMetadata
    M->>M: size differs → delete-and-reallocate path
    M->>M: Check refcnt → reject if busy (OBJECT_REPLICA_BUSY)
    M->>M: Pop all replicas → DiscardedReplicas (TTL delayed release)
    M->>M: Erase old metadata
    M->>M: Allocate new replicas (same as PutStart)
    M->>M: Create new ObjectMetadata
    Note over M: Release shard write lock
    M-->>C: Return new replica descriptors

    C->>TE: TransferWrite(descriptors, data)
    TE->>R: RDMA Write to new buffer
    R-->>TE: Write complete
    TE-->>C: Transfer success

    C->>M: UpsertEnd(client_id, key, MEMORY)
    M->>M: Same as PutEnd
    M-->>C: Success
Loading
Preemption Path (key has in-progress write)
sequenceDiagram
    participant A as Client A
    participant B as Client B
    participant M as Master Service

    A->>M: PutStart(key) or UpsertStart(key)
    M-->>A: Return descriptors
    Note over A: Transfer in progress...

    B->>M: UpsertStart(key, ...)

    Note over M: key is in processing_keys
    M->>M: Pop A's PROCESSING replicas → DiscardedReplicas (with TTL)
    M->>M: Remove from processing_keys
    M->>M: Continue with normal Upsert flow (Case A/B/C)
    M-->>B: Return descriptors

    Note over M: A's replicas released after TTL<br/>by ReleaseExpiredDiscardedReplicas()

    A->>M: PutEnd(key) or UpsertEnd(key)
    M-->>A: ILLEGAL_CLIENT or OBJECT_NOT_FOUND
Loading

2.3.7 Failure Handling

  • Client crash after UpsertStart (Case A/C): DiscardExpiredProcessingReplicas will clean up after put_start_discard_timeout_sec_, same as for PutStart. The newly allocated replicas are freed, and the metadata entry is removed.
  • Client crash after UpsertStart (Case B — in-place): All replicas are PROCESSING with no COMPLETE replicas remaining. DiscardExpiredProcessingReplicas removes the entire metadata entry — old data is lost. This is the same trade-off described in UpsertRevoke above: no rollback for in-place updates.
  • Transfer failure and UpsertRevoke (Case A/C): PutRevoke erases the newly allocated PROCESSING replicas. For Case C, the old replicas were already moved to DiscardedReplicas and will be freed after TTL. The key effectively disappears until the client retries.
  • Transfer failure and UpsertRevoke (Case B — in-place): PutRevoke erases the reused PROCESSING replicas — old data is lost. The client must retry from source. See "Known Limitations" below.
  • Allocation failure in delete-and-reallocate (Case C): Old replicas have been moved to DiscardedReplicas, but new allocation fails. The key's data is lost. The client can retry.
  • Preempted client calls UpsertEnd/PutEnd: Returns ILLEGAL_CLIENT (if metadata was reused with a new client_id) or OBJECT_NOT_FOUND (if metadata was erased and recreated). The preempted client should treat this as a no-op — the new client has taken over the key.
  • Concurrent Copy/Move/Offload: Returns OBJECT_HAS_REPLICATION_TASK. The client should retry after the replication task completes.
  • Active readers (non-zero refcnt): Returns OBJECT_REPLICA_BUSY. The client should retry after readers release their references. Note: this only blocks Case B/C; if the key is mid-write (in processing_keys), preemption proceeds regardless of refcnt on the PROCESSING replicas.

2.3.8 Read Visibility During Upsert

During an in-place update (Case B), all replicas are marked PROCESSING. GetReplicaList only returns COMPLETE replicas, so the key is temporarily unreadable — readers receive REPLICA_IS_NOT_READY until UpsertEnd marks them COMPLETE again.

This is a deliberate design choice to avoid read-write data races on RDMA buffers. The alternatives and their trade-offs:

Approach Read during write Memory overhead Complexity
Current (mark_processing) Not readable 1x Low
COW (allocate new, swap on complete) Reads see old data 2x peak Medium
Shadow key (write to temp key, rename) Reads see old data 2x peak High (namespace pollution)

For model weight workloads, the current approach is acceptable because:

  1. Weight updates are coordinated operations — inference is typically paused during checkpoint loading.
  2. The unreadable window is short (duration of RDMA transfer, typically milliseconds to seconds for GB-scale data).
  3. Avoiding 2x memory overhead is critical — model weights can consume the majority of available memory.

For workloads that require read availability during updates, COW semantics should be considered in a future RFC.

2.3.9 Known Limitations

Limitation Description Impact
In-place revoke loses old data Case B UpsertRevoke erases reused replicas. No rollback to previous value. Client must retry from source (GPU/CPU). Acceptable for model weights.
Key unreadable during upsert Case B marks all replicas PROCESSING; Get returns REPLICA_IS_NOT_READY. Short window. Acceptable for coordinated weight updates.
replica_num changes ignored in Case B If upsert requests a different replica_num than the existing object, Case B only checks size equality and reuses existing replicas. The new replica count is silently ignored. Client gets fewer (or more) replicas than requested.
Metrics reuse put counters RPC layer increments put_start_requests / put_end_requests for upsert operations. Cannot distinguish put vs upsert request volume in monitoring.

4. Client-Side API

The full API surface is defined in §2.3.2. Below are usage examples.

C++

// Single upsert (zero-copy from registered buffer)
ReplicateConfig config;
config.with_hard_pin = true;
client->upsert_from("model/policy/weights", buffer, size, config);

// Single upsert (copy semantics)
client->upsert("model/policy/weights", data_span, config);

// Batch upsert (zero-copy, pipeline — parallel RDMA)
auto results = client->batch_upsert_from(keys, buffers, sizes, config);

Python

# Raw bytes
config = ReplicateConfig(with_hard_pin=True)
store.upsert("model/weights", weight_bytes, config)
store.upsert_from("model/weights", buffer_ptr, size, config)
store.batch_upsert_from(keys, buffer_ptrs, sizes, config)

# Tensor (default ReplicateConfig)
store.upsert_tensor("model/layer0/weights", tensor)
store.batch_upsert_tensor(keys, tensors_list)

# Tensor with custom replication
config = ReplicateConfig()
config.replica_num = 2
store.upsert_pub_tensor("model/layer0/weights", tensor, config)

# Zero-copy tensor (buffer layout: [TensorMetadata][tensor data])
store.upsert_tensor_from("model/layer0/weights", buffer_ptr, size)
store.batch_upsert_tensor_from(keys, buffer_ptrs, sizes)

5. Configuration

No new configuration parameters are required. Hard pin is a per-object attribute set via ReplicateConfig.with_hard_pin at creation time. All existing configuration parameters retain their current behavior.

6. Migration and Backward Compatibility

  • Wire format: ReplicateConfig gains one new boolean field (with_hard_pin, default false). Old clients sending the old format will default to false — no hard pin, existing behavior preserved.
  • Snapshot format: A new hard_pinned field is appended to the serialized ObjectMetadata. On deserialization, if the field is absent (old snapshot), it defaults to false.
  • Existing APIs: Put, Get, Remove, PutStart/PutEnd/PutRevoke are unchanged. No existing behavior is modified.

Before submitting a new issue...

  • Make sure you already searched for relevant issues and read the documentation

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions