Skip to content

[Store] Add hard pin mechanism for eviction-protected objects#1728

Merged
stmatengss merged 5 commits intokvcache-ai:mainfrom
he-yufeng:feat/hard-pin
Mar 25, 2026
Merged

[Store] Add hard pin mechanism for eviction-protected objects#1728
stmatengss merged 5 commits intokvcache-ai:mainfrom
he-yufeng:feat/hard-pin

Conversation

@he-yufeng
Copy link
Copy Markdown
Contributor

RFC: #1645
Related: #1621

Implements the Hard Pin feature (P0 scope from the RFC) — objects created with ReplicateConfig.with_hard_pin = true are guaranteed to never be evicted by BatchEvict().

This is the companion to PR #1662 which covers Upsert. Hard Pin is independent and doesn't conflict with those changes.

What changed

replica.hReplicateConfig gains a with_hard_pin bool (default false).

master_service.hObjectMetadata gets a hard_pinned field set at creation time. Added IsHardPinned() accessor. MetadataAccessorRW::Create forwards the flag.

master_service.cpp:

  • PutStart: passes config.with_hard_pin through to ObjectMetadata constructor
  • BatchEvict: hard-pinned objects are skipped in all 4 eviction loops (first pass candidate collection, first pass eviction, second pass A, second pass B)
  • SerializeMetadata: appends hard_pinned bool to the serialized array
  • DeserializeMetadata: reads hard_pinned if present, defaults to false for old snapshots

Tests — 3 new test cases:

  • HardPinObjectNotEvicted: fills memory, waits for lease expiry + eviction, confirms hard-pinned object survives. Also verifies explicit Remove() still works.
  • HardPinWithSoftPinEvictionOrder: mixed hard pin + soft pin + normal objects under pressure — hard-pinned object always survives.
  • HardPinDefaultIsFalse: sanity check that the default path is unaffected.

Design decisions

  • Boolean, not timeout — model weights need indefinite protection. A timeout-based approach is the exact anti-pattern we're replacing (per RFC discussion).
  • Backward compatible serialization — old snapshots without the hard_pinned field deserialize correctly (defaults to false). New snapshots include it as an extra array element.
  • Remove still works — hard pin only protects against automatic eviction. Explicit Remove() is unaffected, matching the RFC's design intent.

Module

  • Mooncake Store (mooncake-store)

Type of Change

  • New feature

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

Objects created with ReplicateConfig.with_hard_pin=true are never
evicted by the eviction policy, providing guaranteed persistence
for model weights in RL and model management workloads.

Changes:
- ReplicateConfig: add with_hard_pin field (default false)
- ObjectMetadata: add hard_pinned boolean, set at creation via PutStart
- BatchEvict: skip hard-pinned objects in all eviction passes
- Serialization: persist hard_pinned in snapshots (backward compatible
  with old format that lacks the field)
- Tests: verify hard-pinned objects survive eviction, coexist with
  soft pin, and can still be explicitly removed

Refs: kvcache-ai#1621, kvcache-ai#1645
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements a critical 'hard pin' feature for the Mooncake Store, allowing specific objects to be permanently protected from automatic eviction. This enhancement ensures the stability of essential data, such as model weights, by preventing their removal during memory pressure, thereby improving the reliability and predictability of the store's behavior. The changes are carefully integrated to maintain compatibility and are thoroughly tested.

Highlights

  • Hard Pin Mechanism: Introduced a 'hard pin' mechanism that prevents objects from being automatically evicted by the BatchEvict() process. Objects created with ReplicateConfig.with_hard_pin = true are guaranteed to persist in memory.
  • Object Metadata Extension: The ObjectMetadata structure now includes a hard_pinned boolean field and an IsHardPinned() accessor to track the pin status of an object.
  • Eviction Logic Update: Modified the BatchEvict function to explicitly skip any objects marked as hard-pinned across all eviction passes, ensuring they are never removed due to memory pressure.
  • Serialization Backward Compatibility: Updated metadata serialization and deserialization to include the new hard_pinned flag while maintaining backward compatibility with older snapshots that do not contain this field.
  • Comprehensive Testing: Added three new test cases to validate the hard pin functionality, covering scenarios like hard-pinned objects surviving eviction, interaction with soft-pinned objects, and default behavior.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully implements the hard-pin mechanism for eviction-protected objects, aligning with the RFC's P0 scope. The changes are well-integrated across the ReplicateConfig structure, ObjectMetadata class, and the MasterService's eviction and serialization logic. Backward compatibility for deserialization is correctly handled, and the addition of comprehensive test cases ensures the new functionality behaves as expected. The overall implementation is robust and addresses the design goals effectively.

Comment on lines +3512 to +3514
size_t array_size = 8; // client_id, put_start_time, size, lease_timeout,
// has_soft_pin_timeout, soft_pin_timeout,
// replicas_count + hard_pinned
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment for array_size calculation could be made clearer. Currently, it lists replicas_count + hard_pinned, which might be misinterpreted as hard_pinned being added to replicas_count. It would be more precise to list hard_pinned as a separate fixed field, making it clear that 8 is the base count of fixed fields before adding the variable replicas_count.

Suggested change
size_t array_size = 8; // client_id, put_start_time, size, lease_timeout,
// has_soft_pin_timeout, soft_pin_timeout,
// replicas_count + hard_pinned
size_t array_size = 8; // client_id, put_start_time, size, lease_timeout,
// has_soft_pin_timeout, soft_pin_timeout,
// replicas_count, hard_pinned

GetReplicaList grants a new lease, so non-force Remove hits
OBJECT_HAS_LEASE. force=true is the right way to test explicit
removal of hard-pinned objects.
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 23, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 96.03175% with 5 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
mooncake-store/include/master_service.h 50.00% 2 Missing ⚠️
mooncake-store/src/master_service.cpp 89.47% 2 Missing ⚠️
mooncake-store/include/replica.h 0.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Hard Pin support to Mooncake Store so objects created with ReplicateConfig.with_hard_pin=true are never evicted by MasterService::BatchEvict(), and persists this state via snapshot serialization.

Changes:

  • Extend ReplicateConfig and ObjectMetadata with a hard-pin boolean, and thread it through object creation (PutStart / MetadataAccessorRW::Create).
  • Update BatchEvict() to skip hard-pinned objects across all eviction passes.
  • Persist/restore hard_pinned in msgpack metadata serialization, and add unit tests covering eviction behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
mooncake-store/include/replica.h Adds with_hard_pin to replication config and logs it in operator<<.
mooncake-store/include/master_service.h Stores hard-pin state in ObjectMetadata and exposes IsHardPinned().
mooncake-store/src/master_service.cpp Threads hard-pin through PutStart, skips hard-pinned objects during eviction, and appends hard-pin to metadata serialization.
mooncake-store/tests/master_service_test.cpp Adds tests intended to validate hard-pin eviction behavior and defaults.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 3642 to 3649
// Create ObjectMetadata instance
bool enable_soft_pin = has_soft_pin_timeout;
auto metadata = std::make_unique<ObjectMetadata>(
client_id,
std::chrono::system_clock::time_point(
std::chrono::milliseconds(put_start_time_timestamp)),
size, std::move(replicas), enable_soft_pin);
size, std::move(replicas), enable_soft_pin, is_hard_pinned);
metadata->lease_timeout = std::chrono::system_clock::time_point(
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeserializeMetadata() correctly parses the optional hard_pinned flag and passes it into the temporary ObjectMetadata instance, but the restore path in DeserializeShard() reconstructs shard metadata by emplacing a new ObjectMetadata from metadata_ptr without forwarding hard_pinned, so hard-pinned objects will come back as not hard pinned after restoring a snapshot. Please ensure the shard-level reconstruction preserves hard_pinned (either pass it into the constructor or set the field under lock after emplace).

Copilot uses AI. Check for mistakes.
}

// Serialize hard_pinned flag (appended for backward compatibility)
packer.pack(metadata.hard_pinned);
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SerializeMetadata() packs metadata.hard_pinned directly, but hard_pinned is annotated GUARDED_BY(lock). This introduces a new unguarded access to a guarded field. Either read it via IsHardPinned() / SpinLocker (ideally snapshot the guarded fields once under the lock and then pack), or make hard_pinned immutable (e.g., const bool) so it doesn’t need the per-object lock.

Suggested change
packer.pack(metadata.hard_pinned);
packer.pack(metadata.IsHardPinned());

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider this comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use IsHardPinned instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the comment, and switched to IsHardPinned() as you suggested.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — also caught another direct access in DeserializeShard (L3495) and fixed that too.

Comment on lines +4275 to +4322
TEST_F(MasterServiceTest, HardPinObjectNotEvicted) {
// Hard-pinned objects must survive eviction under memory pressure,
// even after lease expires and all non-pinned objects are gone.
const uint64_t kv_lease_ttl = 200;
auto service_config = MasterServiceConfig::builder()
.set_default_kv_lease_ttl(kv_lease_ttl)
.build();
std::unique_ptr<MasterService> service_(new MasterService(service_config));
const UUID client_id = generate_uuid();

constexpr size_t buffer = 0x300000000;
constexpr size_t segment_size = 1024 * 1024 * 16;
constexpr size_t value_size = 1024 * 1024;
[[maybe_unused]] const auto context =
PrepareSimpleSegment(*service_, "test_segment", buffer, segment_size);

// Put a hard-pinned object
{
ReplicateConfig config;
config.replica_num = 1;
config.with_hard_pin = true;
auto result =
service_->PutStart(client_id, "pinned_model", value_size, config);
ASSERT_TRUE(result.has_value());
ASSERT_TRUE(
service_->PutEnd(client_id, "pinned_model", ReplicaType::MEMORY)
.has_value());
}

// Fill remaining space with normal objects to trigger eviction
for (int i = 0; i < 20; i++) {
std::string key = "filler_" + std::to_string(i);
ReplicateConfig config;
config.replica_num = 1;
auto result = service_->PutStart(client_id, key, value_size, config);
if (result.has_value()) {
service_->PutEnd(client_id, key, ReplicaType::MEMORY);
}
}

// Wait for leases to expire and eviction to kick in
std::this_thread::sleep_for(std::chrono::milliseconds(kv_lease_ttl + 500));

// Hard-pinned object must still be there
auto get_result = service_->GetReplicaList("pinned_model");
ASSERT_TRUE(get_result.has_value())
<< "Hard-pinned object was evicted, but it should never be";

Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test doesn’t currently demonstrate “lease expiry + eviction under memory pressure”: PutEnd() sets the initial lease to 0 (expired immediately), and the test never grants a lease (via GetReplicaList/ExistKey) before sleeping. Also, the test never asserts that eviction actually happened (e.g., that at least one non-pinned filler key was evicted). Consider explicitly granting leases to the non-pinned objects, waiting for expiry, and asserting that some filler keys are gone while the hard-pinned key remains.

Copilot uses AI. Check for mistakes.
Comment on lines +4333 to +4399
TEST_F(MasterServiceTest, HardPinWithSoftPinEvictionOrder) {
// Verify eviction priority: non-pinned first, then soft-pinned,
// and hard-pinned objects are never evicted even under extreme pressure.
const uint64_t kv_lease_ttl = 200;
const uint64_t kv_soft_pin_ttl = 10000;
const bool allow_evict_soft_pinned_objects = true;
auto service_config = MasterServiceConfig::builder()
.set_default_kv_lease_ttl(kv_lease_ttl)
.set_default_kv_soft_pin_ttl(kv_soft_pin_ttl)
.set_allow_evict_soft_pinned_objects(
allow_evict_soft_pinned_objects)
.set_eviction_ratio(0.5)
.build();
std::unique_ptr<MasterService> service_(new MasterService(service_config));
const UUID client_id = generate_uuid();

constexpr size_t buffer = 0x300000000;
constexpr size_t segment_size = 1024 * 1024 * 16;
constexpr size_t value_size = 1024 * 1024;
[[maybe_unused]] const auto context =
PrepareSimpleSegment(*service_, "test_segment", buffer, segment_size);

// Put a hard-pinned object
{
ReplicateConfig config;
config.replica_num = 1;
config.with_hard_pin = true;
ASSERT_TRUE(
service_->PutStart(client_id, "hard_pinned", value_size, config)
.has_value());
ASSERT_TRUE(
service_->PutEnd(client_id, "hard_pinned", ReplicaType::MEMORY)
.has_value());
}

// Put a soft-pinned object
{
ReplicateConfig config;
config.replica_num = 1;
config.with_soft_pin = true;
ASSERT_TRUE(
service_->PutStart(client_id, "soft_pinned", value_size, config)
.has_value());
ASSERT_TRUE(
service_->PutEnd(client_id, "soft_pinned", ReplicaType::MEMORY)
.has_value());
}

// Fill the rest
for (int i = 0; i < 20; i++) {
std::string key = "normal_" + std::to_string(i);
ReplicateConfig config;
config.replica_num = 1;
auto result = service_->PutStart(client_id, key, value_size, config);
if (result.has_value()) {
service_->PutEnd(client_id, key, ReplicaType::MEMORY);
}
}

// Let leases expire, trigger eviction
std::this_thread::sleep_for(std::chrono::milliseconds(kv_lease_ttl + 500));

// Hard-pinned always survives
ASSERT_TRUE(service_->GetReplicaList("hard_pinned").has_value())
<< "Hard-pinned object was evicted";

std::this_thread::sleep_for(std::chrono::milliseconds(kv_lease_ttl));
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite the name/comment (“eviction priority”), this test only asserts that the hard-pinned key exists; it doesn’t verify that eviction occurred or that non-pinned keys are evicted before soft-pinned keys (or that soft-pinned keys are evicted when allowed). To make the test meaningful, add assertions that some normal_* keys disappear after eviction, and (if you want to validate ordering) that soft_pinned survives the first-pass eviction and is only evicted in the soft-pin-allowed pass under pressure.

Copilot uses AI. Check for mistakes.
Comment on lines +692 to +695
SpinLocker locker(&lock);
return hard_pinned;
}

Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hard_pinned is set only at creation time and never mutated elsewhere (no assignments beyond the constructor). Keeping it GUARDED_BY(lock) forces IsHardPinned() to take the spinlock on every eviction scan. Consider making hard_pinned immutable (e.g., const bool) and not guarded, so eviction and serialization can read it without lock overhead and without thread-safety annotation violations.

Suggested change
SpinLocker locker(&lock);
return hard_pinned;
}
return hard_pinned;
}

Copilot uses AI. Check for mistakes.
- hard_pinned is immutable after creation, no need for GUARDED_BY(lock)
- DeserializeShard was not forwarding hard_pinned to the new ObjectMetadata
@stmatengss
Copy link
Copy Markdown
Collaborator

Please update docs as well and fix master_service.cpp L3565. others LGTM

- Replace direct hard_pinned member access with IsHardPinned()
  in SerializeMetadata and DeserializeShard
- Remove redundant comment in SerializeMetadata
- Add Hard Pin section to mooncake-store.md docs
- Update ReplicateConfig struct docs with with_hard_pin field
@he-yufeng he-yufeng requested a review from ShangmingCai as a code owner March 25, 2026 02:13
@he-yufeng
Copy link
Copy Markdown
Contributor Author

Updated — added Hard Pin section to mooncake-store.md docs (after Soft Pin) and updated both ReplicateConfig struct references. Also fixed L3565 and another direct access in DeserializeShard.


size_t array_size = 7; // size, lease_timeout, has_soft_pin_timeout,
// soft_pin_timeout, replicas_count
size_t array_size = 8; // client_id, put_start_time, size, lease_timeout,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we avoid hardcoding here? Use size_t array_size = sizeof(struct xxx) + sizeof(struct xxx)

@stmatengss stmatengss merged commit 4c19802 into kvcache-ai:main Mar 25, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants