Skip to content

Ensure LRU cache is properly bounded. #159

@hoijnet

Description

@hoijnet

LRU Archive Cache Byte-Limit Eviction Is Broken (self.current Never Updated)

Summary

The LruArchiveBackend in terminus-store tracks its current byte usage in a current: usize field, but this field is initialized to 0 and never updated when entries are added or evicted. As a result, the byte-limit eviction logic in ensure_enough_cache_space always sees current == 0, believes the cache is empty, and never triggers byte-based eviction. The LRU cache grows without bound (up to entry count limits imposed by the lru crate's unbounded mode).

Affected Code

File: terminusdb-store/src/storage/archive.rs

Struct definition

#[derive(Clone)]
pub struct LruArchiveBackend<M, D> {
    cache: Arc<tokio::sync::Mutex<LruCache<[u32; 5], CacheEntry>>>,
    limit: usize,        // configured limit in MB
    current: usize,      // intended to track current usage in bytes — never updated
    metadata_origin: M,
    data_origin: D,
}

Initialization — current set to 0, never mutated again

impl<M, D> LruArchiveBackend<M, D> {
    pub fn new(metadata_origin: M, data_origin: D, limit: usize) -> Self {
        let cache = Arc::new(tokio::sync::Mutex::new(LruCache::unbounded()));
        Self {
            cache,
            limit,
            current: 0,      // <-- initialized to 0
            metadata_origin,
            data_origin,
        }
    }
}

The eviction check — always sees current == 0

In get_layer_bytes, when a new entry is resolved and about to be cached:

if ensure_enough_cache_space(
    &mut *cache,
    self.limit_bytes(),
    self.current,          // <-- always 0
    bytes.len(),
) {
    let cached = cache
        .get_mut(&id)
        .expect("layer resolving entry not found in cache");
    *cached = CacheEntry::Resolved(bytes.clone());
    // NOTE: self.current is NOT incremented here
}

The eviction function — never evicts because remaining == limit

fn ensure_enough_cache_space(
    cache: &mut LruCache<[u32; 5], CacheEntry>,
    limit: usize,
    current: usize,     // always 0
    required: usize,
) -> bool {
    if required > limit {
        return false;    // only triggers for single entries > limit
    }

    let remaining = limit - current;  // remaining == limit, always
    if remaining < required {          // never true unless required > limit
        ensure_additional_cache_space(cache, required - remaining);
    }

    true
}

Because current is always 0, remaining always equals limit, so remaining < required is never true (the single-entry required > limit guard above already returns false for that case). The eviction path ensure_additional_cache_space is dead code in practice.

Impact

  • The LRU cache configured via TERMINUSDB_LRU_CACHE_SIZE (default 512 MB) has no effective byte limit. It will cache every resolved layer archive indefinitely, bounded only by available memory.
  • In deployments with many distinct layers accessed over time, this can lead to unbounded memory growth that is invisible to operators (the configured limit suggests memory is bounded when it is not).
  • The ensure_additional_cache_space function and its LRU eviction logic are effectively dead code and have never been exercised in production.

Root Cause

self.current is never mutated after initialization. It should be:

  • Incremented by bytes.len() after a CacheEntry::Resolved(bytes) is inserted into the cache.
  • Decremented by the byte size of each entry evicted by ensure_additional_cache_space or drop_from_cache.

However, self.current is a plain usize on a Clone-derived struct, and get_layer_bytes takes &self (not &mut self). Updating it requires either:

  1. Changing current to an Arc<AtomicUsize> (preferred — lock-free, no contention with the cache mutex).
  2. Moving current inside the Mutex-guarded cache struct.

Proposed Fix

Replace current: usize with current: Arc<std::sync::atomic::AtomicUsize> and update it at every cache mutation point:

  1. After inserting a resolved entry in get_layer_bytes:

    *cached = CacheEntry::Resolved(bytes.clone());
    self.current.fetch_add(bytes.len(), Ordering::Relaxed);
  2. Inside ensure_additional_cache_space — the evicted entry sizes are already computed; pass the AtomicUsize in and decrement:

    if let CacheEntry::Resolved(entry) = entry {
        current.fetch_sub(entry.len(), Ordering::Relaxed);
        // ...existing required tracking...
    }
  3. In drop_from_cache — decrement when an uncacheable or failed entry is removed:

    fn drop_from_cache(cache: &mut LruCache<[u32; 5], CacheEntry>, id: [u32; 5], current: &AtomicUsize) {
        if let Some((_, CacheEntry::Resolved(bytes))) = cache.peek_lru().filter(|(k, _)| **k == id) {
            current.fetch_sub(bytes.len(), Ordering::Relaxed);
        }
        cache.demote(&id);
        cache.pop_lru();
    }
  4. Update used_bytes() to read from the atomic instead of iterating the cache (faster, always available):

    pub fn used_bytes(&self) -> usize {
        self.current.load(Ordering::Relaxed)
    }

Testing

  • Add a unit test that creates an LruArchiveBackend with a small limit (e.g., 1 MB), inserts entries exceeding that limit, and verifies that older entries are evicted.
  • Verify used_bytes() returns accurate totals after insertions and evictions.
  • Verify that entries larger than limit_bytes() are correctly rejected (not cached).

Labels

bug, terminus-store, memory

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriageissue to be triage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions