Home › Features › Distributed Locking
Available since v0.3.0
Related: See Architecture: L1+L2 Caching for how distributed locking fits into the overall cache architecture.
Distributed locking prevents "cache stampede" - when multiple pods simultaneously call an expensive function on cache miss. With locking, only one pod calls the function; others wait for the cache result.
@cache(ttl=300) # Distributed locking enabled by default (via LockableBackend)
def expensive_query(key):
return db.expensive_query(key)
# 1000 pods call simultaneously on L2 miss
# Only 1 pod calls expensive_query()
# 999 pods wait for L2 cache to be populatedDistributed locking is enabled by default when the backend supports it:
from cachekit import cache
@cache(ttl=300) # Locking active on LockableBackend (e.g. RedisBackend)
def get_report(date):
return db.generate_report(date) # Expensive operation
# Multiple pods calling simultaneously on cache miss
# Only one executes generate_report()
report = get_report("2025-01-15")Note
Locking requires a backend that implements the LockableBackend protocol (e.g. RedisBackend). Backends that don't support locking (HTTP, FileBackend) silently skip lock acquisition — the function still works, just without stampede protection.
Cache stampede scenario:
Cache miss happens (L1 and L2 miss)
1000 pods call expensive function simultaneously
→ 1000 times load on database (BAD)
→ Database overloaded, queries slow/fail (BAD)
→ Cache takes longer to populate (BAD)
→ More stampedes happen (BAD cascade)
With distributed locking:
1000 pods call expensive function
Distributed lock acquired by Pod A
999 pods wait for lock
Pod A calls function once
Pod A populates L2 cache
Pod A releases lock
999 pods wake up, read from L2 cache
→ Function called 1 time instead of 1000 (GOOD)
→ Database handles 1 query instead of 1000 (GOOD)
Production scenario: Popular data being cached. Cache expires simultaneously across all pods.
Without locking:
Cache miss
1000 pods hit database simultaneously
Database gets 1000 queries for same data
Database overloaded
Queries timeout
Stampede cascades
With locking:
Cache miss
1000 pods contend for lock
1 pod wins, queries database (normal load)
999 pods wait
Database serves 1 query
Lock released, cache populated
999 pods read from cache
No overload, no cascade
Real example: News site, trending story expires from cache
- Without locking: 10,000 requests = 10,000 DB queries
- With locking: 10,000 requests = 1 DB query
Note
Scenarios where locking adds overhead without benefit:
- Inexpensive functions (<1ms execution): Lock overhead isn't worth it
- Low concurrency (1-2 pods): No stampede risk
- Cache always hits (TTL never expires): Locking never used
When locking overhead matters, use a backend that doesn't implement LockableBackend, or raise the issue — per-decorator toggle is being tracked.
@cache(ttl=300)
def operation(x):
return slow_compute(x) # Takes 10 seconds
# If the lock's blocking_timeout expires before slow_compute() finishes,
# waiting pods fall through without the lock.
# Solution: Ensure your function completes within the backend's lock timeout.
# The AdaptiveTimeoutManager adjusts lock timeouts automatically based on
# observed lock operation durations.# Pod A acquires lock
# Pod A crashes while holding lock
# 999 pods wait until lock TTL expires
# Solution: Redis expiry + blocking_timeout handles this automatically@cache(ttl=5) # 5 second TTL
def operation(x):
time.sleep(2)
return slow_compute(x) # Takes 2 seconds
# Lock acquired, Pod B waits 2 seconds
# TTL expires while Pod B waits
# Solution: Ensure TTL > function execution time@cache(ttl=3600) # Locking enabled by default on LockableBackend
def get_leaderboard():
return db.expensive_leaderboard_query()
# 1000 users request leaderboard simultaneously
# Only 1 computes leaderboard
# 999 wait for result
leaderboard = get_leaderboard()from cachekit import cache
from cachekit.backends.redis import RedisBackend
backend = RedisBackend() # Implements LockableBackend
@cache(ttl=300, backend=backend)
def generate_stats(date):
# Computation takes <30 seconds
return stats_engine.compute(date)# Use a non-LockableBackend for operations where stampede isn't a concern,
# or just accept the minimal overhead — locking only activates on cache miss.
@cache(ttl=300)
def cheap_lookup(x):
# <1ms operation; even if 1000 pods hit simultaneously, DB load is trivial
return simple_dict.get(x)The LockableBackend protocol defines how backends provide distributed locking:
async def acquire_lock(
self,
key: str, # Lock key, e.g. "lock:function_name:args_hash"
timeout: float, # How long to hold the lock (seconds)
blocking_timeout: Optional[float] = None, # Max wait to acquire (None = non-blocking)
) -> AsyncIterator[bool]:
# Yields True if lock acquired, False if timeout waiting
...Lock flow:
1. Try to SET lock key (NX - only if not exists)
2. If SET succeeds → lock acquired, yield True
3. If SET fails → lock held, wait up to blocking_timeout
4. On context exit: DEL lock key (only if still holder)
Lock auto-expires via Redis TTL if holder crashes
Lock timeouts are managed by AdaptiveTimeoutManager, which adjusts based on:
- Average lock operation duration
- Lock contention levels (inferred from wait times)
- Success rate trends
This prevents both premature timeouts (function takes longer than expected) and excessive waits (hanging on a crashed holder).
L1 miss, L2 miss detected
Distributed lock acquisition begins (via backend.acquire_lock)
Only one pod wins lock
That pod calls function
Function executes
Result written to L1 and L2
Lock released
Other pods read from L2 (now populated)
- Lock already held: Polling at
blocking_timeoutinterval - Lock acquisition: <10ms (Redis SET NX operation)
- Lock release: <5ms (Redis DEL operation)
- Waiting cost: Function execution cost saved * (pods_waiting - 1)
Example: 1000 pods, 10s function call, 999 waiting
- Cost without locking: 10,000 seconds total CPU
- Cost with locking: 10 seconds + lock overhead ≈ ~60 seconds total CPU
- Savings: 99.4% reduction
Distributed Locking + Circuit Breaker:
@cache(ttl=300) # Both enabled
def operation(x):
# L2 backend down while holding lock
# Circuit breaker catches error
# Lock TTL ensures lock eventually expires
return compute(x)Distributed Locking + Encryption:
@cache.secure(ttl=300) # Both enabled
def fetch_sensitive(x):
# Lock protects function execution
# Encryption happens on write to L2
# Both work transparently together
return compute(x)cachekit_lock_acquisitions_total{function="get_leaderboard"}
# How many times lock was acquired
cachekit_lock_timeouts_total{function="get_leaderboard"}
# How many times lock timeout occurred
cachekit_lock_wait_duration_seconds{function="get_leaderboard"}
# How long waiting pods waited for lock
# If metrics show:
# - High cache_misses_total
# - Low lock_acquisitions_total (relative to misses)
# → Stampede is happening
# Check log:
# - "lock_timeout" errors
# → Lock timeout is too short relative to function execution timeQ: Getting "lock_timeout" errors A: Your function takes longer than the lock's blocking timeout. Ensure function execution time is well under the backend's configured lock timeout.
Q: Locking doesn't seem to be working
A: Verify your backend implements LockableBackend. Check with from cachekit.backends.base import LockableBackend; isinstance(backend, LockableBackend).
Q: How do I know if stampedes are happening?
A: Check Prometheus: rate(cachekit_cache_misses_total[1m]) spike = stampede risk.
- Circuit Breaker - Prevents cascading failures
- Adaptive Timeouts - Auto-tune Redis timeouts
- Prometheus Metrics - Monitor lock performance
- Comparison Guide - Only cachekit + dogpile.cache have locking