fix: make PodLockManager.release_lock atomic compare-and-delete (re-land #21226) by joereyna · Pull Request #24466 · BerriAI/litellm

joereyna · 2026-03-24T03:55:29Z

Summary

Re-lands #21226, which was reverted in #21469 due to e2e tests failing in CI (no Redis connection available). Those tests now have @pytest.mark.skip(reason="Requires Redis connection.") added by commit 3dd58b42, so this re-land is safe.

Root cause fixed

release_lock() was doing GET → compare → DEL in three separate Redis calls. Between the GET and DEL, another pod could acquire the lock (after TTL expiry), and the stale owner would then DEL a lock it no longer owns.

Fix

Atomic Lua compare-and-delete script:

if redis.call("get", KEYS[1]) == ARGV[1] then
    return redis.call("del", KEYS[1])
else
    return 0
end

Script registration is cached per PodLockManager instance (registered once, reused)
Falls back to GET + DEL for cache backends that don't expose async_register_script
If Lua execution raises at runtime (e.g. Redis restart cleared loaded scripts), the cached script handle is reset and execution falls through to the GET + DEL fallback — preserving pre-existing resilience

Test plan

All 13 existing unit tests pass (poetry run pytest tests/test_litellm/proxy/db/db_transaction_queue/test_pod_lock_manager.py)
Added test_release_lock_uses_atomic_compare_delete_script_when_available — verifies Lua path is taken when async_register_script is available, and GET/DEL are not called
Added test_release_lock_reuses_registered_script — verifies script is registered once and reused across calls
Added test_release_lock_emits_event_on_lua_success — verifies released-lock event still fires on the Lua path
Added test_release_lock_falls_back_to_get_del_on_lua_failure — verifies that if Lua raises, execution falls back to GET+DEL rather than leaving the lock held

vercel · 2026-03-24T03:55:34Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
litellm	Ready	Preview, Comment	Apr 15, 2026 11:08pm

codspeed-hq · 2026-03-24T03:57:13Z

Merging this PR will not alter performance

✅ 16 untouched benchmarks

_{Comparing joereyna:fix/pod-lock-atomic-release (badd8c8) with main (72a461b)}

greptile-apps · 2026-03-24T04:10:52Z

Greptile Summary

This PR re-lands the atomic compare-and-delete fix for PodLockManager.release_lock() (originally #21226, reverted in #21469). It replaces the three-step GET → compare → DEL with a single-round-trip Lua script so a stale owner can never delete a lock it no longer holds. The implementation gracefully falls back to GET + DEL when async_register_script is absent from the cache backend or when the Lua call raises at runtime (e.g., after a Redis restart that clears loaded scripts), and caches the registered script handle to avoid re-registering on every call. New tests cover the Lua path, script-handle reuse, event emission, and the runtime-failure fallback; all previously noted concerns from the prior review thread have been addressed.

Confidence Score: 5/5

Safe to merge — the atomic Lua fix is correct, the fallback is present, and all four new test cases pass through the logic cleanly.

No P0/P1 issues found. The Lua script is semantically correct, the exception-based fallback restores pre-PR behavior on Lua failures, and the script-handle caching is safe under Python's cooperative async scheduler.

No files require special attention.

Important Files Changed

Filename	Overview
litellm/proxy/db/db_transaction_queue/pod_lock_manager.py	Replaces non-atomic GET→compare→DEL with atomic Lua compare-and-delete; adds GET+DEL fallback for unsupported backends and for Lua execution failures (e.g., after Redis restart clears scripts).
tests/test_litellm/proxy/db/db_transaction_queue/test_pod_lock_manager.py	Adds four new tests covering the Lua atomic path, script-handle caching, event emission on Lua success, and the GET+DEL fallback when Lua execution fails; existing tests continue to exercise the fallback path via MockRedisCache (no async_register_script).

Sequence Diagram

sequenceDiagram
    participant Pod
    participant PLM as PodLockManager
    participant Redis

    Pod->>PLM: release_lock(cronjob_id)
    PLM->>PLM: _compare_and_delete_lock(lock_key)

    alt async_register_script present on cache
        PLM->>PLM: register Lua script once (cached)
        PLM->>Redis: EVALSHA compare-and-delete(lock_key, pod_id)
        alt Lua succeeds
            Redis-->>PLM: 1 (deleted) or 0 (not owner / missing)
        else Lua fails (NOSCRIPT, scripting disabled, etc.)
            Redis-->>PLM: exception
            PLM->>PLM: reset _release_lock_script = None
            Note over PLM: fall through to GET+DEL
            PLM->>Redis: GET lock_key
            Redis-->>PLM: current_value
            PLM->>Redis: DEL lock_key (only if current_value == pod_id)
            Redis-->>PLM: result (1 or 0)
        end
    else No async_register_script (non-Redis backend)
        PLM->>Redis: GET lock_key
        Redis-->>PLM: current_value
        PLM->>Redis: DEL lock_key (only if current_value == pod_id)
        Redis-->>PLM: result (1 or 0)
    end

    alt result == 1
        PLM->>Pod: _emit_released_lock_event
    else result == 0
        PLM->>Pod: debug log (lock missing or held by another pod)
    end

_{Reviews (4): Last reviewed commit: "fix: add Lua fallback on execution error..." | Re-trigger Greptile}

greptile-apps · 2026-03-24T04:10:59Z

+        if callable(script_register):
+            if self._release_lock_script is None:
+                self._release_lock_script = script_register(
+                    self._COMPARE_AND_DELETE_LOCK_SCRIPT
+                )
+            result = await self._release_lock_script(
+                keys=[lock_key], args=[self.pod_id]
+            )
+            return int(result or 0)


No fallback when Lua script execution fails

The GET + DEL fallback is only reached when async_register_script is entirely absent from the backend. If the attribute is present but the script call at await self._release_lock_script(...) raises at runtime (for example, after a Redis restart that clears loaded scripts), the exception propagates directly out of _compare_and_delete_lock without ever reaching the fallback code below line 168. The outer try/except in release_lock catches it, logs an error, and returns — leaving the lock held until TTL expiry.

Before this PR, those same configurations would have used GET + DEL and released the lock successfully. Wrapping just the Lua execution block in its own try/except and then continuing to the fallback code on failure would preserve the old resilience while still preferring the atomic path.

Address Greptile review feedback on BerriAI#24466: 1. Wrap Lua script execution in try/except — if Redis clears loaded scripts (restart) or scripting is disabled, fall back to GET+DEL rather than letting the exception propagate and leave the lock held until TTL. Reset cached script handle so the next call re-registers. 2. Add test_release_lock_lua_path_emits_released_event — verifies _emit_released_lock_event is called when Lua path returns 1. 3. Add test_release_lock_falls_back_to_get_del_when_lua_execution_fails — verifies the fallback path is taken and script handle is reset. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Re-lands BerriAI#21226 (reverted in BerriAI#21469). release_lock() previously did GET + compare + DEL in separate calls, leaving a window where another pod could reacquire the lock between the GET and DEL, causing a stale owner to delete a live lock. Fix: use a Redis Lua script for atomic compare-and-delete. Script registration is cached per PodLockManager instance. Falls back to the old GET+DEL path for cache backends that don't expose async_register_script. Original revert was due to e2e tests running in CI without Redis. Those tests now carry @pytest.mark.skip(reason="Requires Redis connection.") so this re-land is safe. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Address Greptile review feedback on BerriAI#24466: 1. Wrap Lua script execution in try/except — if Redis clears loaded scripts (restart) or scripting is disabled, fall back to GET+DEL rather than letting the exception propagate and leave the lock held until TTL. Reset cached script handle so the next call re-registers. 2. Add test_release_lock_lua_path_emits_released_event — verifies _emit_released_lock_event is called when Lua path returns 1. 3. Add test_release_lock_falls_back_to_get_del_when_lua_execution_fails — verifies the fallback path is taken and script handle is reset. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

codecov · 2026-04-15T23:10:37Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Address Greptile review feedback on BerriAI#24466: 1. Wrap Lua script execution in try/except — if Redis clears loaded scripts (restart) or scripting is disabled, fall back to GET+DEL rather than letting the exception propagate and leave the lock held until TTL. Reset cached script handle so the next call re-registers. 2. Add test_release_lock_lua_path_emits_released_event — verifies _emit_released_lock_event is called when Lua path returns 1. 3. Add test_release_lock_falls_back_to_get_del_when_lua_execution_fails — verifies the fallback path is taken and script handle is reset.

vercel Bot deployed to Preview March 24, 2026 03:57 View deployment

greptile-apps Bot reviewed Mar 24, 2026

View reviewed changes

vercel Bot deployed to Preview March 24, 2026 05:56 View deployment

joereyna force-pushed the fix/pod-lock-atomic-release branch from b6458d6 to 7a8fc3f Compare March 31, 2026 23:07

vercel Bot deployed to Preview March 31, 2026 23:09 View deployment

joereyna and others added 2 commits April 15, 2026 16:06

joereyna force-pushed the fix/pod-lock-atomic-release branch from 7a8fc3f to badd8c8 Compare April 15, 2026 23:07

vercel Bot deployed to Preview April 15, 2026 23:08 View deployment

krrish-berri-2 changed the base branch from main to litellm_internal_staging April 16, 2026 00:31

krrish-berri-2 enabled auto-merge (squash) April 16, 2026 00:32

krrish-berri-2 approved these changes Apr 16, 2026

View reviewed changes

krrish-berri-2 merged commit f92490c into BerriAI:litellm_internal_staging Apr 16, 2026
50 of 53 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: make PodLockManager.release_lock atomic compare-and-delete (re-land #21226)#24466

fix: make PodLockManager.release_lock atomic compare-and-delete (re-land #21226)#24466
krrish-berri-2 merged 2 commits intoBerriAI:litellm_internal_stagingfrom
joereyna:fix/pod-lock-atomic-release

joereyna commented Mar 24, 2026 •

edited

Loading

Uh oh!

vercel Bot commented Mar 24, 2026 •

edited

Loading

Uh oh!

codspeed-hq Bot commented Mar 24, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Mar 24, 2026 •

edited

Loading

Important Files Changed

Uh oh!

greptile-apps Bot Mar 24, 2026

Uh oh!

codecov Bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

joereyna commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause fixed

Fix

Test plan

Uh oh!

vercel Bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codspeed-hq Bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Uh oh!

greptile-apps Bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joereyna commented Mar 24, 2026 •

edited

Loading

vercel Bot commented Mar 24, 2026 •

edited

Loading

codspeed-hq Bot commented Mar 24, 2026 •

edited

Loading

greptile-apps Bot commented Mar 24, 2026 •

edited

Loading

codecov Bot commented Apr 15, 2026 •

edited

Loading