fix: make PodLockManager.release_lock atomic compare-and-delete (re-land #21226)#24466
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Greptile SummaryThis PR re-lands the atomic compare-and-delete fix for Confidence Score: 5/5Safe to merge — the atomic Lua fix is correct, the fallback is present, and all four new test cases pass through the logic cleanly. No P0/P1 issues found. The Lua script is semantically correct, the exception-based fallback restores pre-PR behavior on Lua failures, and the script-handle caching is safe under Python's cooperative async scheduler. No files require special attention.
|
| Filename | Overview |
|---|---|
| litellm/proxy/db/db_transaction_queue/pod_lock_manager.py | Replaces non-atomic GET→compare→DEL with atomic Lua compare-and-delete; adds GET+DEL fallback for unsupported backends and for Lua execution failures (e.g., after Redis restart clears scripts). |
| tests/test_litellm/proxy/db/db_transaction_queue/test_pod_lock_manager.py | Adds four new tests covering the Lua atomic path, script-handle caching, event emission on Lua success, and the GET+DEL fallback when Lua execution fails; existing tests continue to exercise the fallback path via MockRedisCache (no async_register_script). |
Sequence Diagram
sequenceDiagram
participant Pod
participant PLM as PodLockManager
participant Redis
Pod->>PLM: release_lock(cronjob_id)
PLM->>PLM: _compare_and_delete_lock(lock_key)
alt async_register_script present on cache
PLM->>PLM: register Lua script once (cached)
PLM->>Redis: EVALSHA compare-and-delete(lock_key, pod_id)
alt Lua succeeds
Redis-->>PLM: 1 (deleted) or 0 (not owner / missing)
else Lua fails (NOSCRIPT, scripting disabled, etc.)
Redis-->>PLM: exception
PLM->>PLM: reset _release_lock_script = None
Note over PLM: fall through to GET+DEL
PLM->>Redis: GET lock_key
Redis-->>PLM: current_value
PLM->>Redis: DEL lock_key (only if current_value == pod_id)
Redis-->>PLM: result (1 or 0)
end
else No async_register_script (non-Redis backend)
PLM->>Redis: GET lock_key
Redis-->>PLM: current_value
PLM->>Redis: DEL lock_key (only if current_value == pod_id)
Redis-->>PLM: result (1 or 0)
end
alt result == 1
PLM->>Pod: _emit_released_lock_event
else result == 0
PLM->>Pod: debug log (lock missing or held by another pod)
end
Reviews (4): Last reviewed commit: "fix: add Lua fallback on execution error..." | Re-trigger Greptile
| if callable(script_register): | ||
| if self._release_lock_script is None: | ||
| self._release_lock_script = script_register( | ||
| self._COMPARE_AND_DELETE_LOCK_SCRIPT | ||
| ) | ||
| result = await self._release_lock_script( | ||
| keys=[lock_key], args=[self.pod_id] | ||
| ) | ||
| return int(result or 0) |
There was a problem hiding this comment.
No fallback when Lua script execution fails
The GET + DEL fallback is only reached when async_register_script is entirely absent from the backend. If the attribute is present but the script call at await self._release_lock_script(...) raises at runtime (for example, after a Redis restart that clears loaded scripts), the exception propagates directly out of _compare_and_delete_lock without ever reaching the fallback code below line 168. The outer try/except in release_lock catches it, logs an error, and returns — leaving the lock held until TTL expiry.
Before this PR, those same configurations would have used GET + DEL and released the lock successfully. Wrapping just the Lua execution block in its own try/except and then continuing to the fallback code on failure would preserve the old resilience while still preferring the atomic path.
Address Greptile review feedback on BerriAI#24466: 1. Wrap Lua script execution in try/except — if Redis clears loaded scripts (restart) or scripting is disabled, fall back to GET+DEL rather than letting the exception propagate and leave the lock held until TTL. Reset cached script handle so the next call re-registers. 2. Add test_release_lock_lua_path_emits_released_event — verifies _emit_released_lock_event is called when Lua path returns 1. 3. Add test_release_lock_falls_back_to_get_del_when_lua_execution_fails — verifies the fallback path is taken and script handle is reset. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Address Greptile review feedback on BerriAI#24466: 1. Wrap Lua script execution in try/except — if Redis clears loaded scripts (restart) or scripting is disabled, fall back to GET+DEL rather than letting the exception propagate and leave the lock held until TTL. Reset cached script handle so the next call re-registers. 2. Add test_release_lock_lua_path_emits_released_event — verifies _emit_released_lock_event is called when Lua path returns 1. 3. Add test_release_lock_falls_back_to_get_del_when_lua_execution_fails — verifies the fallback path is taken and script handle is reset. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
b6458d6 to
7a8fc3f
Compare
Re-lands BerriAI#21226 (reverted in BerriAI#21469). release_lock() previously did GET + compare + DEL in separate calls, leaving a window where another pod could reacquire the lock between the GET and DEL, causing a stale owner to delete a live lock. Fix: use a Redis Lua script for atomic compare-and-delete. Script registration is cached per PodLockManager instance. Falls back to the old GET+DEL path for cache backends that don't expose async_register_script. Original revert was due to e2e tests running in CI without Redis. Those tests now carry @pytest.mark.skip(reason="Requires Redis connection.") so this re-land is safe. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Address Greptile review feedback on BerriAI#24466: 1. Wrap Lua script execution in try/except — if Redis clears loaded scripts (restart) or scripting is disabled, fall back to GET+DEL rather than letting the exception propagate and leave the lock held until TTL. Reset cached script handle so the next call re-registers. 2. Add test_release_lock_lua_path_emits_released_event — verifies _emit_released_lock_event is called when Lua path returns 1. 3. Add test_release_lock_falls_back_to_get_del_when_lua_execution_fails — verifies the fallback path is taken and script handle is reset. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
7a8fc3f to
badd8c8
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
f92490c
into
BerriAI:litellm_internal_staging
Address Greptile review feedback on BerriAI#24466: 1. Wrap Lua script execution in try/except — if Redis clears loaded scripts (restart) or scripting is disabled, fall back to GET+DEL rather than letting the exception propagate and leave the lock held until TTL. Reset cached script handle so the next call re-registers. 2. Add test_release_lock_lua_path_emits_released_event — verifies _emit_released_lock_event is called when Lua path returns 1. 3. Add test_release_lock_falls_back_to_get_del_when_lua_execution_fails — verifies the fallback path is taken and script handle is reset.
Summary
Re-lands #21226, which was reverted in #21469 due to e2e tests failing in CI (no Redis connection available). Those tests now have
@pytest.mark.skip(reason="Requires Redis connection.")added by commit3dd58b42, so this re-land is safe.Root cause fixed
release_lock()was doing GET → compare → DEL in three separate Redis calls. Between the GET and DEL, another pod could acquire the lock (after TTL expiry), and the stale owner would then DEL a lock it no longer owns.Fix
Atomic Lua compare-and-delete script:
PodLockManagerinstance (registered once, reused)async_register_scriptTest plan
poetry run pytest tests/test_litellm/proxy/db/db_transaction_queue/test_pod_lock_manager.py)test_release_lock_uses_atomic_compare_delete_script_when_available— verifies Lua path is taken whenasync_register_scriptis available, and GET/DEL are not calledtest_release_lock_reuses_registered_script— verifies script is registered once and reused across callstest_release_lock_emits_event_on_lua_success— verifies released-lock event still fires on the Lua pathtest_release_lock_falls_back_to_get_del_on_lua_failure— verifies that if Lua raises, execution falls back to GET+DEL rather than leaving the lock held