Skip to content

fix: make PodLockManager.release_lock atomic compare-and-delete (re-land #21226)#24466

Merged
krrish-berri-2 merged 2 commits intoBerriAI:litellm_internal_stagingfrom
joereyna:fix/pod-lock-atomic-release
Apr 16, 2026
Merged

fix: make PodLockManager.release_lock atomic compare-and-delete (re-land #21226)#24466
krrish-berri-2 merged 2 commits intoBerriAI:litellm_internal_stagingfrom
joereyna:fix/pod-lock-atomic-release

Conversation

@joereyna
Copy link
Copy Markdown
Contributor

@joereyna joereyna commented Mar 24, 2026

Summary

Re-lands #21226, which was reverted in #21469 due to e2e tests failing in CI (no Redis connection available). Those tests now have @pytest.mark.skip(reason="Requires Redis connection.") added by commit 3dd58b42, so this re-land is safe.

Root cause fixed

release_lock() was doing GET → compare → DEL in three separate Redis calls. Between the GET and DEL, another pod could acquire the lock (after TTL expiry), and the stale owner would then DEL a lock it no longer owns.

Fix

Atomic Lua compare-and-delete script:

if redis.call("get", KEYS[1]) == ARGV[1] then
    return redis.call("del", KEYS[1])
else
    return 0
end
  • Script registration is cached per PodLockManager instance (registered once, reused)
  • Falls back to GET + DEL for cache backends that don't expose async_register_script
  • If Lua execution raises at runtime (e.g. Redis restart cleared loaded scripts), the cached script handle is reset and execution falls through to the GET + DEL fallback — preserving pre-existing resilience

Test plan

  • All 13 existing unit tests pass (poetry run pytest tests/test_litellm/proxy/db/db_transaction_queue/test_pod_lock_manager.py)
  • Added test_release_lock_uses_atomic_compare_delete_script_when_available — verifies Lua path is taken when async_register_script is available, and GET/DEL are not called
  • Added test_release_lock_reuses_registered_script — verifies script is registered once and reused across calls
  • Added test_release_lock_emits_event_on_lua_success — verifies released-lock event still fires on the Lua path
  • Added test_release_lock_falls_back_to_get_del_on_lua_failure — verifies that if Lua raises, execution falls back to GET+DEL rather than leaving the lock held

@vercel
Copy link
Copy Markdown

vercel Bot commented Mar 24, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
litellm Ready Ready Preview, Comment Apr 15, 2026 11:08pm

Request Review

@codspeed-hq
Copy link
Copy Markdown
Contributor

codspeed-hq Bot commented Mar 24, 2026

Merging this PR will not alter performance

✅ 16 untouched benchmarks


Comparing joereyna:fix/pod-lock-atomic-release (badd8c8) with main (72a461b)

Open in CodSpeed

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 24, 2026

Greptile Summary

This PR re-lands the atomic compare-and-delete fix for PodLockManager.release_lock() (originally #21226, reverted in #21469). It replaces the three-step GET → compare → DEL with a single-round-trip Lua script so a stale owner can never delete a lock it no longer holds. The implementation gracefully falls back to GET + DEL when async_register_script is absent from the cache backend or when the Lua call raises at runtime (e.g., after a Redis restart that clears loaded scripts), and caches the registered script handle to avoid re-registering on every call. New tests cover the Lua path, script-handle reuse, event emission, and the runtime-failure fallback; all previously noted concerns from the prior review thread have been addressed.

Confidence Score: 5/5

Safe to merge — the atomic Lua fix is correct, the fallback is present, and all four new test cases pass through the logic cleanly.

No P0/P1 issues found. The Lua script is semantically correct, the exception-based fallback restores pre-PR behavior on Lua failures, and the script-handle caching is safe under Python's cooperative async scheduler.

No files require special attention.

Important Files Changed

Filename Overview
litellm/proxy/db/db_transaction_queue/pod_lock_manager.py Replaces non-atomic GET→compare→DEL with atomic Lua compare-and-delete; adds GET+DEL fallback for unsupported backends and for Lua execution failures (e.g., after Redis restart clears scripts).
tests/test_litellm/proxy/db/db_transaction_queue/test_pod_lock_manager.py Adds four new tests covering the Lua atomic path, script-handle caching, event emission on Lua success, and the GET+DEL fallback when Lua execution fails; existing tests continue to exercise the fallback path via MockRedisCache (no async_register_script).

Sequence Diagram

sequenceDiagram
    participant Pod
    participant PLM as PodLockManager
    participant Redis

    Pod->>PLM: release_lock(cronjob_id)
    PLM->>PLM: _compare_and_delete_lock(lock_key)

    alt async_register_script present on cache
        PLM->>PLM: register Lua script once (cached)
        PLM->>Redis: EVALSHA compare-and-delete(lock_key, pod_id)
        alt Lua succeeds
            Redis-->>PLM: 1 (deleted) or 0 (not owner / missing)
        else Lua fails (NOSCRIPT, scripting disabled, etc.)
            Redis-->>PLM: exception
            PLM->>PLM: reset _release_lock_script = None
            Note over PLM: fall through to GET+DEL
            PLM->>Redis: GET lock_key
            Redis-->>PLM: current_value
            PLM->>Redis: DEL lock_key (only if current_value == pod_id)
            Redis-->>PLM: result (1 or 0)
        end
    else No async_register_script (non-Redis backend)
        PLM->>Redis: GET lock_key
        Redis-->>PLM: current_value
        PLM->>Redis: DEL lock_key (only if current_value == pod_id)
        Redis-->>PLM: result (1 or 0)
    end

    alt result == 1
        PLM->>Pod: _emit_released_lock_event
    else result == 0
        PLM->>Pod: debug log (lock missing or held by another pod)
    end
Loading

Reviews (4): Last reviewed commit: "fix: add Lua fallback on execution error..." | Re-trigger Greptile

Comment on lines +158 to +166
if callable(script_register):
if self._release_lock_script is None:
self._release_lock_script = script_register(
self._COMPARE_AND_DELETE_LOCK_SCRIPT
)
result = await self._release_lock_script(
keys=[lock_key], args=[self.pod_id]
)
return int(result or 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 No fallback when Lua script execution fails

The GET + DEL fallback is only reached when async_register_script is entirely absent from the backend. If the attribute is present but the script call at await self._release_lock_script(...) raises at runtime (for example, after a Redis restart that clears loaded scripts), the exception propagates directly out of _compare_and_delete_lock without ever reaching the fallback code below line 168. The outer try/except in release_lock catches it, logs an error, and returns — leaving the lock held until TTL expiry.

Before this PR, those same configurations would have used GET + DEL and released the lock successfully. Wrapping just the Lua execution block in its own try/except and then continuing to the fallback code on failure would preserve the old resilience while still preferring the atomic path.

joereyna added a commit to joereyna/litellm that referenced this pull request Mar 24, 2026
Address Greptile review feedback on BerriAI#24466:

1. Wrap Lua script execution in try/except — if Redis clears loaded
   scripts (restart) or scripting is disabled, fall back to GET+DEL
   rather than letting the exception propagate and leave the lock held
   until TTL. Reset cached script handle so the next call re-registers.

2. Add test_release_lock_lua_path_emits_released_event — verifies
   _emit_released_lock_event is called when Lua path returns 1.

3. Add test_release_lock_falls_back_to_get_del_when_lua_execution_fails
   — verifies the fallback path is taken and script handle is reset.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
joereyna added a commit to joereyna/litellm that referenced this pull request Mar 31, 2026
Address Greptile review feedback on BerriAI#24466:

1. Wrap Lua script execution in try/except — if Redis clears loaded
   scripts (restart) or scripting is disabled, fall back to GET+DEL
   rather than letting the exception propagate and leave the lock held
   until TTL. Reset cached script handle so the next call re-registers.

2. Add test_release_lock_lua_path_emits_released_event — verifies
   _emit_released_lock_event is called when Lua path returns 1.

3. Add test_release_lock_falls_back_to_get_del_when_lua_execution_fails
   — verifies the fallback path is taken and script handle is reset.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@joereyna joereyna force-pushed the fix/pod-lock-atomic-release branch from b6458d6 to 7a8fc3f Compare March 31, 2026 23:07
joereyna and others added 2 commits April 15, 2026 16:06
Re-lands BerriAI#21226 (reverted in BerriAI#21469).

release_lock() previously did GET + compare + DEL in separate calls,
leaving a window where another pod could reacquire the lock between
the GET and DEL, causing a stale owner to delete a live lock.

Fix: use a Redis Lua script for atomic compare-and-delete. Script
registration is cached per PodLockManager instance. Falls back to
the old GET+DEL path for cache backends that don't expose
async_register_script.

Original revert was due to e2e tests running in CI without Redis.
Those tests now carry @pytest.mark.skip(reason="Requires Redis connection.")
so this re-land is safe.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Address Greptile review feedback on BerriAI#24466:

1. Wrap Lua script execution in try/except — if Redis clears loaded
   scripts (restart) or scripting is disabled, fall back to GET+DEL
   rather than letting the exception propagate and leave the lock held
   until TTL. Reset cached script handle so the next call re-registers.

2. Add test_release_lock_lua_path_emits_released_event — verifies
   _emit_released_lock_event is called when Lua path returns 1.

3. Add test_release_lock_falls_back_to_get_del_when_lua_execution_fails
   — verifies the fallback path is taken and script handle is reset.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@krrish-berri-2 krrish-berri-2 changed the base branch from main to litellm_internal_staging April 16, 2026 00:31
@krrish-berri-2 krrish-berri-2 enabled auto-merge (squash) April 16, 2026 00:32
@krrish-berri-2 krrish-berri-2 merged commit f92490c into BerriAI:litellm_internal_staging Apr 16, 2026
50 of 53 checks passed
joereyna added a commit to joereyna/litellm that referenced this pull request Apr 17, 2026
Address Greptile review feedback on BerriAI#24466:

1. Wrap Lua script execution in try/except — if Redis clears loaded
   scripts (restart) or scripting is disabled, fall back to GET+DEL
   rather than letting the exception propagate and leave the lock held
   until TTL. Reset cached script handle so the next call re-registers.

2. Add test_release_lock_lua_path_emits_released_event — verifies
   _emit_released_lock_event is called when Lua path returns 1.

3. Add test_release_lock_falls_back_to_get_del_when_lua_execution_fails
   — verifies the fallback path is taken and script handle is reset.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants