[Fix] Proxy: reconnect Prisma DB without blocking the event loop#26225
[Fix] Proxy: reconnect Prisma DB without blocking the event loop#26225yuneng-berri merged 4 commits intolitellm_internal_stagingfrom
Conversation
When the DB becomes unreachable the reconnect path calls
`prisma.disconnect()`, which ultimately invokes prisma-client-py's
synchronous `subprocess.Popen.wait()` on the query engine subprocess.
That call does not yield to asyncio, so the event loop freezes for
however long the Rust engine takes to shut down (30-120+ seconds in
production when the engine is stuck on TCP close). During the freeze
`/health/liveliness` becomes unresponsive, and in Kubernetes the
liveness probe fails and the pod is SIGKILL'd.
Replace `disconnect()` in the reconnect paths with a direct, non-blocking
kill of the engine subprocess (SIGTERM -> 0.5s asyncio-yielding sleep ->
SIGKILL) followed by a fresh Prisma client and a new `connect()`. Both
`recreate_prisma_client` and the formerly-separate "direct reconnect"
path go through the same kill-then-recreate flow.
Also validate `_get_engine_pid` returns an int (defensive; prevents a
MagicMock leak under unit-test mocking).
Tests that encoded the old blocking behavior are updated or removed;
the deleted `test_lightweight_reconnect_skips_kill_on_successful_disconnect`
invariant ("don't kill on successful disconnect") was part of the bug.
Greptile SummaryReplaces the blocking Confidence Score: 5/5Safe to merge; the fix is well-reasoned, well-tested, and the only finding is a minor path-consistency P2. All four changed files have correct logic. No security issues, no data-loss risk, no backwards-incompatible API changes. The one P2 comment (engine watcher left unguarded on a mid-reconnect timeout cancellation) is a low-probability edge case that self-heals on the next retry cycle. P2s do not lower the score. No files require special attention;
|
| Filename | Overview |
|---|---|
| litellm/proxy/db/prisma_client.py | Core fix: recreate_prisma_client now kills the engine via SIGTERM→SIGKILL instead of calling blocking disconnect(). _get_engine_pid hardened with isinstance(pid, int) to prevent mock objects from leaking into os.kill. Docstring and log message updated to match new semantics. |
| litellm/proxy/utils.py | The "direct reconnect" path is refactored to delegate to recreate_prisma_client (same as the heavy path) instead of calling disconnect()+connect(). Now also properly manages the engine watcher lifecycle (_cleanup_engine_watcher / _start_engine_watcher) on the direct path. |
| tests/litellm/proxy/test_prisma_engine_watchdog.py | Tests updated to reflect unified reconnect path: assert recreate_prisma_client is called and disconnect is not. DATABASE_URL patched via patch.dict(os.environ) and _start_engine_watcher mocked in all affected tests. |
| tests/test_litellm/proxy/db/test_prisma_self_heal.py | Tests correctly updated: test_lightweight_reconnect_skips_kill_on_successful_disconnect removed (it tested the blocking behavior that was the bug itself), replaced by test_recreate_prisma_client_kills_old_engine_without_disconnect which asserts os.kill(SIGTERM) is called and disconnect is never awaited. |
Reviews (2): Last reviewed commit: "docs(proxy): clarify _kill_engine_proces..." | Re-trigger Greptile
…itellm_dbReconnectNonBlocking
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
… path Greptile review on #26225 (P2): the docstring said "Called when disconnect() fails", and the SIGTERM warning log read "after failed disconnect", but both were stale — `_kill_engine_process` is now invoked on every routine reconnect (via the unified `recreate_prisma_client` path), not as a disconnect-failure recovery branch. The misleading wording would have produced confusing log lines on every reconnect cycle in production. Update the docstring to explain the actual reason (avoiding the blocking `disconnect()` event-loop freeze) and reword the SIGTERM warning to "during reconnect" so it matches reality. No behavior change; logs only.
|
@greptile |
…itellm_dbReconnectNonBlocking_local # Conflicts: # tests/test_litellm/proxy/db/test_prisma_self_heal.py
… path Greptile review on #26225 (P2): the docstring said "Called when disconnect() fails", and the SIGTERM warning log read "after failed disconnect", but both were stale — `_kill_engine_process` is now invoked on every routine reconnect (via the unified `recreate_prisma_client` path), not as a disconnect-failure recovery branch. The misleading wording would have produced confusing log lines on every reconnect cycle in production. Update the docstring to explain the actual reason (avoiding the blocking `disconnect()` event-loop freeze) and reword the SIGTERM warning to "during reconnect" so it matches reality. No behavior change; logs only.
Relevant issues
Summary
Failure Path (Before Fix)
When the database becomes unreachable the DB reconnect logic calls
await self.db.disconnect(), which invokes prisma-client-py's synchronoussubprocess.Popen.wait()on the query engine subprocess. That call does not yield to the asyncio event loop, so the loop freezes for as long as the Rust engine takes to shut down — 30-120+ seconds in production when the engine is stuck on TCP close operations.During the freeze no coroutines run, including
/health/liveliness. In Kubernetes the liveness probe fails and the kubelet SIGKILLs the pod. The surroundingasyncio.wait_for()does not help because it can only cancel atawaitpoints, andprocess.wait()never yields.Fix
Replace
disconnect()in both reconnect paths (recreate_prisma_clientand the formerly-separate "direct reconnect") with a direct, non-blocking kill of the engine subprocess: SIGTERM ->await asyncio.sleep(0.5)-> SIGKILL. A freshPrisma()client is then constructed andconnect()ed. The direct-reconnect path now delegates torecreate_prisma_client, so both engine-alive and engine-dead paths use the same kill-then-recreate flow._get_engine_pidis hardened to only return values that are actually integers, so any future callers that depend on numeric comparisons behave sanely under unit-test mocks as well.Testing
Verified end-to-end against a local proxy with Postgres in Docker, using
docker pauseto simulate an unresponsive database:/health/livelinesslatencymain, prod-like slow close (5s injected)After the simulated outage ends,
/health/readinessreturnsdb:"connected"and/key/listreturns live rows — reconnect works end-to-end.Unit tests in
tests/test_litellm/proxy/db/test_prisma_self_heal.pyandtests/litellm/proxy/test_prisma_engine_watchdog.pyare updated to reflect the new code path. The removedtest_lightweight_reconnect_skips_kill_on_successful_disconnectencoded the old "preserve the engine on successful disconnect" invariant that was itself part of the bug — prisma-client-py'saclose()kills the engine regardless. All 40 tests in the two files pass; the broader proxy suite has no new failures (the 5 failing tests fail identically onmain).Type
🐛 Bug Fix
✅ Test
Fixes LIT-2613