[Fix] Proxy: reconnect Prisma DB without blocking the event loop by yuneng-berri · Pull Request #26225 · BerriAI/litellm

yuneng-berri · 2026-04-22T06:29:45Z

Relevant issues

Summary

Failure Path (Before Fix)

When the database becomes unreachable the DB reconnect logic calls await self.db.disconnect(), which invokes prisma-client-py's synchronous subprocess.Popen.wait() on the query engine subprocess. That call does not yield to the asyncio event loop, so the loop freezes for as long as the Rust engine takes to shut down — 30-120+ seconds in production when the engine is stuck on TCP close operations.

During the freeze no coroutines run, including /health/liveliness. In Kubernetes the liveness probe fails and the kubelet SIGKILLs the pod. The surrounding asyncio.wait_for() does not help because it can only cancel at await points, and process.wait() never yields.

Fix

Replace disconnect() in both reconnect paths (recreate_prisma_client and the formerly-separate "direct reconnect") with a direct, non-blocking kill of the engine subprocess: SIGTERM -> await asyncio.sleep(0.5) -> SIGKILL. A fresh Prisma() client is then constructed and connect()ed. The direct-reconnect path now delegates to recreate_prisma_client, so both engine-alive and engine-dead paths use the same kill-then-recreate flow.

_get_engine_pid is hardened to only return values that are actually integers, so any future callers that depend on numeric comparisons behave sanely under unit-test mocks as well.

Testing

Verified end-to-end against a local proxy with Postgres in Docker, using docker pause to simulate an unresponsive database:

Condition	max `/health/liveliness` latency	2xx
Current `main`, prod-like slow close (5s injected)	10006 ms (probe timeout)	99.7%
This PR, same slow close injected	52.7 ms	100%
This PR, natural run (no injection)	78.8 ms	100%

After the simulated outage ends, /health/readiness returns db:"connected" and /key/list returns live rows — reconnect works end-to-end.

Unit tests in tests/test_litellm/proxy/db/test_prisma_self_heal.py and tests/litellm/proxy/test_prisma_engine_watchdog.py are updated to reflect the new code path. The removed test_lightweight_reconnect_skips_kill_on_successful_disconnect encoded the old "preserve the engine on successful disconnect" invariant that was itself part of the bug — prisma-client-py's aclose() kills the engine regardless. All 40 tests in the two files pass; the broader proxy suite has no new failures (the 5 failing tests fail identically on main).

Type

🐛 Bug Fix
✅ Test

Fixes LIT-2613

When the DB becomes unreachable the reconnect path calls `prisma.disconnect()`, which ultimately invokes prisma-client-py's synchronous `subprocess.Popen.wait()` on the query engine subprocess. That call does not yield to asyncio, so the event loop freezes for however long the Rust engine takes to shut down (30-120+ seconds in production when the engine is stuck on TCP close). During the freeze `/health/liveliness` becomes unresponsive, and in Kubernetes the liveness probe fails and the pod is SIGKILL'd. Replace `disconnect()` in the reconnect paths with a direct, non-blocking kill of the engine subprocess (SIGTERM -> 0.5s asyncio-yielding sleep -> SIGKILL) followed by a fresh Prisma client and a new `connect()`. Both `recreate_prisma_client` and the formerly-separate "direct reconnect" path go through the same kill-then-recreate flow. Also validate `_get_engine_pid` returns an int (defensive; prevents a MagicMock leak under unit-test mocking). Tests that encoded the old blocking behavior are updated or removed; the deleted `test_lightweight_reconnect_skips_kill_on_successful_disconnect` invariant ("don't kill on successful disconnect") was part of the bug.

CLAassistant · 2026-04-22T06:29:58Z

All committers have signed the CLA.

greptile-apps · 2026-04-22T06:32:47Z

Greptile Summary

Replaces the blocking disconnect() call in both Prisma reconnect paths with a direct, non-blocking SIGTERM→SIGKILL sequence on the engine subprocess PID, preventing the asyncio event loop from freezing for 30–120 s and causing Kubernetes liveness-probe failures. The "direct reconnect" path is unified with the "heavy reconnect" path through recreate_prisma_client, and the engine watcher lifecycle is now properly managed on both paths. Test updates accurately reflect the new behavior and do not weaken coverage.

Confidence Score: 5/5

Safe to merge; the fix is well-reasoned, well-tested, and the only finding is a minor path-consistency P2.

All four changed files have correct logic. No security issues, no data-loss risk, no backwards-incompatible API changes. The one P2 comment (engine watcher left unguarded on a mid-reconnect timeout cancellation) is a low-probability edge case that self-heals on the next retry cycle. P2s do not lower the score.

No files require special attention; litellm/proxy/utils.py has the minor P2 watcher-consistency note.

Important Files Changed

Filename	Overview
litellm/proxy/db/prisma_client.py	Core fix: `recreate_prisma_client` now kills the engine via SIGTERM→SIGKILL instead of calling blocking `disconnect()`. `_get_engine_pid` hardened with `isinstance(pid, int)` to prevent mock objects from leaking into `os.kill`. Docstring and log message updated to match new semantics.
litellm/proxy/utils.py	The "direct reconnect" path is refactored to delegate to `recreate_prisma_client` (same as the heavy path) instead of calling `disconnect()+connect()`. Now also properly manages the engine watcher lifecycle (`_cleanup_engine_watcher` / `_start_engine_watcher`) on the direct path.
tests/litellm/proxy/test_prisma_engine_watchdog.py	Tests updated to reflect unified reconnect path: assert `recreate_prisma_client` is called and `disconnect` is not. `DATABASE_URL` patched via `patch.dict(os.environ)` and `_start_engine_watcher` mocked in all affected tests.
tests/test_litellm/proxy/db/test_prisma_self_heal.py	Tests correctly updated: `test_lightweight_reconnect_skips_kill_on_successful_disconnect` removed (it tested the blocking behavior that was the bug itself), replaced by `test_recreate_prisma_client_kills_old_engine_without_disconnect` which asserts `os.kill(SIGTERM)` is called and `disconnect` is never awaited.

_{Reviews (2): Last reviewed commit: "docs(proxy): clarify _kill_engine_proces..." | Re-trigger Greptile}

…itellm_dbReconnectNonBlocking

codecov · 2026-04-29T06:40:10Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

… path Greptile review on #26225 (P2): the docstring said "Called when disconnect() fails", and the SIGTERM warning log read "after failed disconnect", but both were stale — `_kill_engine_process` is now invoked on every routine reconnect (via the unified `recreate_prisma_client` path), not as a disconnect-failure recovery branch. The misleading wording would have produced confusing log lines on every reconnect cycle in production. Update the docstring to explain the actual reason (avoiding the blocking `disconnect()` event-loop freeze) and reword the SIGTERM warning to "during reconnect" so it matches reality. No behavior change; logs only.

yuneng-berri · 2026-04-29T06:58:59Z

@greptile

…itellm_dbReconnectNonBlocking_local # Conflicts: # tests/test_litellm/proxy/db/test_prisma_self_heal.py

… path Greptile review on #26225 (P2): the docstring said "Called when disconnect() fails", and the SIGTERM warning log read "after failed disconnect", but both were stale — `_kill_engine_process` is now invoked on every routine reconnect (via the unified `recreate_prisma_client` path), not as a disconnect-failure recovery branch. The misleading wording would have produced confusing log lines on every reconnect cycle in production. Update the docstring to explain the actual reason (avoiding the blocking `disconnect()` event-loop freeze) and reword the SIGTERM warning to "during reconnect" so it matches reality. No behavior change; logs only.

Merge remote-tracking branch 'origin/litellm_internal_staging' into l…

8c91c8b

…itellm_dbReconnectNonBlocking

yuneng-berri mentioned this pull request Apr 29, 2026

fix(proxy): self-heal Prisma read paths + harden reconnect state machine #26756

Merged

6 tasks

ryan-crabbe-berri approved these changes Apr 29, 2026

View reviewed changes

Merge remote-tracking branch 'origin/litellm_internal_staging' into l…

4f6192a

…itellm_dbReconnectNonBlocking_local # Conflicts: # tests/test_litellm/proxy/db/test_prisma_self_heal.py

yuneng-berri merged commit fc0cc9c into litellm_internal_staging Apr 29, 2026
115 checks passed

yuneng-berri deleted the litellm_dbReconnectNonBlocking branch April 29, 2026 23:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Fix] Proxy: reconnect Prisma DB without blocking the event loop#26225

[Fix] Proxy: reconnect Prisma DB without blocking the event loop#26225
yuneng-berri merged 4 commits intolitellm_internal_stagingfrom
litellm_dbReconnectNonBlocking

yuneng-berri commented Apr 22, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Apr 22, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 22, 2026 •

edited

Loading

Important Files Changed

Uh oh!

codecov Bot commented Apr 29, 2026

Uh oh!

yuneng-berri commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

yuneng-berri commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Relevant issues

Summary

Failure Path (Before Fix)

Fix

Testing

Type

Uh oh!

CLAassistant commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

codecov Bot commented Apr 29, 2026

Codecov Report

Uh oh!

yuneng-berri commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yuneng-berri commented Apr 22, 2026 •

edited

Loading

CLAassistant commented Apr 22, 2026 •

edited

Loading

greptile-apps Bot commented Apr 22, 2026 •

edited

Loading