Skip to content

[Fix] Proxy: reconnect Prisma DB without blocking the event loop#26225

Merged
yuneng-berri merged 4 commits intolitellm_internal_stagingfrom
litellm_dbReconnectNonBlocking
Apr 29, 2026
Merged

[Fix] Proxy: reconnect Prisma DB without blocking the event loop#26225
yuneng-berri merged 4 commits intolitellm_internal_stagingfrom
litellm_dbReconnectNonBlocking

Conversation

@yuneng-berri
Copy link
Copy Markdown
Collaborator

@yuneng-berri yuneng-berri commented Apr 22, 2026

Relevant issues

Summary

Failure Path (Before Fix)

When the database becomes unreachable the DB reconnect logic calls await self.db.disconnect(), which invokes prisma-client-py's synchronous subprocess.Popen.wait() on the query engine subprocess. That call does not yield to the asyncio event loop, so the loop freezes for as long as the Rust engine takes to shut down — 30-120+ seconds in production when the engine is stuck on TCP close operations.

During the freeze no coroutines run, including /health/liveliness. In Kubernetes the liveness probe fails and the kubelet SIGKILLs the pod. The surrounding asyncio.wait_for() does not help because it can only cancel at await points, and process.wait() never yields.

Fix

Replace disconnect() in both reconnect paths (recreate_prisma_client and the formerly-separate "direct reconnect") with a direct, non-blocking kill of the engine subprocess: SIGTERM -> await asyncio.sleep(0.5) -> SIGKILL. A fresh Prisma() client is then constructed and connect()ed. The direct-reconnect path now delegates to recreate_prisma_client, so both engine-alive and engine-dead paths use the same kill-then-recreate flow.

_get_engine_pid is hardened to only return values that are actually integers, so any future callers that depend on numeric comparisons behave sanely under unit-test mocks as well.

Testing

Verified end-to-end against a local proxy with Postgres in Docker, using docker pause to simulate an unresponsive database:

Condition max /health/liveliness latency 2xx
Current main, prod-like slow close (5s injected) 10006 ms (probe timeout) 99.7%
This PR, same slow close injected 52.7 ms 100%
This PR, natural run (no injection) 78.8 ms 100%

After the simulated outage ends, /health/readiness returns db:"connected" and /key/list returns live rows — reconnect works end-to-end.

Unit tests in tests/test_litellm/proxy/db/test_prisma_self_heal.py and tests/litellm/proxy/test_prisma_engine_watchdog.py are updated to reflect the new code path. The removed test_lightweight_reconnect_skips_kill_on_successful_disconnect encoded the old "preserve the engine on successful disconnect" invariant that was itself part of the bug — prisma-client-py's aclose() kills the engine regardless. All 40 tests in the two files pass; the broader proxy suite has no new failures (the 5 failing tests fail identically on main).

Type

🐛 Bug Fix
✅ Test

Fixes LIT-2613

When the DB becomes unreachable the reconnect path calls
`prisma.disconnect()`, which ultimately invokes prisma-client-py's
synchronous `subprocess.Popen.wait()` on the query engine subprocess.
That call does not yield to asyncio, so the event loop freezes for
however long the Rust engine takes to shut down (30-120+ seconds in
production when the engine is stuck on TCP close). During the freeze
`/health/liveliness` becomes unresponsive, and in Kubernetes the
liveness probe fails and the pod is SIGKILL'd.

Replace `disconnect()` in the reconnect paths with a direct, non-blocking
kill of the engine subprocess (SIGTERM -> 0.5s asyncio-yielding sleep ->
SIGKILL) followed by a fresh Prisma client and a new `connect()`. Both
`recreate_prisma_client` and the formerly-separate "direct reconnect"
path go through the same kill-then-recreate flow.

Also validate `_get_engine_pid` returns an int (defensive; prevents a
MagicMock leak under unit-test mocking).

Tests that encoded the old blocking behavior are updated or removed;
the deleted `test_lightweight_reconnect_skips_kill_on_successful_disconnect`
invariant ("don't kill on successful disconnect") was part of the bug.
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 22, 2026

CLA assistant check
All committers have signed the CLA.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 22, 2026

Greptile Summary

Replaces the blocking disconnect() call in both Prisma reconnect paths with a direct, non-blocking SIGTERM→SIGKILL sequence on the engine subprocess PID, preventing the asyncio event loop from freezing for 30–120 s and causing Kubernetes liveness-probe failures. The "direct reconnect" path is unified with the "heavy reconnect" path through recreate_prisma_client, and the engine watcher lifecycle is now properly managed on both paths. Test updates accurately reflect the new behavior and do not weaken coverage.

Confidence Score: 5/5

Safe to merge; the fix is well-reasoned, well-tested, and the only finding is a minor path-consistency P2.

All four changed files have correct logic. No security issues, no data-loss risk, no backwards-incompatible API changes. The one P2 comment (engine watcher left unguarded on a mid-reconnect timeout cancellation) is a low-probability edge case that self-heals on the next retry cycle. P2s do not lower the score.

No files require special attention; litellm/proxy/utils.py has the minor P2 watcher-consistency note.

Important Files Changed

Filename Overview
litellm/proxy/db/prisma_client.py Core fix: recreate_prisma_client now kills the engine via SIGTERM→SIGKILL instead of calling blocking disconnect(). _get_engine_pid hardened with isinstance(pid, int) to prevent mock objects from leaking into os.kill. Docstring and log message updated to match new semantics.
litellm/proxy/utils.py The "direct reconnect" path is refactored to delegate to recreate_prisma_client (same as the heavy path) instead of calling disconnect()+connect(). Now also properly manages the engine watcher lifecycle (_cleanup_engine_watcher / _start_engine_watcher) on the direct path.
tests/litellm/proxy/test_prisma_engine_watchdog.py Tests updated to reflect unified reconnect path: assert recreate_prisma_client is called and disconnect is not. DATABASE_URL patched via patch.dict(os.environ) and _start_engine_watcher mocked in all affected tests.
tests/test_litellm/proxy/db/test_prisma_self_heal.py Tests correctly updated: test_lightweight_reconnect_skips_kill_on_successful_disconnect removed (it tested the blocking behavior that was the bug itself), replaced by test_recreate_prisma_client_kills_old_engine_without_disconnect which asserts os.kill(SIGTERM) is called and disconnect is never awaited.

Reviews (2): Last reviewed commit: "docs(proxy): clarify _kill_engine_proces..." | Re-trigger Greptile

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 29, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

… path

Greptile review on #26225 (P2): the docstring said "Called when disconnect()
fails", and the SIGTERM warning log read "after failed disconnect", but
both were stale — `_kill_engine_process` is now invoked on every routine
reconnect (via the unified `recreate_prisma_client` path), not as a
disconnect-failure recovery branch. The misleading wording would have
produced confusing log lines on every reconnect cycle in production.

Update the docstring to explain the actual reason (avoiding the blocking
`disconnect()` event-loop freeze) and reword the SIGTERM warning to "during
reconnect" so it matches reality.

No behavior change; logs only.
@yuneng-berri
Copy link
Copy Markdown
Collaborator Author

@greptile

…itellm_dbReconnectNonBlocking_local

# Conflicts:
#	tests/test_litellm/proxy/db/test_prisma_self_heal.py
@yuneng-berri yuneng-berri merged commit fc0cc9c into litellm_internal_staging Apr 29, 2026
115 checks passed
@yuneng-berri yuneng-berri deleted the litellm_dbReconnectNonBlocking branch April 29, 2026 23:09
yuneng-berri added a commit that referenced this pull request May 7, 2026
… path

Greptile review on #26225 (P2): the docstring said "Called when disconnect()
fails", and the SIGTERM warning log read "after failed disconnect", but
both were stale — `_kill_engine_process` is now invoked on every routine
reconnect (via the unified `recreate_prisma_client` path), not as a
disconnect-failure recovery branch. The misleading wording would have
produced confusing log lines on every reconnect cycle in production.

Update the docstring to explain the actual reason (avoiding the blocking
`disconnect()` event-loop freeze) and reword the SIGTERM warning to "during
reconnect" so it matches reality.

No behavior change; logs only.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants