Pgsql fix: Add TCP keepalives so dead connections are detected in ~80s#231
Merged
Pgsql fix: Add TCP keepalives so dead connections are detected in ~80s#231
Conversation
Azure PostgreSQL silently drops idle TCP connections without sending RST. With Linux's default 2h tcp_keepalive_time, pg_dump hangs on a read for ~2 hours before failing, which exhausts retries and delays downstream jobs. Pass keepalive params via libpq conninfo (keepalives_idle=30, keepalives_interval=10, keepalives_count=5) so dead sockets are detected in ~80s and the retry loop can actually make progress. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
CI Results
Tool smoke testsTool Smoke Tests
8 passed, 0 failed Unit test outputIntegration test outputLast run: 2026-04-22 04:50:28 UTC | Commit: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Azure PostgreSQL (and its proxy layer) can drop idle TCP connections without sending RST. With Linux's default
tcp_keepalive_time = 7200,pg_dumpsits on a read syscall for ~2 hours before the kernel notices the socket is dead. Meanwhile:pgsqlnever get a chance to start in a reasonable window.Change
Pass libpq TCP keepalive parameters via a conninfo string to both
psql(database listing) andpg_dump:keepalives=1— enable TCP keepalive on the socket.keepalives_idle=30— start probing after 30s idle.keepalives_interval=10— probe every 10s.keepalives_count=5— give up after 5 failed probes.Net effect: a dead socket is detected in ~80s instead of ~2h. The existing retry loop (
db_retries, default 3) can then actually retry within a useful time budget.Short-lived connectivity checks in
config.py(verify_pgsql, etc.) usePGCONNECT_TIMEOUT=5and don't need keepalives — left unchanged to keep the diff minimal.Test plan
pytest tests/test_runners.py tests/test_unit.py— 210 passed, 7 skippedruff check clouddump tests— clean