fix(s3): add retry with exponential backoff for transient S3 503/500 errors#25530
Conversation
…errors S3 occasionally returns 503 "Slow Down" during PUT operations when request rates spike above partition limits. The current code makes a single upload attempt via httpx — unlike boto3, httpx has no built-in retry for transient S3 errors. Failed uploads permanently lose the request's audit/logging data. Add exponential backoff retry (3 attempts, 1s/2s delays) for S3 500/503 responses in both async_upload_data_to_s3 and upload_data_to_s3. Logs a warning on each retry with the S3 object key for observability. In production we observed ~18 permanent S3 upload failures per day (124 over 7 days) — all transient 503s that would have succeeded on a single retry.
Tests cover: - Async retry on 503 (succeeds on second attempt) - Async retry on 500 - Exhausted retries on persistent 503 (calls handle_callback_failure) - No retry on 4xx errors (403) - Sync retry on 503
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Greptile SummaryThis PR adds exponential backoff retry logic (up to 3 attempts, 1 s / 2 s delays) to both the async and sync S3 PUT paths in the Confidence Score: 5/5Safe to merge — retry logic is correct, all edge cases are tested with mocks, and the previously flagged inline import has been resolved. No P0 or P1 findings remain. The retry loop correctly handles the exhaustion case by falling through to raise_for_status on the final attempt. SigV4 signatures are signed once before the loop but remain valid well within the 1–3 s retry window. The five new tests use only mocks, satisfying the no-real-network-calls rule. The import time issue from the prior review is now at module level. No files require special attention.
|
| Filename | Overview |
|---|---|
| litellm/integrations/s3_v2.py | Retry loop (max 3, backoff 1s/2s) added to both async_upload_data_to_s3 and upload_data_to_s3 for 500/503 responses; import time moved to module level. |
| tests/test_litellm/integrations/test_s3_v2.py | Five new unit tests cover: async 503 retry, async 500 retry, retry exhaustion, no-retry on 4xx, and sync 503 retry — all using mocks with no real network calls. |
Sequence Diagram
sequenceDiagram
participant C as Caller (async_send_batch)
participant S as S3Logger
participant H as httpx client
participant S3 as AWS S3
C->>S: async_upload_data_to_s3(element)
S->>S: Sign request (SigV4, once)
loop attempt 0..2
S->>H: PUT /bucket/key
H->>S3: HTTP PUT
S3-->>H: 503 / 500
H-->>S: response (503/500)
alt attempt < 2
S->>S: asyncio.sleep(2^attempt)
else attempt == 2
S->>S: raise_for_status() → Exception
S->>S: handle_callback_failure()
end
end
S3-->>H: 200 OK (on successful retry)
H-->>S: response (200)
S->>S: raise_for_status() → OK
S-->>C: return
Reviews (2): Last reviewed commit: "style(s3): move time import to module le..." | Re-trigger Greptile
Address review feedback: move `import time` from inside upload_data_to_s3 to the top-level imports per project style guide.
|
Tip: Greploop — Automatically fix all review issues by running Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal. |
2fe615b
into
BerriAI:litellm_internal_staging_04_11_2026
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Summary
Add exponential backoff retry (3 attempts, 1s/2s delays) for transient S3 500/503 responses in the
s3_v2callback logger.Problem
S3Logger.async_upload_data_to_s3andupload_data_to_s3make a single PUT request via httpx. Unlike boto3, httpx has no built-in retry for transient S3 errors. When S3 returns a 503 Slow Down (expected at scale when request rates spike above partition limits), the upload fails permanently and the request's audit/logging data is lost.In our production environment (~1.4M requests/day), we observed ~18 permanent S3 upload failures per day (124 over 7 days) — all transient 503s that would have succeeded on a single retry.
Changes
litellm/integrations/s3_v2.py: Wrap the S3 PUT in bothasync_upload_data_to_s3andupload_data_to_s3with a retry loop (max 3 attempts) for HTTP 500/503 responsesTests
Added 6 unit tests in
tests/test_litellm/integrations/test_s3_v2.py:test_async_upload_retries_on_s3_503— async retry succeeds on second attempttest_async_upload_retries_on_s3_500— async retry on 500test_async_upload_exhausts_retries_on_persistent_503— callshandle_callback_failureafter 3 failed attemptstest_async_upload_no_retry_on_4xx— no retry for client errors (403)test_sync_upload_retries_on_s3_503— sync method retryAll 28 tests in
test_s3_v2.pypass.