Skip to content

fix(s3): add retry with exponential backoff for transient S3 503/500 errors#25530

Merged
krrish-berri-2 merged 3 commits intoBerriAI:litellm_internal_staging_04_11_2026from
jimmychen-p72:fix/s3-upload-retry-on-503
Apr 11, 2026
Merged

fix(s3): add retry with exponential backoff for transient S3 503/500 errors#25530
krrish-berri-2 merged 3 commits intoBerriAI:litellm_internal_staging_04_11_2026from
jimmychen-p72:fix/s3-upload-retry-on-503

Conversation

@jimmychen-p72
Copy link
Copy Markdown
Contributor

Summary

Add exponential backoff retry (3 attempts, 1s/2s delays) for transient S3 500/503 responses in the s3_v2 callback logger.

Problem

S3Logger.async_upload_data_to_s3 and upload_data_to_s3 make a single PUT request via httpx. Unlike boto3, httpx has no built-in retry for transient S3 errors. When S3 returns a 503 Slow Down (expected at scale when request rates spike above partition limits), the upload fails permanently and the request's audit/logging data is lost.

In our production environment (~1.4M requests/day), we observed ~18 permanent S3 upload failures per day (124 over 7 days) — all transient 503s that would have succeeded on a single retry.

Changes

  • litellm/integrations/s3_v2.py: Wrap the S3 PUT in both async_upload_data_to_s3 and upload_data_to_s3 with a retry loop (max 3 attempts) for HTTP 500/503 responses
  • Exponential backoff: 1s, 2s (matches AWS SDK retry behavior)
  • Logs a warning on each retry with the S3 object key for observability
  • No behavioral change for non-retryable errors (4xx, other 5xx)

Tests

Added 6 unit tests in tests/test_litellm/integrations/test_s3_v2.py:

  • test_async_upload_retries_on_s3_503 — async retry succeeds on second attempt
  • test_async_upload_retries_on_s3_500 — async retry on 500
  • test_async_upload_exhausts_retries_on_persistent_503 — calls handle_callback_failure after 3 failed attempts
  • test_async_upload_no_retry_on_4xx — no retry for client errors (403)
  • test_sync_upload_retries_on_s3_503 — sync method retry

All 28 tests in test_s3_v2.py pass.

…errors

S3 occasionally returns 503 "Slow Down" during PUT operations when
request rates spike above partition limits. The current code makes a
single upload attempt via httpx — unlike boto3, httpx has no built-in
retry for transient S3 errors. Failed uploads permanently lose the
request's audit/logging data.

Add exponential backoff retry (3 attempts, 1s/2s delays) for S3
500/503 responses in both async_upload_data_to_s3 and
upload_data_to_s3. Logs a warning on each retry with the S3 object
key for observability.

In production we observed ~18 permanent S3 upload failures per day
(124 over 7 days) — all transient 503s that would have succeeded on
a single retry.
Tests cover:
- Async retry on 503 (succeeds on second attempt)
- Async retry on 500
- Exhausted retries on persistent 503 (calls handle_callback_failure)
- No retry on 4xx errors (403)
- Sync retry on 503
@vercel
Copy link
Copy Markdown

vercel bot commented Apr 10, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
litellm Ready Ready Preview, Comment Apr 10, 2026 8:49pm

Request Review

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 10, 2026

Greptile Summary

This PR adds exponential backoff retry logic (up to 3 attempts, 1 s / 2 s delays) to both the async and sync S3 PUT paths in the s3_v2 logger, guarding against transient 500/503 responses. The previously flagged inline import time has since been moved to module level in a follow-up commit.

Confidence Score: 5/5

Safe to merge — retry logic is correct, all edge cases are tested with mocks, and the previously flagged inline import has been resolved.

No P0 or P1 findings remain. The retry loop correctly handles the exhaustion case by falling through to raise_for_status on the final attempt. SigV4 signatures are signed once before the loop but remain valid well within the 1–3 s retry window. The five new tests use only mocks, satisfying the no-real-network-calls rule. The import time issue from the prior review is now at module level.

No files require special attention.

Important Files Changed

Filename Overview
litellm/integrations/s3_v2.py Retry loop (max 3, backoff 1s/2s) added to both async_upload_data_to_s3 and upload_data_to_s3 for 500/503 responses; import time moved to module level.
tests/test_litellm/integrations/test_s3_v2.py Five new unit tests cover: async 503 retry, async 500 retry, retry exhaustion, no-retry on 4xx, and sync 503 retry — all using mocks with no real network calls.

Sequence Diagram

sequenceDiagram
    participant C as Caller (async_send_batch)
    participant S as S3Logger
    participant H as httpx client
    participant S3 as AWS S3

    C->>S: async_upload_data_to_s3(element)
    S->>S: Sign request (SigV4, once)
    loop attempt 0..2
        S->>H: PUT /bucket/key
        H->>S3: HTTP PUT
        S3-->>H: 503 / 500
        H-->>S: response (503/500)
        alt attempt < 2
            S->>S: asyncio.sleep(2^attempt)
        else attempt == 2
            S->>S: raise_for_status() → Exception
            S->>S: handle_callback_failure()
        end
    end
    S3-->>H: 200 OK (on successful retry)
    H-->>S: response (200)
    S->>S: raise_for_status() → OK
    S-->>C: return
Loading

Reviews (2): Last reviewed commit: "style(s3): move time import to module le..." | Re-trigger Greptile

Comment thread litellm/integrations/s3_v2.py Outdated
Address review feedback: move `import time` from inside
upload_data_to_s3 to the top-level imports per project style guide.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 10, 2026

Tip:

Greploop — Automatically fix all review issues by running /greploops in Claude Code. It iterates: fix, push, re-review, repeat until 5/5 confidence.

Use the Greptile plugin for Claude Code to query reviews, search comments, and manage custom context directly from your terminal.

@krrish-berri-2 krrish-berri-2 changed the base branch from main to litellm_internal_staging_04_11_2026 April 11, 2026 16:38
@krrish-berri-2 krrish-berri-2 merged commit 2fe615b into BerriAI:litellm_internal_staging_04_11_2026 Apr 11, 2026
35 of 37 checks passed
@codspeed-hq
Copy link
Copy Markdown
Contributor

codspeed-hq bot commented Apr 11, 2026

Merging this PR will not alter performance

✅ 16 untouched benchmarks


Comparing jimmychen-p72:fix/s3-upload-retry-on-503 (3ecfe38) with litellm_internal_staging_04_11_2026 (2fe615b)1

Open in CodSpeed

Footnotes

  1. No successful run was found on litellm_internal_staging_04_11_2026 (9e4352a) during the generation of this report, so bf6ea8d was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants