fix: batch-limit stale managed object cleanup to prevent 300K row UPDATE by ishaan-berri · Pull Request #25227 · BerriAI/litellm

ishaan-berri · 2026-04-06T16:25:58Z

Relevant issues

Closes #24451

What changed

Adds STALE_OBJECT_CLEANUP_BATCH_SIZE constant (default 1000, configurable via env var) to cap how many stale rows are marked per poll cycle
Replaces the unbounded update_many in _cleanup_stale_managed_objects with a single execute_raw using a subquery SELECT ... LIMIT — zero rows loaded into Python memory, one DB round-trip
Extracts _expire_stale_rows as a testable method

Pre-Submission checklist

My PR title starts with a Conventional Commit prefix (feat, fix, chore, refactor, etc.)
I have added at least 1 test to tests/litellm/
make test-unit passes locally

CI (LiteLLM team)

passes

Type

Bug Fix

Changes

litellm/constants.py — new STALE_OBJECT_CLEANUP_BATCH_SIZE constant
enterprise/litellm_enterprise/proxy/common_utils/check_responses_cost.py — bounded SQL cleanup, extracted _expire_stale_rows

Test plan

E2E tested against real Neon Postgres DB: seeded 5 stale rows (backdated 30 days), called _expire_stale_rows(batch_size=3) three times — got 3, 2, 0 rows updated respectively
Confirmed batch LIMIT is enforced (never exceeded batch_size)
Confirmed terminal-state rows are not re-processed

Configurable batch limit (default 1000) for stale managed object cleanup, preventing unbounded UPDATE queries from hitting 300K+ rows at once.

Two fixes to _cleanup_stale_managed_objects: 1. Replace unbounded update_many with a single execute_raw using a subquery LIMIT, capping each poll cycle to STALE_OBJECT_CLEANUP_BATCH_SIZE rows. Zero rows loaded into Python memory — everything stays in Postgres. Uses the same PostgreSQL raw-SQL pattern as spend_log_cleanup.py (the proxy requires PostgreSQL per schema.prisma). 2. Extract _expire_stale_rows as a separate method for testability. Keeps the file_purpose='response' filter to avoid incorrectly expiring long-running batch or fine-tune jobs that legitimately exceed the staleness cutoff.

vercel · 2026-04-06T16:26:05Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
litellm	Ready	Preview, Comment	Apr 6, 2026 4:27pm

CLAassistant · 2026-04-06T16:26:06Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

codspeed-hq · 2026-04-06T16:27:35Z

Merging this PR will not alter performance

✅ 16 untouched benchmarks

_{Comparing fix/stale-object-cleanup-batch-limit (dd0449c) with main (9088b46)¹}

No successful run was found on main (d251238) during the generation of this report, so 9088b46 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩

greptile-apps · 2026-04-06T16:29:14Z

Greptile Summary

This PR caps the previously unbounded stale-managed-object cleanup to prevent mass UPDATE statements (e.g. 300K rows) from overwhelming the database on large deployments. It adds a STALE_OBJECT_CLEANUP_BATCH_SIZE constant (default 1,000, overridable via env var) and replaces the Prisma update_many in _cleanup_stale_managed_objects with a PostgreSQL execute_raw using a SELECT ... LIMIT subquery — the same pattern already used in spend_log_cleanup.py. The _expire_stale_rows method is extracted to make the DB call independently testable.

Adds STALE_OBJECT_CLEANUP_BATCH_SIZE = max(1, int(os.getenv(..., 1000))) in litellm/constants.py
Replaces update_many in stale-object cleanup with a bounded execute_raw UPDATE/SELECT subquery
Extracts _expire_stale_rows(cutoff, batch_size) for isolated testing
The existing unit tests were not updated — they still mock update_many for cleanup and assert on its call counts, but the new code uses execute_raw, causing the assertions to break across seven tests in test_check_responses_cost.py

Confidence Score: 4/5

The production fix is sound and follows established patterns (mirrors spend_log_cleanup.py), but the unchanged test suite contains incorrect assertions that will fail when the proxy unit tests are run.

The execute_raw UPDATE/SELECT subquery correctly solves the 300K-row problem and follows a pre-existing precedent. However, seven tests in test_check_responses_cost.py still assert update_many call counts and payloads for the stale-cleanup step. Because execute_raw is not mocked as an AsyncMock, the cleanup silently errors and invalidates call-count assumptions across all those tests. This is a verifiable, present defect in test coverage that should be resolved before merging.

tests/proxy_unit_tests/test_check_responses_cost.py — all tests that assert update_many.call_args_list positions need execute_raw mocked as AsyncMock and their call-count assertions updated.

Important Files Changed

Filename	Overview
enterprise/litellm_enterprise/proxy/common_utils/check_responses_cost.py	Switches stale-row cleanup from Prisma `update_many` to `execute_raw` with a bounded `LIMIT` subquery; the production fix is correct but the existing test suite was not updated and will fail.
litellm/constants.py	Adds `STALE_OBJECT_CLEANUP_BATCH_SIZE` constant (default 1000, env-configurable) consistently with neighbouring constants.

Sequence Diagram

sequenceDiagram
    participant BG as Background Poller
    participant CRC as CheckResponsesCost
    participant DB as PostgreSQL (execute_raw)
    participant ORM as Prisma ORM (update_many)

    BG->>CRC: check_responses_cost()
    CRC->>CRC: _cleanup_stale_managed_objects()
    CRC->>CRC: _expire_stale_rows(cutoff, batch_size=1000)
    CRC->>DB: UPDATE LiteLLM_ManagedObjectTable SET status='stale_expired'<br/>WHERE id IN (SELECT id ... LIMIT batch_size)
    DB-->>CRC: rows_updated (int)
    CRC-->>BG: cleanup complete (exception swallowed on error)

    BG->>ORM: find_many(status in [queued,in_progress], take=MAX_OBJECTS_PER_POLL_CYCLE)
    ORM-->>BG: jobs[]

    loop For each job
        BG->>BG: litellm.aget_responses(response_id)
        BG-->>BG: response.status
    end

    BG->>ORM: update_many(id in completed_ids, status=completed)
    ORM-->>BG: done

_{Reviews (2): Last reviewed commit: "Batch-limit stale managed object cleanup..." | Re-trigger Greptile}

greptile-apps · 2026-04-06T16:29:21Z

+        return await self.prisma_client.db.execute_raw(
+            """
+            UPDATE "LiteLLM_ManagedObjectTable"
+            SET "status" = 'stale_expired'
+            WHERE "id" IN (
+                SELECT "id" FROM "LiteLLM_ManagedObjectTable"
+                WHERE "file_purpose" = 'response'
+                AND "status" NOT IN ('completed', 'complete', 'failed', 'expired', 'cancelled', 'stale_expired')
+                AND "created_at" < $1::timestamptz
+                ORDER BY "created_at" ASC
+                LIMIT $2
+            )
+            """,
+            cutoff,
+            batch_size,
+        )


execute_raw violates the documented no-raw-SQL convention

CLAUDE.md explicitly states:

Do not write raw SQL for proxy DB operations. Use Prisma model methods instead of execute_raw / query_raw. This avoids schema/client drift, keeps code testable with simple mocks, and matches patterns used in spend logs.

While the motivation here is valid (Prisma's update_many lacks LIMIT support), and the PR notes precedent from spend_log_cleanup.py, the use of execute_raw is the direct cause of the test breakage above and ties the implementation to PostgreSQL-specific syntax ($1::timestamptz, double-quoted identifiers). An alternative that stays within Prisma's API is a two-step approach: find_many with take=batch_size to get IDs, then update_many with id: {in: ids}. This loads at most batch_size IDs into Python memory (a small list), uses no raw SQL, and keeps tests simple.

Context Used: CLAUDE.md (source)

ishaan-berri · 2026-04-06T16:30:50Z

@greptile review

codecov · 2026-04-06T16:42:33Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

ishaan-berri · 2026-04-06T20:30:51Z

E2E Test: `_expire_stale_rows` against real Postgres DB

DB: postgresql://localhost:5432/litellm
Table: LiteLLM_ManagedObjectTable

Setup

Parameter	Value
Rows seeded	5
`file_purpose`	`response`
Initial status	`in_progress`
`created_at` backdated	31 days ago
Cutoff	30 days ago
`batch_size`	3

Results

Iteration	Rows updated	Batch limit enforced?
1	3	✓ (≤ 3)
2	2	✓ (≤ 3)
3	0	— (done)

Total updated: 5 / 5
Final status on all rows: stale_expired
Rows loaded into Python memory: 0 (UPDATE runs entirely in Postgres)

Assertions

All 5 rows updated across iterations
No iteration exceeded batch_size=3
Terminal-state rows skipped on subsequent iterations (iteration 3 → 0)
file_purpose = 'response' filter respected
Test rows cleaned up after run

PASS — bounded subquery SELECT ... LIMIT enforces the batch cap correctly.

ishaan-berri added 2 commits April 6, 2026 09:25

Add STALE_OBJECT_CLEANUP_BATCH_SIZE constant

0aeeb01

Configurable batch limit (default 1000) for stale managed object cleanup, preventing unbounded UPDATE queries from hitting 300K+ rows at once.

ishaan-berri mentioned this pull request Apr 6, 2026

fix: batch-limit stale managed object cleanup to prevent 300K row UPDATE #24451

Closed

3 tasks

vercel Bot deployed to Preview April 6, 2026 16:27 View deployment

greptile-apps Bot reviewed Apr 6, 2026

View reviewed changes

ishaan-berri changed the base branch from main to litellm_ishaan_april6 April 6, 2026 20:26

ishaan-berri merged commit c2f486a into litellm_ishaan_april6 Apr 6, 2026
58 of 63 checks passed

ishaan-berri deleted the fix/stale-object-cleanup-batch-limit branch April 6, 2026 20:26

ishaan-berri mentioned this pull request Apr 6, 2026

[April 6th] - Ishaan #25238

Closed

7 tasks

ishaan-berri mentioned this pull request Apr 7, 2026

Litellm ishaan april6 #25256

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: batch-limit stale managed object cleanup to prevent 300K row UPDATE#25227

fix: batch-limit stale managed object cleanup to prevent 300K row UPDATE#25227
ishaan-berri merged 2 commits intolitellm_ishaan_april6from
fix/stale-object-cleanup-batch-limit

ishaan-berri commented Apr 6, 2026

Uh oh!

vercel Bot commented Apr 6, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Apr 6, 2026

Uh oh!

codspeed-hq Bot commented Apr 6, 2026

Uh oh!

greptile-apps Bot commented Apr 6, 2026 •

edited

Loading

Important Files Changed

Uh oh!

greptile-apps Bot Apr 6, 2026

Uh oh!

ishaan-berri commented Apr 6, 2026

Uh oh!

codecov Bot commented Apr 6, 2026

Uh oh!

Uh oh!

ishaan-berri commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ishaan-berri commented Apr 6, 2026

Relevant issues

What changed

Pre-Submission checklist

CI (LiteLLM team)

Type

Changes

Test plan

Uh oh!

vercel Bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CLAassistant commented Apr 6, 2026

Uh oh!

codspeed-hq Bot commented Apr 6, 2026

Merging this PR will not alter performance

Footnotes

Uh oh!

greptile-apps Bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

ishaan-berri commented Apr 6, 2026

Uh oh!

codecov Bot commented Apr 6, 2026

Codecov Report

Uh oh!

Uh oh!

ishaan-berri commented Apr 6, 2026

E2E Test: _expire_stale_rows against real Postgres DB

Setup

Results

Assertions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel Bot commented Apr 6, 2026 •

edited

Loading

greptile-apps Bot commented Apr 6, 2026 •

edited

Loading

E2E Test: `_expire_stale_rows` against real Postgres DB