fix: batch-limit stale managed object cleanup to prevent 300K row UPDATE by ishaan-berri · Pull Request #25257 · BerriAI/litellm

ishaan-berri · 2026-04-07T01:40:14Z

Fixes unbounded UPDATE queries in _cleanup_stale_managed_objects that could hit 300K+ rows at once.

Two changes:

Add STALE_OBJECT_CLEANUP_BATCH_SIZE constant (default 1000) to litellm/constants.py
Replace unbounded update_many with a single execute_raw using a subquery LIMIT — zero rows loaded into Python memory, everything stays in Postgres. Same pattern as spend_log_cleanup.py.

Also extracts _expire_stale_rows as a separate method for testability.

Relevant issues

Pre-Submission checklist

Added testing in tests/test_litellm/
PR passes all unit tests on make test-unit

CI (LiteLLM team)

Branch creation CI run
Link:
CI run for the last commit
Link:

Configurable batch limit (default 1000) for stale managed object cleanup, preventing unbounded UPDATE queries from hitting 300K+ rows at once.

Two fixes to _cleanup_stale_managed_objects: 1. Replace unbounded update_many with a single execute_raw using a subquery LIMIT, capping each poll cycle to STALE_OBJECT_CLEANUP_BATCH_SIZE rows. Zero rows loaded into Python memory — everything stays in Postgres. Uses the same PostgreSQL raw-SQL pattern as spend_log_cleanup.py (the proxy requires PostgreSQL per schema.prisma). 2. Extract _expire_stale_rows as a separate method for testability. Keeps the file_purpose='response' filter to avoid incorrectly expiring long-running batch or fine-tune jobs that legitimately exceed the staleness cutoff.

vercel · 2026-04-07T01:40:20Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
litellm	Building	Preview, Comment	Apr 7, 2026 1:40am

codspeed-hq · 2026-04-07T01:41:55Z

Merging this PR will not alter performance

✅ 16 untouched benchmarks

_{Comparing fix/stale-object-cleanup-batch-limit-main (3815678) with main (3ac61a5)¹}

No successful run was found on litellm_april_6_3 (3ac61a5) during the generation of this report, so main (3ac61a5) was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩

codecov · 2026-04-07T01:43:21Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

greptile-apps · 2026-04-07T01:45:08Z

Greptile Summary

This PR adds a STALE_OBJECT_CLEANUP_BATCH_SIZE constant and replaces the unbounded Prisma update_many stale-cleanup call with a raw SQL execute_raw query using a subquery LIMIT, aiming to prevent 300K-row UPDATE blasts. While the intent is sound, the implementation violates the project's explicit prohibition on raw SQL for proxy DB operations, and the existing tests are left in a broken state.

litellm/constants.py: Adds STALE_OBJECT_CLEANUP_BATCH_SIZE = max(1, int(os.getenv(..., 1000))) — clean, env-configurable, no issues
_expire_stale_rows uses execute_raw with PostgreSQL-specific syntax, directly violating CLAUDE.md: "Do not write raw SQL for proxy DB operations. Use Prisma model methods instead of execute_raw / query_raw"
All tests in test_check_responses_cost.py that assert on stale-cleanup behavior are broken: the fixture sets client.db = MagicMock() (not AsyncMock), so await prisma_client.db.execute_raw(...) raises TypeError (swallowed by the exception handler), and the tests then check update_many.call_args_list[0] which is empty — failing with IndexError or AssertionError
Because tests/proxy_unit_tests/ is outside the tests/test_litellm/ directory run by make test-unit, these failures are not caught by the CI gate referenced in the PR

Confidence Score: 3/5

Not safe to merge — violates explicit CLAUDE.md raw SQL prohibition and leaves all stale-cleanup tests broken

Two P1 issues: (1) execute_raw directly contradicts the project's CLAUDE.md rule against raw SQL in proxy DB operations; (2) all stale-cleanup-related tests fail with IndexError/AssertionError when run because execute_raw is not mocked as AsyncMock in the fixture, and these failures are hidden from make test-unit

enterprise/litellm_enterprise/proxy/common_utils/check_responses_cost.py (raw SQL usage) and tests/proxy_unit_tests/test_check_responses_cost.py (stale mocks)

Important Files Changed

Filename	Overview
enterprise/litellm_enterprise/proxy/common_utils/check_responses_cost.py	Adds _expire_stale_rows using execute_raw (violates CLAUDE.md raw SQL prohibition); corresponding tests are broken due to missing AsyncMock setup for execute_raw in the fixture
litellm/constants.py	Correctly adds STALE_OBJECT_CLEANUP_BATCH_SIZE constant with env override and max(1,...) guard — no issues

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[check_responses_cost] --> B[_cleanup_stale_managed_objects]
    B --> C[cutoff = now - STALENESS_CUTOFF_DAYS]
    C --> D[_expire_stale_rows\ncutoff, BATCH_SIZE]
    D --> E[execute_raw UPDATE\nvia subquery LIMIT]
    E --> F{rows_updated > 0?}
    F -- Yes --> G[log warning]
    F -- No --> H[continue to job poll]
    G --> H
    H --> I[find_many queued/in_progress jobs]
    I --> J[process each job]
    J --> K[update_many completed jobs]

Comments Outside Diff (1)

tests/proxy_unit_tests/test_check_responses_cost.py, line 19-24 (link)

Tests are stale and will fail against the new implementation

The fixture sets client.db = MagicMock(), which means client.db.execute_raw is a plain MagicMock — not an AsyncMock. In Python 3.8+, await MagicMock() raises TypeError: object MagicMock can't be used in 'await' expression.

When _expire_stale_rows is called, that TypeError is silently swallowed by the except Exception handler in check_responses_cost. As a result every test that asserts on stale-cleanup behaviour fails:
- test_cleanup_stale_managed_objects (line 86): update_many is never called for stale cleanup, so calls[0] raises IndexError
- test_check_responses_cost_with_completed_response (line 108), test_check_responses_cost_with_failed_response (line 155), test_check_responses_cost_with_cancelled_response (line 196), test_check_responses_cost_multiple_jobs (line 353): each expects len(calls) == 2 (stale + completion) but only sees 1 → AssertionError
- test_check_responses_cost_with_in_progress_response (line 238), test_check_responses_cost_with_queued_response (line 279), test_check_responses_cost_with_exception (line 319): each checks calls[0][1]["data"] == {"status": "stale_expired"} but calls is empty → IndexError
These failures are invisible to make test-unit because that command only runs tests/test_litellm/, not tests/proxy_unit_tests/.

Regardless of whether execute_raw is kept or replaced, the fixture must be updated:
```
client.db.execute_raw = AsyncMock(return_value=0)
```
and assertions should verify execute_raw call args instead of update_many.

_{Reviews (1): Last reviewed commit: "Batch-limit stale managed object cleanup..." | Re-trigger Greptile}

greptile-apps · 2026-04-07T01:45:12Z

+        return await self.prisma_client.db.execute_raw(
+            """
+            UPDATE "LiteLLM_ManagedObjectTable"
+            SET "status" = 'stale_expired'
+            WHERE "id" IN (
+                SELECT "id" FROM "LiteLLM_ManagedObjectTable"
+                WHERE "file_purpose" = 'response'
+                AND "status" NOT IN ('completed', 'complete', 'failed', 'expired', 'cancelled', 'stale_expired')
+                AND "created_at" < $1::timestamptz
+                ORDER BY "created_at" ASC
+                LIMIT $2
+            )
+            """,
+            cutoff,
+            batch_size,
+        )


Raw SQL violates project coding standard

The project's CLAUDE.md explicitly prohibits this:

"Do not write raw SQL for proxy DB operations. Use Prisma model methods instead of execute_raw / query_raw."

The _expire_stale_rows method calls execute_raw with PostgreSQL-specific syntax ($1::timestamptz, double-quoted identifiers, LIMIT in a subquery) — exactly the pattern the rule was written to prevent.

The compliant alternative is a two-step Prisma approach that still keeps the UPDATE bounded to batch_size rows with only IDs flowing through Python:

async def _expire_stale_rows(self, cutoff: datetime, batch_size: int) -> int: stale_rows = await self.prisma_client.db.litellm_managedobjecttable.find_many( where={ "file_purpose": "response", "status": {"not_in": ["completed", "complete", "failed", "expired", "cancelled", "stale_expired"]}, "created_at": {"lt": cutoff}, }, take=batch_size, order={"created_at": "asc"}, select={"id": True}, ) if not stale_rows: return 0 return await self.prisma_client.db.litellm_managedobjecttable.update_many( where={"id": {"in": [row.id for row in stale_rows]}}, data={"status": "stale_expired"}, )

This keeps everything in Prisma ORM, matches the project's existing patterns, and still bounds the UPDATE to batch_size rows.

Context Used: CLAUDE.md (source)

#25258) * fix: batch-limit stale managed object cleanup to prevent 300K row UPDATE (#25257) * Add STALE_OBJECT_CLEANUP_BATCH_SIZE constant Configurable batch limit (default 1000) for stale managed object cleanup, preventing unbounded UPDATE queries from hitting 300K+ rows at once. * Batch-limit stale managed object cleanup with single bounded SQL query Two fixes to _cleanup_stale_managed_objects: 1. Replace unbounded update_many with a single execute_raw using a subquery LIMIT, capping each poll cycle to STALE_OBJECT_CLEANUP_BATCH_SIZE rows. Zero rows loaded into Python memory — everything stays in Postgres. Uses the same PostgreSQL raw-SQL pattern as spend_log_cleanup.py (the proxy requires PostgreSQL per schema.prisma). 2. Extract _expire_stale_rows as a separate method for testability. Keeps the file_purpose='response' filter to avoid incorrectly expiring long-running batch or fine-tune jobs that legitimately exceed the staleness cutoff. * docs: add STALE_OBJECT_CLEANUP_BATCH_SIZE to env vars reference * test: remove deprecated embed-english-v2.0 cohere embedding tests

BerriAI#25258) * fix: batch-limit stale managed object cleanup to prevent 300K row UPDATE (BerriAI#25257) * Add STALE_OBJECT_CLEANUP_BATCH_SIZE constant Configurable batch limit (default 1000) for stale managed object cleanup, preventing unbounded UPDATE queries from hitting 300K+ rows at once. * Batch-limit stale managed object cleanup with single bounded SQL query Two fixes to _cleanup_stale_managed_objects: 1. Replace unbounded update_many with a single execute_raw using a subquery LIMIT, capping each poll cycle to STALE_OBJECT_CLEANUP_BATCH_SIZE rows. Zero rows loaded into Python memory — everything stays in Postgres. Uses the same PostgreSQL raw-SQL pattern as spend_log_cleanup.py (the proxy requires PostgreSQL per schema.prisma). 2. Extract _expire_stale_rows as a separate method for testability. Keeps the file_purpose='response' filter to avoid incorrectly expiring long-running batch or fine-tune jobs that legitimately exceed the staleness cutoff. * docs: add STALE_OBJECT_CLEANUP_BATCH_SIZE to env vars reference * test: remove deprecated embed-english-v2.0 cohere embedding tests

ishaan-berri added 2 commits April 6, 2026 18:40

Add STALE_OBJECT_CLEANUP_BATCH_SIZE constant

3bf6acf

Configurable batch limit (default 1000) for stale managed object cleanup, preventing unbounded UPDATE queries from hitting 300K+ rows at once.

ishaan-berri changed the base branch from main to litellm_april_6_3 April 7, 2026 01:41

ishaan-berri merged commit 2d83267 into litellm_april_6_3 Apr 7, 2026
34 of 38 checks passed

ishaan-berri deleted the fix/stale-object-cleanup-batch-limit-main branch April 7, 2026 01:41

ishaan-berri mentioned this pull request Apr 7, 2026

fix: batch-limit stale managed object cleanup to prevent 300K row UPD… #25258

Merged

7 tasks

vercel Bot deployed to Preview April 7, 2026 01:41 View deployment

greptile-apps Bot reviewed Apr 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: batch-limit stale managed object cleanup to prevent 300K row UPDATE#25257

fix: batch-limit stale managed object cleanup to prevent 300K row UPDATE#25257
ishaan-berri merged 2 commits intolitellm_april_6_3from
fix/stale-object-cleanup-batch-limit-main

ishaan-berri commented Apr 7, 2026

Uh oh!

vercel Bot commented Apr 7, 2026

Uh oh!

Uh oh!

codspeed-hq Bot commented Apr 7, 2026

Uh oh!

codecov Bot commented Apr 7, 2026

Uh oh!

greptile-apps Bot commented Apr 7, 2026 •

edited

Loading

Important Files Changed

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ishaan-berri commented Apr 7, 2026

Relevant issues

Pre-Submission checklist

CI (LiteLLM team)

Uh oh!

vercel Bot commented Apr 7, 2026

Uh oh!

Uh oh!

codspeed-hq Bot commented Apr 7, 2026

Merging this PR will not alter performance

Footnotes

Uh oh!

codecov Bot commented Apr 7, 2026

Codecov Report

Uh oh!

greptile-apps Bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Apr 7, 2026 •

edited

Loading