Skip to content

fix: batch-limit stale managed object cleanup to prevent 300K row UPDATE#25227

Merged
ishaan-berri merged 2 commits intolitellm_ishaan_april6from
fix/stale-object-cleanup-batch-limit
Apr 6, 2026
Merged

fix: batch-limit stale managed object cleanup to prevent 300K row UPDATE#25227
ishaan-berri merged 2 commits intolitellm_ishaan_april6from
fix/stale-object-cleanup-batch-limit

Conversation

@ishaan-berri
Copy link
Copy Markdown
Contributor

Relevant issues

Closes #24451

What changed

  • Adds STALE_OBJECT_CLEANUP_BATCH_SIZE constant (default 1000, configurable via env var) to cap how many stale rows are marked per poll cycle
  • Replaces the unbounded update_many in _cleanup_stale_managed_objects with a single execute_raw using a subquery SELECT ... LIMIT — zero rows loaded into Python memory, one DB round-trip
  • Extracts _expire_stale_rows as a testable method

Pre-Submission checklist

  • My PR title starts with a Conventional Commit prefix (feat, fix, chore, refactor, etc.)
  • I have added at least 1 test to tests/litellm/
  • make test-unit passes locally

CI (LiteLLM team)

  • passes

Type

  • Bug Fix

Changes

  • litellm/constants.py — new STALE_OBJECT_CLEANUP_BATCH_SIZE constant
  • enterprise/litellm_enterprise/proxy/common_utils/check_responses_cost.py — bounded SQL cleanup, extracted _expire_stale_rows

Test plan

  • E2E tested against real Neon Postgres DB: seeded 5 stale rows (backdated 30 days), called _expire_stale_rows(batch_size=3) three times — got 3, 2, 0 rows updated respectively
  • Confirmed batch LIMIT is enforced (never exceeded batch_size)
  • Confirmed terminal-state rows are not re-processed

Configurable batch limit (default 1000) for stale managed object cleanup,
preventing unbounded UPDATE queries from hitting 300K+ rows at once.
Two fixes to _cleanup_stale_managed_objects:

1. Replace unbounded update_many with a single execute_raw using a
   subquery LIMIT, capping each poll cycle to STALE_OBJECT_CLEANUP_BATCH_SIZE
   rows. Zero rows loaded into Python memory — everything stays in Postgres.
   Uses the same PostgreSQL raw-SQL pattern as spend_log_cleanup.py
   (the proxy requires PostgreSQL per schema.prisma).

2. Extract _expire_stale_rows as a separate method for testability.

Keeps the file_purpose='response' filter to avoid incorrectly expiring
long-running batch or fine-tune jobs that legitimately exceed the
staleness cutoff.
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 6, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
litellm Ready Ready Preview, Comment Apr 6, 2026 4:27pm

Request Review

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@codspeed-hq
Copy link
Copy Markdown
Contributor

codspeed-hq Bot commented Apr 6, 2026

Merging this PR will not alter performance

✅ 16 untouched benchmarks


Comparing fix/stale-object-cleanup-batch-limit (dd0449c) with main (9088b46)1

Open in CodSpeed

Footnotes

  1. No successful run was found on main (d251238) during the generation of this report, so 9088b46 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 6, 2026

Greptile Summary

This PR caps the previously unbounded stale-managed-object cleanup to prevent mass UPDATE statements (e.g. 300K rows) from overwhelming the database on large deployments. It adds a STALE_OBJECT_CLEANUP_BATCH_SIZE constant (default 1,000, overridable via env var) and replaces the Prisma update_many in _cleanup_stale_managed_objects with a PostgreSQL execute_raw using a SELECT ... LIMIT subquery — the same pattern already used in spend_log_cleanup.py. The _expire_stale_rows method is extracted to make the DB call independently testable.

  • Adds STALE_OBJECT_CLEANUP_BATCH_SIZE = max(1, int(os.getenv(..., 1000))) in litellm/constants.py
  • Replaces update_many in stale-object cleanup with a bounded execute_raw UPDATE/SELECT subquery
  • Extracts _expire_stale_rows(cutoff, batch_size) for isolated testing
  • The existing unit tests were not updated — they still mock update_many for cleanup and assert on its call counts, but the new code uses execute_raw, causing the assertions to break across seven tests in test_check_responses_cost.py

Confidence Score: 4/5

The production fix is sound and follows established patterns (mirrors spend_log_cleanup.py), but the unchanged test suite contains incorrect assertions that will fail when the proxy unit tests are run.

The execute_raw UPDATE/SELECT subquery correctly solves the 300K-row problem and follows a pre-existing precedent. However, seven tests in test_check_responses_cost.py still assert update_many call counts and payloads for the stale-cleanup step. Because execute_raw is not mocked as an AsyncMock, the cleanup silently errors and invalidates call-count assumptions across all those tests. This is a verifiable, present defect in test coverage that should be resolved before merging.

tests/proxy_unit_tests/test_check_responses_cost.py — all tests that assert update_many.call_args_list positions need execute_raw mocked as AsyncMock and their call-count assertions updated.

Important Files Changed

Filename Overview
enterprise/litellm_enterprise/proxy/common_utils/check_responses_cost.py Switches stale-row cleanup from Prisma update_many to execute_raw with a bounded LIMIT subquery; the production fix is correct but the existing test suite was not updated and will fail.
litellm/constants.py Adds STALE_OBJECT_CLEANUP_BATCH_SIZE constant (default 1000, env-configurable) consistently with neighbouring constants.

Sequence Diagram

sequenceDiagram
    participant BG as Background Poller
    participant CRC as CheckResponsesCost
    participant DB as PostgreSQL (execute_raw)
    participant ORM as Prisma ORM (update_many)

    BG->>CRC: check_responses_cost()
    CRC->>CRC: _cleanup_stale_managed_objects()
    CRC->>CRC: _expire_stale_rows(cutoff, batch_size=1000)
    CRC->>DB: UPDATE LiteLLM_ManagedObjectTable SET status='stale_expired'<br/>WHERE id IN (SELECT id ... LIMIT batch_size)
    DB-->>CRC: rows_updated (int)
    CRC-->>BG: cleanup complete (exception swallowed on error)

    BG->>ORM: find_many(status in [queued,in_progress], take=MAX_OBJECTS_PER_POLL_CYCLE)
    ORM-->>BG: jobs[]

    loop For each job
        BG->>BG: litellm.aget_responses(response_id)
        BG-->>BG: response.status
    end

    BG->>ORM: update_many(id in completed_ids, status=completed)
    ORM-->>BG: done
Loading

Reviews (2): Last reviewed commit: "Batch-limit stale managed object cleanup..." | Re-trigger Greptile

Comment on lines +49 to +64
return await self.prisma_client.db.execute_raw(
"""
UPDATE "LiteLLM_ManagedObjectTable"
SET "status" = 'stale_expired'
WHERE "id" IN (
SELECT "id" FROM "LiteLLM_ManagedObjectTable"
WHERE "file_purpose" = 'response'
AND "status" NOT IN ('completed', 'complete', 'failed', 'expired', 'cancelled', 'stale_expired')
AND "created_at" < $1::timestamptz
ORDER BY "created_at" ASC
LIMIT $2
)
""",
cutoff,
batch_size,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 execute_raw violates the documented no-raw-SQL convention

CLAUDE.md explicitly states:

Do not write raw SQL for proxy DB operations. Use Prisma model methods instead of execute_raw / query_raw. This avoids schema/client drift, keeps code testable with simple mocks, and matches patterns used in spend logs.

While the motivation here is valid (Prisma's update_many lacks LIMIT support), and the PR notes precedent from spend_log_cleanup.py, the use of execute_raw is the direct cause of the test breakage above and ties the implementation to PostgreSQL-specific syntax ($1::timestamptz, double-quoted identifiers). An alternative that stays within Prisma's API is a two-step approach: find_many with take=batch_size to get IDs, then update_many with id: {in: ids}. This loads at most batch_size IDs into Python memory (a small list), uses no raw SQL, and keeps tests simple.

Context Used: CLAUDE.md (source)

@ishaan-berri
Copy link
Copy Markdown
Contributor Author

@greptile review

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@ishaan-berri ishaan-berri changed the base branch from main to litellm_ishaan_april6 April 6, 2026 20:26
@ishaan-berri ishaan-berri merged commit c2f486a into litellm_ishaan_april6 Apr 6, 2026
58 of 63 checks passed
@ishaan-berri ishaan-berri deleted the fix/stale-object-cleanup-batch-limit branch April 6, 2026 20:26
@ishaan-berri ishaan-berri mentioned this pull request Apr 6, 2026
7 tasks
@ishaan-berri
Copy link
Copy Markdown
Contributor Author

E2E Test: _expire_stale_rows against real Postgres DB

DB: postgresql://localhost:5432/litellm
Table: LiteLLM_ManagedObjectTable

Setup

Parameter Value
Rows seeded 5
file_purpose response
Initial status in_progress
created_at backdated 31 days ago
Cutoff 30 days ago
batch_size 3

Results

Iteration Rows updated Batch limit enforced?
1 3 ✓ (≤ 3)
2 2 ✓ (≤ 3)
3 0 — (done)
  • Total updated: 5 / 5
  • Final status on all rows: stale_expired
  • Rows loaded into Python memory: 0 (UPDATE runs entirely in Postgres)

Assertions

  • All 5 rows updated across iterations
  • No iteration exceeded batch_size=3
  • Terminal-state rows skipped on subsequent iterations (iteration 3 → 0)
  • file_purpose = 'response' filter respected
  • Test rows cleaned up after run

PASS — bounded subquery SELECT ... LIMIT enforces the batch cap correctly.

@ishaan-berri ishaan-berri mentioned this pull request Apr 7, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants