fix: batch-limit stale managed object cleanup to prevent 300K row UPDATE#25227
Conversation
Configurable batch limit (default 1000) for stale managed object cleanup, preventing unbounded UPDATE queries from hitting 300K+ rows at once.
Two fixes to _cleanup_stale_managed_objects: 1. Replace unbounded update_many with a single execute_raw using a subquery LIMIT, capping each poll cycle to STALE_OBJECT_CLEANUP_BATCH_SIZE rows. Zero rows loaded into Python memory — everything stays in Postgres. Uses the same PostgreSQL raw-SQL pattern as spend_log_cleanup.py (the proxy requires PostgreSQL per schema.prisma). 2. Extract _expire_stale_rows as a separate method for testability. Keeps the file_purpose='response' filter to avoid incorrectly expiring long-running batch or fine-tune jobs that legitimately exceed the staleness cutoff.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
|
Greptile SummaryThis PR caps the previously unbounded stale-managed-object cleanup to prevent mass
Confidence Score: 4/5The production fix is sound and follows established patterns (mirrors spend_log_cleanup.py), but the unchanged test suite contains incorrect assertions that will fail when the proxy unit tests are run. The
|
| Filename | Overview |
|---|---|
| enterprise/litellm_enterprise/proxy/common_utils/check_responses_cost.py | Switches stale-row cleanup from Prisma update_many to execute_raw with a bounded LIMIT subquery; the production fix is correct but the existing test suite was not updated and will fail. |
| litellm/constants.py | Adds STALE_OBJECT_CLEANUP_BATCH_SIZE constant (default 1000, env-configurable) consistently with neighbouring constants. |
Sequence Diagram
sequenceDiagram
participant BG as Background Poller
participant CRC as CheckResponsesCost
participant DB as PostgreSQL (execute_raw)
participant ORM as Prisma ORM (update_many)
BG->>CRC: check_responses_cost()
CRC->>CRC: _cleanup_stale_managed_objects()
CRC->>CRC: _expire_stale_rows(cutoff, batch_size=1000)
CRC->>DB: UPDATE LiteLLM_ManagedObjectTable SET status='stale_expired'<br/>WHERE id IN (SELECT id ... LIMIT batch_size)
DB-->>CRC: rows_updated (int)
CRC-->>BG: cleanup complete (exception swallowed on error)
BG->>ORM: find_many(status in [queued,in_progress], take=MAX_OBJECTS_PER_POLL_CYCLE)
ORM-->>BG: jobs[]
loop For each job
BG->>BG: litellm.aget_responses(response_id)
BG-->>BG: response.status
end
BG->>ORM: update_many(id in completed_ids, status=completed)
ORM-->>BG: done
Reviews (2): Last reviewed commit: "Batch-limit stale managed object cleanup..." | Re-trigger Greptile
| return await self.prisma_client.db.execute_raw( | ||
| """ | ||
| UPDATE "LiteLLM_ManagedObjectTable" | ||
| SET "status" = 'stale_expired' | ||
| WHERE "id" IN ( | ||
| SELECT "id" FROM "LiteLLM_ManagedObjectTable" | ||
| WHERE "file_purpose" = 'response' | ||
| AND "status" NOT IN ('completed', 'complete', 'failed', 'expired', 'cancelled', 'stale_expired') | ||
| AND "created_at" < $1::timestamptz | ||
| ORDER BY "created_at" ASC | ||
| LIMIT $2 | ||
| ) | ||
| """, | ||
| cutoff, | ||
| batch_size, | ||
| ) |
There was a problem hiding this comment.
execute_raw violates the documented no-raw-SQL convention
CLAUDE.md explicitly states:
Do not write raw SQL for proxy DB operations. Use Prisma model methods instead of
execute_raw/query_raw. This avoids schema/client drift, keeps code testable with simple mocks, and matches patterns used in spend logs.
While the motivation here is valid (Prisma's update_many lacks LIMIT support), and the PR notes precedent from spend_log_cleanup.py, the use of execute_raw is the direct cause of the test breakage above and ties the implementation to PostgreSQL-specific syntax ($1::timestamptz, double-quoted identifiers). An alternative that stays within Prisma's API is a two-step approach: find_many with take=batch_size to get IDs, then update_many with id: {in: ids}. This loads at most batch_size IDs into Python memory (a small list), uses no raw SQL, and keeps tests simple.
Context Used: CLAUDE.md (source)
|
@greptile review |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
E2E Test:
|
| Parameter | Value |
|---|---|
| Rows seeded | 5 |
file_purpose |
response |
| Initial status | in_progress |
created_at backdated |
31 days ago |
| Cutoff | 30 days ago |
batch_size |
3 |
Results
| Iteration | Rows updated | Batch limit enforced? |
|---|---|---|
| 1 | 3 | ✓ (≤ 3) |
| 2 | 2 | ✓ (≤ 3) |
| 3 | 0 | — (done) |
- Total updated: 5 / 5
- Final status on all rows:
stale_expired - Rows loaded into Python memory: 0 (UPDATE runs entirely in Postgres)
Assertions
- All 5 rows updated across iterations
- No iteration exceeded
batch_size=3 - Terminal-state rows skipped on subsequent iterations (iteration 3 → 0)
-
file_purpose = 'response'filter respected - Test rows cleaned up after run
PASS — bounded subquery SELECT ... LIMIT enforces the batch cap correctly.
Relevant issues
Closes #24451
What changed
STALE_OBJECT_CLEANUP_BATCH_SIZEconstant (default 1000, configurable via env var) to cap how many stale rows are marked per poll cycleupdate_manyin_cleanup_stale_managed_objectswith a singleexecute_rawusing a subquerySELECT ... LIMIT— zero rows loaded into Python memory, one DB round-trip_expire_stale_rowsas a testable methodPre-Submission checklist
feat,fix,chore,refactor, etc.)tests/litellm/make test-unitpasses locallyCI (LiteLLM team)
Type
Changes
litellm/constants.py— newSTALE_OBJECT_CLEANUP_BATCH_SIZEconstantenterprise/litellm_enterprise/proxy/common_utils/check_responses_cost.py— bounded SQL cleanup, extracted_expire_stale_rowsTest plan
_expire_stale_rows(batch_size=3)three times — got 3, 2, 0 rows updated respectivelyLIMITis enforced (never exceededbatch_size)