Skip to content

fix: batch-limit stale managed object cleanup to prevent 300K row UPDATE#25257

Merged
ishaan-berri merged 2 commits intolitellm_april_6_3from
fix/stale-object-cleanup-batch-limit-main
Apr 7, 2026
Merged

fix: batch-limit stale managed object cleanup to prevent 300K row UPDATE#25257
ishaan-berri merged 2 commits intolitellm_april_6_3from
fix/stale-object-cleanup-batch-limit-main

Conversation

@ishaan-berri
Copy link
Copy Markdown
Contributor

Fixes unbounded UPDATE queries in _cleanup_stale_managed_objects that could hit 300K+ rows at once.

Two changes:

  1. Add STALE_OBJECT_CLEANUP_BATCH_SIZE constant (default 1000) to litellm/constants.py
  2. Replace unbounded update_many with a single execute_raw using a subquery LIMIT — zero rows loaded into Python memory, everything stays in Postgres. Same pattern as spend_log_cleanup.py.

Also extracts _expire_stale_rows as a separate method for testability.

Relevant issues

Pre-Submission checklist

  • Added testing in tests/test_litellm/
  • PR passes all unit tests on make test-unit

CI (LiteLLM team)

  • Branch creation CI run
    Link:

  • CI run for the last commit
    Link:

Configurable batch limit (default 1000) for stale managed object cleanup,
preventing unbounded UPDATE queries from hitting 300K+ rows at once.
Two fixes to _cleanup_stale_managed_objects:

1. Replace unbounded update_many with a single execute_raw using a
   subquery LIMIT, capping each poll cycle to STALE_OBJECT_CLEANUP_BATCH_SIZE
   rows. Zero rows loaded into Python memory — everything stays in Postgres.
   Uses the same PostgreSQL raw-SQL pattern as spend_log_cleanup.py
   (the proxy requires PostgreSQL per schema.prisma).

2. Extract _expire_stale_rows as a separate method for testability.

Keeps the file_purpose='response' filter to avoid incorrectly expiring
long-running batch or fine-tune jobs that legitimately exceed the
staleness cutoff.
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 7, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
litellm Building Building Preview, Comment Apr 7, 2026 1:40am

Request Review

@ishaan-berri ishaan-berri changed the base branch from main to litellm_april_6_3 April 7, 2026 01:41
@ishaan-berri ishaan-berri merged commit 2d83267 into litellm_april_6_3 Apr 7, 2026
34 of 38 checks passed
@ishaan-berri ishaan-berri deleted the fix/stale-object-cleanup-batch-limit-main branch April 7, 2026 01:41
@codspeed-hq
Copy link
Copy Markdown
Contributor

codspeed-hq Bot commented Apr 7, 2026

Merging this PR will not alter performance

✅ 16 untouched benchmarks


Comparing fix/stale-object-cleanup-batch-limit-main (3815678) with main (3ac61a5)1

Open in CodSpeed

Footnotes

  1. No successful run was found on litellm_april_6_3 (3ac61a5) during the generation of this report, so main (3ac61a5) was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 7, 2026

Greptile Summary

This PR adds a STALE_OBJECT_CLEANUP_BATCH_SIZE constant and replaces the unbounded Prisma update_many stale-cleanup call with a raw SQL execute_raw query using a subquery LIMIT, aiming to prevent 300K-row UPDATE blasts. While the intent is sound, the implementation violates the project's explicit prohibition on raw SQL for proxy DB operations, and the existing tests are left in a broken state.

  • litellm/constants.py: Adds STALE_OBJECT_CLEANUP_BATCH_SIZE = max(1, int(os.getenv(..., 1000))) — clean, env-configurable, no issues
  • _expire_stale_rows uses execute_raw with PostgreSQL-specific syntax, directly violating CLAUDE.md: "Do not write raw SQL for proxy DB operations. Use Prisma model methods instead of execute_raw / query_raw"
  • All tests in test_check_responses_cost.py that assert on stale-cleanup behavior are broken: the fixture sets client.db = MagicMock() (not AsyncMock), so await prisma_client.db.execute_raw(...) raises TypeError (swallowed by the exception handler), and the tests then check update_many.call_args_list[0] which is empty — failing with IndexError or AssertionError
  • Because tests/proxy_unit_tests/ is outside the tests/test_litellm/ directory run by make test-unit, these failures are not caught by the CI gate referenced in the PR

Confidence Score: 3/5

Not safe to merge — violates explicit CLAUDE.md raw SQL prohibition and leaves all stale-cleanup tests broken

Two P1 issues: (1) execute_raw directly contradicts the project's CLAUDE.md rule against raw SQL in proxy DB operations; (2) all stale-cleanup-related tests fail with IndexError/AssertionError when run because execute_raw is not mocked as AsyncMock in the fixture, and these failures are hidden from make test-unit

enterprise/litellm_enterprise/proxy/common_utils/check_responses_cost.py (raw SQL usage) and tests/proxy_unit_tests/test_check_responses_cost.py (stale mocks)

Important Files Changed

Filename Overview
enterprise/litellm_enterprise/proxy/common_utils/check_responses_cost.py Adds _expire_stale_rows using execute_raw (violates CLAUDE.md raw SQL prohibition); corresponding tests are broken due to missing AsyncMock setup for execute_raw in the fixture
litellm/constants.py Correctly adds STALE_OBJECT_CLEANUP_BATCH_SIZE constant with env override and max(1,...) guard — no issues

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[check_responses_cost] --> B[_cleanup_stale_managed_objects]
    B --> C[cutoff = now - STALENESS_CUTOFF_DAYS]
    C --> D[_expire_stale_rows\ncutoff, BATCH_SIZE]
    D --> E[execute_raw UPDATE\nvia subquery LIMIT]
    E --> F{rows_updated > 0?}
    F -- Yes --> G[log warning]
    F -- No --> H[continue to job poll]
    G --> H
    H --> I[find_many queued/in_progress jobs]
    I --> J[process each job]
    J --> K[update_many completed jobs]
Loading

Comments Outside Diff (1)

  1. tests/proxy_unit_tests/test_check_responses_cost.py, line 19-24 (link)

    P1 Tests are stale and will fail against the new implementation

    The fixture sets client.db = MagicMock(), which means client.db.execute_raw is a plain MagicMock — not an AsyncMock. In Python 3.8+, await MagicMock() raises TypeError: object MagicMock can't be used in 'await' expression.

    When _expire_stale_rows is called, that TypeError is silently swallowed by the except Exception handler in check_responses_cost. As a result every test that asserts on stale-cleanup behaviour fails:

    • test_cleanup_stale_managed_objects (line 86): update_many is never called for stale cleanup, so calls[0] raises IndexError
    • test_check_responses_cost_with_completed_response (line 108), test_check_responses_cost_with_failed_response (line 155), test_check_responses_cost_with_cancelled_response (line 196), test_check_responses_cost_multiple_jobs (line 353): each expects len(calls) == 2 (stale + completion) but only sees 1 → AssertionError
    • test_check_responses_cost_with_in_progress_response (line 238), test_check_responses_cost_with_queued_response (line 279), test_check_responses_cost_with_exception (line 319): each checks calls[0][1]["data"] == {"status": "stale_expired"} but calls is empty → IndexError

    These failures are invisible to make test-unit because that command only runs tests/test_litellm/, not tests/proxy_unit_tests/.

    Regardless of whether execute_raw is kept or replaced, the fixture must be updated:

    client.db.execute_raw = AsyncMock(return_value=0)

    and assertions should verify execute_raw call args instead of update_many.

Reviews (1): Last reviewed commit: "Batch-limit stale managed object cleanup..." | Re-trigger Greptile

Comment on lines +49 to +64
return await self.prisma_client.db.execute_raw(
"""
UPDATE "LiteLLM_ManagedObjectTable"
SET "status" = 'stale_expired'
WHERE "id" IN (
SELECT "id" FROM "LiteLLM_ManagedObjectTable"
WHERE "file_purpose" = 'response'
AND "status" NOT IN ('completed', 'complete', 'failed', 'expired', 'cancelled', 'stale_expired')
AND "created_at" < $1::timestamptz
ORDER BY "created_at" ASC
LIMIT $2
)
""",
cutoff,
batch_size,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Raw SQL violates project coding standard

The project's CLAUDE.md explicitly prohibits this:

"Do not write raw SQL for proxy DB operations. Use Prisma model methods instead of execute_raw / query_raw."

The _expire_stale_rows method calls execute_raw with PostgreSQL-specific syntax ($1::timestamptz, double-quoted identifiers, LIMIT in a subquery) — exactly the pattern the rule was written to prevent.

The compliant alternative is a two-step Prisma approach that still keeps the UPDATE bounded to batch_size rows with only IDs flowing through Python:

async def _expire_stale_rows(self, cutoff: datetime, batch_size: int) -> int:
    stale_rows = await self.prisma_client.db.litellm_managedobjecttable.find_many(
        where={
            "file_purpose": "response",
            "status": {"not_in": ["completed", "complete", "failed", "expired", "cancelled", "stale_expired"]},
            "created_at": {"lt": cutoff},
        },
        take=batch_size,
        order={"created_at": "asc"},
        select={"id": True},
    )
    if not stale_rows:
        return 0
    return await self.prisma_client.db.litellm_managedobjecttable.update_many(
        where={"id": {"in": [row.id for row in stale_rows]}},
        data={"status": "stale_expired"},
    )

This keeps everything in Prisma ORM, matches the project's existing patterns, and still bounds the UPDATE to batch_size rows.

Context Used: CLAUDE.md (source)

krrish-berri-2 pushed a commit that referenced this pull request Apr 7, 2026
#25258)

* fix: batch-limit stale managed object cleanup to prevent 300K row UPDATE (#25257)

* Add STALE_OBJECT_CLEANUP_BATCH_SIZE constant

Configurable batch limit (default 1000) for stale managed object cleanup,
preventing unbounded UPDATE queries from hitting 300K+ rows at once.

* Batch-limit stale managed object cleanup with single bounded SQL query

Two fixes to _cleanup_stale_managed_objects:

1. Replace unbounded update_many with a single execute_raw using a
   subquery LIMIT, capping each poll cycle to STALE_OBJECT_CLEANUP_BATCH_SIZE
   rows. Zero rows loaded into Python memory — everything stays in Postgres.
   Uses the same PostgreSQL raw-SQL pattern as spend_log_cleanup.py
   (the proxy requires PostgreSQL per schema.prisma).

2. Extract _expire_stale_rows as a separate method for testability.

Keeps the file_purpose='response' filter to avoid incorrectly expiring
long-running batch or fine-tune jobs that legitimately exceed the
staleness cutoff.

* docs: add STALE_OBJECT_CLEANUP_BATCH_SIZE to env vars reference

* test: remove deprecated embed-english-v2.0 cohere embedding tests
harish876 pushed a commit to harish876/litellm that referenced this pull request Apr 8, 2026
BerriAI#25258)

* fix: batch-limit stale managed object cleanup to prevent 300K row UPDATE (BerriAI#25257)

* Add STALE_OBJECT_CLEANUP_BATCH_SIZE constant

Configurable batch limit (default 1000) for stale managed object cleanup,
preventing unbounded UPDATE queries from hitting 300K+ rows at once.

* Batch-limit stale managed object cleanup with single bounded SQL query

Two fixes to _cleanup_stale_managed_objects:

1. Replace unbounded update_many with a single execute_raw using a
   subquery LIMIT, capping each poll cycle to STALE_OBJECT_CLEANUP_BATCH_SIZE
   rows. Zero rows loaded into Python memory — everything stays in Postgres.
   Uses the same PostgreSQL raw-SQL pattern as spend_log_cleanup.py
   (the proxy requires PostgreSQL per schema.prisma).

2. Extract _expire_stale_rows as a separate method for testability.

Keeps the file_purpose='response' filter to avoid incorrectly expiring
long-running batch or fine-tune jobs that legitimately exceed the
staleness cutoff.

* docs: add STALE_OBJECT_CLEANUP_BATCH_SIZE to env vars reference

* test: remove deprecated embed-english-v2.0 cohere embedding tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant