Skip to content

Issues with Key-Level Model Limits (TPM/RPM/Budget) Enforcement and Router Fallback (DB Mode, Enterprise) #10052

@mkagit

Description

@mkagit

What happened?

Environment:

LiteLLM Version: 1.66.0
Mode: Proxy Server with Database (--db)
Configuration Method: Primarily via UI/API/DB (not using config.yaml at all).
License: Enterprise License

Describe the Bug:

I am encountering several issues related to the enforcement of model-specific limits (TPM, RPM, Budget) defined at the API key level and the router fallback mechanism not triggering as expected when these limits are supposedly hit. These settings are being configured via the API (POST /key/generate or POST /key/update) and stored in the database.

Issues Observed:

Incorrect Response Structure for Model TPM/RPM Limits:
When setting model_tpm_limit and model_rpm_limit via POST /key/update, the limits seem to be functionally applied (partially, see below), but the API response (GET /key/info or the response from the update itself) incorrectly places these limit objects inside the metadata field instead of as top-level fields in the key object.
Simultaneously, the actual top-level model_tpm_limit and model_rpm_limit fields in the response often show as null, creating inconsistency.
Example Snippet from API Response:

"metadata": {
    // ... other metadata ...
    "model_rpm_limit": {
        "openrouter/google/gemini-flash-1.5": 10 // Should NOT be here
    },
    "model_tpm_limit": {
        "openrouter/google/gemini-flash-1.5": 2000 // Should NOT be here
    }
},
// ... other key fields ...
"model_rpm_limit": null, // Incorrectly null at top level
"model_tpm_limit": null, // Incorrectly null at top level
"model_max_budget": { // Correctly placed at top level
     "openrouter/google/gemini-flash-1.5": { ... }
}

Delayed/Inaccurate TPM Limit Enforcement:

While the model_tpm_limit seems to eventually trigger a rate limit, it's not accurate.
For a model configured with a model_tpm_limit of 2000 TPM, I consistently receive a 429 error only after exceeding the limit by approximately 1000 tokens (i.e., around 3000 tokens are processed within the minute before the 429 occurs). The RPM limit seems slightly more accurate but also not perfectly precise.

Router Fallback Not Triggering on Limits:

I have configured router_settings in my DB.
Expected Behavior: When the TPM, RPM, or Model Budget limit for the primary model (openrouter/google/gemini-flash-1.5 in my tests) is hit (even with the delay mentioned above), the router should catch the resulting RateLimitError or BudgetExceededError and automatically retry the request with the fallback model (mistralai/mistral-small-3.1-24b-instruct).
Actual Behavior: When the limit is hit and the 429 error (or potentially a budget error, though hard to trigger consistently) occurs, the request simply fails and returns the error to the client. The fallback model is never attempted. The router fallback logic seems to ignore these specific key-level limit exceptions.

model_max_budget Not Enforced:

I have configured model_max_budget with specific max_budget and budget_duration values per model directly on the API key.
The configuration appears correctly in the API response (as a top-level field model_max_budget and also works inside the litellm_budget_table object associated with the key).
Actual Behavior: Requests continue to succeed even after the defined max_budget for that specific model within the budget_duration should have been exceeded. The spend limit per model per key does not seem to be reliably enforced. The duration reset might also be inconsistent.

Expected Behavior:

API responses should consistently show model_rpm_limit, model_tpm_limit, and model_max_budget as top-level fields in the key object, not nested within metadata.
TPM/RPM limits should be enforced accurately at the specified threshold.
Hitting any of the configured key-level limits (TPM, RPM, Budget - general or model-specific) should trigger the appropriate exception (RateLimitError, BudgetExceededError).
The LiteLLM Router should catch these specific exceptions and trigger the configured fallback mechanism to the secondary model.
model_max_budget limits should be reliably enforced according to their defined max_budget and budget_duration.

Actual Behavior:

Incorrect response structure for limits in API GET responses.
Inaccurate TPM enforcement (triggers late).
Router fallback does not trigger on limit errors.
model_max_budget enforcement seems unreliable.

Relevant log output

Are you a ML Ops Team?

No

What LiteLLM version are you on ?

v1.66.0

Twitter / LinkedIn details

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions