[WIP][Perf] Litellm prometheus improvements by harish-berri · Pull Request #25934 · BerriAI/litellm

harish-berri · 2026-04-17T05:28:23Z

Relevant issues

This PR make 2 optimizations to the prometheys logging module.

label_factory function is amortized at a per request level to avoid repeated dict comprehension and recompute.
Pydantic Models which were functionally not used in a critical capacity have been migrated to python dataclass with improvements. This reduces the CPU time of Logging.async_success_handler from 22% to ~12.2% with positive changes in request latency over 5.80 mins of runtime metrics across ~35-40K total request volume.

Further perf improvements which were identified is the synchnous nature of the callbacks under PrometheusLogger.async_log_success_event even though the parent handler function is async. The caveat here is that there is a lot idle time at the tail of this function which needs further investigation

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

I have Added testing in the tests/test_litellm/ directory, Adding at least 1 test is a hard requirement - see details
My PR passes all unit tests on make test-unit
My PR's scope is as isolated as possible, it only solves 1 specific problem
I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

Delays in PR merge?

If you're seeing a delay in your PR being merged, ping the LiteLLM Team on Slack (#pr-review).

CI (LiteLLM team)

CI status guideline:

50-55 passing tests: main is stable with minor issues.

45-49 passing tests: acceptable but needs attention

<= 40 passing tests: unstable; be careful with your merges and assess the risk.

Branch creation CI run
Link:
CI run for the last commit
Link:
Merge / cherry-pick CI run
Links:

Screenshots / Proof of Fix

Type

🆕 New Feature
🐛 Bug Fix
🧹 Refactoring
📖 Documentation
🚄 Infrastructure
✅ Test

Changes

Amortize label_factory function
Replace non critical pydantic classes with python dataclasses.
Improvements to santization function which reduces memory allocations by 6 times. replace() creates a new string everytime.
TODO: Measure difference between native python dataclasses and pydantic classes with __repr__ class overridden.
This PR Acknowledges 1 breaking change to PrometheusMetricsConfig validation. The previous TODO if successfull will fix this.

Flamegraph metrics

Perf Diff

Latency improvements (Single Instance - 4 vCPU + 16GB RAM - with external LLM API)

BASELINE

Metric	Value
checks_total	36,008
http_req_duration avg	589.75 ms
http_req_duration med	544.31 ms
http_req_duration p90	1.06 s
http_req_duration p95	1.32 s

WITH label_factory amortization + Pydantic -> @DataClass

Metric	Average
checks_total	41,322.5
http_req_duration avg	492.09 ms
http_req_duration med	428.66 ms
http_req_duration p90	742.98 ms
http_req_duration p95	923.62 ms

…s, reduces CPU time by ~5%

… function calls from the hotpath and replacing them with lightweight python native dataclasses. TODO: Override pydantic __repr__ class in all pydantic objects and compare performance. Improves CPU time perf by 2X in the hotpath

vercel · 2026-04-17T05:28:29Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
litellm	Ready	Preview, Comment	Apr 17, 2026 7:51pm

greptile-apps · 2026-04-17T05:34:43Z

Greptile Summary

This PR optimises the Prometheus logging hot path by introducing PrometheusLabelFactoryContext, which amortises one model_dump() + sanitisation pass across all metric increments in a single request, and migrates UserAPIKeyLabelValues / PrometheusMetricsConfig from Pydantic to frozen dataclasses to reduce per-request allocation overhead. The benchmark data shows meaningful latency improvements (~17% avg, ~20% median).

P1: PrometheusMetricsConfig converted to a plain @dataclass raises TypeError for any extra key in a YAML config dict, whereas the previous BaseModel silently ignored extra fields; users with any undocumented key in their prometheus_metrics_config block will fail at proxy start-up.
P2: All counter increments use the unbound call form PrometheusLogger._inc_labeled_counter(self, ...) instead of self._inc_labeled_counter(...) — the two are equivalent here but the pattern is unusual and silently bypasses subclass dispatch.

Confidence Score: 4/5

Safe to merge after addressing the PrometheusMetricsConfig backward-compat break; all other findings are style/annotation issues.

One P1 backward-compatibility issue: the PrometheusMetricsConfig dataclass raises TypeError on extra YAML config keys, which would silently break deployments that happen to have extra fields in their prometheus_metrics_config. The core amortisation logic and sentinel fix are correct. Remaining findings are P2.

litellm/types/integrations/prometheus.py — PrometheusMetricsConfig construction and tags field annotation.

Important Files Changed

Filename	Overview
litellm/integrations/prometheus.py	Amortizes label computation via `PrometheusLabelFactoryContext` and adds `_inc_labeled_counter` helper; all counter sites use unusual unbound-method call style instead of `self._inc_labeled_counter(...)`.
litellm/integrations/prometheus_helpers.py	New module introducing `PrometheusLabelFactoryContext`: per-request cache for sanitized labels, custom metadata, tag labels, and lazy end-user resolution. Uses a dedicated sentinel to correctly cache `None` result. Clean implementation.
litellm/types/integrations/prometheus.py	Migrates `UserAPIKeyLabelValues`, `PrometheusMetricsConfig`, and `PrometheusSettings` from Pydantic to dataclasses. `PrometheusMetricsConfig` loses Pydantic's `extra="ignore"`, breaking YAML configs with extra keys. `tags` field annotation is misleadingly broad.
tests/enterprise/litellm_enterprise/enterprise_callbacks/test_prometheus_logging_callbacks.py	Updates tests to use `hashed_api_key` (correcting previously-silently-dropped `api_key_hash` kwarg) and fixes label name in assertion comment. Changes are correct and improve test accuracy.
tests/test_litellm/types/test_prometheus_label_value_sanitize.py	New unit tests for the optimised `_sanitize_prometheus_label_value` covering newlines, unicode separators, backslash/quote escapes, and non-string coercions. Good coverage for the rewritten translate-based implementation.

Sequence Diagram

sequenceDiagram
    participant LS as async_log_success_event
    participant CTX as PrometheusLabelFactoryContext
    participant LF as prometheus_label_factory
    participant PM as Prometheus Counter

    LS->>CTX: "__init__(enum_values) — model_dump + sanitize all fields"
    Note over CTX: Amortised once per request

    loop "Each metric (8-10x per request)"
        LS->>LF: "_inc_labeled_counter(counter, metric_name, ctx)"
        LF->>CTX: "_prometheus_labels_from_context(supported_labels)"
        CTX-->>LF: "subset of _sanitized_enum + lazy end_user"
        LF-->>LS: "{label: value, ...}"
        LS->>PM: "counter.labels(**labels).inc(amount)"
    end

    Note over CTX: "end_user resolved lazily on first metric that needs it"

Comments Outside Diff (1)

litellm/types/integrations/prometheus.py, line 1044-1050 (link)

PrometheusMetricsConfig dataclass breaks on extra config keys

The old BaseModel used Pydantic v2's default extra="ignore", silently dropping any unknown keys in the YAML config dict. The plain @dataclass raises TypeError: __init__() got an unexpected keyword argument '<key>' for any extra field. If a user's YAML prometheus_metrics_config block has any undocumented key, the proxy will fail to start after this change.

The construction site in prometheus.py is:
```
parsed_config = PrometheusMetricsConfig(**group_config)
```
A backwards-compatible fix is to strip unknown keys before construction:
```
_known = {"group", "metrics", "include_labels"}
parsed_config = PrometheusMetricsConfig(**{k: v for k, v in group_config.items() if k in _known})
```
Rule Used: What: avoid backwards-incompatible changes without... (source)

_{Reviews (5): Last reviewed commit: "resolving code comments to move helper l..." | Re-trigger Greptile}

Improved sentinel handling flagged by greptile

…belValues kwargs

ishaan-berri

nit

ishaan-berri · 2026-04-17T19:28:27Z

        )


+class PrometheusLabelFactoryContext:


always place a class in it's own file, this can be in promtheus_helpers.py

…py```

harish-berri added 2 commits April 17, 2026 00:16

amortize label context call, by reducing number of label_factory call…

69f2bc1

…s, reduces CPU time by ~5%

harish-berri changed the title ~~[Perf] Litellm prometheus improvements~~ [WIP][Perf] Litellm prometheus improvements Apr 17, 2026

greptile-apps Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread litellm/integrations/prometheus.py Outdated

Comment thread litellm/types/integrations/prometheus.py

Removed slots attr which was causing test failures in python 3.9.

1eb6be9

Improved sentinel handling flagged by greptile

harish-berri temporarily deployed to integration-postgres April 17, 2026 17:00 — with GitHub Actions Inactive

harish-berri had a problem deploying to integration-postgres April 17, 2026 17:00 — with GitHub Actions Error

harish-berri temporarily deployed to integration-postgres April 17, 2026 17:00 — with GitHub Actions Inactive

vercel Bot deployed to Preview April 17, 2026 17:01 View deployment

added comments to disable repr

62a189f

harish-berri temporarily deployed to integration-postgres April 17, 2026 17:04 — with GitHub Actions Inactive

harish-berri temporarily deployed to integration-postgres April 17, 2026 17:05 — with GitHub Actions Inactive

vercel Bot deployed to Preview April 17, 2026 17:06 View deployment

fixing backwards compatibility for tests

5df3287

harish-berri temporarily deployed to integration-postgres April 17, 2026 18:08 — with GitHub Actions Inactive

harish-berri had a problem deploying to integration-postgres April 17, 2026 18:08 — with GitHub Actions Error

harish-berri temporarily deployed to integration-postgres April 17, 2026 18:08 — with GitHub Actions Inactive

harish-berri had a problem deploying to integration-postgres April 17, 2026 18:08 — with GitHub Actions Error

vercel Bot deployed to Preview April 17, 2026 18:10 View deployment

fix(prometheus): restore labeled-counter tests; tolerate UserAPIKeyLa…

9d10ebf

…belValues kwargs

harish-berri temporarily deployed to integration-postgres April 17, 2026 18:16 — with GitHub Actions Inactive

harish-berri had a problem deploying to integration-postgres April 17, 2026 18:17 — with GitHub Actions Error

harish-berri temporarily deployed to integration-postgres April 17, 2026 18:17 — with GitHub Actions Inactive

vercel Bot deployed to Preview April 17, 2026 18:18 View deployment

resolve lint errors

b3ccbb9

harish-berri temporarily deployed to integration-postgres April 17, 2026 19:06 — with GitHub Actions Inactive

harish-berri had a problem deploying to integration-postgres April 17, 2026 19:06 — with GitHub Actions Error

harish-berri temporarily deployed to integration-postgres April 17, 2026 19:06 — with GitHub Actions Inactive

vercel Bot deployed to Preview April 17, 2026 19:08 View deployment

ishaan-berri requested changes Apr 17, 2026

View reviewed changes

resolving code comments to move helper logic to ```prometheus_helper.…

8e77f9e

…py```

harish-berri temporarily deployed to integration-postgres April 17, 2026 19:50 — with GitHub Actions Inactive

harish-berri had a problem deploying to integration-postgres April 17, 2026 19:50 — with GitHub Actions Error

harish-berri temporarily deployed to integration-postgres April 17, 2026 19:50 — with GitHub Actions Inactive

vercel Bot deployed to Preview April 17, 2026 19:51 View deployment

harish-berri requested a review from ishaan-berri April 17, 2026 19:57

ishaan-berri approved these changes Apr 17, 2026

View reviewed changes

ishaan-berri merged commit 6ed2929 into litellm_internal_staging Apr 17, 2026
95 of 100 checks passed

ishaan-berri deleted the litellm_prometheus_improvements branch April 17, 2026 20:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP][Perf] Litellm prometheus improvements#25934

[WIP][Perf] Litellm prometheus improvements#25934
ishaan-berri merged 8 commits intolitellm_internal_stagingfrom
litellm_prometheus_improvements

harish-berri commented Apr 17, 2026 •

edited

Loading

Uh oh!

vercel Bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 17, 2026 •

edited

Loading

Important Files Changed

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

ishaan-berri left a comment

Uh oh!

ishaan-berri Apr 17, 2026

Uh oh!

harish-berri Apr 17, 2026

Uh oh!

harish-berri Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

harish-berri commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Relevant issues

Pre-Submission checklist

Delays in PR merge?

CI (LiteLLM team)

Screenshots / Proof of Fix

Type

Changes

Flamegraph metrics

Perf Diff

Latency improvements (Single Instance - 4 vCPU + 16GB RAM - with external LLM API)

BASELINE

WITH label_factory amortization + Pydantic -> @DataClass

Uh oh!

vercel Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

ishaan-berri left a comment

Choose a reason for hiding this comment

Uh oh!

ishaan-berri Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

harish-berri Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

harish-berri Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

harish-berri commented Apr 17, 2026 •

edited

Loading

vercel Bot commented Apr 17, 2026 •

edited

Loading

greptile-apps Bot commented Apr 17, 2026 •

edited

Loading