Skip to content

[WIP][Perf] Litellm prometheus improvements#25934

Merged
ishaan-berri merged 8 commits intolitellm_internal_stagingfrom
litellm_prometheus_improvements
Apr 17, 2026
Merged

[WIP][Perf] Litellm prometheus improvements#25934
ishaan-berri merged 8 commits intolitellm_internal_stagingfrom
litellm_prometheus_improvements

Conversation

@harish-berri
Copy link
Copy Markdown
Collaborator

@harish-berri harish-berri commented Apr 17, 2026

Relevant issues

This PR make 2 optimizations to the prometheys logging module.

  • label_factory function is amortized at a per request level to avoid repeated dict comprehension and recompute.
  • Pydantic Models which were functionally not used in a critical capacity have been migrated to python dataclass with improvements. This reduces the CPU time of Logging.async_success_handler from 22% to ~12.2% with positive changes in request latency over 5.80 mins of runtime metrics across ~35-40K total request volume.

Further perf improvements which were identified is the synchnous nature of the callbacks under PrometheusLogger.async_log_success_event even though the parent handler function is async. The caveat here is that there is a lot idle time at the tail of this function which needs further investigation

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

  • I have Added testing in the tests/test_litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible, it only solves 1 specific problem
  • I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

Delays in PR merge?

If you're seeing a delay in your PR being merged, ping the LiteLLM Team on Slack (#pr-review).

CI (LiteLLM team)

CI status guideline:

  • 50-55 passing tests: main is stable with minor issues.
  • 45-49 passing tests: acceptable but needs attention
  • <= 40 passing tests: unstable; be careful with your merges and assess the risk.
  • Branch creation CI run
    Link:

  • CI run for the last commit
    Link:

  • Merge / cherry-pick CI run
    Links:

Screenshots / Proof of Fix

Type

🆕 New Feature
🐛 Bug Fix
🧹 Refactoring
📖 Documentation
🚄 Infrastructure
✅ Test

Changes

  • Amortize label_factory function
  • Replace non critical pydantic classes with python dataclasses.
  • Improvements to santization function which reduces memory allocations by 6 times. replace() creates a new string everytime.
  • TODO: Measure difference between native python dataclasses and pydantic classes with __repr__ class overridden.
  • This PR Acknowledges 1 breaking change to PrometheusMetricsConfig validation. The previous TODO if successfull will fix this.

Flamegraph metrics

Perf Diff

Screenshot 2026-04-16 at 10 13 47 PM

Latency improvements (Single Instance - 4 vCPU + 16GB RAM - with external LLM API)

BASELINE
Metric Value
checks_total 36,008
http_req_duration avg 589.75 ms
http_req_duration med 544.31 ms
http_req_duration p90 1.06 s
http_req_duration p95 1.32 s
WITH label_factory amortization + Pydantic -> @DataClass
Metric Average
checks_total 41,322.5
http_req_duration avg 492.09 ms
http_req_duration med 428.66 ms
http_req_duration p90 742.98 ms
http_req_duration p95 923.62 ms

… function calls from the hotpath and replacing them with lightweight python native dataclasses.

TODO: Override pydantic __repr__ class in all pydantic objects and compare performance.

Improves CPU time perf by 2X in the hotpath
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 17, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
litellm Ready Ready Preview, Comment Apr 17, 2026 7:51pm

Request Review

@harish-berri harish-berri changed the title [Perf] Litellm prometheus improvements [WIP][Perf] Litellm prometheus improvements Apr 17, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 17, 2026

Greptile Summary

This PR optimises the Prometheus logging hot path by introducing PrometheusLabelFactoryContext, which amortises one model_dump() + sanitisation pass across all metric increments in a single request, and migrates UserAPIKeyLabelValues / PrometheusMetricsConfig from Pydantic to frozen dataclasses to reduce per-request allocation overhead. The benchmark data shows meaningful latency improvements (~17% avg, ~20% median).

  • P1: PrometheusMetricsConfig converted to a plain @dataclass raises TypeError for any extra key in a YAML config dict, whereas the previous BaseModel silently ignored extra fields; users with any undocumented key in their prometheus_metrics_config block will fail at proxy start-up.
  • P2: All counter increments use the unbound call form PrometheusLogger._inc_labeled_counter(self, ...) instead of self._inc_labeled_counter(...) — the two are equivalent here but the pattern is unusual and silently bypasses subclass dispatch.

Confidence Score: 4/5

Safe to merge after addressing the PrometheusMetricsConfig backward-compat break; all other findings are style/annotation issues.

One P1 backward-compatibility issue: the PrometheusMetricsConfig dataclass raises TypeError on extra YAML config keys, which would silently break deployments that happen to have extra fields in their prometheus_metrics_config. The core amortisation logic and sentinel fix are correct. Remaining findings are P2.

litellm/types/integrations/prometheus.py — PrometheusMetricsConfig construction and tags field annotation.

Important Files Changed

Filename Overview
litellm/integrations/prometheus.py Amortizes label computation via PrometheusLabelFactoryContext and adds _inc_labeled_counter helper; all counter sites use unusual unbound-method call style instead of self._inc_labeled_counter(...).
litellm/integrations/prometheus_helpers.py New module introducing PrometheusLabelFactoryContext: per-request cache for sanitized labels, custom metadata, tag labels, and lazy end-user resolution. Uses a dedicated sentinel to correctly cache None result. Clean implementation.
litellm/types/integrations/prometheus.py Migrates UserAPIKeyLabelValues, PrometheusMetricsConfig, and PrometheusSettings from Pydantic to dataclasses. PrometheusMetricsConfig loses Pydantic's extra="ignore", breaking YAML configs with extra keys. tags field annotation is misleadingly broad.
tests/enterprise/litellm_enterprise/enterprise_callbacks/test_prometheus_logging_callbacks.py Updates tests to use hashed_api_key (correcting previously-silently-dropped api_key_hash kwarg) and fixes label name in assertion comment. Changes are correct and improve test accuracy.
tests/test_litellm/types/test_prometheus_label_value_sanitize.py New unit tests for the optimised _sanitize_prometheus_label_value covering newlines, unicode separators, backslash/quote escapes, and non-string coercions. Good coverage for the rewritten translate-based implementation.

Sequence Diagram

sequenceDiagram
    participant LS as async_log_success_event
    participant CTX as PrometheusLabelFactoryContext
    participant LF as prometheus_label_factory
    participant PM as Prometheus Counter

    LS->>CTX: "__init__(enum_values) — model_dump + sanitize all fields"
    Note over CTX: Amortised once per request

    loop "Each metric (8-10x per request)"
        LS->>LF: "_inc_labeled_counter(counter, metric_name, ctx)"
        LF->>CTX: "_prometheus_labels_from_context(supported_labels)"
        CTX-->>LF: "subset of _sanitized_enum + lazy end_user"
        LF-->>LS: "{label: value, ...}"
        LS->>PM: "counter.labels(**labels).inc(amount)"
    end

    Note over CTX: "end_user resolved lazily on first metric that needs it"
Loading

Comments Outside Diff (1)

  1. litellm/types/integrations/prometheus.py, line 1044-1050 (link)

    P1 PrometheusMetricsConfig dataclass breaks on extra config keys

    The old BaseModel used Pydantic v2's default extra="ignore", silently dropping any unknown keys in the YAML config dict. The plain @dataclass raises TypeError: __init__() got an unexpected keyword argument '<key>' for any extra field. If a user's YAML prometheus_metrics_config block has any undocumented key, the proxy will fail to start after this change.

    The construction site in prometheus.py is:

    parsed_config = PrometheusMetricsConfig(**group_config)

    A backwards-compatible fix is to strip unknown keys before construction:

    _known = {"group", "metrics", "include_labels"}
    parsed_config = PrometheusMetricsConfig(**{k: v for k, v in group_config.items() if k in _known})

    Rule Used: What: avoid backwards-incompatible changes without... (source)

Reviews (5): Last reviewed commit: "resolving code comments to move helper l..." | Re-trigger Greptile

Comment thread litellm/integrations/prometheus.py Outdated
Comment thread litellm/types/integrations/prometheus.py
Improved sentinel handling flagged by greptile
Copy link
Copy Markdown
Contributor

@ishaan-berri ishaan-berri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Comment thread litellm/integrations/prometheus.py Outdated
)


class PrometheusLabelFactoryContext:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always place a class in it's own file, this can be in promtheus_helpers.py

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved.

@ishaan-berri ishaan-berri merged commit 6ed2929 into litellm_internal_staging Apr 17, 2026
95 of 100 checks passed
@ishaan-berri ishaan-berri deleted the litellm_prometheus_improvements branch April 17, 2026 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants