[WIP][Perf] Litellm prometheus improvements#25934
[WIP][Perf] Litellm prometheus improvements#25934ishaan-berri merged 8 commits intolitellm_internal_stagingfrom
Conversation
…s, reduces CPU time by ~5%
… function calls from the hotpath and replacing them with lightweight python native dataclasses. TODO: Override pydantic __repr__ class in all pydantic objects and compare performance. Improves CPU time perf by 2X in the hotpath
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Greptile SummaryThis PR optimises the Prometheus logging hot path by introducing
Confidence Score: 4/5Safe to merge after addressing the PrometheusMetricsConfig backward-compat break; all other findings are style/annotation issues. One P1 backward-compatibility issue: the PrometheusMetricsConfig dataclass raises TypeError on extra YAML config keys, which would silently break deployments that happen to have extra fields in their prometheus_metrics_config. The core amortisation logic and sentinel fix are correct. Remaining findings are P2. litellm/types/integrations/prometheus.py — PrometheusMetricsConfig construction and tags field annotation.
|
| Filename | Overview |
|---|---|
| litellm/integrations/prometheus.py | Amortizes label computation via PrometheusLabelFactoryContext and adds _inc_labeled_counter helper; all counter sites use unusual unbound-method call style instead of self._inc_labeled_counter(...). |
| litellm/integrations/prometheus_helpers.py | New module introducing PrometheusLabelFactoryContext: per-request cache for sanitized labels, custom metadata, tag labels, and lazy end-user resolution. Uses a dedicated sentinel to correctly cache None result. Clean implementation. |
| litellm/types/integrations/prometheus.py | Migrates UserAPIKeyLabelValues, PrometheusMetricsConfig, and PrometheusSettings from Pydantic to dataclasses. PrometheusMetricsConfig loses Pydantic's extra="ignore", breaking YAML configs with extra keys. tags field annotation is misleadingly broad. |
| tests/enterprise/litellm_enterprise/enterprise_callbacks/test_prometheus_logging_callbacks.py | Updates tests to use hashed_api_key (correcting previously-silently-dropped api_key_hash kwarg) and fixes label name in assertion comment. Changes are correct and improve test accuracy. |
| tests/test_litellm/types/test_prometheus_label_value_sanitize.py | New unit tests for the optimised _sanitize_prometheus_label_value covering newlines, unicode separators, backslash/quote escapes, and non-string coercions. Good coverage for the rewritten translate-based implementation. |
Sequence Diagram
sequenceDiagram
participant LS as async_log_success_event
participant CTX as PrometheusLabelFactoryContext
participant LF as prometheus_label_factory
participant PM as Prometheus Counter
LS->>CTX: "__init__(enum_values) — model_dump + sanitize all fields"
Note over CTX: Amortised once per request
loop "Each metric (8-10x per request)"
LS->>LF: "_inc_labeled_counter(counter, metric_name, ctx)"
LF->>CTX: "_prometheus_labels_from_context(supported_labels)"
CTX-->>LF: "subset of _sanitized_enum + lazy end_user"
LF-->>LS: "{label: value, ...}"
LS->>PM: "counter.labels(**labels).inc(amount)"
end
Note over CTX: "end_user resolved lazily on first metric that needs it"
Comments Outside Diff (1)
-
litellm/types/integrations/prometheus.py, line 1044-1050 (link)PrometheusMetricsConfigdataclass breaks on extra config keysThe old
BaseModelused Pydantic v2's defaultextra="ignore", silently dropping any unknown keys in the YAML config dict. The plain@dataclassraisesTypeError: __init__() got an unexpected keyword argument '<key>'for any extra field. If a user's YAMLprometheus_metrics_configblock has any undocumented key, the proxy will fail to start after this change.The construction site in
prometheus.pyis:parsed_config = PrometheusMetricsConfig(**group_config)
A backwards-compatible fix is to strip unknown keys before construction:
_known = {"group", "metrics", "include_labels"} parsed_config = PrometheusMetricsConfig(**{k: v for k, v in group_config.items() if k in _known})
Rule Used: What: avoid backwards-incompatible changes without... (source)
Reviews (5): Last reviewed commit: "resolving code comments to move helper l..." | Re-trigger Greptile
Improved sentinel handling flagged by greptile
| ) | ||
|
|
||
|
|
||
| class PrometheusLabelFactoryContext: |
There was a problem hiding this comment.
always place a class in it's own file, this can be in promtheus_helpers.py
6ed2929
into
litellm_internal_staging
Relevant issues
This PR make 2 optimizations to the prometheys logging module.
Logging.async_success_handlerfrom 22% to ~12.2% with positive changes in request latency over 5.80 mins of runtime metrics across ~35-40K total request volume.Further perf improvements which were identified is the synchnous nature of the callbacks under
PrometheusLogger.async_log_success_eventeven though the parent handler function is async. The caveat here is that there is a lot idle time at the tail of this function which needs further investigationPre-Submission checklist
Please complete all items before asking a LiteLLM maintainer to review your PR
tests/test_litellm/directory, Adding at least 1 test is a hard requirement - see detailsmake test-unit@greptileaiand received a Confidence Score of at least 4/5 before requesting a maintainer reviewDelays in PR merge?
If you're seeing a delay in your PR being merged, ping the LiteLLM Team on Slack (#pr-review).
CI (LiteLLM team)
Branch creation CI run
Link:
CI run for the last commit
Link:
Merge / cherry-pick CI run
Links:
Screenshots / Proof of Fix
Type
🆕 New Feature
🐛 Bug Fix
🧹 Refactoring
📖 Documentation
🚄 Infrastructure
✅ Test
Changes
replace()creates a new string everytime.__repr__class overridden.PrometheusMetricsConfigvalidation. The previous TODO if successfull will fix this.Flamegraph metrics
Perf Diff
Latency improvements (Single Instance - 4 vCPU + 16GB RAM - with external LLM API)
BASELINE
WITH label_factory amortization + Pydantic -> @DataClass