fix(presidio): use correct text positions in anonymize_text by Dmitry-Kucher · Pull Request #24998 · BerriAI/litellm

Dmitry-Kucher · 2026-04-02T14:20:33Z

Relevant issues

What this PR does

The Presidio /anonymize endpoint returns items[] with start/end positions that reference the anonymized output text, not the original input text. anonymize_text() was applying these positions to the original text, producing garbled output with remnants of un-masked PII data.

Example (before fix):

Input:    "My name is John Smith, my email is john@example.com, phone 555-867-5309"
Expected: "My name is <PERSON>, my email is <EMAIL_ADDRESS>, phone <PHONE_NUMBER>"
Actual:   "My name is <PERSON>th, my email i<EMAIL_ADDRESS>com, pho<PHONE_NUMBER>9"
                      ^^              ^                ^^^     ^^^^^^^^^^^^^^^ ^
                      PII remnants leaking through due to offset mismatch

Root cause:
The anonymizer returns item positions relative to its own output (e.g. <PERSON> starts at char 11 in the anonymized text), but the code used these offsets to splice the original text where John Smith starts at char 11 but is 10 chars long vs 8 for <PERSON> — a 2-char shift that cascades through all subsequent entities.

Fix (two paths):

output_parse_pii=False — return redacted_text["text"] directly from the anonymizer (it's already correctly masked)
output_parse_pii=True — use analyze_results positions (which reference the original text) to build numbered replacement tokens and the pii_tokens mapping

Pre-Submission checklist

I have Added testing in the tests/test_litellm/ directory, Adding at least 1 test is a hard requirement - see details
My PR passes all unit tests on make test-unit
My PR's scope is as isolated as possible, it only solves 1 specific problem
I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

Type

🐛 Bug Fix

Changes

litellm/proxy/guardrails/guardrail_hooks/presidio.py — anonymize_text() method:
- When output_parse_pii=False: return redacted_text["text"] directly instead of manual splicing with wrong positions
- When output_parse_pii=True: iterate analyze_results (original-text positions) instead of redacted_text["items"] (output-text positions) for building numbered tokens
tests/test_litellm/proxy/guardrails/guardrail_hooks/test_presidio.py — added 2 regression tests:
- test_anonymize_text_uses_correct_positions_no_parse_pii — verifies clean masking without PII remnants
- test_anonymize_text_uses_correct_positions_with_parse_pii — verifies numbered tokens map to correct original PII values

…24160) The Presidio anonymizer endpoint returns items with start/end positions that reference the *anonymized output* text, not the original input. anonymize_text() was applying these positions to the original text, causing garbled output with remnants of un-masked PII data. When output_parse_pii is False, return redacted_text["text"] directly from the anonymizer response instead of manually splicing. When output_parse_pii is True, use analyze_results positions (which correctly reference the original text) to build numbered replacement tokens and the pii_tokens mapping.

vercel · 2026-04-02T14:20:40Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
litellm	Ready	Preview, Comment	Apr 2, 2026 2:32pm

Dmitry-Kucher · 2026-04-02T14:20:55Z

@greptileai review

CLAassistant · 2026-04-02T14:20:59Z

All committers have signed the CLA.

codspeed-hq · 2026-04-02T14:23:02Z

Merging this PR will not alter performance

✅ 16 untouched benchmarks

_{Comparing Dmitry-Kucher:fix/presidio-anonymizer-offset-bug (1112791) with main (d1df4e8)}

greptile-apps · 2026-04-02T14:24:19Z

Greptile Summary

This PR fixes a correctness bug in _OPTIONAL_PresidioPIIMasking.anonymize_text() where offset positions returned by the Presidio /anonymize endpoint (which reference the anonymized output text) were incorrectly applied to the original input text, producing garbled output with residual PII fragments leaking through.

Key changes:

output_parse_pii=False path: returns redacted_text[\"text\"] directly from Presidio instead of manually splicing the original text with mismatched offsets.
output_parse_pii=True path: builds numbered tokens using positions from analyze_results (which reference the original input text) instead of redacted_text[\"items\"] (which reference the anonymized output). Sequential token numbering (<PERSON_1>, <EMAIL_ADDRESS_2>, …) is now assigned in left-to-right order of appearance in the original text.
Previously flagged dead code (anon_item_by_entity dict) has been removed; a minor dead else branch remains on line 544.
Two regression tests added in tests/test_litellm/ covering both code paths with mocked HTTP sessions — no real network calls.

Confidence Score: 5/5

Safe to merge — the fix is correct, well-explained, and regression-tested. Only a trivial dead else branch remains.

The root cause is clearly identified and the two-path fix (direct return vs. position-correct splicing) is correct. Both code paths are covered by new mock-based regression tests that assert against the exact garbled-output scenario. Prior concerns (dead anon_item_by_entity dict, reverse token numbering) have been addressed. The sole remaining finding is a dead else branch that can never execute — a P2 style issue that does not affect correctness.

No files require special attention.

Important Files Changed

Filename	Overview
litellm/proxy/guardrails/guardrail_hooks/presidio.py	Core bug fix: `anonymize_text` now returns Presidio's own output directly when `output_parse_pii=False`, and uses `analyze_results` (original-text positions) instead of `redacted_text["items"]` (anonymized-text positions) when `output_parse_pii=True`. Fix is correct; only a dead `else` branch remains.
tests/test_litellm/proxy/guardrails/guardrail_hooks/test_presidio.py	Adds two focused regression tests for issue #24160 — one for each code path (`output_parse_pii=False/True`). Both use mocked HTTP sessions (no network calls), correctly assert against the garbled-output scenario, and verify that numbered tokens map to the right original PII values.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[anonymize_text called] --> B{analyze_results empty?}
    B -- Yes --> C[return original text]
    B -- No --> D[POST to Presidio /anonymize]
    D --> E{redacted_text is None?}
    E -- Yes --> F[raise Exception]
    E -- No --> G{output_parse_pii?}
    G -- False --> H[iterate redacted_text items\nupdate masked_entity_count]
    H --> I[return redacted_text text directly\n✅ correct Presidio output]
    G -- True --> J[sort analyze_results forward\nbuild seq_map with original positions]
    J --> K[iterate reversed sorted_forward\napply replacements to new_text\nusing ORIGINAL text positions]
    K --> L[store pii_tokens mapping\nnumbered tokens → original values]
    L --> M[return new_text with numbered tokens]

_{Reviews (3): Last reviewed commit: "address review: remove dead code, fix to..." | Re-trigger Greptile}

- Remove unused `anon_item_by_entity` dict (Greptile P2) - Number tokens left-to-right (<PERSON_1> first in text, not last) - Add assertion for token numbering order in test

codecov · 2026-04-02T14:48:53Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Dmitry-Kucher · 2026-04-03T10:46:21Z

@greptileai review

Dmitry-Kucher · 2026-04-14T14:15:16Z

Hi @ishaan-jaff @krrish-berri-2 — this PR was merged into litellm_oss_staging_04_04_2026 on April 5, but that staging branch was never merged into main (it diverged by 336 commits and has since been deleted).

As a result, this security fix for PII data leaking through the Presidio guardrail is not present in main or in any current staging branch (litellm_oss_staging_04_13_2026_p1 does not include it either).

Could you either:

Cherry-pick commit fc75380 into main or the current staging branch, or
Let me know if I should open a new PR targeting main directly?

This is a data-safety issue — without this fix, Presidio anonymization produces corrupted output with PII fragments leaking through due to offset mismatches. Happy to help however needed.

vercel bot deployed to Preview April 2, 2026 14:22 View deployment

greptile-apps bot reviewed Apr 2, 2026

View reviewed changes

Comment thread litellm/proxy/guardrails/guardrail_hooks/presidio.py Outdated

Comment thread litellm/proxy/guardrails/guardrail_hooks/presidio.py Outdated

address review: remove dead code, fix token numbering order

1112791

- Remove unused `anon_item_by_entity` dict (Greptile P2) - Number tokens left-to-right (<PERSON_1> first in text, not last) - Add assertion for token numbering order in test

vercel bot deployed to Preview April 2, 2026 14:32 View deployment

Dmitry-Kucher mentioned this pull request Apr 3, 2026

Presidio output_parse_pii garbles non-ASCII text (byte/char offset mismatch) #24160

Closed

krrish-berri-2 changed the base branch from main to litellm_oss_staging_04_04_2026 April 5, 2026 01:46

krrish-berri-2 merged commit fc75380 into BerriAI:litellm_oss_staging_04_04_2026 Apr 5, 2026
56 of 62 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(presidio): use correct text positions in anonymize_text#24998

fix(presidio): use correct text positions in anonymize_text#24998
krrish-berri-2 merged 2 commits intoBerriAI:litellm_oss_staging_04_04_2026from
Dmitry-Kucher:fix/presidio-anonymizer-offset-bug

Dmitry-Kucher commented Apr 2, 2026 •

edited

Loading

Uh oh!

vercel bot commented Apr 2, 2026 •

edited

Loading

Uh oh!

Dmitry-Kucher commented Apr 2, 2026

Uh oh!

CLAassistant commented Apr 2, 2026 •

edited

Loading

Uh oh!

codspeed-hq bot commented Apr 2, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Apr 2, 2026 •

edited

Loading

Important Files Changed

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Apr 2, 2026

Uh oh!

Dmitry-Kucher commented Apr 3, 2026

Uh oh!

Uh oh!

Dmitry-Kucher commented Apr 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Dmitry-Kucher commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Relevant issues

What this PR does

Pre-Submission checklist

Type

Changes

Uh oh!

vercel bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dmitry-Kucher commented Apr 2, 2026

Uh oh!

CLAassistant commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codspeed-hq bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Uh oh!

greptile-apps bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Apr 2, 2026

Codecov Report

Uh oh!

Dmitry-Kucher commented Apr 3, 2026

Uh oh!

Uh oh!

Dmitry-Kucher commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Dmitry-Kucher commented Apr 2, 2026 •

edited

Loading

vercel bot commented Apr 2, 2026 •

edited

Loading

CLAassistant commented Apr 2, 2026 •

edited

Loading

codspeed-hq bot commented Apr 2, 2026 •

edited

Loading

greptile-apps bot commented Apr 2, 2026 •

edited

Loading

Dmitry-Kucher commented Apr 14, 2026 •

edited

Loading