fix(presidio): use correct text positions in anonymize_text#24998
Conversation
…24160) The Presidio anonymizer endpoint returns items with start/end positions that reference the *anonymized output* text, not the original input. anonymize_text() was applying these positions to the original text, causing garbled output with remnants of un-masked PII data. When output_parse_pii is False, return redacted_text["text"] directly from the anonymizer response instead of manually splicing. When output_parse_pii is True, use analyze_results positions (which correctly reference the original text) to build numbered replacement tokens and the pii_tokens mapping.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
@greptileai review |
Greptile SummaryThis PR fixes a correctness bug in Key changes:
Confidence Score: 5/5Safe to merge — the fix is correct, well-explained, and regression-tested. Only a trivial dead The root cause is clearly identified and the two-path fix (direct return vs. position-correct splicing) is correct. Both code paths are covered by new mock-based regression tests that assert against the exact garbled-output scenario. Prior concerns (dead No files require special attention.
|
| Filename | Overview |
|---|---|
| litellm/proxy/guardrails/guardrail_hooks/presidio.py | Core bug fix: anonymize_text now returns Presidio's own output directly when output_parse_pii=False, and uses analyze_results (original-text positions) instead of redacted_text["items"] (anonymized-text positions) when output_parse_pii=True. Fix is correct; only a dead else branch remains. |
| tests/test_litellm/proxy/guardrails/guardrail_hooks/test_presidio.py | Adds two focused regression tests for issue #24160 — one for each code path (output_parse_pii=False/True). Both use mocked HTTP sessions (no network calls), correctly assert against the garbled-output scenario, and verify that numbered tokens map to the right original PII values. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[anonymize_text called] --> B{analyze_results empty?}
B -- Yes --> C[return original text]
B -- No --> D[POST to Presidio /anonymize]
D --> E{redacted_text is None?}
E -- Yes --> F[raise Exception]
E -- No --> G{output_parse_pii?}
G -- False --> H[iterate redacted_text items\nupdate masked_entity_count]
H --> I[return redacted_text text directly\n✅ correct Presidio output]
G -- True --> J[sort analyze_results forward\nbuild seq_map with original positions]
J --> K[iterate reversed sorted_forward\napply replacements to new_text\nusing ORIGINAL text positions]
K --> L[store pii_tokens mapping\nnumbered tokens → original values]
L --> M[return new_text with numbered tokens]
Reviews (3): Last reviewed commit: "address review: remove dead code, fix to..." | Re-trigger Greptile
- Remove unused `anon_item_by_entity` dict (Greptile P2) - Number tokens left-to-right (<PERSON_1> first in text, not last) - Add assertion for token numbering order in test
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
@greptileai review |
fc75380
into
BerriAI:litellm_oss_staging_04_04_2026
|
Hi @ishaan-jaff @krrish-berri-2 — this PR was merged into As a result, this security fix for PII data leaking through the Presidio guardrail is not present in Could you either:
This is a data-safety issue — without this fix, Presidio anonymization produces corrupted output with PII fragments leaking through due to offset mismatches. Happy to help however needed. |
Relevant issues
Fixes #24160
What this PR does
The Presidio
/anonymizeendpoint returnsitems[]withstart/endpositions that reference the anonymized output text, not the original input text.anonymize_text()was applying these positions to the original text, producing garbled output with remnants of un-masked PII data.Example (before fix):
Root cause:
The anonymizer returns item positions relative to its own output (e.g.
<PERSON>starts at char 11 in the anonymized text), but the code used these offsets to splice the original text whereJohn Smithstarts at char 11 but is 10 chars long vs 8 for<PERSON>— a 2-char shift that cascades through all subsequent entities.Fix (two paths):
output_parse_pii=False— returnredacted_text["text"]directly from the anonymizer (it's already correctly masked)output_parse_pii=True— useanalyze_resultspositions (which reference the original text) to build numbered replacement tokens and thepii_tokensmappingPre-Submission checklist
tests/test_litellm/directory, Adding at least 1 test is a hard requirement - see detailsmake test-unit@greptileaiand received a Confidence Score of at least 4/5 before requesting a maintainer reviewType
🐛 Bug Fix
Changes
litellm/proxy/guardrails/guardrail_hooks/presidio.py—anonymize_text()method:output_parse_pii=False: returnredacted_text["text"]directly instead of manual splicing with wrong positionsoutput_parse_pii=True: iterateanalyze_results(original-text positions) instead ofredacted_text["items"](output-text positions) for building numbered tokenstests/test_litellm/proxy/guardrails/guardrail_hooks/test_presidio.py— added 2 regression tests:test_anonymize_text_uses_correct_positions_no_parse_pii— verifies clean masking without PII remnantstest_anonymize_text_uses_correct_positions_with_parse_pii— verifies numbered tokens map to correct original PII values