Skip to content

fix(presidio): use correct text positions in anonymize_text#24998

Merged
krrish-berri-2 merged 2 commits intoBerriAI:litellm_oss_staging_04_04_2026from
Dmitry-Kucher:fix/presidio-anonymizer-offset-bug
Apr 5, 2026
Merged

fix(presidio): use correct text positions in anonymize_text#24998
krrish-berri-2 merged 2 commits intoBerriAI:litellm_oss_staging_04_04_2026from
Dmitry-Kucher:fix/presidio-anonymizer-offset-bug

Conversation

@Dmitry-Kucher
Copy link
Copy Markdown
Contributor

@Dmitry-Kucher Dmitry-Kucher commented Apr 2, 2026

Relevant issues

Fixes #24160

What this PR does

The Presidio /anonymize endpoint returns items[] with start/end positions that reference the anonymized output text, not the original input text. anonymize_text() was applying these positions to the original text, producing garbled output with remnants of un-masked PII data.

Example (before fix):

Input:    "My name is John Smith, my email is john@example.com, phone 555-867-5309"
Expected: "My name is <PERSON>, my email is <EMAIL_ADDRESS>, phone <PHONE_NUMBER>"
Actual:   "My name is <PERSON>th, my email i<EMAIL_ADDRESS>com, pho<PHONE_NUMBER>9"
                      ^^              ^                ^^^     ^^^^^^^^^^^^^^^ ^
                      PII remnants leaking through due to offset mismatch

Root cause:
The anonymizer returns item positions relative to its own output (e.g. <PERSON> starts at char 11 in the anonymized text), but the code used these offsets to splice the original text where John Smith starts at char 11 but is 10 chars long vs 8 for <PERSON> — a 2-char shift that cascades through all subsequent entities.

Fix (two paths):

  1. output_parse_pii=False — return redacted_text["text"] directly from the anonymizer (it's already correctly masked)
  2. output_parse_pii=True — use analyze_results positions (which reference the original text) to build numbered replacement tokens and the pii_tokens mapping

Pre-Submission checklist

  • I have Added testing in the tests/test_litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible, it only solves 1 specific problem
  • I have requested a Greptile review by commenting @greptileai and received a Confidence Score of at least 4/5 before requesting a maintainer review

Type

🐛 Bug Fix

Changes

  • litellm/proxy/guardrails/guardrail_hooks/presidio.pyanonymize_text() method:
    • When output_parse_pii=False: return redacted_text["text"] directly instead of manual splicing with wrong positions
    • When output_parse_pii=True: iterate analyze_results (original-text positions) instead of redacted_text["items"] (output-text positions) for building numbered tokens
  • tests/test_litellm/proxy/guardrails/guardrail_hooks/test_presidio.py — added 2 regression tests:
    • test_anonymize_text_uses_correct_positions_no_parse_pii — verifies clean masking without PII remnants
    • test_anonymize_text_uses_correct_positions_with_parse_pii — verifies numbered tokens map to correct original PII values

…24160)

The Presidio anonymizer endpoint returns items with start/end positions
that reference the *anonymized output* text, not the original input.
anonymize_text() was applying these positions to the original text,
causing garbled output with remnants of un-masked PII data.

When output_parse_pii is False, return redacted_text["text"] directly
from the anonymizer response instead of manually splicing.

When output_parse_pii is True, use analyze_results positions (which
correctly reference the original text) to build numbered replacement
tokens and the pii_tokens mapping.
@vercel
Copy link
Copy Markdown

vercel bot commented Apr 2, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
litellm Ready Ready Preview, Comment Apr 2, 2026 2:32pm

Request Review

@Dmitry-Kucher
Copy link
Copy Markdown
Contributor Author

@greptileai review

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 2, 2026

CLA assistant check
All committers have signed the CLA.

@codspeed-hq
Copy link
Copy Markdown
Contributor

codspeed-hq bot commented Apr 2, 2026

Merging this PR will not alter performance

✅ 16 untouched benchmarks


Comparing Dmitry-Kucher:fix/presidio-anonymizer-offset-bug (1112791) with main (d1df4e8)

Open in CodSpeed

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 2, 2026

Greptile Summary

This PR fixes a correctness bug in _OPTIONAL_PresidioPIIMasking.anonymize_text() where offset positions returned by the Presidio /anonymize endpoint (which reference the anonymized output text) were incorrectly applied to the original input text, producing garbled output with residual PII fragments leaking through.

Key changes:

  • output_parse_pii=False path: returns redacted_text[\"text\"] directly from Presidio instead of manually splicing the original text with mismatched offsets.
  • output_parse_pii=True path: builds numbered tokens using positions from analyze_results (which reference the original input text) instead of redacted_text[\"items\"] (which reference the anonymized output). Sequential token numbering (<PERSON_1>, <EMAIL_ADDRESS_2>, …) is now assigned in left-to-right order of appearance in the original text.
  • Previously flagged dead code (anon_item_by_entity dict) has been removed; a minor dead else branch remains on line 544.
  • Two regression tests added in tests/test_litellm/ covering both code paths with mocked HTTP sessions — no real network calls.

Confidence Score: 5/5

Safe to merge — the fix is correct, well-explained, and regression-tested. Only a trivial dead else branch remains.

The root cause is clearly identified and the two-path fix (direct return vs. position-correct splicing) is correct. Both code paths are covered by new mock-based regression tests that assert against the exact garbled-output scenario. Prior concerns (dead anon_item_by_entity dict, reverse token numbering) have been addressed. The sole remaining finding is a dead else branch that can never execute — a P2 style issue that does not affect correctness.

No files require special attention.

Important Files Changed

Filename Overview
litellm/proxy/guardrails/guardrail_hooks/presidio.py Core bug fix: anonymize_text now returns Presidio's own output directly when output_parse_pii=False, and uses analyze_results (original-text positions) instead of redacted_text["items"] (anonymized-text positions) when output_parse_pii=True. Fix is correct; only a dead else branch remains.
tests/test_litellm/proxy/guardrails/guardrail_hooks/test_presidio.py Adds two focused regression tests for issue #24160 — one for each code path (output_parse_pii=False/True). Both use mocked HTTP sessions (no network calls), correctly assert against the garbled-output scenario, and verify that numbered tokens map to the right original PII values.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[anonymize_text called] --> B{analyze_results empty?}
    B -- Yes --> C[return original text]
    B -- No --> D[POST to Presidio /anonymize]
    D --> E{redacted_text is None?}
    E -- Yes --> F[raise Exception]
    E -- No --> G{output_parse_pii?}
    G -- False --> H[iterate redacted_text items\nupdate masked_entity_count]
    H --> I[return redacted_text text directly\n✅ correct Presidio output]
    G -- True --> J[sort analyze_results forward\nbuild seq_map with original positions]
    J --> K[iterate reversed sorted_forward\napply replacements to new_text\nusing ORIGINAL text positions]
    K --> L[store pii_tokens mapping\nnumbered tokens → original values]
    L --> M[return new_text with numbered tokens]
Loading

Reviews (3): Last reviewed commit: "address review: remove dead code, fix to..." | Re-trigger Greptile

Comment thread litellm/proxy/guardrails/guardrail_hooks/presidio.py Outdated
Comment thread litellm/proxy/guardrails/guardrail_hooks/presidio.py Outdated
- Remove unused `anon_item_by_entity` dict (Greptile P2)
- Number tokens left-to-right (<PERSON_1> first in text, not last)
- Add assertion for token numbering order in test
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@Dmitry-Kucher
Copy link
Copy Markdown
Contributor Author

@greptileai review

@krrish-berri-2 krrish-berri-2 changed the base branch from main to litellm_oss_staging_04_04_2026 April 5, 2026 01:46
@krrish-berri-2 krrish-berri-2 merged commit fc75380 into BerriAI:litellm_oss_staging_04_04_2026 Apr 5, 2026
56 of 62 checks passed
@Dmitry-Kucher
Copy link
Copy Markdown
Contributor Author

Dmitry-Kucher commented Apr 14, 2026

Hi @ishaan-jaff @krrish-berri-2 — this PR was merged into litellm_oss_staging_04_04_2026 on April 5, but that staging branch was never merged into main (it diverged by 336 commits and has since been deleted).

As a result, this security fix for PII data leaking through the Presidio guardrail is not present in main or in any current staging branch (litellm_oss_staging_04_13_2026_p1 does not include it either).

Could you either:

  1. Cherry-pick commit fc75380 into main or the current staging branch, or
  2. Let me know if I should open a new PR targeting main directly?

This is a data-safety issue — without this fix, Presidio anonymization produces corrupted output with PII fragments leaking through due to offset mismatches. Happy to help however needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Presidio output_parse_pii garbles non-ASCII text (byte/char offset mismatch)

3 participants