PII/PHI detection and redaction solution accelerator for Databricks.
Disclaimer: This is a Databricks Solution Accelerator -- a starting point to accelerate your project. dbxredact is high quality and fully functioning end-to-end, but you should evaluate, test, and modify this code for your specific use case. Detection and redaction results will vary depending on your data and configuration.
dbxredact provides tools for detecting, evaluating, and redacting Protected Health Information (PHI) and Personally Identifiable Information (PII) in text data on Databricks.
- Multiple Detection Methods: Presidio (rule-based), AI Query (LLM-based), and GLiNER (NER-based)
- Multi-language Support: AI Query and rule-based approaches support Spanish language
- Entity Alignment: Combine results from multiple detection methods with confidence scoring
- Flexible Redaction: Generic (
[REDACTED]) or typed ([PERSON],[EMAIL]) strategies
-
Configure environment:
cp example.env dev.env
Edit
dev.env:DATABRICKS_HOST=https://your-workspace.cloud.databricks.com CATALOG=your_catalog SCHEMA=redaction -
Deploy:
./deploy.sh dev
This builds the wheel, uploads it to a volume, and deploys the Databricks Asset Bundle.
-
Run the redaction pipeline:
databricks bundle run redaction_pipeline -t dev \ --notebook-params source_table=catalog.schema.source_table,text_column=text,output_table=catalog.schema.redacted_output
-
Clone to a Databricks Git Folder
-
Install dbxredact (choose one):
# From GitHub directly %pip install git+https://github.com/databricks-industry-solutions/dbxredact.git # Or download wheel from releases, upload to a volume, then: %pip install /Volumes/your_catalog/your_schema/wheels/dbxredact-<version>-py3-none-any.whl
-
Open
notebooks/4_redaction_pipeline.py, configure widgets, and run
| Method | Description | Use Case |
|---|---|---|
| Presidio | Rule-based with spaCy NLP | Fast, deterministic, no API calls |
| AI Query | LLM-based via Databricks endpoints | Context-aware, complex patterns |
| GLiNER | NER with HuggingFace models | Biomedical focus, GPU acceleration |
| Notebook | Description |
|---|---|
4_redaction_pipeline.py |
End-to-end detection and redaction |
1_benchmarking_detection.py |
Run detection benchmarks (requires benchmark data) |
2_benchmarking_evaluation.py |
Evaluate detection performance (requires benchmark data) |
3_benchmarking_redaction.py |
Apply redaction to results (requires benchmark data) |
Note: Notebooks 1-3 are benchmarking tools that require external evaluation data (e.g. the JSL benchmark dataset). This data is not included in the repository because it is not synthetic. To use these notebooks, supply your own labeled evaluation dataset and update the widget defaults accordingly.
from dbxredact import run_detection_pipeline
result_df = run_detection_pipeline(
spark=spark,
source_df=source_df,
doc_id_column="doc_id",
text_column="text",
use_presidio=True,
use_ai_query=True,
endpoint="databricks-gpt-oss-120b"
)from dbxredact import run_redaction_pipeline
result_df = run_redaction_pipeline(
spark=spark,
source_table="catalog.schema.medical_notes",
text_column="note_text",
output_table="catalog.schema.medical_notes_redacted",
redaction_strategy="typed" # or "generic"
)from dbxredact import redact_text
text = "Patient John Smith (SSN: 123-45-6789) visited on 2024-01-15."
entities = [
{"entity": "John Smith", "start": 8, "end": 18, "entity_type": "PERSON"},
{"entity": "123-45-6789", "start": 25, "end": 36, "entity_type": "US_SSN"},
]
result = redact_text(text, entities, strategy="typed")
# "Patient [PERSON] (SSN: [US_SSN]) visited on 2024-01-15."dbxredact/
databricks.yml.template # DAB config template
deploy.sh # Build and deploy script
pyproject.toml # Poetry dependencies
src/dbxredact/ # Core library
notebooks/ # Databricks notebooks
tests/ # Unit and integration tests
pytest tests/ -vUse an ML cluster - not currently working on serverless. GLiNER models perform better on GPU, but other models do not.
| Library | Version | License | Description | PyPI |
|---|---|---|---|---|
| presidio-analyzer | 2.2.358 | MIT | Microsoft Presidio PII detection engine | PyPI |
| presidio-anonymizer | 2.2.358 | MIT | Microsoft Presidio anonymization engine | PyPI |
| spacy | 3.8.7 | MIT | Industrial-strength NLP library | PyPI |
| gliner | >=0.1.0 | Apache 2.0 | Generalist NER using bidirectional transformers | PyPI |
| rapidfuzz | >=3.0.0 | MIT | Fast fuzzy string matching | PyPI |
| pydantic | >=2.0.0 | MIT | Data validation using Python type hints | PyPI |
| pyyaml | >=6.0.1 | MIT | YAML parser and emitter | PyPI |
| databricks-sdk | >=0.30.0 | Apache 2.0 | Databricks SDK for Python | PyPI |
| Model | License | Description | Install |
|---|---|---|---|
| en_core_web_sm | MIT | English NLP model | spaCy Models |
| es_core_news_sm | MIT | Spanish NLP model | spaCy Models |
| Library | License | Description |
|---|---|---|
| pandas | BSD-3-Clause | Data manipulation library |
| pyspark | Apache 2.0 | Apache Spark Python API |
| pyarrow | Apache 2.0 | Apache Arrow Python bindings |
All dependencies use permissive open-source licenses (MIT, Apache 2.0, BSD-3-Clause). No copyleft (GPL) dependencies.
This is a solution accelerator -- it provides tooling to assist with PII/PHI detection and redaction, but all compliance obligations remain with the user. This includes but is not limited to:
- HIPAA: You are responsible for ensuring your deployment meets HIPAA requirements (encryption, access controls, audit logging, BAAs, etc.)
- GDPR, CCPA, and other privacy regulations: Evaluate whether your use of this tool satisfies applicable data protection laws
- Validation: You must verify that redaction results are complete and accurate for your specific data and use case
- Data Encryption: Enable encryption at rest and in transit in your Databricks workspace
- Access Controls: Configure appropriate table/catalog permissions in Unity Catalog
- Audit Logging: Enable workspace audit logs for compliance tracking
Databricks makes no guarantees that use of this tool alone is sufficient for regulatory compliance.