dbxredact

PII/PHI detection and redaction solution accelerator for Databricks.

Disclaimer: This is a Databricks Solution Accelerator -- a starting point to accelerate your project. dbxredact is high quality and fully functioning end-to-end, but you should evaluate, test, and modify this code for your specific use case. Detection and redaction results will vary depending on your data and configuration.

Overview

dbxredact provides tools for detecting, evaluating, and redacting Protected Health Information (PHI) and Personally Identifiable Information (PII) in text data on Databricks.

Features

Multiple Detection Methods: Presidio (rule-based), AI Query (LLM-based), and GLiNER (NER-based)
Multi-language Support: AI Query and rule-based approaches support Spanish language
Entity Alignment: Combine results from multiple detection methods with confidence scoring
Flexible Redaction: Generic ([REDACTED]) or typed ([PERSON], [EMAIL]) strategies

Quickstart

Option 1: CLI Deployment

Configure environment:

cp example.env dev.env

Edit dev.env:

DATABRICKS_HOST=https://your-workspace.cloud.databricks.com
CATALOG=your_catalog
SCHEMA=redaction

Deploy:
```
./deploy.sh dev
```
This builds the wheel, uploads it to a volume, and deploys the Databricks Asset Bundle.

Run the redaction pipeline:

databricks bundle run redaction_pipeline -t dev \
  --notebook-params source_table=catalog.schema.source_table,text_column=text,output_table=catalog.schema.redacted_output

Option 2: Git Folder (No CLI)

Clone to a Databricks Git Folder

Install dbxredact (choose one):

# From GitHub directly
%pip install git+https://github.com/databricks-industry-solutions/dbxredact.git

# Or download wheel from releases, upload to a volume, then:
%pip install /Volumes/your_catalog/your_schema/wheels/dbxredact-<version>-py3-none-any.whl

Open notebooks/4_redaction_pipeline.py, configure widgets, and run

Detection Methods

Method	Description	Use Case
Presidio	Rule-based with spaCy NLP	Fast, deterministic, no API calls
AI Query	LLM-based via Databricks endpoints	Context-aware, complex patterns
GLiNER	NER with HuggingFace models	Biomedical focus, GPU acceleration

Notebooks

Notebook	Description
`4_redaction_pipeline.py`	End-to-end detection and redaction
`1_benchmarking_detection.py`	Run detection benchmarks (requires benchmark data)
`2_benchmarking_evaluation.py`	Evaluate detection performance (requires benchmark data)
`3_benchmarking_redaction.py`	Apply redaction to results (requires benchmark data)

Note: Notebooks 1-3 are benchmarking tools that require external evaluation data (e.g. the JSL benchmark dataset). This data is not included in the repository because it is not synthetic. To use these notebooks, supply your own labeled evaluation dataset and update the widget defaults accordingly.

API Reference

Detection

from dbxredact import run_detection_pipeline

result_df = run_detection_pipeline(
    spark=spark,
    source_df=source_df,
    doc_id_column="doc_id",
    text_column="text",
    use_presidio=True,
    use_ai_query=True,
    endpoint="databricks-gpt-oss-120b"
)

Redaction

from dbxredact import run_redaction_pipeline

result_df = run_redaction_pipeline(
    spark=spark,
    source_table="catalog.schema.medical_notes",
    text_column="note_text",
    output_table="catalog.schema.medical_notes_redacted",
    redaction_strategy="typed"  # or "generic"
)

Simple Text Redaction

from dbxredact import redact_text

text = "Patient John Smith (SSN: 123-45-6789) visited on 2024-01-15."
entities = [
    {"entity": "John Smith", "start": 8, "end": 18, "entity_type": "PERSON"},
    {"entity": "123-45-6789", "start": 25, "end": 36, "entity_type": "US_SSN"},
]

result = redact_text(text, entities, strategy="typed")
# "Patient [PERSON] (SSN: [US_SSN]) visited on 2024-01-15."

Project Structure

dbxredact/
  databricks.yml.template  # DAB config template
  deploy.sh                # Build and deploy script
  pyproject.toml           # Poetry dependencies
  src/dbxredact/           # Core library
  notebooks/               # Databricks notebooks
  tests/                   # Unit and integration tests

Testing

pytest tests/ -v

Compute Types

Use an ML cluster - not currently working on serverless. GLiNER models perform better on GPU, but other models do not.

Libraries

Core Dependencies

Library	Version	License	Description	PyPI
presidio-analyzer	2.2.358	MIT	Microsoft Presidio PII detection engine	PyPI
presidio-anonymizer	2.2.358	MIT	Microsoft Presidio anonymization engine	PyPI
spacy	3.8.7	MIT	Industrial-strength NLP library	PyPI
gliner	>=0.1.0	Apache 2.0	Generalist NER using bidirectional transformers	PyPI
rapidfuzz	>=3.0.0	MIT	Fast fuzzy string matching	PyPI
pydantic	>=2.0.0	MIT	Data validation using Python type hints	PyPI
pyyaml	>=6.0.1	MIT	YAML parser and emitter	PyPI
databricks-sdk	>=0.30.0	Apache 2.0	Databricks SDK for Python	PyPI

Optional spaCy Models (for Presidio)

Model	License	Description	Install
en_core_web_sm	MIT	English NLP model	spaCy Models
es_core_news_sm	MIT	Spanish NLP model	spaCy Models

Runtime Dependencies (provided by Databricks)

Library	License	Description
pandas	BSD-3-Clause	Data manipulation library
pyspark	Apache 2.0	Apache Spark Python API
pyarrow	Apache 2.0	Apache Arrow Python bindings

All dependencies use permissive open-source licenses (MIT, Apache 2.0, BSD-3-Clause). No copyleft (GPL) dependencies.

Compliance and Responsibility

This is a solution accelerator -- it provides tooling to assist with PII/PHI detection and redaction, but all compliance obligations remain with the user. This includes but is not limited to:

HIPAA: You are responsible for ensuring your deployment meets HIPAA requirements (encryption, access controls, audit logging, BAAs, etc.)
GDPR, CCPA, and other privacy regulations: Evaluate whether your use of this tool satisfies applicable data protection laws
Validation: You must verify that redaction results are complete and accurate for your specific data and use case
Data Encryption: Enable encryption at rest and in transit in your Databricks workspace
Access Controls: Configure appropriate table/catalog permissions in Unity Catalog
Audit Logging: Enable workspace audit logs for compliance tracking

Databricks makes no guarantees that use of this tool alone is sufficient for regulatory compliance.

License

DB License

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
notebooks		notebooks
resources		resources
src/dbxredact		src/dbxredact
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
NOTICE.md		NOTICE.md
README.md		README.md
SECURITY.md		SECURITY.md
databricks.yml.template		databricks.yml.template
deploy.sh		deploy.sh
example.env		example.env
load_benchmark.sh		load_benchmark.sh
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
variables.yml		variables.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dbxredact

Overview

Features

Quickstart

Option 1: CLI Deployment

Option 2: Git Folder (No CLI)

Detection Methods

Notebooks

API Reference

Detection

Redaction

Simple Text Redaction

Project Structure

Testing

Compute Types

Libraries

Core Dependencies

Optional spaCy Models (for Presidio)

Runtime Dependencies (provided by Databricks)

Compliance and Responsibility

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

dbxredact

Overview

Features

Quickstart

Option 1: CLI Deployment

Option 2: Git Folder (No CLI)

Detection Methods

Notebooks

API Reference

Detection

Redaction

Simple Text Redaction

Project Structure

Testing

Compute Types

Libraries

Core Dependencies

Optional spaCy Models (for Presidio)

Runtime Dependencies (provided by Databricks)

Compliance and Responsibility

License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages