A secure, serverless file vault deployed on Google Cloud that enforces strict per-user isolation, irreversible PII redaction, and intelligent data extraction using Vertex AI (Gemini).
This project is a strict implementation of the Technical Assessment Spec for building a secure File Vault. It ensures that sensitive PII (Personally Identifiable Information) is visually and structurally removed from documents before they are permanently stored or analyzed.
Core Philosophy:
-
Zero Trust Redaction: PII is removed via rasterization (converting PDF to images) and DLP masking.
-
Least Privilege: The extraction AI (Gemini) never sees the original file, only the redacted artifact.
-
Strict Isolation: User data is isolated using a "Broker" pattern and unique GCS paths.
The solution uses a Serverless Monolith approach on Cloud Run to minimize complexity while leveraging fully managed Google Cloud services.
-
Frontend: React (Vite) Single Page Application.
-
Backend: FastAPI (Python).
-
Storage: Google Cloud Storage (Quarantine & Vault buckets).
-
Sanitization: Google Cloud DLP (Data Loss Prevention) +
pdf2image(Rasterization). -
Extraction: Vertex AI (Gemini 2.0 Flash).
-
Database: PostgreSQL (Cloud SQL) or SQLite (Demo).
-
Secure Upload: User uploads PDF
$\to$ [project]-quarantinebucket (Raw). -
Irreversible Redaction:
-
PDF is converted to images (destroying metadata/text layers).
-
Cloud DLP identifies PII (SSN, Name, Address).
-
Black rectangles are drawn over PII.
-
Images are re-assembled into a new "Redacted" PDF.
-
-
Human Approval: User views the redacted preview via a signed URL.
-
Vault Storage:
-
If Approved: Redacted file moves to
[project]-vault. Original file is immediately deleted. -
If Rejected: Both files are deleted.
-
-
AI Extraction: Vertex AI reads the Redacted file from the Vault and extracts 5 specific tax fields.
-
Persistence: The 5 extracted fields +
user_idare written to the SQL Database.
-
Google Cloud Project with Billing Enabled.
-
gcloudCLI installed and authenticated (gcloud auth login).
The deploy.sh script automates the entire process: enabling APIs, creating buckets, setting up IAM, building the frontend, and deploying to Cloud Run.
# 1. Set your Project ID
export GOOGLE_CLOUD_PROJECT=your-project-id
# 2. Run Deployment
./deploy.sh
What happens?
- Access the live application via the Service URL printed at the end of the script.
You can run the entire stack locally using Docker Compose. We mock Google Cloud services to avoid needing active credentials for local UI testing.
docker compose up --build
-
Frontend: http://localhost:5173
-
Backend API: http://localhost:8080/docs
We guarantee PII is not recoverable because we rasterize the PDF. The process converts vector text into flat pixels. The redaction is not just a "layer" on top; the pixels themselves are overwritten with black rectangles before being saved as a new image-based PDF. Hidden metadata and text layers are destroyed during the conversion.
-
Storage: Files are stored in GCS paths
gs://bucket/{user_id}/{doc_id}.pdf. -
Logic: The backend enforces
X-User-IDchecks. A user can only access blobs nested under their specific User ID folder. -
Database: SQL queries are always filtered by
WHERE user_id = :user_id.
-
Lifecycle: The
quarantinebucket has a 1-hour lifecycle policy to auto-delete orphaned files. -
Atomic Move: Upon approval, the raw file is explicitly deleted via the Storage API. It is never moved to the Vault.
-
Database: The SQL model strictly defines only 5 tax fields. No metadata, filenames, or raw text blobs are stored in the database.
-
Architecture Diagram: Implemented as code in
deploy.shanddocker-compose.yml. -
Working Deployment:
deploy.shpushes a production-ready container to Cloud Run. -
Demo Steps:
-
Upload
training_file.pdf. -
See "Review" screen with redacted PII.
-
Click "Approve".
-
View extracted data in Dashboard.
-
-
Write-up: See "Security & Compliance Guarantees" section above.