A Python-based Robotic Process Automation (RPA) bot for automatically processing PDF invoices. Extracts key data, validates information, and generates reports.
- PDF Text Extraction: Uses pdfplumber for direct text extraction
- OCR Support: Falls back to EasyOCR for scanned/image-based PDFs
- Data Validation: Validates dates, amounts, and vendor information
- GUI Interface: Simple Tkinter GUI for folder selection and processing
- CSV Reporting: Generates structured CSV reports of extracted data
- Email Alerts: Sends summary emails when validation thresholds are exceeded
- Configurable: Uses YAML config file for easy customization
- Logging: Comprehensive logging to files and console
- Docker Ready: Includes Docker configuration for containerized deployment
- Python 3.10+
- pip (Python package installer)
- Clone or download the project
- Navigate to the project directory
- Install dependencies:
pip install -r requirements.txt
Run the GUI application:
python run_bot.pyThis launches a simple GUI where you can:
- Select an input folder containing PDF invoices
- Click "Start Processing" to process all PDFs
- View progress and results
Edit config.yaml to customize:
- Input/output directories
- Email settings (sender, recipient, SMTP credentials)
- OCR settings (languages, retry counts)
- Regex patterns for data extraction
- Validation thresholds
- Make sure you have
Tesseractinstalled if using external OCR tools. - Scanned PDFs should be at least 300 DPI for best results.
- If the bot skips a PDF, check
logs/bot.logfor details on extraction failure.
- Ensure the
output/andlogs/folders are writable. - On Windows, avoid running the bot from restricted folders like
Program Files.
- If the vendor or amount isn't being picked up, check the
regex_patternsinconfig.yaml. - You can test your patterns on regex101.com.
The samples/invoices/ folder contains 6 sample PDF files with various formats and validation scenarios:
- invoice_001_valid.pdf: Valid invoice
- invoice_002_invalid_date.pdf: Invalid date format
- invoice_003_missing_total.pdf: Missing total amount
- invoice_004_scanned.pdf: Image-based (needs OCR)
- invoice_005_alt_format.pdf: Different layout
- invoice_006_negative_total.pdf: Invalid negative amount
Processing generates:
output/invoice_report.csv: Structured CSV with extracted data and validation resultslogs/bot.log: Detailed processing logs- Email alerts (if configured and validation threshold exceeded)
Build and run with Docker:
# Build image
docker build -t rpa-invoice-bot .
# Run with volumes
docker run -v "$(pwd)/samples:/app/samples" -v "$(pwd)/output:/app/output" -v "$(pwd)/logs:/app/logs" rpa-invoice-bot- Python 3.10+
- pdfplumber
- easyocr
- pyyaml
- pandas
- Pillow
Install with: pip install -r requirements.txt
MIT License