Skip to content
/ rpa_bot Public

Python RPA bot that extracts data from PDF invoices using pdfplumber and EasyOCR, validates it, and generates CSV reports

Notifications You must be signed in to change notification settings

Sid-V5/rpa_bot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 

Repository files navigation

RPA Invoice Processing Bot

A Python-based Robotic Process Automation (RPA) bot for automatically processing PDF invoices. Extracts key data, validates information, and generates reports.

Features

  • PDF Text Extraction: Uses pdfplumber for direct text extraction
  • OCR Support: Falls back to EasyOCR for scanned/image-based PDFs
  • Data Validation: Validates dates, amounts, and vendor information
  • GUI Interface: Simple Tkinter GUI for folder selection and processing
  • CSV Reporting: Generates structured CSV reports of extracted data
  • Email Alerts: Sends summary emails when validation thresholds are exceeded
  • Configurable: Uses YAML config file for easy customization
  • Logging: Comprehensive logging to files and console
  • Docker Ready: Includes Docker configuration for containerized deployment

Quick Start

Requirements

  • Python 3.10+
  • pip (Python package installer)

Installation

  1. Clone or download the project
  2. Navigate to the project directory
  3. Install dependencies:
    pip install -r requirements.txt

Usage

Run the GUI application:

python run_bot.py

This launches a simple GUI where you can:

  • Select an input folder containing PDF invoices
  • Click "Start Processing" to process all PDFs
  • View progress and results

Configuration

Edit config.yaml to customize:

  • Input/output directories
  • Email settings (sender, recipient, SMTP credentials)
  • OCR settings (languages, retry counts)
  • Regex patterns for data extraction
  • Validation thresholds

Troubleshooting & Tips

OCR Not Working?

  • Make sure you have Tesseract installed if using external OCR tools.
  • Scanned PDFs should be at least 300 DPI for best results.
  • If the bot skips a PDF, check logs/bot.log for details on extraction failure.

Permission Errors

  • Ensure the output/ and logs/ folders are writable.
  • On Windows, avoid running the bot from restricted folders like Program Files.

Customizing Extraction

  • If the vendor or amount isn't being picked up, check the regex_patterns in config.yaml.
  • You can test your patterns on regex101.com.

Sample Data

The samples/invoices/ folder contains 6 sample PDF files with various formats and validation scenarios:

  • invoice_001_valid.pdf: Valid invoice
  • invoice_002_invalid_date.pdf: Invalid date format
  • invoice_003_missing_total.pdf: Missing total amount
  • invoice_004_scanned.pdf: Image-based (needs OCR)
  • invoice_005_alt_format.pdf: Different layout
  • invoice_006_negative_total.pdf: Invalid negative amount

Output

Processing generates:

  • output/invoice_report.csv: Structured CSV with extracted data and validation results
  • logs/bot.log: Detailed processing logs
  • Email alerts (if configured and validation threshold exceeded)

Docker

Build and run with Docker:

# Build image
docker build -t rpa-invoice-bot .

# Run with volumes
docker run -v "$(pwd)/samples:/app/samples" -v "$(pwd)/output:/app/output" -v "$(pwd)/logs:/app/logs" rpa-invoice-bot

Dependencies

  • Python 3.10+
  • pdfplumber
  • easyocr
  • pyyaml
  • pandas
  • Pillow

Install with: pip install -r requirements.txt

License

MIT License

About

Python RPA bot that extracts data from PDF invoices using pdfplumber and EasyOCR, validates it, and generates CSV reports

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors