This script processes various file types within a directory, extracting and organizing data, especially from image files. It supports formats such as JSON, CSV, XML, TXT, Markdown, DOCX, and common image formats like BMP, JPEG, PNG, GIF, and more. The script includes features like OCR (Optical Character Recognition) for images and provides descriptive data about each processed file.
- Python 3.x
- Libraries: Pillow, Pytesseract, Tqdm, Termcolor, PyYAML, Python-docx
- Tesseract OCR (for OCR functionalities)
Install the required Python libraries using pip:
pip install Pillow pytesseract tqdm termcolor pyyaml python-docxInstall Tesseract OCR. Follow the instructions here: Tesseract OCR Installation.
-
Place the script in the directory containing the files you want to process.
-
Run the script using Python:
python script_name.py
-
Optional flags:
--debug: Enable debug mode for verbose logging.
-
Optional venv setup:
python -m venv venv source venv/bin/activate pip install -r requirements.txt
.\venv\Scripts\activate
- Gather knowledge sources into folder.
- gpt-crawler
- custom gpts
- reddit posts
- documentation
- code examples
- images, pdfs, etc.
- Execute python . script to process all files in the folder.
- Extract text from all files.
- Extract data from structured files (json, csv, xml, etc.)
- Extract data from images (OCR)
- Organize data into a structured format.
- Log progress and important information.
- Handle errors gracefully.
- Runs gpt-crawler, for example with the following command:
python ./src --url "https://scriptgpt.wiki/" --match "https://scriptgpt.wiki/**" --project "ScriptGPT"
- Processes files in various formats, extracting relevant data.
- Performs OCR on image files to extract text.
- Organizes extracted data into a structured format.
- Logs progress and important information with colored outputs.
- Handles errors gracefully, providing informative messages for troubleshooting.
- Text Files:
.txt,.md - Document Files:
.doc,.docx - Data Files:
.json,.csv,.xml,.yaml - Image Files:
.bmp,.jpeg,.png,.gif,.ico,.svg,.psd,.pdf - Additional image processing and OCR support for image files.
The script can be configured via a config.json file, allowing customization of log levels, output paths, and other settings.
Detailed logging is provided, including debug information if the debug mode is enabled. Logs can be viewed in the console and optionally saved to a file.
Contributions to enhance the script's capabilities are welcome. Please ensure to follow the existing code structure and style for consistency.
[Specify your license or terms of use]
[Your Contact Information]
- Tesseract OCR for OCR capabilities.
- Pillow and other Python libraries that made this script possible.