A lightweight PDF outline extractor that identifies document headings and generates a JSON outline of title, levels, text, and page number.
- Multilingual Support: Works seamlessly on PDFs in multiple languages (e.g., English, French, Chinese, German).
- High Performance: Processes documents up to 50 pages in under 5–6 seconds on average.
-
Text and Feature Extraction
- Use PyMuPDF (
fitz) to parse each PDF page into text blocks and lines. - For each line, extract raw features such as font size, position (
x0,y0), text length, uppercase ratio, bold/italic flags, numeric or punctuation patterns, and title-case flag.
- Use PyMuPDF (
-
Batch Predictions for Heading Detection
- Aggregate all line features into a single Pandas DataFrame.
- Perform a single
heading_model.predict()over the entire DataFrame to flag heading lines. - This reduces per-line DataFrame construction and model-call overhead, improving performance on large documents.
-
Merge Split/Overlapping Lines
- Iteratively merge consecutive fragments if they have the same font size and their vertical positions (
y0) fall within a threshold. - Prevent duplicate text fragments by checking if the new fragment is already contained in the current heading entry.
- Iteratively merge consecutive fragments if they have the same font size and their vertical positions (
-
Map Font Sizes to Heading Levels (H1–H6)
- Identify the unique font sizes among detected headings in descending order.
- Assign the largest font size to
H1, the second-largest toH2, and so on up toH6.
-
De-duplicate and Finalize Outline
- Remove any remaining duplicate heading texts.
- Select the first
H1as the document title. - Produce a JSON outline containing a
titlestring and an array of{ level, text, page }objects.
-
Models
heading_classifier.pkl: Binary classifier (e.g., Random Forest or Logistic Regression) to detect heading vs. body text.level_classifier.pkl: Multiclass model for heading levels; currently provided for extensibility but not used in the main batch pipeline.
-
Python Libraries
- PyMuPDF (
fitz) for PDF parsing. - Pandas for DataFrame-based batching.
- scikit-learn for loading and predicting with pre-trained models.
- joblib for efficient model serialization and loading.
- PyMuPDF (
- Clone the Repository
git clone https://github.com/AdamyaSingh7/AdobeChallenge_1A.git
cd AdobeChallenge_1a- Set Up Prerequisites
- Ensure Docker is installed (version 20+ recommended).
- (Optional) If running locally, use Python 3.8+ and install dependencies:
pip install -r requirements.txtIn Bash (build):
docker build --platform linux/amd64 -t adobe1a-outline-extractor:latest .In PowerShell (build):
docker build --platform linux/amd64 -t adobe1a-outline-extractor:latest .
⚠️ Note: The build process takes approximately 1 to 1.5 minutes, depending on your system.
In Bash (run):
docker run --rm \
-v $(pwd)/input:/app/input \
-v $(pwd)/output:/app/output \
--network none \
adobe1a-outline-extractor:latestIn PowerShell (run):
docker run --rm `
-v "${PWD}\input:/app/input" `
-v "${PWD}\output:/app/output" `
--network none `
adobe1a-outline-extractor:latestThe tool will process all PDFs in /app/input and write JSON outlines to /app/output.
Note: The
/inputfolder also includes extra test PDFs beyond the Adobe-provided samples:
- A 50-page document to benchmark performance.
- A German-language PDF to validate multilingual support.