ADOBE ROUND 1A

A lightweight PDF outline extractor that identifies document headings and generates a JSON outline of title, levels, text, and page number.

Multilingual Support: Works seamlessly on PDFs in multiple languages (e.g., English, French, Chinese, German).
High Performance: Processes documents up to 50 pages in under 5–6 seconds on average.

Approach

Text and Feature Extraction
- Use PyMuPDF (fitz) to parse each PDF page into text blocks and lines.
- For each line, extract raw features such as font size, position (x0, y0), text length, uppercase ratio, bold/italic flags, numeric or punctuation patterns, and title-case flag.
Batch Predictions for Heading Detection
- Aggregate all line features into a single Pandas DataFrame.
- Perform a single heading_model.predict() over the entire DataFrame to flag heading lines.
- This reduces per-line DataFrame construction and model-call overhead, improving performance on large documents.
Merge Split/Overlapping Lines
- Iteratively merge consecutive fragments if they have the same font size and their vertical positions (y0) fall within a threshold.
- Prevent duplicate text fragments by checking if the new fragment is already contained in the current heading entry.

Post-Processing

Map Font Sizes to Heading Levels (H1–H6)
- Identify the unique font sizes among detected headings in descending order.
- Assign the largest font size to H1, the second-largest to H2, and so on up to H6.
De-duplicate and Finalize Outline
- Remove any remaining duplicate heading texts.
- Select the first H1 as the document title.
- Produce a JSON outline containing a title string and an array of { level, text, page } objects.

Models & Libraries Used

Models
- heading_classifier.pkl: Binary classifier (e.g., Random Forest or Logistic Regression) to detect heading vs. body text.
- level_classifier.pkl: Multiclass model for heading levels; currently provided for extensibility but not used in the main batch pipeline.
Python Libraries
- PyMuPDF (fitz) for PDF parsing.
- Pandas for DataFrame-based batching.
- scikit-learn for loading and predicting with pre-trained models.
- joblib for efficient model serialization and loading.

Installation

Clone the Repository

git clone https://github.com/AdamyaSingh7/AdobeChallenge_1A.git
cd AdobeChallenge_1a

Set Up Prerequisites

Ensure Docker is installed (version 20+ recommended).
(Optional) If running locally, use Python 3.8+ and install dependencies:

pip install -r requirements.txt

Build & Run

Build Docker Image

In Bash (build):

docker build --platform linux/amd64 -t adobe1a-outline-extractor:latest .

In PowerShell (build):

docker build --platform linux/amd64 -t adobe1a-outline-extractor:latest .

⚠️ Note: The build process takes approximately 1 to 1.5 minutes, depending on your system.

Run Container

In Bash (run):

docker run --rm \
  -v $(pwd)/input:/app/input \
  -v $(pwd)/output:/app/output \
  --network none \
  adobe1a-outline-extractor:latest

In PowerShell (run):

docker run --rm `
  -v "${PWD}\input:/app/input" `
  -v "${PWD}\output:/app/output" `
  --network none `
  adobe1a-outline-extractor:latest

The tool will process all PDFs in /app/input and write JSON outlines to /app/output.

Note: The /input folder also includes extra test PDFs beyond the Adobe-provided samples:

A 50-page document to benchmark performance.

A German-language PDF to validate multilingual support.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.idea		.idea
input		input
models		models
my_pdfs		my_pdfs
output		output
train_model		train_model
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
extract_pdf_headings.py		extract_pdf_headings.py
extractor.py		extractor.py
process_pdfs.py		process_pdfs.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ADOBE ROUND 1A

Approach

Post-Processing

Models & Libraries Used

Installation

Build & Run

Build Docker Image

Run Container

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ADOBE ROUND 1A

Approach

Post-Processing

Models & Libraries Used

Installation

Build & Run

Build Docker Image

Run Container

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages