Skip to content

[Feature Request]: Replace Docling Parser's CLI subprocess with Python API #222

@hanlianlu

Description

@hanlianlu

Do you need to file a feature request?

  • I have searched the existing feature request and this feature request is not already filed.
  • I believe this is a legitimate feature request, not just a question or bug.

Feature Request Description

Summary

  • Replace the current DoclingParser implementation that shells out to the
    docling CLI via subprocess.run with a direct integration through the
    Docling Python API
    (docling.document_converter.DocumentConverter).
     
    This eliminates process-spawning overhead, avoids disk I/O round-trips for
    intermediate JSON/Markdown files, and enables in-memory model reuse across
    consecutive parse calls — yielding significant performance gains for
    multi-document workloads while preserving full backward compatibility.
     

 

Motivation / Problem

 
The current DoclingParser in HKUDS/RAG-Anything invokes Docling through
its command-line interface:
 

# Current approach (raganything/parser.py – _run_docling_command)
cmd = [
"docling",
"--output", str(file_output_dir),
"--to", "json",
"--to", "md",
str(input_path),
]
result = subprocess.run(cmd, **docling_subprocess_kwargs)

 

  • After the subprocess completes, output files are read back from disk
    (_read_output_files). This pattern has several drawbacks:
     
    | Issue | Impact |
    |-------|--------|
    | Process-spawn overhead | Each parse_* call forks a new process, loads the Python interpreter, and re-initializes all Docling models from scratch. |
    | Disk I/O round-trip | Docling writes JSON + Markdown to disk; the parser then reads them back. This is unnecessary when the data is immediately consumed in-memory. |
    | No model reuse | Docling's deep-learning models (table structure, OCR, layout) are loaded fresh on every invocation — the most expensive part of the pipeline. |
    | Fragile error handling | Errors surface as subprocess.CalledProcessError with stderr strings rather than typed Python exceptions with full stack traces. |
    | Platform quirks | Windows requires CREATE_NO_WINDOW flags; PATH must include the docling entry-point. These are unnecessary when calling Python directly. |
    | Limited pipeline control | CLI flags expose only a subset of Docling's configuration surface. The Python API offers fine-grained control over pipeline options, format options, and OCR settings. |
     

 

Proposed Solution

 

Core Idea

 
Use docling.document_converter.DocumentConverter directly in Python:
 

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode
 
# Build pipeline options from user kwargs
pipeline_options = PdfPipelineOptions()
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
pipeline_options.do_ocr = True
 
# Create converter (reused across calls)
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options),
}
)
 
# Parse document — no subprocess, no disk I/O
result = converter.convert(str(input_path))
doc_dict = result.document.export_to_dict()
 
# Convert to MinerU-compatible content-list format
content_list = self.read_from_block_recursive(
doc_dict["body"], "body", img_output_dir, 0, "0", doc_dict
)

 ---
 

Backward Compatibility

 

  • No breaking changes to the public parse_pdf(), parse_document(),
    parse_office_doc(), parse_html() signatures.
  • Callers passing env={"KEY": "VAL"} (previously used for subprocess
    environment) will have the type validated but the value silently ignored.
     

Output Format

 

  • The content-list output format is identical to the current
    implementation. The same read_from_block_recursive() /
    read_from_block() methods are used to transform the Docling document
    dict into the MinerU-compatible structure.
     

 

Dependencies

 

  • No new required dependencies. docling remains an optional package.
  • The check_installation() method gracefully reports whether the package
    is available.
  • pip install docling is the only setup step for users who want to use
    the Docling parser backend.
     

 

Migration Guide

 

For End Users

 
No action required. The public API is unchanged:
 

from raganything.parser import DoclingParser
 
parser = DoclingParser()
content = parser.parse_document("report.pdf", output_dir="./output")

 

For Callers Passing env

 
The env kwarg is still accepted but no longer has any effect:
 

# Before (CLI subprocess):
parser.parse_pdf("doc.pdf", env={"DOCLING_CACHE": "/tmp"})
# After (Python API): accepted without error, but env is ignored.

 

For Advanced Configuration

 
The Python API exposes more options than the CLI:
 

parser.parse_pdf(
"doc.pdf",
table_mode="accurate",      # TableFormerMode.ACCURATE
tables=True,                # do_table_structure = True
allow_ocr=True,             # do_ocr = True
artifacts_path="/models",   # custom model artifacts directory
)

 

 

Checklist

 

  • Backward compatibility preserved for all existing kwargs
  • Output format identical to current CLI-based implementation
  • No new required dependencies introduced 

 

References

 

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions