[Feature Request]: Replace Docling Parser's CLI subprocess with Python API

### Do you need to file a feature request?

- [x] I have searched the existing feature request and this feature request is not already filed.
- [x] I believe this is a legitimate feature request, not just a question or bug.

### Feature Request Description

## Summary 

- Replace the current `DoclingParser` implementation that shells out to the
`docling` CLI via `subprocess.run` with a direct integration through the
[Docling Python API](https://github.com/DS4SD/docling)
(`docling.document_converter.DocumentConverter`).
 
This eliminates process-spawning overhead, avoids disk I/O round-trips for
intermediate JSON/Markdown files, and enables in-memory model reuse across
consecutive parse calls — yielding significant performance gains for
multi-document workloads while preserving full backward compatibility.
 
---
 
## Motivation / Problem
 
The current `DoclingParser` in HKUDS/RAG-Anything invokes Docling through
its command-line interface:
 
```python
# Current approach (raganything/parser.py – _run_docling_command)
cmd = [
"docling",
"--output", str(file_output_dir),
"--to", "json",
"--to", "md",
str(input_path),
]
result = subprocess.run(cmd, **docling_subprocess_kwargs)
```
 
- After the subprocess completes, output files are read back from disk
(`_read_output_files`). This pattern has several drawbacks:
 
\| Issue \| Impact \|
\|-------\|--------\|
\| **Process-spawn overhead** \| Each `parse_*` call forks a new process, loads the Python interpreter, and re-initializes all Docling models from scratch. \|
\| **Disk I/O round-trip** \| Docling writes JSON + Markdown to disk; the parser then reads them back. This is unnecessary when the data is immediately consumed in-memory. \|
\| **No model reuse** \| Docling's deep-learning models (table structure, OCR, layout) are loaded fresh on every invocation — the most expensive part of the pipeline. \|
\| **Fragile error handling** \| Errors surface as `subprocess.CalledProcessError` with stderr strings rather than typed Python exceptions with full stack traces. \|
\| **Platform quirks** \| Windows requires `CREATE_NO_WINDOW` flags; PATH must include the `docling` entry-point. These are unnecessary when calling Python directly. \|
\| **Limited pipeline control** \| CLI flags expose only a subset of Docling's configuration surface. The Python API offers fine-grained control over pipeline options, format options, and OCR settings. \|
 
---
 
## Proposed Solution
 
### Core Idea
 
Use `docling.document_converter.DocumentConverter` directly in Python:
 
```python
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode
 
# Build pipeline options from user kwargs
pipeline_options = PdfPipelineOptions()
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
pipeline_options.do_ocr = True
 
# Create converter (reused across calls)
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options),
}
)
 
# Parse document — no subprocess, no disk I/O
result = converter.convert(str(input_path))
doc_dict = result.document.export_to_dict()
 
# Convert to MinerU-compatible content-list format
content_list = self.read_from_block_recursive(
doc_dict["body"], "body", img_output_dir, 0, "0", doc_dict
)
```
 ---
 
## Backward Compatibility
 
- **No breaking changes** to the public `parse_pdf()`, `parse_document()`,
`parse_office_doc()`, `parse_html()` signatures.
- Callers passing `env={"KEY": "VAL"}` (previously used for subprocess
environment) will have the type validated but the value silently ignored.
 
### Output Format
 
- The content-list output format is **identical** to the current
implementation. The same `read_from_block_recursive()` /
`read_from_block()` methods are used to transform the Docling document
dict into the MinerU-compatible structure.
 
---
 
### Dependencies
 
- **No new required dependencies.** `docling` remains an optional package.
- The `check_installation()` method gracefully reports whether the package
is available.
- `pip install docling` is the only setup step for users who want to use
the Docling parser backend.
 
---
 
## Migration Guide
 
### For End Users
 
No action required. The public API is unchanged:
 
```python
from raganything.parser import DoclingParser
 
parser = DoclingParser()
content = parser.parse_document("report.pdf", output_dir="./output")
```
 
### For Callers Passing `env`
 
The `env` kwarg is still accepted but no longer has any effect:
 
```python
# Before (CLI subprocess):
parser.parse_pdf("doc.pdf", env={"DOCLING_CACHE": "/tmp"})
# After (Python API): accepted without error, but env is ignored.
```
 
### For Advanced Configuration
 
The Python API exposes more options than the CLI:
 
```python
parser.parse_pdf(
"doc.pdf",
table_mode="accurate",      # TableFormerMode.ACCURATE
tables=True,                # do_table_structure = True
allow_ocr=True,             # do_ocr = True
artifacts_path="/models",   # custom model artifacts directory
)
```
 
---
 
## Checklist
 
- [x] Backward compatibility preserved for all existing kwargs
- [x] Output format identical to current CLI-based implementation
- [x] No new required dependencies introduced 
---
 
## References
 
- Docling Python API: https://github.com/DS4SD/docling
- `DocumentConverter` usage: https://ds4sd.github.io/docling/

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: Replace Docling Parser's CLI subprocess with Python API #222

Do you need to file a feature request?

Feature Request Description

Summary

Motivation / Problem

Proposed Solution

Core Idea

Backward Compatibility

Output Format

Dependencies

Migration Guide

For End Users

For Callers Passing `env`

For Advanced Configuration

Checklist

References

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request]: Replace Docling Parser's CLI subprocess with Python API #222

Description

Do you need to file a feature request?

Feature Request Description

Summary

Motivation / Problem

Proposed Solution

Core Idea

Backward Compatibility

Output Format

Dependencies

Migration Guide

For End Users

For Callers Passing env

For Advanced Configuration

Checklist

References

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

For Callers Passing `env`