[Feature Request]:

### Do you need to file a feature request?

- [x] I have searched the existing feature request and this feature request is not already filed.
- [x] I believe this is a legitimate feature request, not just a question or bug.

### Feature Request Description

RAG-Anything already advertises MinerU integration and direct content list insertion, and there has also been recent compatibility work for MinerU 2.x (for example, the hybrid backend output directory issue and the merged MinerU-2 field-name fix).

However, from practical inspection of current MinerU outputs, a ZIP bundle may now contain both:

- legacy flat `*_content_list.json`
- newer `content_list_v2.json`

In my case, the current pipeline still ends up consuming the legacy flat `*_content_list.json`, while `content_list_v2.json` appears to be a richer and more structured intermediate representation.

Why this matters

The legacy flat content list contains many low-semantic structural blocks such as:

- `page_number`
- `header`
- `list`
- `footer`

When these are passed into the current RAG-Anything ingestion path, they can amplify noise in downstream semantic extraction and multimodal graph construction.

By contrast, `content_list_v2.json` appears to provide a more semantically organized representation (for example, grouped blocks and higher-level block types such as `title`, `paragraph`, `image`, etc.), which seems more suitable as the upstream document representation for RAG ingestion.

Observed behavior

In MinerU output bundles, I can see files like:

- `*_content_list.json`
- `content_list_v2.json`
- `full.md`
- `layout.json`

But the current ingestion path appears to still prefer the legacy `*_content_list.json`.

Request

Please consider adding **native support for `content_list_v2.json`** in RAG-Anything, including:

1. Detect and prefer `content_list_v2.json` when available
2. Add a native parser/adapter for the v2 schema
3. Clarify whether v2 is officially supported, experimental, or unsupported
4. Document the expected MinerU output contract(s) for current RAG-Anything versions

Suggested acceptance criteria

- RAG-Anything can ingest MinerU bundles where `content_list_v2.json` is present
- The ingestion path does not require falling back to legacy flat `*_content_list.json`
- Core semantic content types from v2 are properly routed into text / multimodal processing
- Structural-only blocks (such as page numbers or similar layout artifacts) are not accidentally amplified as semantic entities



### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: #224

Do you need to file a feature request?

Feature Request Description

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request]: #224

Description

Do you need to file a feature request?

Feature Request Description

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions