Do you need to file a feature request?
Feature Request Description
RAG-Anything already advertises MinerU integration and direct content list insertion, and there has also been recent compatibility work for MinerU 2.x (for example, the hybrid backend output directory issue and the merged MinerU-2 field-name fix).
However, from practical inspection of current MinerU outputs, a ZIP bundle may now contain both:
- legacy flat
*_content_list.json
- newer
content_list_v2.json
In my case, the current pipeline still ends up consuming the legacy flat *_content_list.json, while content_list_v2.json appears to be a richer and more structured intermediate representation.
Why this matters
The legacy flat content list contains many low-semantic structural blocks such as:
page_number
header
list
footer
When these are passed into the current RAG-Anything ingestion path, they can amplify noise in downstream semantic extraction and multimodal graph construction.
By contrast, content_list_v2.json appears to provide a more semantically organized representation (for example, grouped blocks and higher-level block types such as title, paragraph, image, etc.), which seems more suitable as the upstream document representation for RAG ingestion.
Observed behavior
In MinerU output bundles, I can see files like:
*_content_list.json
content_list_v2.json
full.md
layout.json
But the current ingestion path appears to still prefer the legacy *_content_list.json.
Request
Please consider adding native support for content_list_v2.json in RAG-Anything, including:
- Detect and prefer
content_list_v2.json when available
- Add a native parser/adapter for the v2 schema
- Clarify whether v2 is officially supported, experimental, or unsupported
- Document the expected MinerU output contract(s) for current RAG-Anything versions
Suggested acceptance criteria
- RAG-Anything can ingest MinerU bundles where
content_list_v2.json is present
- The ingestion path does not require falling back to legacy flat
*_content_list.json
- Core semantic content types from v2 are properly routed into text / multimodal processing
- Structural-only blocks (such as page numbers or similar layout artifacts) are not accidentally amplified as semantic entities
Additional Context
No response
Do you need to file a feature request?
Feature Request Description
RAG-Anything already advertises MinerU integration and direct content list insertion, and there has also been recent compatibility work for MinerU 2.x (for example, the hybrid backend output directory issue and the merged MinerU-2 field-name fix).
However, from practical inspection of current MinerU outputs, a ZIP bundle may now contain both:
*_content_list.jsoncontent_list_v2.jsonIn my case, the current pipeline still ends up consuming the legacy flat
*_content_list.json, whilecontent_list_v2.jsonappears to be a richer and more structured intermediate representation.Why this matters
The legacy flat content list contains many low-semantic structural blocks such as:
page_numberheaderlistfooterWhen these are passed into the current RAG-Anything ingestion path, they can amplify noise in downstream semantic extraction and multimodal graph construction.
By contrast,
content_list_v2.jsonappears to provide a more semantically organized representation (for example, grouped blocks and higher-level block types such astitle,paragraph,image, etc.), which seems more suitable as the upstream document representation for RAG ingestion.Observed behavior
In MinerU output bundles, I can see files like:
*_content_list.jsoncontent_list_v2.jsonfull.mdlayout.jsonBut the current ingestion path appears to still prefer the legacy
*_content_list.json.Request
Please consider adding native support for
content_list_v2.jsonin RAG-Anything, including:content_list_v2.jsonwhen availableSuggested acceptance criteria
content_list_v2.jsonis present*_content_list.jsonAdditional Context
No response