Skip to content

[Feature Request]: #224

@vega-he

Description

@vega-he

Do you need to file a feature request?

  • I have searched the existing feature request and this feature request is not already filed.
  • I believe this is a legitimate feature request, not just a question or bug.

Feature Request Description

RAG-Anything already advertises MinerU integration and direct content list insertion, and there has also been recent compatibility work for MinerU 2.x (for example, the hybrid backend output directory issue and the merged MinerU-2 field-name fix).

However, from practical inspection of current MinerU outputs, a ZIP bundle may now contain both:

  • legacy flat *_content_list.json
  • newer content_list_v2.json

In my case, the current pipeline still ends up consuming the legacy flat *_content_list.json, while content_list_v2.json appears to be a richer and more structured intermediate representation.

Why this matters

The legacy flat content list contains many low-semantic structural blocks such as:

  • page_number
  • header
  • list
  • footer

When these are passed into the current RAG-Anything ingestion path, they can amplify noise in downstream semantic extraction and multimodal graph construction.

By contrast, content_list_v2.json appears to provide a more semantically organized representation (for example, grouped blocks and higher-level block types such as title, paragraph, image, etc.), which seems more suitable as the upstream document representation for RAG ingestion.

Observed behavior

In MinerU output bundles, I can see files like:

  • *_content_list.json
  • content_list_v2.json
  • full.md
  • layout.json

But the current ingestion path appears to still prefer the legacy *_content_list.json.

Request

Please consider adding native support for content_list_v2.json in RAG-Anything, including:

  1. Detect and prefer content_list_v2.json when available
  2. Add a native parser/adapter for the v2 schema
  3. Clarify whether v2 is officially supported, experimental, or unsupported
  4. Document the expected MinerU output contract(s) for current RAG-Anything versions

Suggested acceptance criteria

  • RAG-Anything can ingest MinerU bundles where content_list_v2.json is present
  • The ingestion path does not require falling back to legacy flat *_content_list.json
  • Core semantic content types from v2 are properly routed into text / multimodal processing
  • Structural-only blocks (such as page numbers or similar layout artifacts) are not accidentally amplified as semantic entities

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions