Skip to content

VLM Enhanced Query#86

Merged
LarFii merged 3 commits intomainfrom
vlm_enhanced_query
Aug 15, 2025
Merged

VLM Enhanced Query#86
LarFii merged 3 commits intomainfrom
vlm_enhanced_query

Conversation

@LarFii
Copy link
Copy Markdown
Collaborator

@LarFii LarFii commented Aug 15, 2025

Description

This pull request introduces VLM Enhanced Query mode to RAGAnything, enabling automatic multimodal analysis when documents contain images. The system can now pass images directly to Vision Language Models (VLM) alongside text context for comprehensive analysis.

Related Issues

N/A

Changes Made

  • Added VLM Enhanced Query Mode: New query type that automatically processes images in retrieved context
  • Updated vision_model_func signature: Added messages parameter to support multimodal VLM communication format
  • Enhanced README documentation:
    • Updated both English and Chinese READMEs with VLM enhanced query examples
    • Changed query types from "two types" to "three types"
    • Added comprehensive usage examples for VLM enhanced queries
  • Updated example code: Modified raganything_example.py to demonstrate the new VLM functionality
  • Version bump: Updated version from 1.2.6 to 1.2.7 in __init__.py
  • Backward compatibility: Maintained support for traditional single image format and pure text queries

Key Features Added:

  • Automatic image detection and base64 encoding from retrieved context
  • Support for both automatic VLM enhancement (when vision_model_func is available) and manual control via vlm_enhanced parameter
  • Comprehensive multimodal analysis combining text context and images
  • Fallback to normal queries when no images are found

Checklist

  • Changes tested locally
  • Code reviewed
  • Documentation updated (README.md and README_zh.md)
  • Example code updated
  • Version number incremented
  • Backward compatibility maintained

Additional Notes

This feature significantly enhances RAGAnything's multimodal capabilities by enabling seamless integration of visual content analysis within the RAG pipeline. Users can now ask questions about charts, diagrams, and other visual elements in documents without additional preprocessing steps. The implementation maintains full backward compatibility with existing functionality while providing powerful new capabilities for multimodal document understanding.

@LarFii LarFii merged commit 79078b2 into main Aug 15, 2025
1 check passed
@LarFii LarFii deleted the vlm_enhanced_query branch August 15, 2025 12:18
Kirky-X pushed a commit to Kirky-X/RAG-Anything that referenced this pull request Dec 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant