An intelligent, multi-agent system for automated academic literature review generation. This tool searches arXiv, extracts content from PDFs, builds a RAG (Retrieval-Augmented Generation) pipeline, and synthesizes comprehensive literature reviews.
- Features
- Architecture
- Installation
- Quick Start
- Project Structure
- Usage
- Configuration
- API Reference
- Troubleshooting
- Contributing
- License
- Contact
- ๐ Automated Paper Discovery: Search and retrieve papers from arXiv based on your research query
- ๐ PDF Processing: Intelligent extraction and cleaning of academic PDFs
- ๐ง RAG Pipeline: Vector-based retrieval using FAISS for semantic search
- ๐ AI Summarization: Powered by Google Gemini for high-quality summaries
- ๐ Literature Synthesis: Automatic generation of structured literature reviews
- ๐ฏ Multi-Agent Architecture: Specialized agents for different tasks
- ๐ฅ๏ธ Multiple Interfaces: CLI workflow and Flask web interface
The system uses a multi-agent architecture with specialized components:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Research Query โ
โโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SEARCH AGENT โ
โ โข Queries arXiv API โ
โ โข Downloads PDFs โ
โ โข Extracts metadata โ
โโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ EXTRACTION AGENT โ
โ โข Parses PDF content โ
โ โข Cleans text (removes refs, figures, equations) โ
โ โข Extracts abstract โ
โ โข Chunks body text โ
โโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ RAG PIPELINE โ
โ โข Generates embeddings (SentenceTransformer) โ
โ โข Builds FAISS index โ
โ โข Semantic similarity search โ
โโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SUMMARIZER AGENT โ
โ โข Summarizes abstract + body chunks โ
โ โข Query-based chunk selection โ
โ โข Google Gemini API integration โ
โโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SYNTHESIZER AGENT โ
โ โข Aggregates paper summaries โ
โ โข Generates structured literature review โ
โ โข Markdown formatting โ
โโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Literature Review (Markdown) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
- Python 3.8 or higher
- pip package manager
- Google Gemini API key (Get one here)
git clone https://github.com/yourusername/agentic-research-assistant.git
cd agentic-research-assistant# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activatepip install -r requirements.txtCreate a .env file in the root directory:
API_KEY=your_gemini_api_key_hereOr export as environment variable:
# On Windows (Command Prompt):
set API_KEY=your_gemini_api_key_here
# On Windows (PowerShell):
$env:API_KEY="your_gemini_api_key_here"
# On macOS/Linux:
export API_KEY="your_gemini_api_key_here"from src.main import run
# Generate literature review for your research topic
review = run("machine learning optimization")The generated review will be saved in outputs/sample_review_optimized.md
python app/app.pyThen open your browser to http://localhost:5001
Agentic-Research-Assistant/
โ
โโโ src/
โ โโโ agents/
โ โ โโโ __init__.py
โ โ โโโ search_agent.py # arXiv search & PDF download
โ โ โโโ extraction_agent.py # PDF parsing & cleaning
โ โ โโโ summarizer_agent.py # AI-powered summarization
โ โ โโโ synthesizer_agent.py # Literature review generation
โ โ
โ โโโ rag_pipeline.py # FAISS + embeddings + retrieval
โ โโโ utils.py # Logging, config helpers
โ โโโ main.py # Orchestrator (ties all agents)
โ
โโโ app/
โ โโโ app.py # Flask backend
โ โโโ templates/
โ โโโ llm.html # Web UI template
โ
โโโ outputs/
โ โโโ sample_review_optimized.md # Generated literature reviews
โ
โโโ pdfs/ # Downloaded PDFs (auto-created)
โ
โโโ requirements.txt # Python dependencies
โโโ .env.example # Environment variables template
โโโ .gitignore
โโโ readme.md # This file
from src.main import run
# Basic usage
review = run("deep learning")
# The function will:
# 1. Search arXiv for relevant papers
# 2. Download and parse PDFs
# 3. Extract and chunk content
# 4. Generate summaries using AI
# 5. Create a structured literature review
# 6. Save to outputs/sample_review_optimized.mdfrom src.agents.search_agent import SearchAgent
from src.agents.extraction_agent import ExtractionAgent
from src.agents.summarizer_agent import SummarizerAgent
from src.rag_pipeline import RAGPipeline
import faiss
# Initialize agents
search = SearchAgent(pdf_dir="pdfs")
extraction = ExtractionAgent()
summarizer = SummarizerAgent(api_key="your_api_key")
# Search for papers
papers = search.search_arxiv("neural networks", max_results=5)
# Process a single paper
for paper in papers:
if paper.get("pdf_path"):
# Extract content
parsed = extraction.parse_pdf(paper["pdf_path"], paper["id"])
# Generate summary
summary = summarizer.summarize_chunks(parsed, query="neural networks")
print(f"Title: {paper['title']}")
print(f"Summary: {summary}\n")# Customize number of papers
papers = search.search_arxiv("quantum computing", max_results=10)
# Customize chunk size
chunks = extraction.chunk_text(text, chunk_size=1500)
# Customize summary length
summary = summarizer._summarize_text(text, max_output_tokens=500)
# Customize RAG retrieval
papers, summaries = rag.query(
query="machine learning",
top_k_chunks=200,
top_k_papers=5,
chunks_per_paper=15
)from src.main import run
queries = [
"machine learning optimization",
"deep learning computer vision",
"natural language processing transformers"
]
for query in queries:
print(f"Processing: {query}")
review = run(query)
print(f"Review saved for: {query}\n")Create a .env file:
# Required
API_KEY=your_gemini_api_key
# Backwards-compatible alternative
GEMINI_API_KEY=your_gemini_api_key
# Optional
PDF_DIR=pdfs
OUTPUT_DIR=outputs
LOG_LEVEL=INFO
MAX_PAPERS=5
CHUNK_SIZE=2000Edit parameters in src/main.py:
# Search configuration
papers = search.search_arxiv(query, max_results=3) # Number of papers
# Chunking configuration
chunks = extraction.chunk_text(text, chunk_size=2000) # Chunk size
# Summarization configuration
summary = summarizer._summarize_text(
text,
max_output_tokens=300 # Summary length
)
# RAG configuration
papers, summaries = rag.query(
query=query,
top_k_chunks=200, # Chunks to retrieve
top_k_papers=3, # Papers to include
chunks_per_paper=10 # Chunks per paper
)In src/utils.py:
def get_logger(name):
logger = logging.getLogger(name)
handler = logging.StreamHandler()
formatter = logging.Formatter(
'%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.setLevel(logging.INFO) # Change to DEBUG for verbose output
return loggerInitialize:
search = SearchAgent(pdf_dir="pdfs")Methods:
Search arXiv for papers matching the query.
Parameters:
query(str): Search querymax_results(int): Maximum number of papers to retrieve
Returns:
- List of paper dictionaries with keys:
id,title,summary,authors,pdf_path,url,published
Example:
papers = search.search_arxiv("machine learning", max_results=5)
for paper in papers:
print(f"Title: {paper['title']}")
print(f"Authors: {', '.join(paper['authors'])}")Download a PDF from the given URL.
Parameters:
pdf_url(str): URL of the PDFpaper_id(str): Unique identifier for the paper
Returns:
- str: Path to downloaded PDF, or None if download failed
Initialize:
extraction = ExtractionAgent()Methods:
Extract and clean text from a PDF file.
Parameters:
pdf_path(str): Path to PDF filepaper_id(str, optional): Paper identifier for metadata
Returns:
- dict:
{"abstract": str, "chunks": List[dict]}
Example:
parsed = extraction.parse_pdf("pdfs/paper123.pdf", paper_id="paper123")
print(f"Abstract: {parsed['abstract'][:200]}...")
print(f"Number of chunks: {len(parsed['chunks'])}")Split text into chunks for processing.
Parameters:
text(str): Text to chunkpaper_id(str, optional): Paper identifierchunk_size(int): Number of words per chunk
Returns:
- List[dict]: Chunks with metadata
Remove equations, citations, figures, and other noise from text.
Parameters:
text(str): Raw text
Returns:
- str: Cleaned text
Initialize:
summarizer = SummarizerAgent(api_key="your_gemini_api_key")Methods:
Generate a summary from paper content.
Parameters:
paper_data(dict):{"abstract": str, "chunks": List[dict]}query(str, optional): Query for relevance-based chunk selectionk(int): Number of top chunks to use
Returns:
- str: Generated summary
Example:
summary = summarizer.summarize_chunks(
paper_data=parsed,
query="machine learning",
k=5
)
print(summary)Generate embeddings for text chunks.
Parameters:
chunks(List[str]): List of text chunksnormalize(bool): Whether to normalize embeddings
Returns:
- np.ndarray: Array of embeddings
Initialize:
from src.rag_pipeline import RAGPipeline
import faiss
dim = summarizer.embedding_model.get_sentence_embedding_dimension()
index = faiss.IndexFlatIP(dim)
rag = RAGPipeline(
search_agent=search,
extraction_agent=extraction,
summarizer_agent=summarizer,
index=index,
id_to_metadata={}
)Methods:
Add chunks to the FAISS index.
Parameters:
chunks(List[dict]): Chunks with text and metadatapaper_info(dict, optional): Paper metadata
Example:
rag.build_index(chunks, paper_info=paper)Retrieve and summarize relevant papers.
Parameters:
query(str): Search querytop_k_chunks(int): Total chunks to retrievetop_k_papers(int): Number of papers to returnchunks_per_paper(int): Chunks per paper for summarization
Returns:
- Tuple[List[dict], List[str]]: (papers, summaries)
Initialize:
from src.agents.synthesizer_agent import SynthesizerAgent
synthesizer = SynthesizerAgent()Methods:
Generate a structured literature review.
Parameters:
papers(List[dict]): List of paper metadatasummaries(List[str]): List of paper summaries
Returns:
- str: Markdown-formatted literature review
Cause: Gemini API rate limiting or quota exceeded
Solutions:
# Solution 1: Reduce number of papers
papers = search.search_arxiv(query, max_results=2)
# Solution 2: Use optimized main.py (fewer API calls)
from src.main import run # Uses optimized version
# Solution 3: Check your API quota
# Visit: https://makersuite.google.com/app/apikey
# Solution 4: Add manual delays
import time
time.sleep(2) # Between API callsCause: PDF download failed or paper has no PDF available
Solutions:
- Check your internet connection
- Some papers don't have PDFs (will use abstract instead)
- Check
pdfs/directory permissions:# On macOS/Linux: chmod 755 pdfs/ # On Windows: # Right-click pdfs folder โ Properties โ Security โ Edit
Cause: No valid text chunks extracted
Solutions:
# Check if PDFs downloaded
import os
print(os.listdir("pdfs"))
# Check extraction
parsed = extraction.parse_pdf(pdf_path, paper_id)
print(f"Abstract: {bool(parsed['abstract'])}")
print(f"Chunks: {len(parsed['chunks'])}")
# Debug extraction
if not parsed['chunks']:
print("No chunks extracted - PDF might be image-based or corrupted")Solutions:
# Solution 1: Reduce chunk size
chunks = extraction.chunk_text(text, chunk_size=1000)
# Solution 2: Limit number of chunks
chunks = chunks[:50]
# Solution 3: Process papers one at a time
for paper in papers:
# Process and clear memory
del parsed, summary
import gc
gc.collect()Cause: Missing dependencies or wrong Python version
Solutions:
# Reinstall dependencies
pip install --upgrade -r requirements.txt
# Check Python version
python --version # Should be 3.8+
# Create fresh virtual environment
deactivate
rm -rf venv
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
pip install -r requirements.txtCause: Python path not set correctly
Solutions:
# Solution 1: Add to PYTHONPATH
export PYTHONPATH="${PYTHONPATH}:$(pwd)"
# Solution 2: Run from project root
cd Agentic-Research-Assistant/
python -c "from src.main import run; run('test')"
# Solution 3: Install as package
pip install -e .Enable detailed logging:
# In your script
import logging
logging.basicConfig(
level=logging.DEBUG,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
# Or set in environment
export LOG_LEVEL=DEBUG # macOS/Linux
set LOG_LEVEL=DEBUG # WindowsIf you encounter issues:
- Check logs: Look in the console output for error messages
- Enable debug mode: Set
LOG_LEVEL=DEBUG - Check API quota: Visit Google AI Studio
- Open an issue: GitHub Issues
Contributions are welcome! Please follow these steps:
-
Fork the repository
# Click "Fork" on GitHub git clone https://github.com/your-username/agentic-research-assistant.git cd agentic-research-assistant
-
Create a feature branch
git checkout -b feature/amazing-feature
-
Make your changes
- Write code
- Add tests
- Update documentation
-
Commit your changes
git add . git commit -m 'Add amazing feature'
-
Push to the branch
git push origin feature/amazing-feature
-
Open a Pull Request
- Go to GitHub
- Click "New Pull Request"
- Describe your changes
- Follow PEP 8 style guide
- Add docstrings to all functions
- Write unit tests for new features
- Update README if needed
# Install development dependencies
pip install pytest flake8 black mypy
# Run tests
pytest tests/
# Run linting
flake8 src/
# Format code
black src/
# Type checking
mypy src/This project is licensed under the MIT License - see below for details.
MIT License
Copyright (c) 2024 [Raghav Agarwal]
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
- arXiv for providing open access to research papers
- Google Gemini for AI summarization capabilities
- Sentence Transformers for embedding models
- FAISS for efficient similarity search
- PyMuPDF for PDF processing
For questions, suggestions, or collaboration:
- Email: agarwal1996raghav@gmail.com
- GitHub: @raghav-567
- Issues: Report a bug
- Discussions: Start a discussion
- โ arXiv integration
- โ PDF extraction
- โ RAG pipeline with FAISS
- โ AI summarization
- โ Literature review generation
- Support for PubMed and Google Scholar
- Citation graph analysis
- Interactive web UI improvements
- Multi-language support
- Custom prompt templates
- Export to LaTeX/Word
- Collaborative filtering
- Paper recommendation system
- Knowledge graph visualization
- Real-time collaboration features
- Integration with reference managers (Zotero, Mendeley)
Typical performance metrics:
| Operation | Time | API Calls |
|---|---|---|
| Search 5 papers | ~5s | 1 |
| Download PDFs | ~10s | 0 |
| Extract & Chunk | ~15s | 0 |
| Generate Summaries | ~30s | 5 |
| Build RAG Index | ~5s | 0 |
| Total Pipeline | ~65s | 6 |
Note: Times vary based on paper length and network speed