🔬 Agentic Research Assistant

An intelligent, multi-agent system for automated academic literature review generation. This tool searches arXiv, extracts content from PDFs, builds a RAG (Retrieval-Augmented Generation) pipeline, and synthesizes comprehensive literature reviews.

📋 Table of Contents

Features
Architecture
Installation
Quick Start
Project Structure
Usage
Configuration
API Reference
Troubleshooting
Contributing
License
Contact

✨ Features

🔍 Automated Paper Discovery: Search and retrieve papers from arXiv based on your research query
📄 PDF Processing: Intelligent extraction and cleaning of academic PDFs
🧠 RAG Pipeline: Vector-based retrieval using FAISS for semantic search
📝 AI Summarization: Powered by Google Gemini for high-quality summaries
📚 Literature Synthesis: Automatic generation of structured literature reviews
🎯 Multi-Agent Architecture: Specialized agents for different tasks
🖥️ Multiple Interfaces: CLI workflow and Flask web interface

🏗️ Architecture

The system uses a multi-agent architecture with specialized components:

┌─────────────────────────────────────────────────────────┐
│                    Research Query                        │
└─────────────────────┬───────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────┐
│  SEARCH AGENT                                            │
│  • Queries arXiv API                                     │
│  • Downloads PDFs                                        │
│  • Extracts metadata                                     │
└─────────────────────┬───────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────┐
│  EXTRACTION AGENT                                        │
│  • Parses PDF content                                    │
│  • Cleans text (removes refs, figures, equations)       │
│  • Extracts abstract                                     │
│  • Chunks body text                                      │
└─────────────────────┬───────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────┐
│  RAG PIPELINE                                            │
│  • Generates embeddings (SentenceTransformer)           │
│  • Builds FAISS index                                    │
│  • Semantic similarity search                            │
└─────────────────────┬───────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────┐
│  SUMMARIZER AGENT                                        │
│  • Summarizes abstract + body chunks                     │
│  • Query-based chunk selection                           │
│  • Google Gemini API integration                         │
└─────────────────────┬───────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────┐
│  SYNTHESIZER AGENT                                       │
│  • Aggregates paper summaries                            │
│  • Generates structured literature review                │
│  • Markdown formatting                                   │
└─────────────────────┬───────────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────────┐
│              Literature Review (Markdown)                │
└─────────────────────────────────────────────────────────┘

📦 Installation

Prerequisites

Python 3.8 or higher
pip package manager
Google Gemini API key (Get one here)

Step 1: Clone the Repository

git clone https://github.com/yourusername/agentic-research-assistant.git
cd agentic-research-assistant

Step 2: Create Virtual Environment

# Create virtual environment
python -m venv venv

# Activate virtual environment
# On Windows:
venv\Scripts\activate

# On macOS/Linux:
source venv/bin/activate

Step 3: Install Dependencies

pip install -r requirements.txt

Step 4: Set Up API Keys

Create a .env file in the root directory:

API_KEY=your_gemini_api_key_here

Or export as environment variable:

# On Windows (Command Prompt):
set API_KEY=your_gemini_api_key_here

# On Windows (PowerShell):
$env:API_KEY="your_gemini_api_key_here"

# On macOS/Linux:
export API_KEY="your_gemini_api_key_here"

🚀 Quick Start

Basic Usage (CLI)

from src.main import run

# Generate literature review for your research topic
review = run("machine learning optimization")

The generated review will be saved in outputs/sample_review_optimized.md

Using Flask Web App

python app/app.py

Then open your browser to http://localhost:5001

📁 Project Structure

Agentic-Research-Assistant/
│
├── src/
│   ├── agents/
│   │   ├── __init__.py
│   │   ├── search_agent.py        # arXiv search & PDF download
│   │   ├── extraction_agent.py    # PDF parsing & cleaning
│   │   ├── summarizer_agent.py    # AI-powered summarization
│   │   └── synthesizer_agent.py   # Literature review generation
│   │
│   ├── rag_pipeline.py             # FAISS + embeddings + retrieval
│   ├── utils.py                    # Logging, config helpers
│   └── main.py                     # Orchestrator (ties all agents)
│
├── app/
│   ├── app.py                      # Flask backend
│   └── templates/
│       └── llm.html                # Web UI template
│
├── outputs/
│   └── sample_review_optimized.md  # Generated literature reviews
│
├── pdfs/                           # Downloaded PDFs (auto-created)
│
├── requirements.txt                # Python dependencies
├── .env.example                    # Environment variables template
├── .gitignore
└── readme.md                       # This file

💻 Usage

Command Line Interface

from src.main import run

# Basic usage
review = run("deep learning")

# The function will:
# 1. Search arXiv for relevant papers
# 2. Download and parse PDFs
# 3. Extract and chunk content
# 4. Generate summaries using AI
# 5. Create a structured literature review
# 6. Save to outputs/sample_review_optimized.md

Programmatic Usage

from src.agents.search_agent import SearchAgent
from src.agents.extraction_agent import ExtractionAgent
from src.agents.summarizer_agent import SummarizerAgent
from src.rag_pipeline import RAGPipeline
import faiss

# Initialize agents
search = SearchAgent(pdf_dir="pdfs")
extraction = ExtractionAgent()
summarizer = SummarizerAgent(api_key="your_api_key")

# Search for papers
papers = search.search_arxiv("neural networks", max_results=5)

# Process a single paper
for paper in papers:
    if paper.get("pdf_path"):
        # Extract content
        parsed = extraction.parse_pdf(paper["pdf_path"], paper["id"])
        
        # Generate summary
        summary = summarizer.summarize_chunks(parsed, query="neural networks")
        
        print(f"Title: {paper['title']}")
        print(f"Summary: {summary}\n")

Customization Options

# Customize number of papers
papers = search.search_arxiv("quantum computing", max_results=10)

# Customize chunk size
chunks = extraction.chunk_text(text, chunk_size=1500)

# Customize summary length
summary = summarizer._summarize_text(text, max_output_tokens=500)

# Customize RAG retrieval
papers, summaries = rag.query(
    query="machine learning",
    top_k_chunks=200,
    top_k_papers=5,
    chunks_per_paper=15
)

Batch Processing

from src.main import run

queries = [
    "machine learning optimization",
    "deep learning computer vision",
    "natural language processing transformers"
]

for query in queries:
    print(f"Processing: {query}")
    review = run(query)
    print(f"Review saved for: {query}\n")

⚙️ Configuration

Environment Variables

Create a .env file:

# Required
API_KEY=your_gemini_api_key

# Backwards-compatible alternative
GEMINI_API_KEY=your_gemini_api_key

# Optional
PDF_DIR=pdfs
OUTPUT_DIR=outputs
LOG_LEVEL=INFO
MAX_PAPERS=5
CHUNK_SIZE=2000

Agent Configuration

Edit parameters in src/main.py:

# Search configuration
papers = search.search_arxiv(query, max_results=3)  # Number of papers

# Chunking configuration
chunks = extraction.chunk_text(text, chunk_size=2000)  # Chunk size

# Summarization configuration
summary = summarizer._summarize_text(
    text,
    max_output_tokens=300  # Summary length
)

# RAG configuration
papers, summaries = rag.query(
    query=query,
    top_k_chunks=200,      # Chunks to retrieve
    top_k_papers=3,        # Papers to include
    chunks_per_paper=10    # Chunks per paper
)

Logging Configuration

In src/utils.py:

def get_logger(name):
    logger = logging.getLogger(name)
    handler = logging.StreamHandler()
    formatter = logging.Formatter(
        '%(asctime)s - %(name)s - %(levelname)s - %(message)s'
    )
    handler.setFormatter(formatter)
    logger.addHandler(handler)
    logger.setLevel(logging.INFO)  # Change to DEBUG for verbose output
    return logger

📚 API Reference

SearchAgent

Initialize:

search = SearchAgent(pdf_dir="pdfs")

Methods:

`search_arxiv(query, max_results=3)`

Search arXiv for papers matching the query.

Parameters:

query (str): Search query
max_results (int): Maximum number of papers to retrieve

Returns:

List of paper dictionaries with keys: id, title, summary, authors, pdf_path, url, published

Example:

papers = search.search_arxiv("machine learning", max_results=5)
for paper in papers:
    print(f"Title: {paper['title']}")
    print(f"Authors: {', '.join(paper['authors'])}")

`download_pdf(pdf_url, paper_id)`

Download a PDF from the given URL.

Parameters:

pdf_url (str): URL of the PDF
paper_id (str): Unique identifier for the paper

Returns:

str: Path to downloaded PDF, or None if download failed

ExtractionAgent

Initialize:

extraction = ExtractionAgent()

Methods:

`parse_pdf(pdf_path, paper_id=None)`

Extract and clean text from a PDF file.

Parameters:

pdf_path (str): Path to PDF file
paper_id (str, optional): Paper identifier for metadata

Returns:

dict: {"abstract": str, "chunks": List[dict]}

Example:

parsed = extraction.parse_pdf("pdfs/paper123.pdf", paper_id="paper123")
print(f"Abstract: {parsed['abstract'][:200]}...")
print(f"Number of chunks: {len(parsed['chunks'])}")

`chunk_text(text, paper_id=None, chunk_size=2000)`

Split text into chunks for processing.

Parameters:

text (str): Text to chunk
paper_id (str, optional): Paper identifier
chunk_size (int): Number of words per chunk

Returns:

List[dict]: Chunks with metadata

`clean_text(text)`

Remove equations, citations, figures, and other noise from text.

Parameters:

text (str): Raw text

Returns:

str: Cleaned text

SummarizerAgent

Initialize:

summarizer = SummarizerAgent(api_key="your_gemini_api_key")

Methods:

`summarize_chunks(paper_data, query=None, k=10)`

Generate a summary from paper content.

Parameters:

paper_data (dict): {"abstract": str, "chunks": List[dict]}
query (str, optional): Query for relevance-based chunk selection
k (int): Number of top chunks to use

Returns:

str: Generated summary

Example:

summary = summarizer.summarize_chunks(
    paper_data=parsed,
    query="machine learning",
    k=5
)
print(summary)

`embed_chunks(chunks, normalize=True)`

Generate embeddings for text chunks.

Parameters:

chunks (List[str]): List of text chunks
normalize (bool): Whether to normalize embeddings

Returns:

np.ndarray: Array of embeddings

RAGPipeline

Initialize:

from src.rag_pipeline import RAGPipeline
import faiss

dim = summarizer.embedding_model.get_sentence_embedding_dimension()
index = faiss.IndexFlatIP(dim)

rag = RAGPipeline(
    search_agent=search,
    extraction_agent=extraction,
    summarizer_agent=summarizer,
    index=index,
    id_to_metadata={}
)

Methods:

`build_index(chunks, paper_info=None)`

Add chunks to the FAISS index.

Parameters:

chunks (List[dict]): Chunks with text and metadata
paper_info (dict, optional): Paper metadata

Example:

rag.build_index(chunks, paper_info=paper)

`query(query, top_k_chunks=200, top_k_papers=3, chunks_per_paper=10)`

Retrieve and summarize relevant papers.

Parameters:

query (str): Search query
top_k_chunks (int): Total chunks to retrieve
top_k_papers (int): Number of papers to return
chunks_per_paper (int): Chunks per paper for summarization

Returns:

Tuple[List[dict], List[str]]: (papers, summaries)

SynthesizerAgent

Initialize:

from src.agents.synthesizer_agent import SynthesizerAgent

synthesizer = SynthesizerAgent()

Methods:

`synthesize(papers, summaries)`

Generate a structured literature review.

Parameters:

papers (List[dict]): List of paper metadata
summaries (List[str]): List of paper summaries

Returns:

str: Markdown-formatted literature review

🔧 Troubleshooting

Common Issues

1. "Unable to generate summary after 3 attempts"

Cause: Gemini API rate limiting or quota exceeded

Solutions:

# Solution 1: Reduce number of papers
papers = search.search_arxiv(query, max_results=2)

# Solution 2: Use optimized main.py (fewer API calls)
from src.main import run  # Uses optimized version

# Solution 3: Check your API quota
# Visit: https://makersuite.google.com/app/apikey

# Solution 4: Add manual delays
import time
time.sleep(2)  # Between API calls

2. "No PDF for [paper], using arXiv summary"

Cause: PDF download failed or paper has no PDF available

Solutions:

Check your internet connection
Some papers don't have PDFs (will use abstract instead)

Check pdfs/ directory permissions:

# On macOS/Linux:
chmod 755 pdfs/

# On Windows:
# Right-click pdfs folder → Properties → Security → Edit

3. "Empty embeddings" or FAISS errors

Cause: No valid text chunks extracted

Solutions:

# Check if PDFs downloaded
import os
print(os.listdir("pdfs"))

# Check extraction
parsed = extraction.parse_pdf(pdf_path, paper_id)
print(f"Abstract: {bool(parsed['abstract'])}")
print(f"Chunks: {len(parsed['chunks'])}")

# Debug extraction
if not parsed['chunks']:
    print("No chunks extracted - PDF might be image-based or corrupted")

4. Memory errors with large PDFs

Solutions:

# Solution 1: Reduce chunk size
chunks = extraction.chunk_text(text, chunk_size=1000)

# Solution 2: Limit number of chunks
chunks = chunks[:50]

# Solution 3: Process papers one at a time
for paper in papers:
    # Process and clear memory
    del parsed, summary
    import gc
    gc.collect()

5. Import errors

Cause: Missing dependencies or wrong Python version

Solutions:

# Reinstall dependencies
pip install --upgrade -r requirements.txt

# Check Python version
python --version  # Should be 3.8+

# Create fresh virtual environment
deactivate
rm -rf venv
python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows
pip install -r requirements.txt

6. "ModuleNotFoundError: No module named 'src'"

Cause: Python path not set correctly

Solutions:

# Solution 1: Add to PYTHONPATH
export PYTHONPATH="${PYTHONPATH}:$(pwd)"

# Solution 2: Run from project root
cd Agentic-Research-Assistant/
python -c "from src.main import run; run('test')"

# Solution 3: Install as package
pip install -e .

Debug Mode

Enable detailed logging:

# In your script
import logging
logging.basicConfig(
    level=logging.DEBUG,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

# Or set in environment
export LOG_LEVEL=DEBUG  # macOS/Linux
set LOG_LEVEL=DEBUG     # Windows

Getting Help

If you encounter issues:

Check logs: Look in the console output for error messages
Enable debug mode: Set LOG_LEVEL=DEBUG
Check API quota: Visit Google AI Studio
Open an issue: GitHub Issues

🤝 Contributing

Contributions are welcome! Please follow these steps:

How to Contribute

Fork the repository

# Click "Fork" on GitHub
git clone https://github.com/your-username/agentic-research-assistant.git
cd agentic-research-assistant

Create a feature branch
```
git checkout -b feature/amazing-feature
```
Make your changes
- Write code
- Add tests
- Update documentation

Commit your changes

git add .
git commit -m 'Add amazing feature'

Push to the branch
```
git push origin feature/amazing-feature
```
Open a Pull Request
- Go to GitHub
- Click "New Pull Request"
- Describe your changes

Development Guidelines

Follow PEP 8 style guide
Add docstrings to all functions
Write unit tests for new features
Update README if needed

Development Setup

# Install development dependencies
pip install pytest flake8 black mypy

# Run tests
pytest tests/

# Run linting
flake8 src/

# Format code
black src/

# Type checking
mypy src/

📝 License

This project is licensed under the MIT License - see below for details.

MIT License

Copyright (c) 2024 [Raghav Agarwal]

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

🙏 Acknowledgments

arXiv for providing open access to research papers
Google Gemini for AI summarization capabilities
Sentence Transformers for embedding models
FAISS for efficient similarity search
PyMuPDF for PDF processing

📧 Contact

For questions, suggestions, or collaboration:

Email: agarwal1996raghav@gmail.com
GitHub: @raghav-567
Issues: Report a bug
Discussions: Start a discussion

🔮 Roadmap

Current Version (v1.0)

✅ arXiv integration
✅ PDF extraction
✅ RAG pipeline with FAISS
✅ AI summarization
✅ Literature review generation

Upcoming Features (v1.1)

Future Vision (v2.0)

Export to LaTeX/Word
Collaborative filtering
Paper recommendation system
Knowledge graph visualization
Real-time collaboration features
Integration with reference managers (Zotero, Mendeley)

📊 Performance

Typical performance metrics:

Operation	Time	API Calls
Search 5 papers	~5s	1
Download PDFs	~10s	0
Extract & Chunk	~15s	0
Generate Summaries	~30s	5
Build RAG Index	~5s	0
Total Pipeline	~65s	6

Note: Times vary based on paper length and network speed

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
Testing		Testing
app		app
outputs		outputs
src		src
.env.example		.env.example
.gitignore		.gitignore
readme.md		readme.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🔬 Agentic Research Assistant

📋 Table of Contents

✨ Features

🏗️ Architecture

📦 Installation

Prerequisites

Step 1: Clone the Repository

Step 2: Create Virtual Environment

Step 3: Install Dependencies

Step 4: Set Up API Keys

🚀 Quick Start

Basic Usage (CLI)

Using Flask Web App

📁 Project Structure

💻 Usage

Command Line Interface

Programmatic Usage

Customization Options

Batch Processing

⚙️ Configuration

Environment Variables

Agent Configuration

Logging Configuration

📚 API Reference

SearchAgent

search_arxiv(query, max_results=3)

download_pdf(pdf_url, paper_id)

ExtractionAgent

parse_pdf(pdf_path, paper_id=None)

chunk_text(text, paper_id=None, chunk_size=2000)

clean_text(text)

SummarizerAgent

summarize_chunks(paper_data, query=None, k=10)

embed_chunks(chunks, normalize=True)

RAGPipeline

build_index(chunks, paper_info=None)

query(query, top_k_chunks=200, top_k_papers=3, chunks_per_paper=10)

SynthesizerAgent

synthesize(papers, summaries)

🔧 Troubleshooting

Common Issues

1. "Unable to generate summary after 3 attempts"

2. "No PDF for [paper], using arXiv summary"

3. "Empty embeddings" or FAISS errors

4. Memory errors with large PDFs

5. Import errors

6. "ModuleNotFoundError: No module named 'src'"

Debug Mode

Getting Help

🤝 Contributing

How to Contribute

Development Guidelines

Development Setup

📝 License

🙏 Acknowledgments

📧 Contact

🔮 Roadmap

Current Version (v1.0)

Upcoming Features (v1.1)

Future Vision (v2.0)

📊 Performance

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`search_arxiv(query, max_results=3)`

`download_pdf(pdf_url, paper_id)`

`parse_pdf(pdf_path, paper_id=None)`

`chunk_text(text, paper_id=None, chunk_size=2000)`

`clean_text(text)`

`summarize_chunks(paper_data, query=None, k=10)`

`embed_chunks(chunks, normalize=True)`

`build_index(chunks, paper_info=None)`

`query(query, top_k_chunks=200, top_k_papers=3, chunks_per_paper=10)`

`synthesize(papers, summaries)`

Packages