Skip to content

gejifeng/daily-arxiv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Daily arXiv – AI Research Tracker πŸ“šπŸ€–

Python 3.12 License: MIT arXiv

English Document | δΈ­ζ–‡ζ–‡ζ‘£

Automatically track the latest AI research papers on arXiv each day, use LLMs for intelligent summarization, and generate research trend analysis reports.

✨ Features

Core Functions

  • πŸ” Intelligent Crawling: Daily automatic fetching of the newest papers from arXiv in specified fields

    • Supports multiple research areas (cs.AI, cs.LG, cs.CV, etc.)
    • Keyword filtering
    • TF‑IDF based smart selection
  • πŸ€– Multi‑Model Summarization: Use LLMs to generate concise paper summaries

    • Supports 5 LLM providers: OpenAI, Gemini, Claude, DeepSeek, vLLM
    • Bilingual (Chinese & English) summaries
    • Concurrent processing for higher efficiency
  • πŸ“Š Trend Analysis: In‑depth analysis of research hot topics and technological trends

    • TF‑IDF keyword extraction
    • LDA topic modeling
    • Word‑cloud visualization
    • LLM deep analysis (research hotspots, technology trends, future directions)
  • 🌐 Web Interface: Modern responsive web UI

    • Built with BootstrapΒ 5
    • Real‑time data display
    • Detailed paper view
    • Pagination and filtering
  • ⏰ Scheduled Execution: Various scheduling options

    • APScheduler (recommended)
    • Linux cron jobs
    • Systemd service
  • πŸ“§ Email Notifications: Execution status via email

    • Elegant HTML email templates
    • Separate success/failure notices
    • Detailed statistics

πŸ“Έ Interface Preview

alt textalt textalt text

πŸš€ Quick Start

Prerequisites

  • PythonΒ 3.12+
  • Conda (recommended) or virtualenv
  • LLM API keys (OpenAI / Gemini / Claude / DeepSeek / vLLM)

1. Clone the repository

git clone https://github.com/yourusername/daily-arxiv.git
cd daily-arxiv

2. Create a virtual environment

# Using Conda (recommended)
conda create -n daily-arxiv python=3.12 -y
conda activate daily-arxiv

# Or using venv
python -m venv venv
source venv/bin/activate  # Linux/macOS
# venv\Scripts\activate   # Windows

3. Install dependencies

pip install uv
uv pip install -r requirements.txt

4. Configure environment variables

# Copy the example file
cp .env.example .env

# Edit the .env file
nano .env

Add your API keys:

# OpenAI
OPENAI_API_KEY=sk-...

# Google Gemini
GEMINI_API_KEY=...

# Anthropic Claude
ANTHROPIC_API_KEY=...

# DeepSeek
DEEPSEEK_API_KEY=...

# vLLM (local deployment)
VLLM_API_KEY=EMPTY

# Email notifications (optional)
EMAIL_PASSWORD=your-app-password

5. Configure config.yaml

Edit config/config.yaml:

# Research fields
arxiv:
  categories:
    - "cs.AI"  # Artificial Intelligence
    - "cs.LG"  # Machine Learning
  
  keywords:
    - "large language model"
    - "transformer"
  
  max_results: 20

# LLM provider
llm:
  provider: "vllm"  # openai, gemini, claude, deepseek, vllm

# Scheduler settings
scheduler:
  enabled: true
  run_time: "09:00"
  timezone: "Asia/Shanghai"

6. Run tests

# Test paper fetching
python test/test_fetcher.py

# Test LLM summarization
python test/test_summarizer.py

# Test trend analysis
python test/test_analyzer.py

# Test web service
python test/test_web.py

# Test scheduler
python test/test_scheduler.py

7. Execute the full workflow

# Manual single run
python main.py

8. Start the web service

# Development mode
python src/web/app.py

# Open http://localhost:5000

9. Launch scheduled execution

# Recommended: use the start script
./deploy/start.sh

# Or run directly
python scheduler.py

Visit http://localhost:5000 to view results.

πŸ“‚ Project Structure

daily-arxiv/
β”œβ”€β”€ config/
β”‚   └── config.yaml              # Main configuration file
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ crawler/
β”‚   β”‚   └── arxiv_fetcher.py    # arXiv paper crawler
β”‚   β”œβ”€β”€ summarizer/
β”‚   β”‚   β”œβ”€β”€ base_llm_client.py  # Base LLM class
β”‚   β”‚   β”œβ”€β”€ openai_client.py    # OpenAI client
β”‚   β”‚   β”œβ”€β”€ gemini_client.py    # Gemini client
β”‚   β”‚   β”œβ”€β”€ claude_client.py    # Claude client
β”‚   β”‚   β”œβ”€β”€ deepseek_client.py  # DeepSeek client
β”‚   β”‚   β”œβ”€β”€ vllm_client.py      # vLLM client
β”‚   β”‚   β”œβ”€β”€ llm_factory.py      # LLM factory
β”‚   β”‚   └── paper_summarizer.py # Paper summarizer
β”‚   β”œβ”€β”€ analyzer/
β”‚   β”‚   └── trend_analyzer.py   # Trend analysis
β”‚   β”œβ”€β”€ web/
β”‚   β”‚   β”œβ”€β”€ app.py             # Flask web app
β”‚   β”‚   └── templates/
β”‚   β”‚       └── index.html     # Web UI page
β”‚   β”œβ”€β”€ notifier/
β”‚   β”‚   └── email_notifier.py  # Email notifier
β”‚   └── utils.py               # Utility functions
β”œβ”€β”€ static/
β”‚   └── js/
β”‚       └── main.js            # Front‑end JavaScript
β”œβ”€β”€ data/                      # Data storage
β”‚   β”œβ”€β”€ papers/               # Paper JSON files
β”‚   β”œβ”€β”€ summaries/            # Summary JSON files
β”‚  ──/ # word‑cloud images
β”œβ”€β”€ logs/                     # Log files
β”œβ”€β”€ deploy/                   # Deployment scripts
β”‚   β”œβ”€β”€ start.sh             # Start script
β”‚   β”œβ”€β”€ daily-arxiv.service  # Systemd service
β”‚   └── crontab.example      # Cron example
β”œβ”€β”€ docs/                     # Documentation
β”‚   β”œβ”€β”€ arxiv_fetcher_guide.md
β”‚   β”œβ”€β”€ trend_analyzer_guide.md
β”‚   β”œβ”€β”€ web_interface_guide.md
β”‚   └── scheduler_guide.md
β”œβ”€β”€ main.py                   # Main entry point
β”œβ”€β”€ scheduler.py              # APScheduler dispatcher
β”œβ”€β”€ test_*.py                # Test scripts
β”œβ”€β”€ requirements.txt         # Python dependencies
β”œβ”€β”€ .env.example            # Example env file
└── README.md               # Project overview

βš™οΈ Configuration Details

arXiv Category Codes

Common Computer Science categories:

  • cs.AI – Artificial Intelligence
  • cs.LG – Machine Learning
  • cs.CV – Computer Vision
  • cs.CL – Computation and Language (NLP)
  • cs.NE – Neural and Evolutionary Computing
  • stat.ML – Machine Learning (Statistics)

See the full list at: https://arxiv.org/category_taxonomy

LLM Providers

Supported providers:

  • OpenAI: GPT‑4, GPT‑3.5‑turbo
  • Gemini: Gemini models
  • Anthropic: Claude
  • DeepSeek: DeepSeek models
  • vLLM: Locally run open‑source models (OpenAI‑compatible API)

πŸ“ Development Roadmap

  • Project scaffolding βœ…
  • arXiv crawling βœ…
  • LLM summarization βœ…
    • Support OpenAI, Gemini, Claude, DeepSeek, vLLM
  • Trend analysis βœ…
    • Keyword extraction, topic modeling, word‑cloud generation
    • LLM‑driven deep analysis (hotspots, trends, innovations)
  • Web UI development
  • Scheduling functionality
  • Testing & optimization
  • UI beautification
  • Add WeChat public account integration

πŸ§ͺ Testing

# Test paper crawler
python test/test_fetcher.py

# Test summarizer
python test/test_summarizer.py

# Test trend analyzer
python test/test_analyzer.py

# Run full pipeline
python main.py

πŸ“Š Generated Files

data/
β”œβ”€β”€ papers/
β”‚   β”œβ”€β”€ papers_YYYY-MM-DD.json   # Daily paper data
β”‚   └── latest.json              # Latest paper data
β”œβ”€β”€ summaries/
β”‚   β”œβ”€β”€ summaries_YYYY-MM-DD.json# Daily summaries
β”‚   └── latest.json              # Latest summaries
└── analysis/
    β”œβ”€β”€ wordcloud_YYYY-MM-DD.png # Word‑cloud image
    β”œβ”€β”€ analysis_YYYY-MM-DD.json # Analysis results
    β”œβ”€β”€ report_YYYY-MM-DD.md     # Markdown report
    └── latest.json              # Latest analysis data

πŸ“– Documentation

🀝 Contributing

Feel free to open Issues and submit Pull Requests!

πŸ“„ License

MIT License

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors