cat > README.md << 'EOF'
A production-ready news aggregator built with Go that crawls websites, deduplicates content, and provides a personalized feed API.
- JWT Authentication - Secure user authentication with role-based access
- Web Crawler - Asynchronous HTML parsing and article extraction
- Content Deduplication - SHA-256 hashing to prevent duplicate articles
- RESTful API - Clean API with pagination support
- MongoDB Integration - Optimized with indexes for fast queries
- Background Jobs - Non-blocking crawl operations with goroutines
- Subscription Management - Users can subscribe to multiple sources
- Language: Go 1.21+
- Framework: Gin
- Database: MongoDB
- Authentication: JWT
- Web Scraping: goquery
- Go 1.21 or higher
- MongoDB 4.4 or higher
- Git
- Clone the repository:
git clone https://github.com/Bavithbabu/Secure-Web-Crawling-Feed-Aggregation-Service.git
cd Secure-Web-Crawling-Feed-Aggregation-Service- Install dependencies:
go mod download- Create
.envfile:
PORT=9000
MONGODB_URL=mongodb+srv://username:password@cluster.mongodb.net/
SECRET_KEY=your_secret_key_here- Run the application:
go run main.goServer starts at: http://localhost:9000
POST /users/signup
Content-Type: application/json
{
"first_name": "John",
"last_name": "Doe",
"email": "john@example.com",
"password": "Password@123",
"phone": "1234567890",
"user_type": "USER"
}POST /users/login
Content-Type: application/json
{
"email": "john@example.com",
"password": "Password@123"
}POST /api/subscriptions
token: <your_jwt_token>
Content-Type: application/json
{
"url": "https://news.ycombinator.com"
}GET /api/subscriptions
token: <your_jwt_token>DELETE /api/subscriptions/:id
token: <your_jwt_token>POST /api/crawl/:subscription_id
token: <your_jwt_token>POST /api/crawl/all
token: <your_jwt_token>GET /api/feed?page=1&limit=20
token: <your_jwt_token>go-lang-jwt/
├── controllers/ # HTTP request handlers
├── database/ # MongoDB connection & indexes
├── helpers/ # JWT & hashing utilities
├── middleware/ # Authentication middleware
├── models/ # Data structures
├── routes/ # API route definitions
├── services/ # Business logic
├── main.go # Application entry point
├── go.mod # Go module dependencies
└── .env # Environment variables
- users - User accounts with hashed passwords
- sources - Crawled website sources
- subscriptions - User-source mappings
- articles - Extracted and deduplicated articles
- Bcrypt password hashing
- JWT token authentication
- Role-based access control (ADMIN/USER)
- Input validation
- MongoDB injection prevention
- Database indexes on frequently queried fields
- Pagination for large datasets
- Background crawling with goroutines
- Content deduplication (SHA-256)
- Article limit per source (50 max)
##Supported Sites
Currently optimized for:
- Hacker News
- Lobsters
- Any site with standard HTML structure
Pull requests are welcome! For major changes, please open an issue first.
MIT License
Bavith Babu
- GitHub: @Bavithbabu
- RSS feed parser
- Scheduled crawling with cron
- Email notifications
- Full-text search
- Docker containerization
- Rate limiting & caching
⭐ Star this repo if you found it helpful! EOF