🕷️ Secure Web Crawling Feed Aggregation Service

cat > README.md << 'EOF'

🕷️ Secure Web Crawling Feed Aggregation Service

A production-ready news aggregator built with Go that crawls websites, deduplicates content, and provides a personalized feed API.

Features

JWT Authentication - Secure user authentication with role-based access
Web Crawler - Asynchronous HTML parsing and article extraction
Content Deduplication - SHA-256 hashing to prevent duplicate articles
RESTful API - Clean API with pagination support
MongoDB Integration - Optimized with indexes for fast queries
Background Jobs - Non-blocking crawl operations with goroutines
Subscription Management - Users can subscribe to multiple sources

Tech Stack

Language: Go 1.21+
Framework: Gin
Database: MongoDB
Authentication: JWT
Web Scraping: goquery

Prerequisites

Go 1.21 or higher
MongoDB 4.4 or higher
Git

Installation

Clone the repository:

git clone https://github.com/Bavithbabu/Secure-Web-Crawling-Feed-Aggregation-Service.git
cd Secure-Web-Crawling-Feed-Aggregation-Service

Install dependencies:

go mod download

Create .env file:

PORT=9000
MONGODB_URL=mongodb+srv://username:password@cluster.mongodb.net/
SECRET_KEY=your_secret_key_here

Run the application:

go run main.go

Server starts at: http://localhost:9000

API Documentation

Authentication

Signup

POST /users/signup
Content-Type: application/json

{
  "first_name": "John",
  "last_name": "Doe",
  "email": "john@example.com",
  "password": "Password@123",
  "phone": "1234567890",
  "user_type": "USER"
}

Login

POST /users/login
Content-Type: application/json

{
  "email": "john@example.com",
  "password": "Password@123"
}

Subscriptions (Protected)

Add Subscription

POST /api/subscriptions
token: <your_jwt_token>
Content-Type: application/json

{
  "url": "https://news.ycombinator.com"
}

List Subscriptions

GET /api/subscriptions
token: <your_jwt_token>

Remove Subscription

DELETE /api/subscriptions/:id
token: <your_jwt_token>

Crawling (Protected)

Crawl Specific Source

POST /api/crawl/:subscription_id
token: <your_jwt_token>

Crawl All Sources

POST /api/crawl/all
token: <your_jwt_token>

Feed (Protected)

Get Feed

GET /api/feed?page=1&limit=20
token: <your_jwt_token>

Project Structure

go-lang-jwt/
├── controllers/       # HTTP request handlers
├── database/         # MongoDB connection & indexes
├── helpers/          # JWT & hashing utilities
├── middleware/       # Authentication middleware
├── models/          # Data structures
├── routes/          # API route definitions
├── services/        # Business logic
├── main.go          # Application entry point
├── go.mod           # Go module dependencies
└── .env            # Environment variables

Database Collections

users - User accounts with hashed passwords
sources - Crawled website sources
subscriptions - User-source mappings
articles - Extracted and deduplicated articles

Security Features

Bcrypt password hashing
JWT token authentication
Role-based access control (ADMIN/USER)
Input validation
MongoDB injection prevention

Performance Optimizations

Database indexes on frequently queried fields
Pagination for large datasets
Background crawling with goroutines
Content deduplication (SHA-256)
Article limit per source (50 max)

##Supported Sites

Currently optimized for:

Hacker News
Lobsters
Any site with standard HTML structure

Contributing

Pull requests are welcome! For major changes, please open an issue first.

License

MIT License

Author

Bavith Babu

GitHub: @Bavithbabu

Future Enhancements

⭐ Star this repo if you found it helpful! EOF

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🕷️ Secure Web Crawling Feed Aggregation Service

Features

Tech Stack

Prerequisites

Installation

API Documentation

Authentication

Signup

Login

Subscriptions (Protected)

Add Subscription

List Subscriptions

Remove Subscription

Crawling (Protected)

Crawl Specific Source

Crawl All Sources

Feed (Protected)

Get Feed

Project Structure

Database Collections

Security Features

Performance Optimizations

Contributing

License

Author

Future Enhancements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
controllers		controllers
database		database
helpers		helpers
middleware		middleware
models		models
routes		routes
services		services
.gitignore		.gitignore
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go

Folders and files

Latest commit

History

Repository files navigation

🕷️ Secure Web Crawling Feed Aggregation Service

Features

Tech Stack

Prerequisites

Installation

API Documentation

Authentication

Signup

Login

Subscriptions (Protected)

Add Subscription

List Subscriptions

Remove Subscription

Crawling (Protected)

Crawl Specific Source

Crawl All Sources

Feed (Protected)

Get Feed

Project Structure

Database Collections

Security Features

Performance Optimizations

Contributing

License

Author

Future Enhancements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages