Skip to content

Conversation

@FJiangArthur
Copy link

No description provided.

This commit adds full support for monitoring Western media sources,
especially USA political news from left/right/center perspectives,
plus social media platforms and technical news sources.

New Features:
- Western news RSS collection (CNN, Fox News, Reuters, NYT, WaPo, etc.)
- Reddit crawler with API support (political, tech, news subreddits)
- Twitter/X crawler (scraper-based, no API needed but high IP ban risk)
- YouTube crawler with API support (political channels, tech content)
- HackerNews crawler (free API, technical news)
- Google News integration (various topics)
- Comprehensive rate limiting and IP protection for home use

Platform Coverage:
Left-leaning news: CNN, MSNBC, NYT, Washington Post, NPR
Right-leaning news: Fox News, Breitbart, Daily Wire, NY Post
Center/Balanced: Reuters, AP, BBC, WSJ
Tech news: TechCrunch, The Verge, Wired, HackerNews
Social: Reddit, Twitter/X, YouTube

IP Protection Features:
- Per-platform rate limiting (configurable requests/hour)
- Minimum delay enforcement between requests
- Request counting and quota management
- User agent rotation
- Conservative defaults to prevent home IP bans
- Special protection for high-risk platforms (Twitter)

Database Changes:
- New tables for reddit_post, reddit_comment
- New tables for twitter_tweet
- New tables for youtube_video, youtube_comment
- New tables for hackernews_post, hackernews_comment
- New table for western_news_article

Configuration:
- Updated .env.example with Western platform API credentials
- Added rate limiting configuration options
- Platform-specific delay and quota settings

Documentation:
- Comprehensive WESTERN_MEDIA_SETUP.md guide
- API setup instructions (Reddit, YouTube)
- IP protection best practices
- Usage examples for all platforms
- Troubleshooting guide

Dependencies:
- praw (Reddit API)
- feedparser (RSS feeds)
- google-api-python-client (YouTube API)
- tweepy (Twitter API, optional)
- ntscraper (Twitter scraper)
- ratelimit (rate limiting)
- fake-useragent (user agent rotation)

Testing:
- test_western_crawlers.py for quick validation
- Individual crawler test functions

Note: TikTok US support is planned but not fully implemented yet
due to complexity of TikTok's anti-scraping measures.

This implementation prioritizes IP safety for home use with
aggressive rate limiting to prevent bans.
Critical Fix:
- ntscraper is non-functional (Nitter instances shut down by Twitter)
- Replaced with twikit (free) and Apify API (paid) options

New Implementation:
- twitter_crawler_v2.py with dual backend support
  * twikit: Free scraping (requires Twitter account, maintenance)
  * Apify API: Paid scraping (~$0.30/1000 tweets, reliable)

Files Added:
- twitter_crawler_v2.py: New working Twitter crawler
- TWITTER_MIGRATION_GUIDE.md: Comprehensive migration guide

Updated:
- requirements.txt: Replace ntscraper with twikit
- Mark ntscraper as deprecated

Recommended Solutions for 100 publishers daily:
1. Brand24 ($49-99/month) - Best overall, multi-platform
2. Apify API (~$10/month) - Good value, reliable
3. twikit (free) - Budget option, requires maintenance
4. Twitter API ($200/month) - Official but expensive

See TWITTER_MIGRATION_GUIDE.md for detailed setup instructions.
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. documentation Improvements or additions to documentation improvement New feature or request labels Nov 16, 2025
Implements industry-standard multi-agent system based on Anthropic's
best practices for coordinating specialized agents in Western media
monitoring.

Documentation (docs/):
- PROJECT_STRUCTURE.md: Complete system architecture and directory structure
- PROJECT_PLAN.md: 12-week implementation plan with phases and milestones
- AGENT_PERSONAS.md: Detailed personas for 20+ specialized agents
- INTER_AGENT_COMMUNICATION.md: Message protocols and communication patterns
- MULTI_AGENT_SYSTEM_README.md: Comprehensive getting started guide

Agent Framework (agents/):
- shared/base_agent.py: Base class template for all agents
  * Standard lifecycle management (IDLE → READY → WORKING → COMPLETED)
  * Message bus communication (pub/sub pattern)
  * Health monitoring and metrics
  * Error handling and reporting
  * Logging and observability

- platform_agents/reddit_agent.py: Example concrete implementation
  * Monitors political and tech subreddits
  * Integrates with reddit_crawler.py
  * Demonstrates agent personality and expertise
  * Shows task execution patterns
  * Includes health checks

Agent Categories:
1. Coordinator Agents (1): Project Manager, Task Dispatcher, Status Tracker
2. Platform Agents (6): Reddit, Twitter, YouTube, HackerNews, TikTok, RSS News
3. Data Agents (4): Pipeline, Storage, Validation, Deduplication
4. Analysis Agents (3): Sentiment, Topic, Bias
5. Protection Agents (3): Rate Limiter, Health Monitor, Error Recovery
6. QA Agents (2): Test, Monitoring

Key Features:
- Message Bus Architecture: Redis-based pub/sub for agent communication
- Standard Message Formats: JSON schemas for all message types
- Priority System: 1-5 priority levels for message processing
- Circuit Breaker Pattern: Prevents cascading failures
- Rate Limiting: Per-platform IP protection
- Health Monitoring: Real-time agent health checks
- Error Recovery: Automatic retry and recovery strategies

Communication Patterns:
- Request-Response: Synchronous operations with timeout
- Pub-Sub: Event broadcasting to multiple agents
- Task Queue: Distributed work distribution
- Circuit Breaker: Error isolation and recovery

Project Timeline:
- Phase 1 (Weeks 1-3): Foundation and core crawlers ✓ (mostly done)
- Phase 2 (Weeks 4-6): Agent implementation
- Phase 3 (Weeks 7-9): Integration and testing
- Phase 4 (Weeks 10-12): Production deployment

Benefits:
- Scalability: Agents can be scaled independently
- Modularity: Easy to add new platforms/features
- Reliability: Isolated failures, automatic recovery
- Observability: Full visibility into system state
- Maintainability: Clear separation of concerns
- Best Practices: Based on Anthropic's agent guidelines

Next Steps:
1. Implement remaining platform agents (Twitter, YouTube, etc.)
2. Implement data processing agents
3. Set up Redis message bus
4. Create agent coordinator
5. Write integration tests

See docs/MULTI_AGENT_SYSTEM_README.md for getting started guide.
@sonarqubecloud
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
5 Security Hotspots

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation improvement New feature or request size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants