-
-
Notifications
You must be signed in to change notification settings - Fork 6.2k
Claude/adapt western media sources 01 e4q ej pwxa zxqmo wh81 puok #358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
FJiangArthur
wants to merge
4
commits into
666ghj:main
Choose a base branch
from
FJiangArthur:claude/adapt-western-media-sources-01E4qEjPWXAZxqmoWH81Puok
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Claude/adapt western media sources 01 e4q ej pwxa zxqmo wh81 puok #358
FJiangArthur
wants to merge
4
commits into
666ghj:main
from
FJiangArthur:claude/adapt-western-media-sources-01E4qEjPWXAZxqmoWH81Puok
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This commit adds full support for monitoring Western media sources, especially USA political news from left/right/center perspectives, plus social media platforms and technical news sources. New Features: - Western news RSS collection (CNN, Fox News, Reuters, NYT, WaPo, etc.) - Reddit crawler with API support (political, tech, news subreddits) - Twitter/X crawler (scraper-based, no API needed but high IP ban risk) - YouTube crawler with API support (political channels, tech content) - HackerNews crawler (free API, technical news) - Google News integration (various topics) - Comprehensive rate limiting and IP protection for home use Platform Coverage: Left-leaning news: CNN, MSNBC, NYT, Washington Post, NPR Right-leaning news: Fox News, Breitbart, Daily Wire, NY Post Center/Balanced: Reuters, AP, BBC, WSJ Tech news: TechCrunch, The Verge, Wired, HackerNews Social: Reddit, Twitter/X, YouTube IP Protection Features: - Per-platform rate limiting (configurable requests/hour) - Minimum delay enforcement between requests - Request counting and quota management - User agent rotation - Conservative defaults to prevent home IP bans - Special protection for high-risk platforms (Twitter) Database Changes: - New tables for reddit_post, reddit_comment - New tables for twitter_tweet - New tables for youtube_video, youtube_comment - New tables for hackernews_post, hackernews_comment - New table for western_news_article Configuration: - Updated .env.example with Western platform API credentials - Added rate limiting configuration options - Platform-specific delay and quota settings Documentation: - Comprehensive WESTERN_MEDIA_SETUP.md guide - API setup instructions (Reddit, YouTube) - IP protection best practices - Usage examples for all platforms - Troubleshooting guide Dependencies: - praw (Reddit API) - feedparser (RSS feeds) - google-api-python-client (YouTube API) - tweepy (Twitter API, optional) - ntscraper (Twitter scraper) - ratelimit (rate limiting) - fake-useragent (user agent rotation) Testing: - test_western_crawlers.py for quick validation - Individual crawler test functions Note: TikTok US support is planned but not fully implemented yet due to complexity of TikTok's anti-scraping measures. This implementation prioritizes IP safety for home use with aggressive rate limiting to prevent bans.
Critical Fix: - ntscraper is non-functional (Nitter instances shut down by Twitter) - Replaced with twikit (free) and Apify API (paid) options New Implementation: - twitter_crawler_v2.py with dual backend support * twikit: Free scraping (requires Twitter account, maintenance) * Apify API: Paid scraping (~$0.30/1000 tweets, reliable) Files Added: - twitter_crawler_v2.py: New working Twitter crawler - TWITTER_MIGRATION_GUIDE.md: Comprehensive migration guide Updated: - requirements.txt: Replace ntscraper with twikit - Mark ntscraper as deprecated Recommended Solutions for 100 publishers daily: 1. Brand24 ($49-99/month) - Best overall, multi-platform 2. Apify API (~$10/month) - Good value, reliable 3. twikit (free) - Budget option, requires maintenance 4. Twitter API ($200/month) - Official but expensive See TWITTER_MIGRATION_GUIDE.md for detailed setup instructions.
Implements industry-standard multi-agent system based on Anthropic's best practices for coordinating specialized agents in Western media monitoring. Documentation (docs/): - PROJECT_STRUCTURE.md: Complete system architecture and directory structure - PROJECT_PLAN.md: 12-week implementation plan with phases and milestones - AGENT_PERSONAS.md: Detailed personas for 20+ specialized agents - INTER_AGENT_COMMUNICATION.md: Message protocols and communication patterns - MULTI_AGENT_SYSTEM_README.md: Comprehensive getting started guide Agent Framework (agents/): - shared/base_agent.py: Base class template for all agents * Standard lifecycle management (IDLE → READY → WORKING → COMPLETED) * Message bus communication (pub/sub pattern) * Health monitoring and metrics * Error handling and reporting * Logging and observability - platform_agents/reddit_agent.py: Example concrete implementation * Monitors political and tech subreddits * Integrates with reddit_crawler.py * Demonstrates agent personality and expertise * Shows task execution patterns * Includes health checks Agent Categories: 1. Coordinator Agents (1): Project Manager, Task Dispatcher, Status Tracker 2. Platform Agents (6): Reddit, Twitter, YouTube, HackerNews, TikTok, RSS News 3. Data Agents (4): Pipeline, Storage, Validation, Deduplication 4. Analysis Agents (3): Sentiment, Topic, Bias 5. Protection Agents (3): Rate Limiter, Health Monitor, Error Recovery 6. QA Agents (2): Test, Monitoring Key Features: - Message Bus Architecture: Redis-based pub/sub for agent communication - Standard Message Formats: JSON schemas for all message types - Priority System: 1-5 priority levels for message processing - Circuit Breaker Pattern: Prevents cascading failures - Rate Limiting: Per-platform IP protection - Health Monitoring: Real-time agent health checks - Error Recovery: Automatic retry and recovery strategies Communication Patterns: - Request-Response: Synchronous operations with timeout - Pub-Sub: Event broadcasting to multiple agents - Task Queue: Distributed work distribution - Circuit Breaker: Error isolation and recovery Project Timeline: - Phase 1 (Weeks 1-3): Foundation and core crawlers ✓ (mostly done) - Phase 2 (Weeks 4-6): Agent implementation - Phase 3 (Weeks 7-9): Integration and testing - Phase 4 (Weeks 10-12): Production deployment Benefits: - Scalability: Agents can be scaled independently - Modularity: Easy to add new platforms/features - Reliability: Isolated failures, automatic recovery - Observability: Full visibility into system state - Maintainability: Clear separation of concerns - Best Practices: Based on Anthropic's agent guidelines Next Steps: 1. Implement remaining platform agents (Twitter, YouTube, etc.) 2. Implement data processing agents 3. Set up Redis message bus 4. Create agent coordinator 5. Write integration tests See docs/MULTI_AGENT_SYSTEM_README.md for getting started guide.
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
documentation
Improvements or additions to documentation
improvement
New feature or request
size:XXL
This PR changes 1000+ lines, ignoring generated files.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.


No description provided.