Add HackerNews support and improve data quality metrics#12
Closed
ARJ999 wants to merge 2 commits intomvanhorn:mainfrom
Closed
Add HackerNews support and improve data quality metrics#12ARJ999 wants to merge 2 commits intomvanhorn:mainfrom
ARJ999 wants to merge 2 commits intomvanhorn:mainfrom
Conversation
This commit implements comprehensive enhancements for highest quality research: ## Model Selection - Reduced model cache TTL from 7 days to 1 day for faster model updates - Extended fallback chain to include GPT-5.5, GPT-5.3 series - Dynamic xAI model selection with version parsing ## Scoring Algorithm - Exponential recency scoring: strongly prioritizes fresh content (days 0-3 = premium) - Quality-focused engagement formulas: - Reddit: 45% score + 30% upvote_ratio + 25% comments (ratio = agreement signal) - X: 40% reposts + 35% likes + 15% replies + 10% quotes (reposts = deep engagement) - Added engagement verification bonus (+8) for verified data - Reduced default engagement from 35 to 20, increased penalty to 15 ## Date Filtering - Strict 30-day enforcement in Reddit prompt - Hard date validation with explicit from_date/to_date constraints ## Engagement Validation - Added engagement_verified flag to Reddit and X items - Validation checks for reasonable engagement ranges - Detection of deleted posts (comments > 0 but score = 0) ## Performance - Parallelized Reddit enrichment (5 workers) - 15s → 3s improvement ## New Source: HackerNews - Added hn.py with Algolia API integration (free, no auth required) - HNItem schema with points/comments engagement - Normalized scoring formula: 60% points + 40% comments ## Data Quality Metrics - DataQuality tracking: verified dates %, engagement %, avg recency - Sources available/failed tracking - Quality metrics displayed in compact output ## Tests - Updated tests for exponential recency scoring - Updated tests for new xAI model aliases - All 95 tests passing https://claude.ai/code/session_01YNFJJXPx5fDWRdfHAz7k27
mvanhorn
added a commit
that referenced
this pull request
Feb 28, 2026
zl190
pushed a commit
to zl190/last30days-skill
that referenced
this pull request
Apr 4, 2026
Shoutout to @ARJ999 (first HN submission, PR mvanhorn#12), @wkbaran (PR mvanhorn#26, referenced in planning), and @gbessoni for endorsing HN as the right addition. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds HackerNews as a new data source for research reports and introduces comprehensive data quality metrics to provide transparency about data verification and freshness. It also includes performance improvements for Reddit enrichment and refinements to scoring algorithms.
Key Changes
New Features
scripts/lib/hn.pymodule to search HackerNews via Algolia API with configurable depth levels (quick/default/deep)DataQualityschema class that tracks verified dates, verified engagement, average recency, and available/failed sourcesHNItemdataclass for normalized HackerNews items with engagement metrics (points, comments)Performance & Reliability
ThreadPoolExecutorwith 5 workers for better throughput while maintaining per-item error handlingengagement_verifiedflag to track whether engagement data comes from actual API responses vs. estimatesScoring & Ranking Improvements
Data Validation
is_engagement_valid()function to detect deleted posts and invalid metricsUI & Reporting
Model Selection
grok-4-1-fasttogrok-4-1for higher quality, with proper x_search capability detectionImplementation Details
as_completed()for progress updatesTesting
https://claude.ai/code/session_01YNFJJXPx5fDWRdfHAz7k27