Add HackerNews support and improve data quality metrics by ARJ999 · Pull Request #12 · mvanhorn/last30days-skill

ARJ999 · 2026-01-29T02:26:31Z

Summary

This PR adds HackerNews as a new data source for research reports and introduces comprehensive data quality metrics to provide transparency about data verification and freshness. It also includes performance improvements for Reddit enrichment and refinements to scoring algorithms.

Key Changes

New Features

HackerNews Integration: Added scripts/lib/hn.py module to search HackerNews via Algolia API with configurable depth levels (quick/default/deep)
Data Quality Metrics: New DataQuality schema class that tracks verified dates, verified engagement, average recency, and available/failed sources
HNItem Schema: New HNItem dataclass for normalized HackerNews items with engagement metrics (points, comments)

Performance & Reliability

Parallel Reddit Enrichment: Converted sequential Reddit enrichment to use ThreadPoolExecutor with 5 workers for better throughput while maintaining per-item error handling
Model Cache Optimization: Reduced model cache TTL from 7 days to 1 day to ensure latest models are always used
Engagement Verification: Added engagement_verified flag to track whether engagement data comes from actual API responses vs. estimates

Scoring & Ranking Improvements

Exponential Recency Bias: Replaced linear recency scoring with tiered exponential scoring that strongly prioritizes content from the last 3 days (premium tier: 90-100) while still valuing week-old content (high tier: 75-89)
Quality-Focused Engagement Weights:
- Reddit: Increased upvote ratio weight (30%) to reward community agreement
- X: Prioritized reposts (40%) as a deep engagement signal
- HN: New formula with 60% points + 40% comments
Verified Data Bonus: Added +8 point bonus for items with verified engagement data from APIs
Unverified Penalty: Increased penalty for unknown engagement from 10 to 15 points

Data Validation

Reddit Engagement Validation: New is_engagement_valid() function to detect deleted posts and invalid metrics
Stricter Date Requirements: Updated Reddit search prompt to enforce date filtering and reject pre-filtered old content
HN Date Handling: Proper Unix timestamp conversion with timezone awareness

UI & Reporting

Data Quality Section: New section in compact render showing total items, verification percentages, average recency, and source availability
HackerNews Display: Dedicated section showing HN items with points, comments, and discussion links
Enhanced Error Tracking: Separate error fields for each source (reddit_error, x_error, hn_error, web_error)

Model Selection

GPT-5 Series Priority: Updated fallback models to prefer latest GPT-5.5, GPT-5.3, GPT-5.2 over older versions
Grok Model Improvements: Changed xAI latest from grok-4-1-fast to grok-4-1 for higher quality, with proper x_search capability detection

Implementation Details

HN search uses Algolia's free API (no authentication required)
Engagement metrics are computed using logarithmic scaling for better distribution across different platforms
Data quality metrics are computed server-side after all sources complete
Thread pool executor properly handles futures with as_completed() for progress updates
All new fields are backward compatible with existing report serialization

Testing

Updated test expectations for exponential recency scoring
Updated model selection tests to reflect new GPT-5 and Grok preferences
Maintained backward compatibility with existing report deserialization

https://claude.ai/code/session_01YNFJJXPx5fDWRdfHAz7k27

This commit implements comprehensive enhancements for highest quality research: ## Model Selection - Reduced model cache TTL from 7 days to 1 day for faster model updates - Extended fallback chain to include GPT-5.5, GPT-5.3 series - Dynamic xAI model selection with version parsing ## Scoring Algorithm - Exponential recency scoring: strongly prioritizes fresh content (days 0-3 = premium) - Quality-focused engagement formulas: - Reddit: 45% score + 30% upvote_ratio + 25% comments (ratio = agreement signal) - X: 40% reposts + 35% likes + 15% replies + 10% quotes (reposts = deep engagement) - Added engagement verification bonus (+8) for verified data - Reduced default engagement from 35 to 20, increased penalty to 15 ## Date Filtering - Strict 30-day enforcement in Reddit prompt - Hard date validation with explicit from_date/to_date constraints ## Engagement Validation - Added engagement_verified flag to Reddit and X items - Validation checks for reasonable engagement ranges - Detection of deleted posts (comments > 0 but score = 0) ## Performance - Parallelized Reddit enrichment (5 workers) - 15s → 3s improvement ## New Source: HackerNews - Added hn.py with Algolia API integration (free, no auth required) - HNItem schema with points/comments engagement - Normalized scoring formula: 60% points + 40% comments ## Data Quality Metrics - DataQuality tracking: verified dates %, engagement %, avg recency - Sources available/failed tracking - Quality metrics displayed in compact output ## Tests - Updated tests for exponential recency scoring - Updated tests for new xAI model aliases - All 95 tests passing https://claude.ai/code/session_01YNFJJXPx5fDWRdfHAz7k27

https://claude.ai/code/session_01YNFJJXPx5fDWRdfHAz7k27

@ARJ999

Shoutout to @ARJ999 (first HN submission, PR #12), @wkbaran (PR #26, referenced in planning), and @gbessoni for endorsing HN as the right addition. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@ARJ999

Shoutout to @ARJ999 (first HN submission, PR mvanhorn#12), @wkbaran (PR mvanhorn#26, referenced in planning), and @gbessoni for endorsing HN as the right addition. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

claude added 2 commits January 29, 2026 02:26

chore: Add .gitignore for Python cache files

1834831

https://claude.ai/code/session_01YNFJJXPx5fDWRdfHAz7k27

ARJ999 closed this Jan 29, 2026

ARJ999 deleted the claude/understand-repo-skills-ecQOY branch January 29, 2026 02:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HackerNews support and improve data quality metrics#12

Add HackerNews support and improve data quality metrics#12
ARJ999 wants to merge 2 commits intomvanhorn:mainfrom
ARJ999:claude/understand-repo-skills-ecQOY

ARJ999 commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ARJ999 commented Jan 29, 2026

Summary

Key Changes

New Features

Performance & Reliability

Scoring & Ranking Improvements

Data Validation

UI & Reporting

Model Selection

Implementation Details

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants