Skip to content

Add HackerNews support and improve data quality metrics#12

Closed
ARJ999 wants to merge 2 commits intomvanhorn:mainfrom
ARJ999:claude/understand-repo-skills-ecQOY
Closed

Add HackerNews support and improve data quality metrics#12
ARJ999 wants to merge 2 commits intomvanhorn:mainfrom
ARJ999:claude/understand-repo-skills-ecQOY

Conversation

@ARJ999
Copy link
Copy Markdown

@ARJ999 ARJ999 commented Jan 29, 2026

Summary

This PR adds HackerNews as a new data source for research reports and introduces comprehensive data quality metrics to provide transparency about data verification and freshness. It also includes performance improvements for Reddit enrichment and refinements to scoring algorithms.

Key Changes

New Features

  • HackerNews Integration: Added scripts/lib/hn.py module to search HackerNews via Algolia API with configurable depth levels (quick/default/deep)
  • Data Quality Metrics: New DataQuality schema class that tracks verified dates, verified engagement, average recency, and available/failed sources
  • HNItem Schema: New HNItem dataclass for normalized HackerNews items with engagement metrics (points, comments)

Performance & Reliability

  • Parallel Reddit Enrichment: Converted sequential Reddit enrichment to use ThreadPoolExecutor with 5 workers for better throughput while maintaining per-item error handling
  • Model Cache Optimization: Reduced model cache TTL from 7 days to 1 day to ensure latest models are always used
  • Engagement Verification: Added engagement_verified flag to track whether engagement data comes from actual API responses vs. estimates

Scoring & Ranking Improvements

  • Exponential Recency Bias: Replaced linear recency scoring with tiered exponential scoring that strongly prioritizes content from the last 3 days (premium tier: 90-100) while still valuing week-old content (high tier: 75-89)
  • Quality-Focused Engagement Weights:
    • Reddit: Increased upvote ratio weight (30%) to reward community agreement
    • X: Prioritized reposts (40%) as a deep engagement signal
    • HN: New formula with 60% points + 40% comments
  • Verified Data Bonus: Added +8 point bonus for items with verified engagement data from APIs
  • Unverified Penalty: Increased penalty for unknown engagement from 10 to 15 points

Data Validation

  • Reddit Engagement Validation: New is_engagement_valid() function to detect deleted posts and invalid metrics
  • Stricter Date Requirements: Updated Reddit search prompt to enforce date filtering and reject pre-filtered old content
  • HN Date Handling: Proper Unix timestamp conversion with timezone awareness

UI & Reporting

  • Data Quality Section: New section in compact render showing total items, verification percentages, average recency, and source availability
  • HackerNews Display: Dedicated section showing HN items with points, comments, and discussion links
  • Enhanced Error Tracking: Separate error fields for each source (reddit_error, x_error, hn_error, web_error)

Model Selection

  • GPT-5 Series Priority: Updated fallback models to prefer latest GPT-5.5, GPT-5.3, GPT-5.2 over older versions
  • Grok Model Improvements: Changed xAI latest from grok-4-1-fast to grok-4-1 for higher quality, with proper x_search capability detection

Implementation Details

  • HN search uses Algolia's free API (no authentication required)
  • Engagement metrics are computed using logarithmic scaling for better distribution across different platforms
  • Data quality metrics are computed server-side after all sources complete
  • Thread pool executor properly handles futures with as_completed() for progress updates
  • All new fields are backward compatible with existing report serialization

Testing

  • Updated test expectations for exponential recency scoring
  • Updated model selection tests to reflect new GPT-5 and Grok preferences
  • Maintained backward compatibility with existing report deserialization

https://claude.ai/code/session_01YNFJJXPx5fDWRdfHAz7k27

This commit implements comprehensive enhancements for highest quality research:

## Model Selection
- Reduced model cache TTL from 7 days to 1 day for faster model updates
- Extended fallback chain to include GPT-5.5, GPT-5.3 series
- Dynamic xAI model selection with version parsing

## Scoring Algorithm
- Exponential recency scoring: strongly prioritizes fresh content (days 0-3 = premium)
- Quality-focused engagement formulas:
  - Reddit: 45% score + 30% upvote_ratio + 25% comments (ratio = agreement signal)
  - X: 40% reposts + 35% likes + 15% replies + 10% quotes (reposts = deep engagement)
- Added engagement verification bonus (+8) for verified data
- Reduced default engagement from 35 to 20, increased penalty to 15

## Date Filtering
- Strict 30-day enforcement in Reddit prompt
- Hard date validation with explicit from_date/to_date constraints

## Engagement Validation
- Added engagement_verified flag to Reddit and X items
- Validation checks for reasonable engagement ranges
- Detection of deleted posts (comments > 0 but score = 0)

## Performance
- Parallelized Reddit enrichment (5 workers) - 15s → 3s improvement

## New Source: HackerNews
- Added hn.py with Algolia API integration (free, no auth required)
- HNItem schema with points/comments engagement
- Normalized scoring formula: 60% points + 40% comments

## Data Quality Metrics
- DataQuality tracking: verified dates %, engagement %, avg recency
- Sources available/failed tracking
- Quality metrics displayed in compact output

## Tests
- Updated tests for exponential recency scoring
- Updated tests for new xAI model aliases
- All 95 tests passing

https://claude.ai/code/session_01YNFJJXPx5fDWRdfHAz7k27
@ARJ999 ARJ999 closed this Jan 29, 2026
@ARJ999 ARJ999 deleted the claude/understand-repo-skills-ecQOY branch January 29, 2026 02:28
mvanhorn added a commit that referenced this pull request Feb 28, 2026
Shoutout to @ARJ999 (first HN submission, PR #12), @wkbaran
(PR #26, referenced in planning), and @gbessoni for endorsing
HN as the right addition.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
zl190 pushed a commit to zl190/last30days-skill that referenced this pull request Apr 4, 2026
Shoutout to @ARJ999 (first HN submission, PR mvanhorn#12), @wkbaran
(PR mvanhorn#26, referenced in planning), and @gbessoni for endorsing
HN as the right addition.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants