Skip to content

Benchmark: GSM8K Math Reasoning#83

Merged
saschabuehrle merged 1 commit intomainfrom
feature/benchmark-gsm8k
Nov 20, 2025
Merged

Benchmark: GSM8K Math Reasoning#83
saschabuehrle merged 1 commit intomainfrom
feature/benchmark-gsm8k

Conversation

@saschabuehrle
Copy link
Copy Markdown
Collaborator

Summary

Implements GSM8K benchmark for evaluating CascadeFlow on grade school math word problems requiring multi-step reasoning. Tests cost-effectiveness and accuracy of cascading on mathematical tasks.

Dataset

GSM8K-10: Subset of 10 representative problems from the GSM8K dataset (8,500 total problems)

Problem Distribution:

  • Easy (4): Single-step arithmetic (e.g., "Janet's ducks")
  • Medium (4): Multi-step reasoning (e.g., percentage calculations, sequential operations)
  • Hard (2): Complex multi-step problems requiring careful tracking (e.g., rate problems)

Evaluation Method

  1. Answer Extraction: Robust extraction from multiple formats

    • GSM8K standard format: #### 18
    • Natural language: "The answer is 18"
    • Fallback: Last number in response
  2. Comparison: Numeric comparison with floating point tolerance (0.01)

  3. Correctness: Binary pass/fail based on answer match

Key Metrics

  • Accuracy: % of problems with correct numerical answer
  • Acceptance Rate: % handled by drafter
  • Cost Reduction: Savings vs. always-verifier baseline
  • Drafter Accuracy: Correctness when drafter accepted

Research Questions

  1. Can cheaper models (gpt-4o-mini) handle multi-step math reasoning?
  2. What cost savings are possible on math word problems?
  3. How does quality scoring perform on math vs. code/text tasks?
  4. What is the optimal quality threshold for mathematical reasoning?
  5. Does mathematical reasoning benefit from cascade pattern?

Expected Results

Based on preliminary analysis:

  • Accuracy: 80-90% (math reasoning is challenging)
  • Drafter Acceptance: 50-60% (medium problems escalate more)
  • Cost Reduction: 40-50% (less than code due to lower acceptance)
  • Quality Threshold: 0.7 may need adjustment for math

Test Plan

To run the benchmark:

cd tests/benchmarks
export OPENAI_API_KEY="your-key"
python3 gsm8k.py

Output includes:

  • Correctness per problem
  • Acceptance vs. escalation breakdown
  • Cost analysis with ROI
  • Performance metrics
  • Findings on math reasoning challenges

Integration

Uses benchmark framework from PR #80:

  • Inherits from Benchmark base class
  • Uses BenchmarkResult and BenchmarkSummary
  • Compatible with reporting and visualization

🤖 Generated with Claude Code

Implements benchmark for evaluating CascadeFlow on grade school math word problems:
- 10 problems from GSM8K dataset (varying difficulty)
- Tests drafter ability to handle multi-step mathematical reasoning
- Evaluates numerical answer correctness
- Measures cost savings vs. always-verifier baseline

Problem Categories:
- Easy: Single-step arithmetic (4 problems)
- Medium: Multi-step reasoning (4 problems)
- Hard: Complex multi-step problems (2 problems)

Evaluation Method:
- Extracts numerical answer from model response
- Handles multiple answer formats:
  • GSM8K format: "#### 18"
  • Natural language: "The answer is 18"
  • Last number in text
- Numeric comparison with tolerance for floating point

Key Metrics:
- Accuracy: % of problems solved correctly
- Acceptance rate: % handled by drafter
- Cost reduction: Savings vs. baseline
- Drafter accuracy: Correctness when accepted

Research Questions:
1. Can cheaper models handle multi-step math reasoning?
2. What cost savings possible on math word problems?
3. How does quality scoring perform on math vs. code/text?
4. What is optimal threshold for mathematical tasks?

Co-Authored-By: Claude <noreply@anthropic.com>
@saschabuehrle saschabuehrle merged commit 9720333 into main Nov 20, 2025
19 checks passed
@saschabuehrle saschabuehrle deleted the feature/benchmark-gsm8k branch November 20, 2025 20:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant