Benchmark: GSM8K Math Reasoning by saschabuehrle · Pull Request #83 · lemony-ai/cascadeflow

saschabuehrle · 2025-11-20T20:02:51Z

Summary

Implements GSM8K benchmark for evaluating CascadeFlow on grade school math word problems requiring multi-step reasoning. Tests cost-effectiveness and accuracy of cascading on mathematical tasks.

Dataset

GSM8K-10: Subset of 10 representative problems from the GSM8K dataset (8,500 total problems)

Problem Distribution:

Easy (4): Single-step arithmetic (e.g., "Janet's ducks")
Medium (4): Multi-step reasoning (e.g., percentage calculations, sequential operations)
Hard (2): Complex multi-step problems requiring careful tracking (e.g., rate problems)

Evaluation Method

Answer Extraction: Robust extraction from multiple formats
- GSM8K standard format: #### 18
- Natural language: "The answer is 18"
- Fallback: Last number in response
Comparison: Numeric comparison with floating point tolerance (0.01)
Correctness: Binary pass/fail based on answer match

Key Metrics

Accuracy: % of problems with correct numerical answer
Acceptance Rate: % handled by drafter
Cost Reduction: Savings vs. always-verifier baseline
Drafter Accuracy: Correctness when drafter accepted

Research Questions

Can cheaper models (gpt-4o-mini) handle multi-step math reasoning?
What cost savings are possible on math word problems?
How does quality scoring perform on math vs. code/text tasks?
What is the optimal quality threshold for mathematical reasoning?
Does mathematical reasoning benefit from cascade pattern?

Expected Results

Based on preliminary analysis:

Accuracy: 80-90% (math reasoning is challenging)
Drafter Acceptance: 50-60% (medium problems escalate more)
Cost Reduction: 40-50% (less than code due to lower acceptance)
Quality Threshold: 0.7 may need adjustment for math

Test Plan

To run the benchmark:

cd tests/benchmarks
export OPENAI_API_KEY="your-key"
python3 gsm8k.py

Output includes:

Correctness per problem
Acceptance vs. escalation breakdown
Cost analysis with ROI
Performance metrics
Findings on math reasoning challenges

Integration

Uses benchmark framework from PR #80:

Inherits from Benchmark base class
Uses BenchmarkResult and BenchmarkSummary
Compatible with reporting and visualization

🤖 Generated with Claude Code

Implements benchmark for evaluating CascadeFlow on grade school math word problems: - 10 problems from GSM8K dataset (varying difficulty) - Tests drafter ability to handle multi-step mathematical reasoning - Evaluates numerical answer correctness - Measures cost savings vs. always-verifier baseline Problem Categories: - Easy: Single-step arithmetic (4 problems) - Medium: Multi-step reasoning (4 problems) - Hard: Complex multi-step problems (2 problems) Evaluation Method: - Extracts numerical answer from model response - Handles multiple answer formats: • GSM8K format: "#### 18" • Natural language: "The answer is 18" • Last number in text - Numeric comparison with tolerance for floating point Key Metrics: - Accuracy: % of problems solved correctly - Acceptance rate: % handled by drafter - Cost reduction: Savings vs. baseline - Drafter accuracy: Correctness when accepted Research Questions: 1. Can cheaper models handle multi-step math reasoning? 2. What cost savings possible on math word problems? 3. How does quality scoring perform on math vs. code/text? 4. What is optimal threshold for mathematical tasks? Co-Authored-By: Claude <noreply@anthropic.com>

github-actions Bot added lang: python tests size/m labels Nov 20, 2025

saschabuehrle mentioned this pull request Nov 20, 2025

Add benchmark suite runner #84

Merged

saschabuehrle merged commit 9720333 into main Nov 20, 2025
19 checks passed

saschabuehrle deleted the feature/benchmark-gsm8k branch November 20, 2025 20:21

saschabuehrle mentioned this pull request Nov 20, 2025

Add MT-Bench multi-turn conversation benchmark #85

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark: GSM8K Math Reasoning#83

Benchmark: GSM8K Math Reasoning#83
saschabuehrle merged 1 commit intomainfrom
feature/benchmark-gsm8k

saschabuehrle commented Nov 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

saschabuehrle commented Nov 20, 2025

Summary

Dataset

Evaluation Method

Key Metrics

Research Questions

Expected Results

Test Plan

Integration

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant