Merged
Conversation
Implements benchmark for evaluating CascadeFlow on grade school math word problems: - 10 problems from GSM8K dataset (varying difficulty) - Tests drafter ability to handle multi-step mathematical reasoning - Evaluates numerical answer correctness - Measures cost savings vs. always-verifier baseline Problem Categories: - Easy: Single-step arithmetic (4 problems) - Medium: Multi-step reasoning (4 problems) - Hard: Complex multi-step problems (2 problems) Evaluation Method: - Extracts numerical answer from model response - Handles multiple answer formats: • GSM8K format: "#### 18" • Natural language: "The answer is 18" • Last number in text - Numeric comparison with tolerance for floating point Key Metrics: - Accuracy: % of problems solved correctly - Acceptance rate: % handled by drafter - Cost reduction: Savings vs. baseline - Drafter accuracy: Correctness when accepted Research Questions: 1. Can cheaper models handle multi-step math reasoning? 2. What cost savings possible on math word problems? 3. How does quality scoring perform on math vs. code/text? 4. What is optimal threshold for mathematical tasks? Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements GSM8K benchmark for evaluating CascadeFlow on grade school math word problems requiring multi-step reasoning. Tests cost-effectiveness and accuracy of cascading on mathematical tasks.
Dataset
GSM8K-10: Subset of 10 representative problems from the GSM8K dataset (8,500 total problems)
Problem Distribution:
Evaluation Method
Answer Extraction: Robust extraction from multiple formats
#### 18Comparison: Numeric comparison with floating point tolerance (0.01)
Correctness: Binary pass/fail based on answer match
Key Metrics
Research Questions
Expected Results
Based on preliminary analysis:
Test Plan
To run the benchmark:
Output includes:
Integration
Uses benchmark framework from PR #80:
Benchmarkbase classBenchmarkResultandBenchmarkSummary🤖 Generated with Claude Code