[ICLR2026] SparseEval: Efficient Evaluation of Large Language Models by Sparse Optimization

📖 Introduction • 📊 Main Results • 🚀 Getting Started • 📜 Citation

📖 Introduction

As large language models (LLMs) continue to scale, evaluating their capabilities on comprehensive benchmarks has become computationally expensive. SparseEval is a novel framework that formulates efficient evaluation as a sparse optimization problem.

Key contributions of SparseEval include:

Sparsity Discovery: We reveal that evaluation matrices exhibit inherent sparsity, where a small subset of "anchor" items can effectively represent the entire benchmark.
Anchor Optimization: We introduce a gradient descent-based method to optimize the weights of selected anchor points for accurate performance estimation.
Task-Aware Refinement: We leverage a proxy model to iteratively refine anchor selection, ensuring high relevance to the downstream task.

Fig. 2: The overall framework of SparseEval.

SparseEval enables efficient benchmarking by selecting the most informative samples, achieving a balance between computational efficiency and evaluation reliability.

📊 Main Results

Evaluation Sparsity in LLM Benchmarks

Our analysis of model-item performance matrices reveals distinct clustering patterns (Evaluation Sparsity). As shown in our studies, items within the same cluster exhibit high similarity in model response patterns, allowing us to select representative anchors to predict performance on the full dataset accurately.

Fig. 1: Motivation of SparseEval.

Performance Comparison

SparseEval consistently outperforms traditional baseline methods in terms of accuracy-efficiency tradeoffs. As shown in the figure below, our method achieves superior performance across multiple metrics.

Fig. 3: Main Results of SparseEval.

High Correlation: Maintains a Kendall’s $\tau > 0.9$ with full benchmark scores while using significantly fewer samples.
Low Estimation Error: Achieves significantly lower Mean Absolute Error (MAE) compared to baselines.
Efficiency: Reduces inference costs by orders of magnitude (e.g., evaluating on only 100 instances) without sacrificing ranking consistency.

We demonstrate robustness across various benchmarks including MMLU, ARC, GSM8K, HellaSwag, TruthfulQA, and Winogrande.

🚀 Getting Started

Data Preparation

Please download the data from Hugging Face:

preprocess_data (Necessary): Download Link. Contains the processed data files ready for evaluation.
benchmark_data (Optional): Download Link. Contains the raw, unprocessed prediction files.

Supported Datasets: arc, gsm8k, hellaswag, mmlu, truthfulqa, winogrande.

Place the downloaded folders in the root of the repository. The expected directory structure is:

.
├── benchmark_data/    # Raw prediction files (CSV, Optional)
├── preprocess_data/   # Processed data files (Tensor)
└── ...

Running SparseEval

Execute the evaluation methods using the provided scripts. Experimental parameters (number of anchors, learning rates) can be modified within these scripts.

Main Method: SparseEval (MLP-based)

This is the primary method proposed in the paper, utilizing gradient-based optimization with an MLP predictor.

bash SparseEval/run/gd_cluster_mlp.sh <dataset_name> <num_anchors>
# Example: bash SparseEval/run/gd_cluster_mlp.sh gsm8k 100

Baselines

We provide several baseline methods for comparison:

1. Optimization-based Linear Weighting

Uses gradient descent to optimize weights for interpretable performance prediction.

bash SparseEval/run/gd_cluster_linear.sh <dataset_name> <num_anchors>

2. Anchor Point Selection

Selects representative samples and uses cluster sizes as weights.

bash SparseEval/run/gd_cluster_anchor_points.sh <dataset_name> <num_anchors>

Viewing Results

You can view the aggregated results (Error and Tau) using the provided statistics script:

python SparseEval/stat/stat.py

📜 Citation

If you find this work helpful, please cite us.

@article{zhang2026sparseeval,
  title={SparseEval: Efficient Evaluation of Large Language Models by Sparse Optimization},
  author={Zhang, Taolin and Guo, Hang and Lu, Wang and Dai, Tao and Xia, Shu-Tao and Wang, Jindong},
  journal={arXiv preprint arXiv:2602.07909},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
SparseEval		SparseEval
assets		assets
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[ICLR2026] SparseEval: Efficient Evaluation of Large Language Models by Sparse Optimization

📖 Introduction

📊 Main Results

Evaluation Sparsity in LLM Benchmarks

Performance Comparison

🚀 Getting Started

Data Preparation

Running SparseEval

Main Method: SparseEval (MLP-based)

Baselines

Viewing Results

📜 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

[ICLR2026] SparseEval: Efficient Evaluation of Large Language Models by Sparse Optimization

📖 Introduction

📊 Main Results

Evaluation Sparsity in LLM Benchmarks

Performance Comparison

🚀 Getting Started

Data Preparation

Running SparseEval

Main Method: SparseEval (MLP-based)

Baselines

Viewing Results

📜 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages