BenchPress Predicts LLM Performance with Fewer Evaluations

Yuchen Zeng, Dimitris Papailiopoulos· June 24, 2026 View original

Summary

Researchers found that large language model performance across many benchmarks is largely determined by just two underlying factors, leading to BenchPress. This logit-space rank-2 matrix completion method can accurately predict held-out scores using a small subset of benchmarks, significantly reducing evaluation costs.

Evaluating large language models (LLMs) typically involves running them across dozens, sometimes hundreds, of benchmarks. This process is resource-intensive, consuming significant computational power and time, both during development and for final model releases. A new study analyzed a public score matrix of 84 frontier models across 133 benchmarks and discovered a surprising underlying structure: the performance of a model across all these evaluations can be largely explained by just two latent factors. This means that a model's comprehensive scorecard is effectively determined by only two numbers. Building on this insight, the researchers developed BenchPress, a logit-space rank-2 matrix completion method. BenchPress can recover held-out scores to within 4.6 points and provides a confidence layer for its predictions. It identified a minimal subset of five benchmarks (e.g., GPQA-D, HLE, Codeforces, MMLU-Pro, ARC-AGI-1) that can predict a model's full public scorecard within 3.93 points, and an even cheaper set for tighter inference budgets. This tool promises to drastically reduce the computational burden of LLM evaluation.

Why it matters

This research offers a significant efficiency gain for AI developers and researchers by enabling accurate LLM performance prediction with far fewer evaluations, saving substantial time and computational resources.

How to implement this in your domain

  1. 1Utilize BenchPress to streamline your LLM evaluation pipeline, reducing the number of benchmarks run.
  2. 2Identify the most informative subset of benchmarks for your specific model development goals.
  3. 3Integrate BenchPress predictions into your model tracking and checkpoint selection processes.
  4. 4Allocate saved computational resources to other critical development or research areas.
  5. 5Contribute to or leverage the public score matrix and tools for broader model comparison.

Who benefits

AI/ML DevelopmentCloud ComputingResearch & DevelopmentSoftware TestingData Science

Key takeaways

  • LLM performance across many benchmarks is largely explained by two underlying factors.
  • BenchPress is a method to predict full LLM scorecards from a small subset of evaluations.
  • It significantly reduces the computational cost and time of model evaluation.
  • A minimal set of 5-6 benchmarks can accurately predict broader model performance.

Original post by Yuchen Zeng, Dimitris Papailiopoulos

"arXiv:2606.24020v1 Announce Type: new Abstract: A modern model release reports scores on 40+ benchmarks and the same evaluations were run many more times before it: to track training progress, compare design choices, and select the checkpoint for the release. But do we need to ru…"

View on X

Originally posted by Yuchen Zeng, Dimitris Papailiopoulos on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses