New Standard Schema and Repository for AI Evaluation Results Launched

Jan Batzner, Sree Harsha Nelaturu, Anastassia Kornilova, Jon Crall, Tommaso Cerruti, Yanan Long, Yifan Mai, Sanchit Ahuja, Asaf Yehudai, Marek \v{S}uppa, John P. Lalor, Oluwagbemike Olowe, Jatin Ganhotra, Brian H. Hu, Eliya Habba, Andrew M. Bean, Chang Liu, Sander Land, Steven Dillmann, Aniketh Garikaparthi, Elron Bandel, Saki Imai, James Edgell, Wm. Matthew Kennedy, Jenny Chim, Patrick Meusling, Asteria Kaeberlein, Venkata Ramachandra Karthik Chundi, Manasi Patwardhan, Martin Ku, Austin Meek, Leon Knauer, Brian Wingenroth, Srishti Yadav, Usman Gohar, Felix Friedrich, Michelle Lin, Jennifer Mickel, Arman Cohan, Stella Biderman, Irene Solaiman, Zeerak Talat, Anka Reuel, Mubashara Akhtar, Gjergji Kasneci, Avijit Ghosh, Leshem Choshen· June 15, 2026 View original

Summary

A new initiative called "Every Eval Ever" introduces a shared schema and community-crowdsourced repository to standardize AI evaluation results. This aims to address inconsistencies across diverse evaluation frameworks and formats, making it easier to compare and analyze AI model performance.

The "Every Eval Ever" project has been launched to standardize the way AI evaluation results are stored and compared. The initiative addresses the current fragmentation where evaluation outcomes are scattered across various incompatible formats, including leaderboards, research papers, and custom logs. This new effort provides a unified, source-agnostic JSON schema for representing AI evaluation results. It is designed to ingest data from diverse evaluation harnesses and publications, optionally storing per-instance outputs for more granular analysis. Key contributions include a community-governed metadata schema, automatic converters for popular formats, and a crowdsourced database hosted on Hugging Face. This repository currently encompasses over 22,000 models, 2,200 benchmarks, and 30 evaluation formats, aiming to foster better comparison, reduce costs, and promote reuse in AI evaluation science.

Why it matters

For AI professionals, standardizing evaluation results is crucial for reliable model comparison, benchmarking, and understanding true progress. This initiative can significantly streamline workflows, reduce redundant evaluation efforts, and improve the transparency and reproducibility of AI research and development.

How to implement this in your domain

  1. 1Adopt the "Every Eval Ever" schema for storing and sharing your AI model evaluation results.
  2. 2Utilize the provided automatic converters to integrate existing evaluation data into the unified repository.
  3. 3Contribute your own evaluation results to the community database on Hugging Face to enhance collective knowledge.
  4. 4Leverage the standardized data to perform more consistent and comparable analyses of different AI models and benchmarks.

Who benefits

AI ResearchSoftware DevelopmentData ScienceAcademiaQuality Assurance

Key takeaways

  • AI evaluation results are often inconsistent and fragmented.
  • "Every Eval Ever" offers a unified schema and repository.
  • The schema standardizes how evaluations are represented in JSON.
  • It aims to improve comparison, reduce costs, and promote reuse in AI evaluation.

Original post by Jan Batzner, Sree Harsha Nelaturu, Anastassia Kornilova, Jon Crall, Tommaso Cerruti, Yanan Long, Yifan Mai, Sanchit Ahuja, Asaf Yehudai, Marek \v{S}uppa, John P. Lalor, Oluwagbemike Olowe, Jatin Ganhotra, Brian H. Hu, Eliya Habba, Andrew M. Bean, Chang Liu, Sander Land, Steven Dillmann, Aniketh Garikaparthi, Elron Bandel, Saki Imai, James Edgell, Wm. Matthew Kennedy, Jenny Chim, Patrick Meusling, Asteria Kaeberlein, Venkata Ramachandra Karthik Chundi, Manasi Patwardhan, Martin Ku, Austin Meek, Leon Knauer, Brian Wingenroth, Srishti Yadav, Usman Gohar, Felix Friedrich, Michelle Lin, Jennifer Mickel, Arman Cohan, Stella Biderman, Irene Solaiman, Zeerak Talat, Anka Reuel, Mubashara Akhtar, Gjergji Kasneci, Avijit Ghosh, Leshem Choshen

"arXiv:2606.14516v1 Announce Type: new Abstract: AI evaluations are widely used for testing and understanding progress. However, the diverse evaluators bring with them inconsistencies that challenge analysis and comparison. First, results are saved in incompatible formats, scatter…"

View on X

Originally posted by Jan Batzner, Sree Harsha Nelaturu, Anastassia Kornilova, Jon Crall, Tommaso Cerruti, Yanan Long, Yifan Mai, Sanchit Ahuja, Asaf Yehudai, Marek \v{S}uppa, John P. Lalor, Oluwagbemike Olowe, Jatin Ganhotra, Brian H. Hu, Eliya Habba, Andrew M. Bean, Chang Liu, Sander Land, Steven Dillmann, Aniketh Garikaparthi, Elron Bandel, Saki Imai, James Edgell, Wm. Matthew Kennedy, Jenny Chim, Patrick Meusling, Asteria Kaeberlein, Venkata Ramachandra Karthik Chundi, Manasi Patwardhan, Martin Ku, Austin Meek, Leon Knauer, Brian Wingenroth, Srishti Yadav, Usman Gohar, Felix Friedrich, Michelle Lin, Jennifer Mickel, Arman Cohan, Stella Biderman, Irene Solaiman, Zeerak Talat, Anka Reuel, Mubashara Akhtar, Gjergji Kasneci, Avijit Ghosh, Leshem Choshen on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses