New Solver Outperforms GPT-5.2 Pro on ARC-AGI-2 Benchmark

Johan Land· July 1, 2026 View original

Summary

A new solver for the ARC-AGI-2 visual reasoning benchmark achieves 72.9% accuracy, significantly surpassing top frontier models like GPT-5.2 Pro and Gemini 3 Pro. It uses modality-driven search to generate diverse candidates and a holistic judging model to compare all reasoning traces.

Large language models (LLMs) can generate fluent and internally consistent reasoning for abstract tasks, but they often produce confidently incorrect answers. This highlights that the primary challenge isn't just generating hypotheses, but effectively selecting the correct one from multiple candidates. A new solver for ARC-AGI-2, a few-shot visual reasoning benchmark, addresses this by employing two core principles.First, it treats different reasoning modalities—text, image, and code—as distinct search operators, allowing for the independent generation of diverse candidate solutions across these channels. This approach ensures a wide range of potential answers are considered. Second, it utilizes a context-preserving holistic judging mechanism. A dedicated judge model simultaneously compares all generated reasoning traces within a single, long-context prompt.This method, unlike simpler self-consistency or majority voting, has proven capable of identifying correct minority hypotheses even when the most common answer is wrong. On the ARC Prize semi-private evaluation set, the solver achieved an impressive 72.9% accuracy, significantly outperforming leading standalone models like GPT-5.2 Pro (54.2%) and Gemini 3 Pro (54.0%). The public evaluation set saw similar success at 76.1%. The researchers also shared extensive negative results, noting that prescriptive prompting and iterative refinement can reduce hypothesis diversity and degrade performance.

Why it matters

For AI researchers and engineers, this work demonstrates a significant leap in abstract reasoning capabilities, particularly in visual domains, by focusing on robust candidate selection rather than just generation. It offers a blueprint for building more reliable and accurate AI systems for complex problem-solving.

How to implement this in your domain

  1. 1Explore multi-modal reasoning approaches for complex problem-solving tasks within your AI systems.
  2. 2Implement a dedicated "judge" component that holistically evaluates multiple candidate solutions generated by different modalities or methods.
  3. 3Prioritize generating diverse hypotheses over iterative refinement in initial solution exploration phases.
  4. 4Benchmark current AI systems against visual reasoning challenges like ARC-AGI-2 to identify areas for improvement.
  5. 5Study the open-sourced code and negative results to understand effective and ineffective strategies for abstract reasoning.

Who benefits

AI DevelopmentResearch & DevelopmentRoboticsGamingEducation

Key takeaways

  • Selecting correct hypotheses is more critical than just generating them for abstract reasoning.
  • Modality-driven search generates diverse candidates across text, image, and code.
  • Holistic judging of all candidates in a single context improves accuracy.
  • The new solver significantly outperforms frontier LLMs on ARC-AGI-2.

Original post by Johan Land

"arXiv:2606.31543v1 Announce Type: new Abstract: Large language models can produce fluent, internally coherent reasoning traces for abstract reasoning tasks while still being confidently wrong - making selection among candidates, not just generation, the central challenge. I prese…"

View on X

Originally posted by Johan Land on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

More in AI Research

AI ResearchAI Engineering & DevTools

Philosophical Foundations for Explainable AI in Healthcare Explored

This paper critically reviews the intersection of philosophy of science and explainable AI (XAI) in health sciences, examining what constitutes an adequate medical explanation. It identifies causality, trust, and epistemic adequacy as central axes for designing robust XAI systems in clinical decision-making.

Martina Mattioli, Marcello PelilloJul 1, 2026
AI ResearchAI Engineering & DevTools

New Metric Improves LLM Reinforcement Learning with Verifiable Rewards.

This research introduces the Relative Surprisal Index (RSI), an information-theoretic metric for adaptive token selection in Reinforcement Learning with Verifiable Rewards (RLVR) for LLMs. RSI-S, an entropy-adaptive filtering method based on RSI, improves reasoning accuracy by 2-3 percentage points by retaining tokens within a stable surprisal interval.

Outongyi Lv, Yanzhao Zheng, Yuanwei Zhang, Zhenghao Huang, Xingjun Wang, Baohua Dong, Hangcheng Zhu, Yingda ChenJul 1, 2026
AI Engineering & DevToolsAI Research

New ACE Module Boosts LLM Agent Context Management

Researchers introduce ACE (Adaptive Context Elasticizer), a plug-and-play module that dynamically manages historical information for LLM-based agents. ACE maintains a lossless message layer and adaptively orchestrates context, significantly improving performance across various agent frameworks without architectural changes.

Ning Liao, Zihao Long, Xiaoxing Wang, Xue Yang, Yaoming Wang, Ziyuan Zhuang, Xunliang Cai, Rongxiang Weng, Junchi YanJul 1, 2026