New Solver Outperforms GPT-5.2 Pro on ARC-AGI-2 Benchmark

Johan Land· July 1, 2026 View original

Summary

A new solver for the ARC-AGI-2 visual reasoning benchmark achieves 72.9% accuracy, significantly surpassing top frontier models like GPT-5.2 Pro and Gemini 3 Pro. It uses modality-driven search to generate diverse candidates and a holistic judging model to compare all reasoning traces.

Large language models (LLMs) can generate fluent and internally consistent reasoning for abstract tasks, but they often produce confidently incorrect answers. This highlights that the primary challenge isn't just generating hypotheses, but effectively selecting the correct one from multiple candidates. A new solver for ARC-AGI-2, a few-shot visual reasoning benchmark, addresses this by employing two core principles.First, it treats different reasoning modalities—text, image, and code—as distinct search operators, allowing for the independent generation of diverse candidate solutions across these channels. This approach ensures a wide range of potential answers are considered. Second, it utilizes a context-preserving holistic judging mechanism. A dedicated judge model simultaneously compares all generated reasoning traces within a single, long-context prompt.This method, unlike simpler self-consistency or majority voting, has proven capable of identifying correct minority hypotheses even when the most common answer is wrong. On the ARC Prize semi-private evaluation set, the solver achieved an impressive 72.9% accuracy, significantly outperforming leading standalone models like GPT-5.2 Pro (54.2%) and Gemini 3 Pro (54.0%). The public evaluation set saw similar success at 76.1%. The researchers also shared extensive negative results, noting that prescriptive prompting and iterative refinement can reduce hypothesis diversity and degrade performance.

Why it matters

For AI researchers and engineers, this work demonstrates a significant leap in abstract reasoning capabilities, particularly in visual domains, by focusing on robust candidate selection rather than just generation. It offers a blueprint for building more reliable and accurate AI systems for complex problem-solving.

How to implement this in your domain

1Explore multi-modal reasoning approaches for complex problem-solving tasks within your AI systems.
2Implement a dedicated "judge" component that holistically evaluates multiple candidate solutions generated by different modalities or methods.
3Prioritize generating diverse hypotheses over iterative refinement in initial solution exploration phases.
4Benchmark current AI systems against visual reasoning challenges like ARC-AGI-2 to identify areas for improvement.
5Study the open-sourced code and negative results to understand effective and ineffective strategies for abstract reasoning.

Who benefits

AI DevelopmentResearch & DevelopmentRoboticsGamingEducation

Key takeaways

Selecting correct hypotheses is more critical than just generating them for abstract reasoning.
Modality-driven search generates diverse candidates across text, image, and code.
Holistic judging of all candidates in a single context improves accuracy.
The new solver significantly outperforms frontier LLMs on ARC-AGI-2.

Original post by Johan Land

"arXiv:2606.31543v1 Announce Type: new Abstract: Large language models can produce fluent, internally coherent reasoning traces for abstract reasoning tasks while still being confidently wrong - making selection among candidates, not just generation, the central challenge. I prese…"

View on X

Originally posted by Johan Land on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

New Solver Outperforms GPT-5.2 Pro on ARC-AGI-2 Benchmark

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Research

Philosophical Foundations for Explainable AI in Healthcare Explored

New Metric Improves LLM Reinforcement Learning with Verifiable Rewards.

New ACE Module Boosts LLM Agent Context Management