New Benchmark Tests LLM Reasoning, Diplomacy Under Fog of Wa

New Benchmark Tests LLM Reasoning, Diplomacy Under Fog of War

Arnaud Ricci· June 24, 2026 View original

▶ The 2-minute explainer

Summary

"Age of LLM" is a new turn-based 1v1 benchmark where two LLMs compete to destroy an enemy base on a grid, incorporating stressors like fog of war, full diplomacy, and strict JSON schema adherence for actions. The benchmark reveals insights into LLM reasoning, belief-tracking, and spontaneous deception.

A novel benchmark named "Age of LLM" has been introduced to evaluate large language models (LLMs) in a strategic, adversarial environment. This turn-based 1v1 game pits two LLMs against each other on a grid, with the objective of destroying the opponent's base. The benchmark incorporates several challenging elements: a "fog of war" that limits information, full diplomatic capabilities including messages and ceasefires, and a strict requirement for actions to conform to a JSON schema, with illegal actions being silently discarded. The engine is private, and each match uses a fresh random map and opponent to prevent data contamination. Initial findings from benchmarking 15 models across 54 matches indicate that a "nuclear rush" strategy dominates, often due to mechanical execution rather than cognitive deterrence failure. Diplomacy is frequent but rarely successful, and a significant portion of illegal actions stem from fog/state errors, providing a measure of belief-tracking. The research also suggests a weak link between reliability and winning, and the turn-by-turn traces offer a unique lens into LLM reasoning under uncertainty, including belief-tracking and spontaneous deception.

Why it matters

This benchmark provides a robust, adversarial environment to rigorously test and understand LLM capabilities in complex strategic reasoning, reliability, and social interaction, crucial for developing more sophisticated and trustworthy AI agents.

How to implement this in your domain

1Utilize the benchmark's methodology to evaluate custom LLM agents for strategic decision-making.
2Analyze replay data to identify patterns in LLM behavior under uncertainty and adversarial conditions.
3Develop LLM training strategies that specifically address belief-tracking and adherence to action schemas.
4Explore multi-agent architectures where different LLMs handle strategic planning, diplomacy, and action execution.

Who benefits

AI DevelopmentGamingDefenseCybersecurityRobotics

Key takeaways

"Age of LLM" is a new benchmark for evaluating LLMs in strategic 1v1 combat under uncertainty.
It tests reasoning, diplomacy, and reliability with fog of war and strict action schemas.
Initial results show a dominant "nuclear rush" strategy and frequent but ineffective diplomacy.
The benchmark provides insights into LLM belief-tracking and potential for deception.

Original post by Arnaud Ricci

"arXiv:2606.24391v1 Announce Type: new Abstract: We introduce Age of LLM, a turn-based 1v1 benchmark in which two LLMs face off on a 13x7 grid to destroy the enemy base. Three stressors are deliberate: fog of war, full diplomacy (messages, ceasefires, ultimatums; uranium kept secr…"

View on X

Originally posted by Arnaud Ricci on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

New Benchmark Tests LLM Reasoning, Diplomacy Under Fog of War

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Research

VISReg Enhances JEPA Training with Novel Regularization

Margaret Atwood Criticizes AI for "Garbage In, Garbage Out" Flaw

Podcast Explores Large Test-Time Compute and AI Model Budgets