LLMs Struggle with Physics Reasoning in Unfamiliar Worlds

Dong Zhang· July 2, 2026 View original

Summary

A new four-stage diagnostic evaluates frontier LLMs' physics literacy in counterfactual and historical physics frameworks, revealing significant limitations in genuine reasoning beyond recall. Models frequently fail quantitative predictions despite understanding qualitative directions.

Current benchmarks for Large Language Models (LLMs) in physics often rely on answer accuracy, which doesn't differentiate between genuine reasoning and mere recall of familiar problem types. This research introduces a novel, auditable four-stage diagnostic designed to assess an LLM's ability to reason within unfamiliar physics frameworks. The diagnostic involves induction, formulation, prediction, and review stages, with strict protocols like locked pre-registrations and fresh sessions. The diagnostic was applied to three distinct "parallel physics worlds": a single-equation counterfactual world (F=mv), a historical framework (Aristotelian mechanics), and a four-domain counterfactual world (Decay World). The results for Claude Opus 4.7, GPT-5.5, and Gemini 3.1 Pro showed low composite PASS rates across these worlds, indicating significant challenges for LLMs in adapting to new physical rules. A key empirical finding was a qualitative-versus-quantitative asymmetry: LLMs rarely predicted the wrong direction of change in Decay World but frequently made incorrect quantitative calculations, often reverting to standard-physics relations. The study also highlighted methodological issues, such as the unreliability of LLM judges across different frameworks and the weakness of LLM self-review, where models often failed to identify their own errors.

Why it matters

Professionals developing or deploying LLMs for scientific, engineering, or complex reasoning tasks must understand their limitations in adapting to novel rule sets and performing accurate quantitative reasoning beyond pattern matching.

How to implement this in your domain

1Adopt rigorous diagnostic protocols for evaluating LLMs beyond simple accuracy metrics, especially for critical applications.
2Design custom benchmarks that test LLMs' ability to reason in novel or counterfactual scenarios relevant to specific domains.
3Implement human-in-the-loop auditing for LLM outputs, particularly for quantitative tasks where models may "hallucinate" incorrect calculations.
4Train LLMs with more diverse and abstract reasoning tasks to improve their adaptability to unfamiliar frameworks.

Who benefits

AI/ML DevelopmentScientific ResearchAerospaceEngineeringEducation

Key takeaways

Current LLMs struggle with genuine physics reasoning in unfamiliar frameworks.
They often revert to standard physics rules when faced with counterfactual scenarios.
LLMs show a qualitative-quantitative asymmetry, performing better on direction than magnitude.
LLM self-review mechanisms are currently weak and unreliable for error detection.

Original post by Dong Zhang

"arXiv:2607.00276v1 Announce Type: new Abstract: Current large-language-model (LLM) physics benchmarks are usually scored by answer accuracy, which cannot distinguish genuine reasoning from recall of familiar problem patterns and reveals little about where a model's reasoning brea…"

View on X

Originally posted by Dong Zhang on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

LLMs Struggle with Physics Reasoning in Unfamiliar Worlds

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Research

Human Feedback Guides Generative Meta-Learning for Robust Generalization.

Valdi: Value Diffusion World Models for MPC

Task-Aware LLM Quantization Improves Efficiency and Performance.