LLM-as-a-Judge Evaluations Show Significant Instability and

LLM-as-a-Judge Evaluations Show Significant Instability and Bias

Abel Yagubyan· June 15, 2026 View original

Summary

This study reveals significant run-to-run unreliability and bias in LLM-as-a-Judge evaluations, with pairwise preferences flipping frequently and a notable first-position bias. It suggests that single-trial LLM judging is often too noisy for high-stakes evaluation, necessitating multi-trial aggregation and position randomization.

The widespread adoption of "LLM-as-a-Judge" for ranking model outputs, training reward models, and populating leaderboards has prompted a critical examination of its reliability and potential biases. This research systematically investigates the consistency of such evaluations by conducting repeated identical trials across 29 tasks and 10 categories, using two OpenAI judge models (GPT-4o-mini and GPT-4.1-mini). The study involved 50 pairwise and 50 pointwise trials per question, complemented by ablations for temperature and prompt sensitivity. The findings reveal substantial instability: pairwise preferences flipped on average 13.6% of the time, with nearly a third of questions exceeding a 20% flip rate, and one question showing a 56% flip rate. Furthermore, GPT-4o-mini exhibited a significant first-position bias, favoring the first option in 72% of cases. Despite small mean pointwise score gaps, judges frequently declared a winner even when their scalar scores indicated minimal quality differences, highlighting a discrepancy between pairwise and pointwise assessments. Beyond within-judge inconsistency, cross-judge agreement was only 76%, and minor changes to prompt templates altered majority outcomes in 25% of cases. While deterministic decoding reduced inconsistency, it did not eliminate it. A reliability curve analysis suggests that, on average, 11 repeated trials are needed to recover the 50-trial reference verdict with 95% probability, increasing to 15 for high-variance questions. These results underscore that single-trial LLM judging is often too noisy for critical evaluations, advocating for standard practices like multi-trial aggregation, position randomization, and explicit uncertainty reporting.

Why it matters

Professionals relying on LLM-as-a-Judge for model evaluation, A/B testing, or leaderboard rankings must be aware of its inherent unreliability and biases. Implementing recommended practices like multiple trials and randomization is crucial for obtaining trustworthy and actionable evaluation results.

How to implement this in your domain

1Implement multi-trial evaluations: Conduct multiple repeated trials for each LLM-as-a-Judge assessment to improve reliability.
2Randomize output positions: Always randomize the order of model outputs when presenting them to an LLM judge to mitigate position bias.
3Report uncertainty: Include confidence intervals or flip rates alongside LLM-as-a-Judge scores to reflect evaluation uncertainty.
4Cross-validate with human judges: Periodically compare LLM-as-a-Judge results with human evaluations for critical tasks to ensure alignment.
5Standardize prompt templates: Develop and adhere to robust, semantically equivalent prompt templates to minimize prompt sensitivity issues.

Who benefits

AI DevelopmentSoftware TestingProduct ManagementResearch & DevelopmentQuality Assurance

Key takeaways

LLM-as-a-Judge evaluations are prone to significant run-to-run unreliability.
Position bias and prompt sensitivity can heavily influence outcomes.
Single-trial judging is often insufficient for high-stakes decisions.
Multi-trial aggregation, randomization, and uncertainty reporting are essential for reliable results.

Original post by Abel Yagubyan

"arXiv:2606.13685v1 Announce Type: cross Abstract: LLM-as-a-Judge is now widely used to rank model outputs, train reward models, and populate public leaderboards, but its run-to-run reliability remains under-characterized. We study repeated identical evaluations on 29 tasks spanni…"

View on X

Originally posted by Abel Yagubyan on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

LLM-as-a-Judge Evaluations Show Significant Instability and Bias

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Research

VISReg Enhances JEPA Training with Novel Regularization

Margaret Atwood Criticizes AI for "Garbage In, Garbage Out" Flaw

Podcast Explores Large Test-Time Compute and AI Model Budgets