LLM-as-Judge Safety Evaluations Lack Reproducibility, Even at Zero Temperature

Hiroki Tamba· June 26, 2026 View original

Summary

A study reveals that LLM-as-judge safety evaluations are often non-reproducible, even when temperature is set to zero, due to default provider settings and inherent model variability. This exposes a critical flaw where evaluation harnesses report single-run verdicts without variance, potentially misrepresenting safety properties.

This research uncovers significant reproducibility issues in LLM-as-judge safety evaluations, a standard component in AI deployment decision-making. A common assumption is that setting the sampling temperature to 0 ensures deterministic grading. However, testing against a real-world safety evaluation codebase (Japan AISI's aisev) demonstrated this assumption is flawed on two levels. Firstly, many harnesses invoke graders without explicitly setting temperature or seed, leading providers to silently apply a default of 1.0. This results in substantial per-item disagreement (up to 50%) across identical runs for items near decision boundaries. Secondly, even when temperature is explicitly pinned to 0, non-reproducibility persists. Across 690 API calls with various providers and models, 1-2 out of 7 borderline items remained non-reproducible, even under forced greedy decoding. The deprecation of temperature control in newer models like Claude Opus 4.7/4.8 further complicates mitigation. These findings highlight a structural gap where single-run verdicts without variance metrics can mask noise as a safety property.

Why it matters

For AI developers and deployers, this research is critical, revealing that current LLM-as-judge safety evaluations may be unreliable, necessitating a re-evaluation of testing methodologies and the inclusion of variance metrics to ensure robust and trustworthy AI systems.

How to implement this in your domain

  1. 1Always explicitly set temperature and seed parameters when using LLM-as-judge components in evaluation harnesses.
  2. 2Conduct multiple runs for each evaluation item and report variance or disagreement metrics alongside average scores.
  3. 3Develop internal guidelines for acceptable levels of grader disagreement in safety evaluations.
  4. 4Advocate for AI model providers to offer more transparent control over sampling parameters and report reproducibility guarantees.
  5. 5Explore alternative or complementary evaluation methods that are less susceptible to LLM variability for critical safety assessments.

Who benefits

AI/ML DevelopmentSoftware TestingRegulatory ComplianceCybersecurityProduct Management

Key takeaways

  • LLM-as-judge safety evaluations are often non-reproducible, even at temperature 0.
  • Default provider settings and inherent model variability contribute to this issue.
  • Reporting single-run verdicts without variance can misrepresent safety properties.
  • Evaluation harnesses should treat grader disagreement as a first-class health metric.

Original post by Hiroki Tamba

"arXiv:2606.26185v1 Announce Type: new Abstract: LLM-as-judge ("grader") components are now standard in evaluation harnesses, including safety evaluations where a pass/fail verdict may gate downstream deployment decisions. A widespread assumption is that setting the grader's sampl…"

View on X

Originally posted by Hiroki Tamba on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses