LLM-as-Judge Safety Evaluations Lack Reproducibility, Even at Zero Temperature
Summary
A study reveals that LLM-as-judge safety evaluations are often non-reproducible, even when temperature is set to zero, due to default provider settings and inherent model variability. This exposes a critical flaw where evaluation harnesses report single-run verdicts without variance, potentially misrepresenting safety properties.
Why it matters
For AI developers and deployers, this research is critical, revealing that current LLM-as-judge safety evaluations may be unreliable, necessitating a re-evaluation of testing methodologies and the inclusion of variance metrics to ensure robust and trustworthy AI systems.
How to implement this in your domain
- 1Always explicitly set temperature and seed parameters when using LLM-as-judge components in evaluation harnesses.
- 2Conduct multiple runs for each evaluation item and report variance or disagreement metrics alongside average scores.
- 3Develop internal guidelines for acceptable levels of grader disagreement in safety evaluations.
- 4Advocate for AI model providers to offer more transparent control over sampling parameters and report reproducibility guarantees.
- 5Explore alternative or complementary evaluation methods that are less susceptible to LLM variability for critical safety assessments.
Who benefits
Key takeaways
- LLM-as-judge safety evaluations are often non-reproducible, even at temperature 0.
- Default provider settings and inherent model variability contribute to this issue.
- Reporting single-run verdicts without variance can misrepresent safety properties.
- Evaluation harnesses should treat grader disagreement as a first-class health metric.
Original post by Hiroki Tamba
"arXiv:2606.26185v1 Announce Type: new Abstract: LLM-as-judge ("grader") components are now standard in evaluation harnesses, including safety evaluations where a pass/fail verdict may gate downstream deployment decisions. A widespread assumption is that setting the grader's sampl…"
View on XOriginally posted by Hiroki Tamba on X · view source
Want to go deeper?
Turn these trends into skills with Learnijoy's hands-on AI & tech courses.
Explore coursesMore in AI Engineering & DevTools
AI-Powered Development Workflow Integrates Multiple Models
A new development workflow leverages various AI models like Grok 4.3, GPT-5.5, and Opus 4.8 for distinct stages including research, planning, coding, testing, and debugging. This structured approach aims to optimize the software development lifecycle.

Proposing AI Usage Transparency for Credible Commentary
The author suggests a requirement for individuals and organizations to publish their percentage of frontier AI usage at work and personal usage. This transparency would establish credibility before commenting on AI's utility.
MCP and A2A Protocols Standardize Agentic Internet Development
The Model Context Protocol (MCP) and Agent-to-Agent (A2A) Protocol are standardizing how AI agents discover tools, call services, and coordinate across systems. Understanding these protocols is crucial for developers building agent-compatible infrastructure.