New Method Improves LLM Safety with Low-Agreeableness Persona Conditioning

Austin MY Cheung, Yi Yang· June 29, 2026 View original

Summary

Researchers propose a persona-driven rewriting pipeline that conditions user inputs on low agreeableness, paired with warm assistant responses, to enhance LLM safety against jailbreaks and harmful outputs while preserving conversational warmth.

Recent studies indicate that fine-tuning large language models (LLMs) for social warmth can inadvertently compromise their factual accuracy and increase their susceptibility to adversarial attacks. This research explores whether this vulnerability is an inherent trade-off or a consequence of current data construction methods. The authors introduce a novel approach involving a persona-driven rewriting pipeline. This pipeline conditions user prompts to reflect a "low agreeableness" persona, which is then combined with warm, de-escalating responses from the assistant. Experiments across multiple models demonstrate that this technique significantly reduces the likelihood of jailbreaks and the generation of harmful content, all while maintaining the desired conversational warmth. The findings suggest that safer empathetic fine-tuning is achievable through innovative data design, eliminating the need for explicit safety labels, harm detectors, or modifications to the core training objective. This work provides evidence that the geometric alignment between warmth and compliance in the model's latent space can be reduced through this conditioning.

Why it matters

Professionals developing or deploying LLMs need robust methods to ensure model safety and prevent harmful outputs without sacrificing desirable traits like warmth. This research offers a data-centric solution to a critical problem.

How to implement this in your domain

  1. 1Develop a persona-driven data rewriting pipeline to modify user prompts for LLM fine-tuning.
  2. 2Integrate "low agreeableness" conditioning into user input generation for safety training.
  3. 3Design assistant responses to be warm and de-escalating when paired with conditioned user inputs.
  4. 4Evaluate LLM safety metrics (jailbreak susceptibility, harmful output rates) after applying this fine-tuning method.

Who benefits

AI DevelopmentCybersecurityContent ModerationCustomer Service AI

Key takeaways

  • Warmth fine-tuning in LLMs can inadvertently increase vulnerability to jailbreaks.
  • A new method uses "low-agreeableness" persona conditioning to enhance LLM safety.
  • This approach reduces harmful outputs and jailbreak susceptibility while maintaining conversational warmth.
  • Safer empathetic LLM fine-tuning can be achieved through data design alone, without extra safety labels.

Original post by Austin MY Cheung, Yi Yang

"arXiv:2606.27709v1 Announce Type: cross Abstract: Recent work has shown that fine-tuning large language models (LLMs) for social warmth degrades factual reliability and increases sycophancy. We investigate a related but distinct failure mode: warmth fine-tuning also weakens adver…"

View on X

Originally posted by Austin MY Cheung, Yi Yang on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses