PASE Enhances Cloud Healing with LLM-Generated Recovery Plans

Junyan Tan, Haoran Lin, Siyuan Guo, Yichen Fang, Xinyue Luo, Tianyu Shen, Zeyu Qiao· July 3, 2026 View original

▶ The 2-minute explainer

Summary

A new framework called PASE (Planning-Aware Semantic self-healing engine) improves cloud system reliability by using an LLM to generate structured recovery plans. It employs a neural-symbolic world model for plan verification and a DRL-trained meta-prompt optimizer for adaptive guidance.

Ensuring the reliability of large-scale cloud-based AI systems is a significant challenge, particularly in detecting faults and implementing adaptive recovery. Traditional methods often combine Large Language Models (LLMs) for understanding and Deep Reinforcement Learning (DRL) for policy optimization, but these approaches can be limited by their sequential architectures. A novel framework, PASE, proposes a new paradigm by treating fault self-healing as a neuro-symbolic program synthesis task. PASE utilizes an LLM as a central Plan Synthesis Engine to create structured recovery plans from a library of semantic primitives. These plans are then verified for feasibility by a Neural-Symbolic World Model through simulation. Furthermore, a Meta-Prompt Optimizer, trained with DRL, learns to generate optimal prompts that guide the LLM's planning process, enabling dynamic and context-aware recovery strategies beyond predefined actions. Experiments show PASE significantly reduces recovery time and improves fault detection accuracy in unknown scenarios.

Why it matters

This research offers a path to more resilient and autonomous cloud infrastructure, reducing downtime and operational costs for AI-powered services by enabling systems to self-heal more effectively and adaptively.

How to implement this in your domain

  1. 1Evaluate current cloud fault recovery mechanisms for their adaptability and speed.
  2. 2Explore integrating LLM-driven plan generation into incident response playbooks.
  3. 3Develop or adapt neural-symbolic world models for simulating recovery plan feasibility.
  4. 4Investigate DRL-based meta-prompt optimization to enhance LLM guidance in critical systems.
  5. 5Pilot PASE-like frameworks in non-production environments to assess performance and safety.

Who benefits

Cloud ComputingIT OperationsTelecommunicationsCybersecurityManufacturing

Key takeaways

  • PASE introduces a neuro-symbolic approach for autonomous cloud fault self-healing.
  • LLMs generate structured recovery plans, verified by a neural-symbolic world model.
  • A DRL-trained optimizer guides the LLM for adaptive, context-aware recovery.
  • The framework significantly reduces recovery time and improves fault detection.

Original post by Junyan Tan, Haoran Lin, Siyuan Guo, Yichen Fang, Xinyue Luo, Tianyu Shen, Zeyu Qiao

"arXiv:2607.01595v1 Announce Type: new Abstract: As the scale and complexity of cloud-based AI systems continue to escalate, ensuring service reliability through rapid fault detection and adaptive recovery has become a critical challenge. While existing approaches integrate Large…"

View on X

Originally posted by Junyan Tan, Haoran Lin, Siyuan Guo, Yichen Fang, Xinyue Luo, Tianyu Shen, Zeyu Qiao on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses