New Method Enhances LLM Self-Improvement with Procedural Mem

New Method Enhances LLM Self-Improvement with Procedural Memory

Ye Liu, Srijan Bansal, Bo Pang, Yang Li, Zeyu Leo Liu, Yifei Ming, Zixuan Ke, Shafiq Joty, Semih Yavuz· July 3, 2026 View original

Summary

Procedural Memory Distillation (PMD) allows language models to convert cross-episode signals into reusable procedural memory, which is then distilled into the policy's weights during training. This online reflection mechanism significantly improves performance on complex tasks by enabling the model to internalize strategies and lessons.

Current reinforcement learning methods for language models, like RLVR and SDPO, often discard rich procedural information from rollouts, focusing only on episode-level signals. This new research introduces Procedural Memory Distillation (PMD), a technique that enables language models to retain and reuse cross-episode information. PMD converts these signals into a structured procedural memory, which is then distilled directly into the model's weights during training. The memory is organized at three levels: raw trajectories, self-reflected strategies, and recurring behavioral patterns, all extracted online from the model's own interactions. A memory-conditioned self-teacher uses this accumulated experience to supervise the student model, allowing it to progressively internalize procedural knowledge. This "co-evolution" design, where the policy updates memory and memory shapes policy updates, is central to its effectiveness. Empirical results show PMD significantly outperforms SDPO on benchmarks like SCIKNOWEVAL and LIVECODEBENCH, with gains of 3.8-5.5% and 7.9-13.6% respectively. The co-evolution aspect is crucial, as freezing either the memory or the policy leads to substantial performance drops. This approach yields a memory-free model at inference, as the knowledge is absorbed into the policy itself.

Why it matters

For professionals developing or deploying AI, this method offers a path to more robust and efficient self-improving language models, potentially reducing the need for constant human supervision and improving performance on complex, multi-step tasks.

How to implement this in your domain

1Investigate integrating PMD principles into existing reinforcement learning pipelines for LLM fine-tuning.
2Experiment with different memory abstraction levels for specific domain tasks to optimize knowledge retention.
3Evaluate the performance gains of PMD-trained models against current state-of-the-art methods on internal benchmarks.
4Consider developing tools to visualize and analyze the procedural memory generated by PMD to gain insights into model learning.

Who benefits

Software DevelopmentAI/ML EngineeringRoboticsCustomer Service AutomationEdTech

Key takeaways

Procedural Memory Distillation (PMD) enhances LLM self-improvement by leveraging cross-episode information.
PMD converts rich procedural data into reusable memory, which is distilled into the model's weights.
The co-evolution of policy and memory is critical for significant performance gains.
This approach results in a memory-free model at inference, having internalized complex strategies.

Original post by Ye Liu, Srijan Bansal, Bo Pang, Yang Li, Zeyu Leo Liu, Yifei Ming, Zixuan Ke, Shafiq Joty, Semih Yavuz

"arXiv:2607.01480v1 Announce Type: new Abstract: Reinforcement learning with verifiable rewards (RLVR), along with recent selfdistillation variants such as SDPO, evaluates each rollout against a verifier and updates the policy from that episode-level signal. However, the richer pr…"

View on X

Originally posted by Ye Liu, Srijan Bansal, Bo Pang, Yang Li, Zeyu Leo Liu, Yifei Ming, Zixuan Ke, Shafiq Joty, Semih Yavuz on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

New Method Enhances LLM Self-Improvement with Procedural Memory

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Research

New Methods for Log-Density-Ratio Estimation in Gaussian Models

Dynamic Support Learning Enhances Reinforcement Learning Value Estimation

Decomposer Recovers Music Programs from Symbolic MIDI Data