Active-GRPO Boosts Molecular Optimization with Adaptive Learning

Xuefeng Liu, Mingxuan Cao, Qinan Huang, Thomas Brettin, Rick Stevens, Le Cong· July 2, 2026 View original

Summary

Active-GRPO introduces an adaptive imitation and self-improving reasoning paradigm for molecular optimization, allowing policies to dynamically switch between imitating references and reinforcing their own discoveries, significantly improving performance over prior methods.

Training large language models for scientific reasoning, particularly in instruction-based molecular optimization, faces challenges like the collapse of multi-step reasoning with supervised fine-tuning and sparse feedback in reinforcement learning. Reference-guided Policy Optimization (RGPO) methods attempt to mitigate this by using dataset-provided references, but their effectiveness is capped by the quality of these references. This research proposes Active-GRPO, an active reasoning paradigm designed to overcome this limitation. Active-GRPO allows the policy to adaptively decide, on an instance-by-instance basis, whether to imitate a reference or to reinforce its own novel discoveries. This is achieved through two coupled mechanisms: active imitate-reinforce and active referencing. The former performs imitation learning when references are superior, then shifts to self-improvement via reinforcement learning once the policy generates better candidates. The latter continuously upgrades the reference itself with the best policy-generated molecules, ensuring that guidance remains informative and progressively raises the performance target throughout training. This approach significantly improves molecular optimization metrics, demonstrating substantial gains over existing methods like GRPO and RePO.

Why it matters

AI researchers and drug discovery professionals can leverage Active-GRPO to develop more robust and efficient AI systems for molecular design, accelerating the discovery of novel compounds with desired properties.

How to implement this in your domain

  1. 1Integrate Active-GRPO's adaptive learning mechanisms into your AI models for molecular optimization tasks.
  2. 2Experiment with the active imitate-reinforce strategy to dynamically balance exploration and exploitation in your generative models.
  3. 3Implement active referencing to continuously improve the quality of guidance provided to your AI systems during training.
  4. 4Apply Active-GRPO to specific molecular design challenges, such as optimizing drug candidates for specific properties.

Who benefits

PharmaceuticalsBiotechnologyMaterials ScienceChemical Engineering

Key takeaways

  • Active-GRPO improves molecular optimization by adaptively combining imitation and self-improvement.
  • It dynamically switches between learning from references and reinforcing novel discoveries.
  • Active referencing continuously upgrades the imitation target, preventing performance plateaus.
  • The method significantly outperforms prior reference-guided policy optimization techniques.

Original post by Xuefeng Liu, Mingxuan Cao, Qinan Huang, Thomas Brettin, Rick Stevens, Le Cong

"arXiv:2607.00531v1 Announce Type: new Abstract: Scientific reasoning is an increasingly important capability of large language models, yet improving the robustness and efficiency of training such reasoning remains a key open challenge. We study this problem in instruction-based m…"

View on X

Originally posted by Xuefeng Liu, Mingxuan Cao, Qinan Huang, Thomas Brettin, Rick Stevens, Le Cong on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses