Large-Scale Autoregressive Pretraining Enables Controllable Catalyst Inverse Design

Dong Hyeon Mok, Jonggeol Na, Seoin Back· June 17, 2026 View original

Summary

This paper introduces a conditional catalyst generative model based on a GPT architecture with a numerical embedding layer, allowing for the generation of catalyst structures conditioned on both categorical and continuous properties. Pretrained on 133 million structures and fine-tuned on 460,000, the model achieves high structural validity and significantly improves screening efficiency for reaction-targeted catalyst discovery.

Designing heterogeneous catalysts remains a significant challenge due to their complex structures and the vast chemical space involved, which conventional screening struggles to explore efficiently. While machine learning has accelerated catalyst discovery, its efficiency diminishes as the search space expands, highlighting the need for generative models that can directly construct catalysts with desired properties. This research presents a novel conditional catalyst generative model, leveraging a Generative Pretrained Transformer (GPT) architecture augmented with a numerical embedding layer. This design enables the model to generate catalyst structures based on both categorical and continuous properties within a unified autoregressive framework. The model was extensively pretrained on 133 million catalyst structures and subsequently fine-tuned on approximately 460,000 optimized structures with associated properties. This resulted in a high structural validity of 98%, 95% optimization validity, and strong categorical condition fidelity. For binding energy conditioning, the model achieved a four-fold improvement over baseline distributions, leading to a 1.5 to 4-fold increase in screening efficiency for targeted catalyst discovery without further fine-tuning. This demonstrates a practical pathway towards controllable catalyst generation and accelerated discovery.

Why it matters

For professionals in materials science, chemistry, and manufacturing, this breakthrough offers a powerful AI tool to accelerate the discovery and design of new catalysts. It can drastically reduce the time and cost associated with experimental screening, leading to faster innovation in areas like sustainable energy, chemical production, and pharmaceuticals.

How to implement this in your domain

  1. 1Explore integrating this generative AI approach into catalyst R&D pipelines to accelerate material discovery.
  2. 2Utilize the model's conditional generation capabilities to design catalysts with specific target properties for industrial applications.
  3. 3Assess the potential for reducing experimental screening costs and time by leveraging AI-driven inverse design.
  4. 4Collaborate with AI researchers to adapt and fine-tune similar models for proprietary material design challenges.

Who benefits

ChemicalsPharmaceuticalsEnergyMaterials ScienceManufacturing

Key takeaways

  • A new GPT-based model enables controllable inverse design of catalysts.
  • It generates structures conditioned on both categorical and continuous properties.
  • Large-scale pretraining significantly improves structural validity and property matching.
  • The model accelerates catalyst discovery and improves screening efficiency.

Original post by Dong Hyeon Mok, Jonggeol Na, Seoin Back

"arXiv:2606.17445v1 Announce Type: new Abstract: Inverse design of heterogeneous catalysts remains challenging because catalyst surfaces exhibit substantial structural complexity with coupled surface-adsorbate interactions across a vast chemical space that is difficult to explore…"

View on X

Originally posted by Dong Hyeon Mok, Jonggeol Na, Seoin Back on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses