KernelPro Optimizes GPU Kernels with LLMs and Micro-Profilin

KernelPro Optimizes GPU Kernels with LLMs and Micro-Profiling

Jiading Gai, Shuai Zhang, Kaj Bostrom, Jin Huang, Vihang Patil, Haoyang Fang, Bernie Wang, Huzefa Rangwala, George Karypis· June 26, 2026 View original

▶ The 2-minute explainer

Summary

KernelPro is a closed-loop multi-agent system that automates GPU kernel optimization by integrating LLM code generation with hardware profiler feedback and pluggable bottleneck detection tools. It achieves state-of-the-art speedups and is the first to optimize for energy efficiency, outperforming hand-tuned kernels.

Optimizing GPU kernel code is a highly specialized and time-consuming task, typically requiring expert human intervention. This research introduces KernelPro, a novel closed-loop multi-agent system designed to automate and significantly enhance this process. KernelPro combines the code generation capabilities of Large Language Models (LLMs) with detailed hardware profiler feedback and specialized bottleneck detection tools. The system's innovative design includes a semantic feedback operator that translates raw hardware metrics into actionable natural language guidance, effectively acting as an expert surrogate. It employs a two-stage tool invocation architecture, where a roofline-based classifier intelligently selects which specialized profiling tools (like `ncu`, SASS, `nsys`) to deploy for kernel, instruction, or system-level analysis. Furthermore, KernelPro utilizes a domain-adapted Monte Carlo Tree Search (MCTS) algorithm with progressive widening and other enhancements to guide its code optimization process, including direct CuTe source-level code generation. KernelPro achieved state-of-the-art performance on KernelBench, demonstrating geometric mean speedups of up to 5.30x on challenging optimization levels. Notably, it even surpassed expert-optimized MoE training kernels from VeOmni by 1.23x, generating a raw-CUDA+CuTe Hopper WGMMA kernel from scratch. Beyond speed, KernelPro is also the first system of its kind to optimize for energy efficiency, achieving an 11.6% measured energy reduction at matched speed, highlighting the significant impact of its micro-profiling tools, MCTS search, and proactive tool orchestration.

Why it matters

For professionals in high-performance computing, AI infrastructure, and deep learning, KernelPro offers a revolutionary approach to GPU kernel optimization. It automates a complex task, delivers significant speedups and energy efficiency, and reduces the need for highly specialized manual tuning, accelerating AI development and deployment.

How to implement this in your domain

1Investigate integrating KernelPro's methodology into existing GPU kernel development workflows.
2Explore using LLMs in conjunction with hardware profilers for automated code optimization in other domains.
3Develop custom micro-profiling tools to translate specific hardware metrics into actionable feedback for LLM agents.
4Evaluate the potential energy savings and performance gains for critical GPU-accelerated applications.

Who benefits

AI/ML InfrastructureHigh-Performance ComputingCloud ComputingGamingScientific Research

Key takeaways

KernelPro automates GPU kernel optimization using LLMs and hardware profiler feedback.
It employs semantic feedback and a two-stage tool invocation architecture for bottleneck detection.
The system achieves state-of-the-art speedups, outperforming expert-tuned kernels.
KernelPro is the first to optimize for energy efficiency, demonstrating significant power reductions.

Original post by Jiading Gai, Shuai Zhang, Kaj Bostrom, Jin Huang, Vihang Patil, Haoyang Fang, Bernie Wang, Huzefa Rangwala, George Karypis

"arXiv:2606.26453v1 Announce Type: new Abstract: We present KernelPro, a closed-loop multi-agent system that automatically generates, profiles, and iteratively optimizes GPU kernel code by integrating large language model (LLM) code generation with hardware profiler feedback and p…"

View on X

Originally posted by Jiading Gai, Shuai Zhang, Kaj Bostrom, Jin Huang, Vihang Patil, Haoyang Fang, Bernie Wang, Huzefa Rangwala, George Karypis on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses

KernelPro Optimizes GPU Kernels with LLMs and Micro-Profiling

Why it matters

How to implement this in your domain

Who benefits

Key takeaways

Want to go deeper?

More in AI Engineering & DevTools

MCP and A2A Protocols Standardize Agentic Internet Development

VISReg Enhances JEPA Training with Novel Regularization

Ford's AI-Driven Layoffs Backfire Significantly