Hawk Boosts NPU Kernel Generation with Hardware-Aware Knowledge

Junyi Wen, Ruiyan Zhuang, Yongjia Xu, Pengtu Li, Rui Zou, Hongyi Chen, Chingman Wan, Puxu Yang, Wuhui Chen, Yanlin Wang· July 3, 2026 View original

▶ The 2-minute explainer

Summary

Hawk is a training-free framework that significantly improves the generation of high-performance kernels for Neural Processing Units (NPUs). It addresses the lack of hardware-specific priors in LLMs by synthesizing runtime knowledge, retrieving bottleneck-aware information, and distilling knowledge through semantic arbitration.

Developing high-performance kernels for Neural Processing Units (NPUs) is a major bottleneck in the industry, often requiring manual navigation of complex hardware constraints and memory hierarchies. While Large Language Models (LLMs) offer automation potential, they typically fail with NPUs due to their lack of inherent hardware-specific knowledge, leading to runtime crashes and performance issues even if code compiles. To overcome this, a new training-free framework called Hawk has been introduced. Hawk integrates hardware-aware knowledge through three core modules. First, a Run-Time Knowledge Synthesis Module uses a Triple-Part Executable Knowledge Representation to link error contexts with executable semantics. Second, a Bottleneck-Aware Knowledge Retrieval Module employs a 2D-Retrieval paradigm, projecting queries into both syntactic and hardware-aligned semantic spaces. Finally, an Effect-Driven Knowledge Distillation Module leverages LLM-driven semantic arbitration to continuously refine knowledge by pruning errors and consolidating redundancies based on empirical execution feedback. Evaluations on real-world NPU workloads demonstrate that Hawk significantly increases generation accuracy from 49.4% to 80.0% and achieves up to a 2.2x execution speedup over current baselines.

Why it matters

This innovation is critical for accelerating the development and optimization of AI applications on specialized hardware, enabling faster deployment and more efficient operation of neural networks on NPUs.

How to implement this in your domain

  1. 1Evaluate current NPU kernel development workflows for efficiency and performance bottlenecks.
  2. 2Explore integrating hardware-aware code generation frameworks like Hawk into your toolchain.
  3. 3Develop internal knowledge bases that couple error contexts with executable semantics for NPU programming.
  4. 4Implement 2D-retrieval systems to access both syntactic and hardware-specific semantic information.
  5. 5Pilot Hawk-like approaches for optimizing specific NPU workloads to measure performance gains.

Who benefits

SemiconductorAI HardwareAutomotiveEdge AICloud Computing

Key takeaways

  • Hawk is a training-free framework for high-performance NPU kernel generation.
  • It addresses LLM limitations by incorporating hardware-aware knowledge.
  • The framework uses runtime knowledge synthesis, bottleneck-aware retrieval, and effect-driven distillation.
  • Hawk significantly improves generation accuracy and execution speed on NPUs.

Original post by Junyi Wen, Ruiyan Zhuang, Yongjia Xu, Pengtu Li, Rui Zou, Hongyi Chen, Chingman Wan, Puxu Yang, Wuhui Chen, Yanlin Wang

"arXiv:2607.01590v1 Announce Type: new Abstract: Developing high-performance kernels for Neural Processing Units (NPUs) is a critical industry bottleneck, requiring developers to manually navigate implicit hardware constraints and strict memory hierarchies. While large language mo…"

View on X

Originally posted by Junyi Wen, Ruiyan Zhuang, Yongjia Xu, Pengtu Li, Rui Zou, Hongyi Chen, Chingman Wan, Puxu Yang, Wuhui Chen, Yanlin Wang on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses