Optimizing 3D Generative Diffusion Models on NVIDIA GPUs

Jeeho Ryoo, Yongchan Jung, Muhammad Ali Khaliq, Weidong Zhang, Jiatong Han, Byeong Kil Lee· June 19, 2026 View original

Summary

This paper provides a comprehensive performance analysis of the Med-DDPM 3D medical diffusion model across NVIDIA GPU architectures, identifying inefficiencies in cuDNN convolution and implicit-GEMM kernels. It demonstrates that architecture-aware optimizations like TF32 Tensor Core activation and a 3D channels-last layout can significantly reduce SM cycles and dynamic instructions without degrading synthesis quality.

3D generative diffusion models, particularly for high-fidelity MRI synthesis, are crucial but demand substantial GPU resources. This is due to hundreds of U-Net evaluations per sample and highly varied kernel behavior. This research conducts an in-depth performance analysis of Med-DDPM, a state-of-the-art medical diffusion model, across three generations of NVIDIA GPU architectures. The analysis delves into kernel-level runtime breakdowns, instruction-mix characteristics, memory system utilization, and warp-level activities. It reveals that training is predominantly driven by cuDNN convolution and implicit-GEMM kernels, with inefficiencies stemming from memory-access patterns, tensor-layout conversions, and underutilization of Tensor Cores. Leveraging these insights, the study evaluates two architecture-aware optimizations: TF32 Tensor Core activation and a 3D channels-last layout. These optimizations are shown to dramatically reduce Streaming Multiprocessor (SM) cycles by up to 100x and dynamic instructions by 100x. Furthermore, Tensor Core utilization increased significantly (from 1.45x to 9.98x), and Instructions Per Cycle (IPC) improved by 7% on A100 GPUs, all without compromising the quality of the synthesized medical images.

Why it matters

For professionals working with 3D generative AI, especially in medical imaging, these optimizations mean faster training, reduced computational costs, and more efficient use of expensive GPU resources, accelerating research and deployment.

How to implement this in your domain

  1. 1Profile 3D generative diffusion model workloads on NVIDIA GPUs to identify performance bottlenecks.
  2. 2Investigate enabling TF32 Tensor Core activation for compatible models and hardware.
  3. 3Experiment with 3D channels-last memory layouts to improve memory access patterns.
  4. 4Optimize cuDNN convolution and implicit-GEMM kernel usage within diffusion model implementations.
  5. 5Collaborate with hardware vendors to leverage architecture-specific features for performance gains.

Who benefits

HealthcareScientific ResearchGamingAutomotiveAerospace

Key takeaways

  • 3D diffusion models for MRI synthesis are computationally intensive on GPUs.
  • Inefficiencies arise from memory access, tensor layout, and low Tensor Core utilization.
  • TF32 Tensor Core activation significantly boosts performance on NVIDIA GPUs.
  • A 3D channels-last layout can drastically reduce SM cycles and dynamic instructions.

Original post by Jeeho Ryoo, Yongchan Jung, Muhammad Ali Khaliq, Weidong Zhang, Jiatong Han, Byeong Kil Lee

"arXiv:2606.19365v1 Announce Type: new Abstract: Diffusion models have become essential for high-fidelity 3D MRI synthesis, yet their deployment remains constrained by substantial GPU resource demands arising from hundreds of U-Net evaluations per sample and a highly heterogeneous…"

View on X

Originally posted by Jeeho Ryoo, Yongchan Jung, Muhammad Ali Khaliq, Weidong Zhang, Jiatong Han, Byeong Kil Lee on X · view source

Want to go deeper?

Turn these trends into skills with Learnijoy's hands-on AI & tech courses.

Explore courses