arxiv:2603.28342

Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

Published on Mar 30

· Submitted by

duhe on Mar 31

Intern Large Models

Upvote

Authors:

He Du ,

Qiming Ge ,

Zheng Cai ,

Zixian Huang ,

Qipeng Guo ,

Abstract

Kernel-Smith is a GPU kernel generation framework that combines evolutionary algorithms with post-training reinforcement learning to optimize performance across different hardware backends.

AI-generated summary

We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe. On the agent side, Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness, and speedup. To make this search reliable, we build backend-specific evaluation services for Triton on NVIDIA GPUs and Maca on MetaX GPUs. On the training side, we convert long-horizon evolution trajectories into step-centric supervision and reinforcement learning signals by retaining correctness-preserving, high-gain revisions, so that the model is optimized as a strong local improver inside the evolutionary loop rather than as a one-shot generator. Under a unified evolutionary protocol, Kernel-Smith-235B-RL achieves state-of-the-art overall performance on KernelBench with Nvidia Triton backend, attaining the best average speedup ratio and outperforming frontier proprietary models including Gemini-3.0-pro and Claude-4.6-opus. We further validate the framework on the MetaX MACA backend, where our Kernel-Smith-MACA-30B surpasses large-scale counterparts such as DeepSeek-V3.2-think and Qwen3-235B-2507-think, highlighting potential for seamless adaptation across heterogeneous platforms. Beyond benchmark results, the same workflow produces upstream contributions to production systems including SGLang and LMDeploy, demonstrating that LLM-driven kernel optimization can transfer from controlled evaluation to practical deployment.

View arXiv page View PDF Project page GitHub 3 Add to collection

Community

Elynden

Paper author Paper submitter about 10 hours ago

•

edited about 9 hours ago

Welcome to our new work: Kernel-Smith 🚀.

We have broken through the bottleneck of traditional LLM operator generation and introduced the Evolutionary Agent mechanism. The core logic is very simple: let the model learn how to "improve better" in the evolutionary cycle.

How we work: we build stable evaluation services (NVIDIA & MetaX) to capture high-gain steps in the long-term evolutionary trajectory and convert them into reinforcement learning signals.
Our achievements: we achieved SOTA in KernelBench. Not only fast, but more importantly, Kernel-Smith had a high success rate in evolution.
Implementation status: We do not play with virtual metrics. The operators we have optimized have been directly contributed to SGLang and LMDeploy.

Welcome everyone to experience our Demo on the Project Page!