A benchmark for evaluating LLMs' ability to generate efficient GPU kernels
โจ What's New in KernelBench-v2
- Triton Integration: Native support for Triton kernels with automatic detection and a production-ready evaluation pipeline (
scripts/run_and_check_triton.py). - Unified Evaluation: CUDA and Triton kernels now share the same correctness and performance metrics.
- Docker Environment: Reproducible environment via the root
Dockerfile(CUDA 12.4 + Triton + PyTorch 2.5). - Robust Error Handling: 15+ categorized runtime/compilation error classes for easier debugging.
- Compile-Cache Builder: New
build_compile_cache_tritonhelper for fast batched evaluation. - Expanded Docs: See
TRITON_README.mdfor a quick start andTRITON_INTEGRATION_GUIDE.mdfor an in-depth technical deep-dive.
๐ Task Description
We structure the problem for LLM to transpile operators described in PyTorch to CUDA kernels, at whatever level of granularity it desires to.

We construct KernelBench to have 4 Levels of categories:
- Level 1 ๐งฑ: Single-kernel operators (100 Problems) The foundational building blocks of neural nets (Convolutions, Matrix multiplies, Layer normalization)
- Level 2 ๐: Simple fusion patterns (100 Problems) A fused kernel would be faster than separated kernels (Conv + Bias + ReLU, Matmul + Scale + Sigmoid)
- Level 3 โ๏ธ: Full model architectures (50 Problems) Optimize entire model architectures end-to-end (MobileNet, VGG, MiniGPT, Mamba)
- Level 4 ๐ค: Level Hugging Face Optimize whole model architectures from HuggingFace
โ๏ธ Evaluation
Methodology
To evaluate model-generated kernels, we need to check if they:
- is correct โ
: check against reference torch operators
n_correctnesstimes on randomized inputs. - is performant โฑ๏ธ: compare against reference torch operators
n_trialtimes to measure speedup between runtimes.
Check out src/eval.py for details on how we implement correctness check and timing.
We provide a convenient script scripts/run_and_check.py to evaluate one single sample source code against a reference source code, check correctness and compute speedup. You can use this to evaluate a model-generated kernel.
Overall Benchmark Metric
Since we need to capture both correctness and performance, we define a metric fast_p: fraction of tasks that are both correct and have a speedup greater than threshold p; speedup is computed as the ratio of PyTorch reference wall-clock time to generated kernel time.
Some examples to illustrate this metric that filters based on speedups:
fast_1is the fraction of tasks that LM-generated kernels are both correct and faster than PyTorch baselinefast_2is the fraction of tasks that LM-generated kernels are both correct and at least 2x faster than PyTorch baselinefast_0is the fraction of tasks that LM-generated kernels are correct. (same as correctness rate)
You can increase speedup threshold p to make the task more challenging.
Compute Overall Benchmark Performance
We provide a script scripts/greedy_analysis.py to compute the overall benchmark performance.
Since we need to capture both correctness and performance, we use a metric fast_p: fraction of tasks that are both correct and have a speedup greater than threshold p; speedup is computed as the ratio of PyTorch reference wall-clock time to generated kernel time.
๐ Directory Structure
We organize the repo into the following structure:
KernelBench-v2/
โโโ assets/
โ โโโ figures/
โโโ KernelBench/ # Benchmark dataset (problems & reference PyTorch)
โโโ src/ # Core benchmark logic
โ โโโ unit_tests/ # PyTest unit tests
โ โโโ prompts/ # Prompt templates & example kernels
โ โโโ โฆ
โโโ scripts/ # Helper CLI entry-points (generation, eval, analysis)
โโโ triton_test/ # Stand-alone Triton examples / kernels
โโโ results/ # Baseline timings & evaluation outputs
โโโ docker-compose.yml & Dockerfile # Reproducible container env
โโโ runs/ # Generated model outputs live here
๐ง Set up
conda create --name kernel-bench python=3.10
conda activate kernel-bench
pip install -r requirements.txt
pip install -e .
Docker Quick-Start (GPU required):
# 1. Build image & launch dev container docker compose up --build -d # spins up `kernelbench-triton` # 2. Attach an interactive shell (optional) docker exec -it kernelbench-triton bash
Note (v2): If you intend to run or evaluate Triton kernels, also install Triton with
pip install triton.
To call LLM API providers, set your {INFERENCE_SERVER_PROVIDER}_API_KEY API key.
Running and profiling kernels require a GPU.
If you don't have GPU available locally, you can set up Modal. Set up your modal token after creating an account by running modal token new. Then, use the generate_and_eval_single_sample_modal.py script.
๐ Usage
Run on a single problem
It is easier to get started with a single problem. This will fetch the problem, generate a sample, and evaluate the sample.
# for example, run level 2 problem 40 from huggingface
python3 scripts/generate_and_eval_single_sample.py dataset_src="huggingface" level=2 problem_id=40
# dataset_src could be "local" or "huggingface"
# add .verbose_logging for more visbility
๐ฅ NEW: Triton Kernel Support
KernelBench now supports both CUDA and Triton kernels with automatic detection!
# Evaluate Triton kernels with auto-detection python3 scripts/run_and_check_triton.py \ ref_origin=kernelbench \ level=1 \ problem_id=1 \ kernel_src_path=your_triton_kernel.py # See TRITON_README.md for complete guide
Run on all problems
# 1. Generate responses and store kernels locally to runs/{run_name} directory
python3 scripts/generate_samples.py run_name=test_hf_level_1 dataset_src=huggingface level=1 num_workers=50 server_type=deepseek model_name=deepseek-chat temperature=0
# 2. Evaluate on all generated kernels in runs/{run_name} directory
python3 scripts/eval_from_generations.py run_name=test_hf_level_1 dataset_src=local level=1 num_gpu_devices=8 timeout=300
# If you like to speedup evaluation, you can use parallelize compilation on CPUs before getting to evluation on GPUs
# add build_cache=True and num_cpu_workers=<num_cpu_workers> to the command
Analyze the eval results to compute Benchmark Performance
We provide scripts/benchmark_eval_analysis.py to analyze the eval results to compute success rate, timing metric, and overall benchmark performance fast_p.
python3 scripts/benchmark_eval_analysis.py run_name=test_hf_level_1 level=1 hardware=L40S_matx3 baseline=baseline_time_torch
If you are using a different hardware, you can generate the baseline time with scripts/generate_baseline_time.py script.
We provide some reference baseline times a variety of NVIDIA GPUs across generations in results/timing, but we recommend you to generate your own baseline time for more accurate results (cluster power, software version, all affects timing result). See results/timing/README.md for more details.
Multi-Turn Framework
We have also releaed the test-time framework Caesar that are used in the multi-turn / iterative refinement experiments in our paper. You can use or modify this framework for high-throughput test-time scaling (both sequential and parallel) targeting KernelBench problems.
๐ฃ๏ธ Upcoming Roadmap
- Triton Variant (โ
Now Available! - See
TRITON_README.md) - Easy to use CoLab Notebook Example
- Push button flow on Modal / Cloud Provider
- Integrate with more frameworks, such as ThunderKittens
- Add backward pass
- Integrate with toolchains such as NCU See Issues for the ongoing roadmap and directions.
๐ Known Usage
- NVIDIA - Automating GPU Kernel Generation with DeepSeek-R1 and Inference Time Scaling
- METR - Measuring Automated Kernel Engineering
- Sakana AI - AI Cuda Engineer
- Project Popcorn - Triton Support for KernelBench, Data Scaling + SFT'd Kernel LLM
- Kevin - Kevin-32B: Multi-Turn RL for Writing CUDA Kernels
- Simple Test-Time Search - by @anneouyang
If you are using KernelBench, we love to hear more about it!
๐ชช License
MIT. Check LICENSE.md for more details.
Citation
@misc{ouyang2025kernelbenchllmswriteefficient, title={KernelBench: Can LLMs Write Efficient GPU Kernels?}, author={Anne Ouyang and Simon Guo and Simran Arora and Alex L. Zhang and William Hu and Christopher Rรฉ and Azalia Mirhoseini}, year={2025}, eprint={2502.10517}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2502.10517}, }
๐งช Testing & CI
Run the full unit-test and integration suite locally:
# Run fast unit tests pytest -q src/unit_tests # Sanity-check reference problems compile & run python scripts/verify_bench.py # CUDA kernels python test_triton_integration.py # Triton kernels
All tests also run in CI via GitHub Actions (see .github/workflows/ci.yml).
