KernelBench Leaderboard

A leaderboard for tracking CUDA kernel optimization performance on the KernelBench benchmark.

Overview

KernelBench evaluates the ability of AI systems to optimize CUDA kernels across three difficulty levels:

Level 1: 100 basic CUDA operations (matrix multiplication, activations, etc.)
Level 2: 100 intermediate fused operations
Level 3: 50 advanced architectures (ResNet, Transformers, etc.)

Metrics

Metric	Description
GeoMean Speedup	Geometric mean of speedup ratios across all correctly solved problems. 1.0x = baseline PyTorch performance.
Fast@1.0	Percentage of kernels matching or exceeding baseline speed
Fast@1.5	Percentage of kernels achieving ≥1.5x speedup
Fast@2.0	Percentage of kernels achieving ≥2.0x speedup
Correct Rate	Percentage of kernels that compile and produce correct outputs

How to Submit

Step 1: Prepare Your Submission

Create a JSON file with this exact structure:

{
  "metadata": {
    "display_name": "Your Model Name",
    "organization": "Your Organization",
    "notes": "Optional notes about your submission"
  },
  "kernels": {
    "level_1_problem_1_sample_0": "import torch\nimport ...\n\nclass ModelNew(nn.Module):\n    ...",
    "level_1_problem_2_sample_0": "...",
    "level_2_problem_1_sample_0": "...",
    "level_3_problem_1_sample_0": "..."
  }
}

Required fields:

metadata.display_name: Name shown on the leaderboard
kernels: Dictionary with kernel code for each problem

Kernel key format: level_{L}_problem_{P}_sample_0

Level 1: problems 1-100
Level 2: problems 1-100
Level 3: problems 1-50

Total: 250 kernels expected

Step 2: Submit via Pull Request

Fork this repository
Add your submission: submissions/your_model_name.json
Open a Pull Request
Wait for a maintainer to review and add the evaluate label

Step 3: Automated Evaluation

Once a maintainer adds the evaluate label to your PR:

Validation - Checks your JSON format and kernel counts
Evaluation - Runs all 250 kernels on H100 GPUs via Modal
Scoring - Computes metrics using KernelBench's scoring
Results - Posts a comment with your scores on the PR

Evaluation typically takes 30-60 minutes depending on queue.

Hardware

All submissions are evaluated on NVIDIA H100 GPUs.

Local Development

Prerequisites

Python 3.10+
Modal account and CLI configured

Setup

# Clone with submodules
git clone --recursive https://github.com/ScalingIntelligence/KernelBench-Leaderboard.git
cd KernelBench-Leaderboard

# Install dependencies
pip install modal pydra tqdm numpy tabulate datasets
pip install -r KernelBench/requirements.txt

# Configure Modal
modal token new

Generate Kernels with Your Model

cd KernelBench

# Generate for each level
python scripts/generate_samples.py \
    dataset_src=local level=1 run_name=my_model_level1 \
    server_type=google model_name=gemini/gemini-2.5-flash

python scripts/generate_samples.py \
    dataset_src=local level=2 run_name=my_model_level2 \
    server_type=google model_name=gemini/gemini-2.5-flash

python scripts/generate_samples.py \
    dataset_src=local level=3 run_name=my_model_level3 \
    server_type=google model_name=gemini/gemini-2.5-flash

Evaluate Locally

# Evaluate each level
python scripts/eval_from_generations.py \
    run_name=my_model_level1 level=1 eval_mode=modal gpu=H100 dataset_src=local

python scripts/eval_from_generations.py \
    run_name=my_model_level2 level=2 eval_mode=modal gpu=H100 dataset_src=local

python scripts/eval_from_generations.py \
    run_name=my_model_level3 level=3 eval_mode=modal gpu=H100 dataset_src=local

# Analyze results
python scripts/benchmark_eval_analysis.py \
    run_name=my_model_level1 level=1 hardware=H100 baseline=baseline \
    baseline_file=../baselines/H100.json output_file=../data/level1_results.json

python scripts/benchmark_eval_analysis.py \
    run_name=my_model_level2 level=2 hardware=H100 baseline=baseline \
    baseline_file=../baselines/H100.json output_file=../data/level2_results.json

python scripts/benchmark_eval_analysis.py \
    run_name=my_model_level3 level=3 hardware=H100 baseline=baseline \
    baseline_file=../baselines/H100.json output_file=../data/level3_results.json

Create Submission & Update Leaderboard

cd ..

# Bundle kernels into submission JSON
python scripts/combine_runs_to_submission.py \
    --level1_run my_model_level1 \
    --level2_run my_model_level2 \
    --level3_run my_model_level3 \
    --output submissions/my_model.json \
    --display_name "My Model" \
    --organization "My Organization"

# Update leaderboard
python scripts/update_leaderboard.py \
    --level1 data/level1_results.json \
    --level2 data/level2_results.json \
    --level3 data/level3_results.json \
    --submission submissions/my_model.json

Architecture

This leaderboard is a thin orchestration layer. All scoring logic lives in KernelBench:

┌─────────────────────────────────────────────────────────────────┐
│                    KernelBench Leaderboard                      │
│  ┌──────────────────┐  ┌──────────────────┐  ┌───────────────┐  │
│  │ submission_to_   │  │ update_          │  │ combine_runs_ │  │
│  │ run.py           │  │ leaderboard.py   │  │ to_submission │  │
│  │ (JSON→files)     │  │ (aggregation)    │  │ .py           │  │
│  └────────┬─────────┘  └────────┬─────────┘  └───────────────┘  │
│           │                     │                               │
└───────────┼─────────────────────┼───────────────────────────────┘
            │                     │
┌───────────▼─────────────────────▼───────────────────────────────┐
│                       KernelBench (submodule)                    │
│  ┌──────────────────┐  ┌──────────────────┐  ┌───────────────┐  │
│  │ eval_from_       │  │ benchmark_eval_  │  │ src/score.py  │  │
│  │ generations.py   │  │ analysis.py      │  │ (all metrics) │  │
│  │ (GPU eval)       │  │ (scoring+JSON)   │  │               │  │
│  └──────────────────┘  └──────────────────┘  └───────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Project Structure

KernelBench-Leaderboard/
├── baselines/              # Baseline timing data
│   └── H100.json
├── data/
│   ├── metadata.json       # Leaderboard entries
│   └── results/            # Detailed per-submission results
├── scripts/
│   ├── submission_to_run.py
│   ├── update_leaderboard.py
│   └── combine_runs_to_submission.py
├── submissions/            # Submission JSON files
├── KernelBench/            # Submodule: benchmark & scoring
├── index.html              # Leaderboard frontend
└── .github/workflows/      # CI/CD automation

Links

License

This project is part of the KernelBench benchmark suite. See the KernelBench repository for licensing information.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
KernelBench @ 1f8d20f		KernelBench @ 1f8d20f
assets		assets
baselines		baselines
css		css
data		data
js		js
scripts		scripts
submissions		submissions
.gitmodules		.gitmodules
.nojekyll		.nojekyll
README.md		README.md
eval.html		eval.html
eval_results_opus_test.json		eval_results_opus_test.json
index.html		index.html
kernels.html		kernels.html
leaderboard.html		leaderboard.html
models.html		models.html
submit.html		submit.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

KernelBench Leaderboard

Overview

Metrics

How to Submit

Step 1: Prepare Your Submission

Step 2: Submit via Pull Request

Step 3: Automated Evaluation

Hardware

Local Development

Prerequisites

Setup

Generate Kernels with Your Model

Evaluate Locally

Create Submission & Update Leaderboard

Architecture

Project Structure

Links

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

AffectionateCurry/KernelBench-Leaderboard

Folders and files

Latest commit

History

Repository files navigation

KernelBench Leaderboard

Overview

Metrics

How to Submit

Step 1: Prepare Your Submission

Step 2: Submit via Pull Request

Step 3: Automated Evaluation

Hardware

Local Development

Prerequisites

Setup

Generate Kernels with Your Model

Evaluate Locally

Create Submission & Update Leaderboard

Architecture

Project Structure

Links

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages