[Optimization 3/n] Add Diagnosis Module (Prompt Builder for Hardware Bottleneck) #73

kaiming-cheng · 2026-01-13T20:41:17Z

This PR introduces a diagnose module building GPU performance analysis prompts. The module provides GPU hardware specification lookup, NCU metric schema definitions, and composable prompt section rendering for bottleneck analysis.

Core Components

1. MetricSchema (`metric_schema.py`)

Defines single source of truth for NCU profiling metrics (keys, labels, units)
Organizes metrics into different sections: SM & Compute Utilization, Memory Bandwidth & Cache, Memory Access Patterns, Occupancy & Resources, Stall Metrics
Extensible schema design: Metrics can be easily added, removed, or recategorized by editing the schema, supporting iterative experimentation

2. GPU Specs (`gpu_specs.py`)

GPU specifications database for Nvidia A100, H100, RTX 4090, RTX 5080
Auto-detection via nvidia-smi with fuzzy matching support
Hardware specs include: peak compute (FP32/FP16/BF16), memory bandwidth, SM count, cache sizes, memory type

3. Judger Prompts (`judger_prompts.py`)

Prompt builder for the Judge LLM dual-bottleneck analysis
Integrates section renderers for composable prompt construction:
Response extraction with multi-strategy JSON parsing

Example Usage

from kernel_perf_agent.kernel_opt.diagnose_prompt import
get_gpu_specs, build_judge_optimization_prompt


    specs = get_gpu_specs()
    print(f"\nUsing specs for: {specs['name']} ({specs.get('architecture', 'Unknown')})")
    print(f"  - Peak Memory Bandwidth: {specs['peak_memory_bw_gbps']} GB/s")
    print(f"  - Peak FP32 Performance: {specs['peak_fp32_tflops']} TFLOPS")
    print(f"  - SM Count: {specs['sm_count']}")

Detected GPU: NVIDIA H100

Using specs for: NVIDIA H100 (Hopper)

Peak Memory Bandwidth: 3352 GB/s
Peak FP32 Performance: 51.0 TFLOPS
SM Count: 132

system_prompt, user_prompt = build_judge_optimization_prompt(
  kernel_code=kernel_code,
  problem_description=problem_desc,
  ncu_metrics=ncu_metrics,
  gpu_specs=specs
)

You are a senior GPU performance engineer. Analyze the target GPU spec, the current kernel, and the Nsight Compute (NCU) profiling metrics. Identify EXACTLY TWO DISTINCT bottlenecks from the hardware profiling data, and propose specific optimization methods for each. Be surgical and metrics-driven.
......

Jack-Khuu · 2026-01-14T18:01:40Z

kernel_perf_agent/kernel_opt/diagnose_prompt/gpu_specs.py

+    # Return default if detection failed
+    if gpu_name is None:
+        print("⚠️  GPU auto-detection failed, using A100 specs as fallback")
+        return GPU_SPECS_DATABASE["NVIDIA A100"].copy()


Should we fall back to A100? Or does returning an empty dict make more sense?

I agree returning empty dict is cleaner, but it will also lead to KeyError in the optimization flow. Should we make a decision of disabling the optimization if no gpu_name is found?

I think that makes sense; if there are setup/detection issues then we shouldn't optimize

Yeah explicit failure makes sense. (return None (or {}) and let caller disable diagnosis/optimization)

Jack-Khuu · 2026-01-14T18:03:51Z

kernel_perf_agent/kernel_opt/diagnose_prompt/gpu_specs.py

+
+# GPU specifications database
+# Sources: NVIDIA official specifications, manufacturer datasheets
+GPU_SPECS_DATABASE = {


Can we move this const to its own file? It makes module overriding easier

Jack-Khuu · 2026-01-14T18:09:25Z

kernel_perf_agent/kernel_opt/diagnose_prompt/gpu_specs.py

+    gpu_name_lower = gpu_name.lower()
+    for key, specs in GPU_SPECS_DATABASE.items():
+        key_lower = key.lower()
+        # Check if either name contains the other
+        if gpu_name_lower in key_lower or key_lower in gpu_name_lower:
+            print(f"ℹ️  Matched '{gpu_name}' to '{key}' (fuzzy match)")
+            return specs.copy()


Curious if you've encountered this case before?

sometimes I'll just put a100 or h100 in my optimization workflow. What do you think, should we just force the gpu name input to be exactly matching?

kernel_perf_agent/kernel_opt/diagnose_prompt/gpu_specs.py

Jack-Khuu · 2026-01-14T18:12:56Z

kernel_perf_agent/kernel_opt/diagnose_prompt/judger_prompts.py

+Metric definitions are in metric_schema.py.
+"""
+
+from typing import Any, Callable, Dict, List, Optional, Tuple


Jack-Khuu · 2026-01-14T18:14:22Z

kernel_perf_agent/kernel_opt/diagnose_prompt/judger_prompts.py

+
+    for label, key, unit in GPU_SPEC_FIELDS:
+        value = gpu_specs.get(key, "N/A")
+        lines.append(f"- **{label}:** {value}{unit}")


do we want the unit if N/A

Jack-Khuu · 2026-01-14T18:20:12Z

kernel_perf_agent/kernel_opt/diagnose_prompt/judger_prompts.py

+        raise ValueError("NCU metrics are empty - cannot build judge prompt")
+
+    # Extract first kernel's metrics for the metric getter
+    first_kernel = list(ncu_metrics.values())[0] if ncu_metrics else {}


we check for empty above

Jack-Khuu · 2026-01-14T18:26:59Z

kernel_perf_agent/kernel_opt/diagnose_prompt/judger_prompts.py

+        except json.JSONDecodeError:
+            pass
+
+    # Strategy 2: Find first { ... } block with "bottleneck_1" field


Are strategy 2/3 typically encountered?

Not required for this PR, but we can look into forcing the llm providers to provide a structured output

Not really actually. All my experiments return dual bottleneck analysis

Jack-Khuu · 2026-01-14T18:30:36Z

kernel_perf_agent/kernel_opt/diagnose_prompt/judger_prompts.py

+    ) and _validate_bottleneck_entry(analysis["bottleneck_2"])
+
+
+VALID_CATEGORIES = frozenset(


frozenset isn't wrong, but we're just using it as a lookup so normal set is fine

Consolidates previous kernel_benchmark.py and pytorch_benchmark.py into a streamlined 3-file architecture with clear separation of concerns: Architecture: - benchmark.py (299 lines): Main Benchmark class with simplified API - benchmark_kernel(): Always uses subprocess for crash protection - benchmark_pytorch(): Always uses direct mode for stable code - BenchmarkLockManager: GPU lock management for multi-worker scenarios - timing.py (437 lines): Complete timing infrastructure - Timing: time_with_cuda_events(), time_with_triton_do_bench() - Loading: prepare_pytorch_model(), load_kernel_function() - Stats: compute_timing_stats() with essential metrics (mean/std/min/max) - kernel_subprocess.py (442 lines): Subprocess runner for kernel isolation - Crash protection for potentially buggy kernels - Clean CUDA state between runs - Timeout handling Key improvements: - Eliminated string code generation (was generating Python as strings) - Removed unnecessary statistics (median, p25/p75/p95/p99) - Removed confusing use_subprocess parameter (behavior now deterministic) - Fixed dtype bug causing incorrect speedup measurements - Reduced from 5 files to 3 files with clearer naming - Code reduction: ~1,400 lines → 1,178 lines Simple API: bench = Benchmark(logger, temp_dir, lock, worker_id) pytorch_result = bench.benchmark_pytorch(problem_file) kernel_result = bench.benchmark_kernel(kernel_file, problem_file) speedup = pytorch_result['stats']['mean'] / kernel_result['time_ms']

Jack-Khuu

stamp to unblock

Jack-Khuu · 2026-01-24T00:12:47Z

kernel_perf_agent/kernel_opt/diagnose_prompt/gpu_specs.py

+            # Take only the first GPU (nvidia-smi returns one line per GPU)
+            gpu_name = result.stdout.strip().split("\n")[0].strip()
+            return gpu_name
+    except (subprocess.TimeoutExpired, FileNotFoundError, Exception):


We might want to drop the try catch here, feels tad too cautious

If it errors out, there's bigger problems

Jack-Khuu · 2026-01-26T19:54:25Z

kernel_perf_agent/kernel_opt/profiler/ncu_profiler.py

+    if python_executable is None:
+        python_executable = sys.executable
+
+    if ncu_bin is None:
+        ncu_bin = shutil.which("ncu") or "/usr/local/cuda/bin/ncu"


Set defaults in function signature

kernel_perf_agent/kernel_opt/profiler/ncu_profiler.py

Jack-Khuu · 2026-01-26T19:56:26Z

kernel_perf_agent/kernel_opt/profiler/ncu_profiler.py

+    except subprocess.TimeoutExpired:
+        raise RuntimeError(f"NCU profiling timed out after {timeout} seconds")
+    except Exception as e:
+        raise RuntimeError(f"NCU profiling failed: {e}")


Wrap the timeout just around the subprocess call

Second exception check is redundant pass through?

Jack-Khuu · 2026-01-26T21:09:26Z

kernel_perf_agent/kernel_opt/profiler/ncu_profiler.py

+    if df.empty:
+        return df
+
+    if len(df) == 1:
+        return df


Suggested change

if df.empty:

return df

if len(df) == 1:

return df

if len(df) <= 1:

return df

Jack-Khuu · 2026-01-26T21:10:48Z

kernel_perf_agent/kernel_opt/profiler/ncu_profiler.py

+    extra_keep: Optional[Sequence[str]] = ("Kernel Name",),
+    coerce_numeric: bool = True,
+    name_list: Optional[Sequence[str]] = None,
+    select: Union[str, MetricSelectionPolicy] = MetricSelectionPolicy.LAST,


why string and enum?

Jack-Khuu · 2026-01-26T21:13:32Z

kernel_perf_agent/kernel_opt/profiler/ncu_profiler.py

+    # Filter by kernel name list if provided
+    if name_list:
+        sub = _filter_by_kernel_names(sub, name_list, policy, keep_cols)
+    else:
+        # Apply selection to all rows if no name filter
+        sub = _apply_selection_policy(sub, policy)
+
+    return sub


Suggested change

# Filter by kernel name list if provided

if name_list:

sub = _filter_by_kernel_names(sub, name_list, policy, keep_cols)

else:

# Apply selection to all rows if no name filter

sub = _apply_selection_policy(sub, policy)

return sub

return (

_filter_by_kernel_names(sub, name_list, policy, keep_cols)

if name_list

else _apply_selection_policy(sub, policy)

)

Jack-Khuu · 2026-01-26T21:17:29Z

triton_kernel_agent/opt_worker_component/benchmarking/kernel_subprocess.py

+    dtype_map = {
+        "float32": torch.float32,
+        "float16": torch.float16,
+        "bfloat16": torch.bfloat16,
+    }
+    dtype = dtype_map[args.dtype]


Suggested change

dtype_map = {

"float32": torch.float32,

"float16": torch.float16,

"bfloat16": torch.bfloat16,

}

dtype = dtype_map[args.dtype]

dtype = {

"float32": torch.float32,

"float16": torch.float16,

"bfloat16": torch.bfloat16,

}[args.dtype]

triton_kernel_agent/opt_worker_component/profiling/ncu_wrapper_template.j2

Laurawly · 2026-01-27T23:19:44Z

kernel_perf_agent/kernel_opt/diagnose_prompt/gpu_specs.py

+    # Return default if detection failed
+    if gpu_name is None:
+        print("⚠️  GPU auto-detection failed, using A100 specs as fallback")
+        return GPU_SPECS_DATABASE["NVIDIA A100"].copy()


Yeah explicit failure makes sense. (return None (or {}) and let caller disable diagnosis/optimization)

Laurawly · 2026-01-27T23:21:30Z

kernel_perf_agent/kernel_opt/diagnose_prompt/judger_prompts.py

+
+
+def render_ncu_metrics(
+    ncu_metrics: dict[str, Any],


ncu_metrics not used?

Laurawly · 2026-01-27T23:23:20Z

kernel_perf_agent/kernel_opt/diagnose_prompt/judger_prompts.py

+    def get_metric(key: str, default: str = "N/A") -> str:
+        val = kernel_metrics.get(key, default)
+        if isinstance(val, (int, float)):
+            return f"{val:.2f}"


For ints: render as int.

For floats: keep .2f for pct-style metrics, but consider scientific notation or “humanized” units for huge values (even just "{val:.3g}" is often better for prompts).

Alternatively, keep raw strings from the profiler and don’t reformat here.

Laurawly · 2026-01-27T23:26:52Z

kernel_perf_agent/kernel_opt/diagnose_prompt/gpu_specs_database.py

+        "memory_gb": 16,
+        "memory_type": "GDDR7",
+    },
+}


GPU specs table: “A100/H100” are multi-SKU; peak BW/TFLOPs + memory size can be wrong under fuzzy match/fallback (e.g., A100 80GB → 40GB). Can we avoid silent A100 fallback and add a match_type/matched_name field (or split common SKUs)

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 13, 2026

kaiming-cheng requested review from Jack-Khuu and Laurawly January 13, 2026 20:41

kaiming-cheng mentioned this pull request Jan 13, 2026

[Optimization 4/n] Add Kernel Optimization Template to PromptManager #77

Open

kaiming-cheng changed the title ~~Add Diagnosis Module (Prompt Builder for Hardware Bottleneck)~~ [Optimization 3/n] Add Diagnosis Module (Prompt Builder for Hardware Bottleneck) Jan 13, 2026

Jack-Khuu reviewed Jan 14, 2026

View reviewed changes

kaiming-cheng force-pushed the kaiming/opt_component_3 branch from ac11151 to bfa2fd0 Compare January 15, 2026 19:21

Kaiming Cheng added 22 commits January 15, 2026 11:44

NCU profiling wrapper generation and execution

07a3268

Refactor profiling components and add kernel_perf_util

3c4b124

Refactor profiling components and add kernel_perf_util

11f4e79

Refactor profiling components and add kernel_perf_util

251f419

update directory name and add package in pyproject

b789660

Remove kernel_perf_util directory

4d35d57

move gpu spec.py to future PR and fix import

d871678

Add copyright header

db0c754

fix ruff

cd29759

address previous comments

bbfa6cd

fix ruff

543453a

Introducing benchmarking infra for kernel performance

4febdd6

fix ruff

d92a7b7

fix ruff

2994315

address comments

1378fc3

Diagnose module - prompt constructor

45fec80

Refactors the diagnose_prompt module into a modular architecture

b640cde

fix diff issue

e952123

fix ruff issue

e7ba29a

fix

72ac4d1

fix ruff

e2c599e

kaiming-cheng force-pushed the kaiming/opt_component_3 branch from af9b7af to e2c599e Compare January 15, 2026 19:48

Jack-Khuu approved these changes Jan 26, 2026

View reviewed changes

Merge branch 'main' into kaiming/opt_component_3

8ab907c

Laurawly reviewed Jan 27, 2026

View reviewed changes

		) and _validate_bottleneck_entry(analysis["bottleneck_2"])


		VALID_CATEGORIES = frozenset(

[Optimization 3/n] Add Diagnosis Module (Prompt Builder for Hardware Bottleneck) #73

Are you sure you want to change the base?

[Optimization 3/n] Add Diagnosis Module (Prompt Builder for Hardware Bottleneck) #73

Conversation

kaiming-cheng commented Jan 13, 2026

Core Components

1. MetricSchema (metric_schema.py)

2. GPU Specs (gpu_specs.py)

3. **Judger Prompts ** (judger_prompts.py)

Example Usage

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jack-Khuu Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jack-Khuu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

1. MetricSchema (`metric_schema.py`)

2. GPU Specs (`gpu_specs.py`)

3. Judger Prompts (`judger_prompts.py`)

Jack-Khuu Jan 15, 2026 •

edited

Loading