Skip to content

Conversation

@kaiming-cheng
Copy link
Contributor

This PR introduces a diagnose module building GPU performance analysis prompts. The module provides GPU hardware specification lookup, NCU metric schema definitions, and composable prompt section rendering for bottleneck analysis.

Core Components

1. MetricSchema (metric_schema.py)

  • Defines single source of truth for NCU profiling metrics (keys, labels, units)
  • Organizes metrics into different sections: SM & Compute Utilization, Memory Bandwidth & Cache, Memory Access Patterns, Occupancy & Resources, Stall Metrics
  • Extensible schema design: Metrics can be easily added, removed, or recategorized by editing the schema, supporting iterative experimentation

2. GPU Specs (gpu_specs.py)

  • GPU specifications database for Nvidia A100, H100, RTX 4090, RTX 5080
  • Auto-detection via nvidia-smi with fuzzy matching support
  • Hardware specs include: peak compute (FP32/FP16/BF16), memory bandwidth, SM count, cache sizes, memory type

3. **Judger Prompts ** (judger_prompts.py)

  • Prompt builder for the Judge LLM dual-bottleneck analysis
  • Integrates section renderers for composable prompt construction:
  • Response extraction with multi-strategy JSON parsing

Example Usage

from kernel_perf_agent.kernel_opt.diagnose_prompt import
get_gpu_specs, build_judge_optimization_prompt


    specs = get_gpu_specs()
    print(f"\nUsing specs for: {specs['name']} ({specs.get('architecture', 'Unknown')})")
    print(f"  - Peak Memory Bandwidth: {specs['peak_memory_bw_gbps']} GB/s")
    print(f"  - Peak FP32 Performance: {specs['peak_fp32_tflops']} TFLOPS")
    print(f"  - SM Count: {specs['sm_count']}")

Detected GPU: NVIDIA H100

Using specs for: NVIDIA H100 (Hopper)

  • Peak Memory Bandwidth: 3352 GB/s
  • Peak FP32 Performance: 51.0 TFLOPS
  • SM Count: 132
system_prompt, user_prompt = build_judge_optimization_prompt(
  kernel_code=kernel_code,
  problem_description=problem_desc,
  ncu_metrics=ncu_metrics,
  gpu_specs=specs
)

You are a senior GPU performance engineer. Analyze the target GPU spec, the current kernel, and the Nsight Compute (NCU) profiling metrics. Identify EXACTLY TWO DISTINCT bottlenecks from the hardware profiling data, and propose specific optimization methods for each. Be surgical and metrics-driven.
......

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 13, 2026
@kaiming-cheng kaiming-cheng changed the title Add Diagnosis Module (Prompt Builder for Hardware Bottleneck) [Optimization 3/n] Add Diagnosis Module (Prompt Builder for Hardware Bottleneck) Jan 13, 2026
Comment on lines +148 to +96
# Return default if detection failed
if gpu_name is None:
print("⚠️ GPU auto-detection failed, using A100 specs as fallback")
return GPU_SPECS_DATABASE["NVIDIA A100"].copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we fall back to A100? Or does returning an empty dict make more sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree returning empty dict is cleaner, but it will also lead to KeyError in the optimization flow. Should we make a decision of disabling the optimization if no gpu_name is found?

Copy link
Contributor

@Jack-Khuu Jack-Khuu Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense; if there are setup/detection issues then we shouldn't optimize

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah explicit failure makes sense. (return None (or {}) and let caller disable diagnosis/optimization)


# GPU specifications database
# Sources: NVIDIA official specifications, manufacturer datasheets
GPU_SPECS_DATABASE = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this const to its own file? It makes module overriding easier

Comment on lines +158 to +109
gpu_name_lower = gpu_name.lower()
for key, specs in GPU_SPECS_DATABASE.items():
key_lower = key.lower()
# Check if either name contains the other
if gpu_name_lower in key_lower or key_lower in gpu_name_lower:
print(f"ℹ️ Matched '{gpu_name}' to '{key}' (fuzzy match)")
return specs.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious if you've encountered this case before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sometimes I'll just put a100 or h100 in my optimization workflow. What do you think, should we just force the gpu name input to be exactly matching?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enum it

Metric definitions are in metric_schema.py.
"""

from typing import Any, Callable, Dict, List, Optional, Tuple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😦


for label, key, unit in GPU_SPEC_FIELDS:
value = gpu_specs.get(key, "N/A")
lines.append(f"- **{label}:** {value}{unit}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want the unit if N/A

raise ValueError("NCU metrics are empty - cannot build judge prompt")

# Extract first kernel's metrics for the metric getter
first_kernel = list(ncu_metrics.values())[0] if ncu_metrics else {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we check for empty above

except json.JSONDecodeError:
pass

# Strategy 2: Find first { ... } block with "bottleneck_1" field
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are strategy 2/3 typically encountered?

Not required for this PR, but we can look into forcing the llm providers to provide a structured output

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really actually. All my experiments return dual bottleneck analysis

) and _validate_bottleneck_entry(analysis["bottleneck_2"])


VALID_CATEGORIES = frozenset(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

frozenset isn't wrong, but we're just using it as a lookup so normal set is fine

@kaiming-cheng kaiming-cheng force-pushed the kaiming/opt_component_3 branch from ac11151 to bfa2fd0 Compare January 15, 2026 19:21
Kaiming Cheng added 22 commits January 15, 2026 11:44
Consolidates previous kernel_benchmark.py and pytorch_benchmark.py into a
streamlined 3-file architecture with clear separation of concerns:

Architecture:
- benchmark.py (299 lines): Main Benchmark class with simplified API
  - benchmark_kernel(): Always uses subprocess for crash protection
  - benchmark_pytorch(): Always uses direct mode for stable code
  - BenchmarkLockManager: GPU lock management for multi-worker scenarios

- timing.py (437 lines): Complete timing infrastructure
  - Timing: time_with_cuda_events(), time_with_triton_do_bench()
  - Loading: prepare_pytorch_model(), load_kernel_function()
  - Stats: compute_timing_stats() with essential metrics (mean/std/min/max)

- kernel_subprocess.py (442 lines): Subprocess runner for kernel isolation
  - Crash protection for potentially buggy kernels
  - Clean CUDA state between runs
  - Timeout handling

Key improvements:
- Eliminated string code generation (was generating Python as strings)
- Removed unnecessary statistics (median, p25/p75/p95/p99)
- Removed confusing use_subprocess parameter (behavior now deterministic)
- Fixed dtype bug causing incorrect speedup measurements
- Reduced from 5 files to 3 files with clearer naming
- Code reduction: ~1,400 lines → 1,178 lines

Simple API:
  bench = Benchmark(logger, temp_dir, lock, worker_id)
  pytorch_result = bench.benchmark_pytorch(problem_file)
  kernel_result = bench.benchmark_kernel(kernel_file, problem_file)
  speedup = pytorch_result['stats']['mean'] / kernel_result['time_ms']
@kaiming-cheng kaiming-cheng force-pushed the kaiming/opt_component_3 branch from af9b7af to e2c599e Compare January 15, 2026 19:48
Copy link
Contributor

@Jack-Khuu Jack-Khuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stamp to unblock

# Take only the first GPU (nvidia-smi returns one line per GPU)
gpu_name = result.stdout.strip().split("\n")[0].strip()
return gpu_name
except (subprocess.TimeoutExpired, FileNotFoundError, Exception):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to drop the try catch here, feels tad too cautious

If it errors out, there's bigger problems

Comment on lines 114 to 118
if python_executable is None:
python_executable = sys.executable

if ncu_bin is None:
ncu_bin = shutil.which("ncu") or "/usr/local/cuda/bin/ncu"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set defaults in function signature

Comment on lines 203 to 206
except subprocess.TimeoutExpired:
raise RuntimeError(f"NCU profiling timed out after {timeout} seconds")
except Exception as e:
raise RuntimeError(f"NCU profiling failed: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Wrap the timeout just around the subprocess call

  2. Second exception check is redundant pass through?

Comment on lines 223 to 227
if df.empty:
return df

if len(df) == 1:
return df
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if df.empty:
return df
if len(df) == 1:
return df
if len(df) <= 1:
return df

extra_keep: Optional[Sequence[str]] = ("Kernel Name",),
coerce_numeric: bool = True,
name_list: Optional[Sequence[str]] = None,
select: Union[str, MetricSelectionPolicy] = MetricSelectionPolicy.LAST,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why string and enum?

Comment on lines 351 to 358
# Filter by kernel name list if provided
if name_list:
sub = _filter_by_kernel_names(sub, name_list, policy, keep_cols)
else:
# Apply selection to all rows if no name filter
sub = _apply_selection_policy(sub, policy)

return sub
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Filter by kernel name list if provided
if name_list:
sub = _filter_by_kernel_names(sub, name_list, policy, keep_cols)
else:
# Apply selection to all rows if no name filter
sub = _apply_selection_policy(sub, policy)
return sub
return (
_filter_by_kernel_names(sub, name_list, policy, keep_cols)
if name_list
else _apply_selection_policy(sub, policy)
)

Comment on lines 291 to 296
dtype_map = {
"float32": torch.float32,
"float16": torch.float16,
"bfloat16": torch.bfloat16,
}
dtype = dtype_map[args.dtype]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dtype_map = {
"float32": torch.float32,
"float16": torch.float16,
"bfloat16": torch.bfloat16,
}
dtype = dtype_map[args.dtype]
dtype = {
"float32": torch.float32,
"float16": torch.float16,
"bfloat16": torch.bfloat16,
}[args.dtype]

Comment on lines +148 to +96
# Return default if detection failed
if gpu_name is None:
print("⚠️ GPU auto-detection failed, using A100 specs as fallback")
return GPU_SPECS_DATABASE["NVIDIA A100"].copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah explicit failure makes sense. (return None (or {}) and let caller disable diagnosis/optimization)



def render_ncu_metrics(
ncu_metrics: dict[str, Any],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ncu_metrics not used?

def get_metric(key: str, default: str = "N/A") -> str:
val = kernel_metrics.get(key, default)
if isinstance(val, (int, float)):
return f"{val:.2f}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ints: render as int.

For floats: keep .2f for pct-style metrics, but consider scientific notation or “humanized” units for huge values (even just "{val:.3g}" is often better for prompts).

Alternatively, keep raw strings from the profiler and don’t reformat here.

"memory_gb": 16,
"memory_type": "GDDR7",
},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPU specs table: “A100/H100” are multi-SKU; peak BW/TFLOPs + memory size can be wrong under fuzzy match/fallback (e.g., A100 80GB → 40GB). Can we avoid silent A100 fallback and add a match_type/matched_name field (or split common SKUs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants