-
Notifications
You must be signed in to change notification settings - Fork 28
[Optimization 3/n] Add Diagnosis Module (Prompt Builder for Hardware Bottleneck) #73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| # Return default if detection failed | ||
| if gpu_name is None: | ||
| print("⚠️ GPU auto-detection failed, using A100 specs as fallback") | ||
| return GPU_SPECS_DATABASE["NVIDIA A100"].copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we fall back to A100? Or does returning an empty dict make more sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree returning empty dict is cleaner, but it will also lead to KeyError in the optimization flow. Should we make a decision of disabling the optimization if no gpu_name is found?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense; if there are setup/detection issues then we shouldn't optimize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah explicit failure makes sense. (return None (or {}) and let caller disable diagnosis/optimization)
|
|
||
| # GPU specifications database | ||
| # Sources: NVIDIA official specifications, manufacturer datasheets | ||
| GPU_SPECS_DATABASE = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this const to its own file? It makes module overriding easier
| gpu_name_lower = gpu_name.lower() | ||
| for key, specs in GPU_SPECS_DATABASE.items(): | ||
| key_lower = key.lower() | ||
| # Check if either name contains the other | ||
| if gpu_name_lower in key_lower or key_lower in gpu_name_lower: | ||
| print(f"ℹ️ Matched '{gpu_name}' to '{key}' (fuzzy match)") | ||
| return specs.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious if you've encountered this case before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sometimes I'll just put a100 or h100 in my optimization workflow. What do you think, should we just force the gpu name input to be exactly matching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enum it
| Metric definitions are in metric_schema.py. | ||
| """ | ||
|
|
||
| from typing import Any, Callable, Dict, List, Optional, Tuple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😦
|
|
||
| for label, key, unit in GPU_SPEC_FIELDS: | ||
| value = gpu_specs.get(key, "N/A") | ||
| lines.append(f"- **{label}:** {value}{unit}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want the unit if N/A
| raise ValueError("NCU metrics are empty - cannot build judge prompt") | ||
|
|
||
| # Extract first kernel's metrics for the metric getter | ||
| first_kernel = list(ncu_metrics.values())[0] if ncu_metrics else {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we check for empty above
| except json.JSONDecodeError: | ||
| pass | ||
|
|
||
| # Strategy 2: Find first { ... } block with "bottleneck_1" field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are strategy 2/3 typically encountered?
Not required for this PR, but we can look into forcing the llm providers to provide a structured output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really actually. All my experiments return dual bottleneck analysis
| ) and _validate_bottleneck_entry(analysis["bottleneck_2"]) | ||
|
|
||
|
|
||
| VALID_CATEGORIES = frozenset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
frozenset isn't wrong, but we're just using it as a lookup so normal set is fine
ac11151 to
bfa2fd0
Compare
Consolidates previous kernel_benchmark.py and pytorch_benchmark.py into a streamlined 3-file architecture with clear separation of concerns: Architecture: - benchmark.py (299 lines): Main Benchmark class with simplified API - benchmark_kernel(): Always uses subprocess for crash protection - benchmark_pytorch(): Always uses direct mode for stable code - BenchmarkLockManager: GPU lock management for multi-worker scenarios - timing.py (437 lines): Complete timing infrastructure - Timing: time_with_cuda_events(), time_with_triton_do_bench() - Loading: prepare_pytorch_model(), load_kernel_function() - Stats: compute_timing_stats() with essential metrics (mean/std/min/max) - kernel_subprocess.py (442 lines): Subprocess runner for kernel isolation - Crash protection for potentially buggy kernels - Clean CUDA state between runs - Timeout handling Key improvements: - Eliminated string code generation (was generating Python as strings) - Removed unnecessary statistics (median, p25/p75/p95/p99) - Removed confusing use_subprocess parameter (behavior now deterministic) - Fixed dtype bug causing incorrect speedup measurements - Reduced from 5 files to 3 files with clearer naming - Code reduction: ~1,400 lines → 1,178 lines Simple API: bench = Benchmark(logger, temp_dir, lock, worker_id) pytorch_result = bench.benchmark_pytorch(problem_file) kernel_result = bench.benchmark_kernel(kernel_file, problem_file) speedup = pytorch_result['stats']['mean'] / kernel_result['time_ms']
af9b7af to
e2c599e
Compare
Jack-Khuu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stamp to unblock
| # Take only the first GPU (nvidia-smi returns one line per GPU) | ||
| gpu_name = result.stdout.strip().split("\n")[0].strip() | ||
| return gpu_name | ||
| except (subprocess.TimeoutExpired, FileNotFoundError, Exception): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to drop the try catch here, feels tad too cautious
If it errors out, there's bigger problems
| if python_executable is None: | ||
| python_executable = sys.executable | ||
|
|
||
| if ncu_bin is None: | ||
| ncu_bin = shutil.which("ncu") or "/usr/local/cuda/bin/ncu" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set defaults in function signature
| except subprocess.TimeoutExpired: | ||
| raise RuntimeError(f"NCU profiling timed out after {timeout} seconds") | ||
| except Exception as e: | ||
| raise RuntimeError(f"NCU profiling failed: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Wrap the timeout just around the subprocess call
-
Second exception check is redundant pass through?
| if df.empty: | ||
| return df | ||
|
|
||
| if len(df) == 1: | ||
| return df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if df.empty: | |
| return df | |
| if len(df) == 1: | |
| return df | |
| if len(df) <= 1: | |
| return df |
| extra_keep: Optional[Sequence[str]] = ("Kernel Name",), | ||
| coerce_numeric: bool = True, | ||
| name_list: Optional[Sequence[str]] = None, | ||
| select: Union[str, MetricSelectionPolicy] = MetricSelectionPolicy.LAST, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why string and enum?
| # Filter by kernel name list if provided | ||
| if name_list: | ||
| sub = _filter_by_kernel_names(sub, name_list, policy, keep_cols) | ||
| else: | ||
| # Apply selection to all rows if no name filter | ||
| sub = _apply_selection_policy(sub, policy) | ||
|
|
||
| return sub |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # Filter by kernel name list if provided | |
| if name_list: | |
| sub = _filter_by_kernel_names(sub, name_list, policy, keep_cols) | |
| else: | |
| # Apply selection to all rows if no name filter | |
| sub = _apply_selection_policy(sub, policy) | |
| return sub | |
| return ( | |
| _filter_by_kernel_names(sub, name_list, policy, keep_cols) | |
| if name_list | |
| else _apply_selection_policy(sub, policy) | |
| ) |
| dtype_map = { | ||
| "float32": torch.float32, | ||
| "float16": torch.float16, | ||
| "bfloat16": torch.bfloat16, | ||
| } | ||
| dtype = dtype_map[args.dtype] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| dtype_map = { | |
| "float32": torch.float32, | |
| "float16": torch.float16, | |
| "bfloat16": torch.bfloat16, | |
| } | |
| dtype = dtype_map[args.dtype] | |
| dtype = { | |
| "float32": torch.float32, | |
| "float16": torch.float16, | |
| "bfloat16": torch.bfloat16, | |
| }[args.dtype] |
triton_kernel_agent/opt_worker_component/profiling/ncu_wrapper_template.j2
Show resolved
Hide resolved
| # Return default if detection failed | ||
| if gpu_name is None: | ||
| print("⚠️ GPU auto-detection failed, using A100 specs as fallback") | ||
| return GPU_SPECS_DATABASE["NVIDIA A100"].copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah explicit failure makes sense. (return None (or {}) and let caller disable diagnosis/optimization)
|
|
||
|
|
||
| def render_ncu_metrics( | ||
| ncu_metrics: dict[str, Any], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ncu_metrics not used?
| def get_metric(key: str, default: str = "N/A") -> str: | ||
| val = kernel_metrics.get(key, default) | ||
| if isinstance(val, (int, float)): | ||
| return f"{val:.2f}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For ints: render as int.
For floats: keep .2f for pct-style metrics, but consider scientific notation or “humanized” units for huge values (even just "{val:.3g}" is often better for prompts).
Alternatively, keep raw strings from the profiler and don’t reformat here.
| "memory_gb": 16, | ||
| "memory_type": "GDDR7", | ||
| }, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GPU specs table: “A100/H100” are multi-SKU; peak BW/TFLOPs + memory size can be wrong under fuzzy match/fallback (e.g., A100 80GB → 40GB). Can we avoid silent A100 fallback and add a match_type/matched_name field (or split common SKUs)
This PR introduces a
diagnosemodule building GPU performance analysis prompts. The module provides GPU hardware specification lookup, NCU metric schema definitions, and composable prompt section rendering for bottleneck analysis.Core Components
1. MetricSchema (
metric_schema.py)2. GPU Specs (
gpu_specs.py)3. **Judger Prompts ** (
judger_prompts.py)Example Usage
Detected GPU: NVIDIA H100
Using specs for: NVIDIA H100 (Hopper)
You are a senior GPU performance engineer. Analyze the target GPU spec, the current kernel, and the Nsight Compute (NCU) profiling metrics. Identify EXACTLY TWO DISTINCT bottlenecks from the hardware profiling data, and propose specific optimization methods for each. Be surgical and metrics-driven.
......