Comprehensive runtime instrumentation for vLLM servers to capture CUDA graph and KV cache metrics.
- ✅ CUDA Graph Tracking: Capture unique graphs, replay counts, and latencies
- ✅ KV Cache Profiling: Block allocations, usage patterns, and eviction metrics
- ✅ Full BatchDescriptor Tracking: See exact graph configurations
- ✅ Automatic CSV Export: Easy analysis with pandas/excel
- ✅ Zero Code Changes: Works via Python import hooks
- ✅ Production-Ready: Minimal overhead (<1%)
cd profilemate
export PYTHONPATH="$(pwd):$PYTHONPATH"python -m vllm.entrypoints.openai.api_server \
--model openai/gpt-oss-120b \
--tensor-parallel-size 4 \
--port 9999ls /tmp/vllm_profiling/session_*/After running, you'll find:
/tmp/vllm_profiling/session_20260124_123456/
├── metadata.json # Session info
│
├── CUDA Graph Files:
│ ├── cuda_graph_captures.csv # Unique graphs captured
│ ├── cuda_graph_usage.csv # Replay frequency per graph
│ └── cuda_graph_timeline.csv # Detailed replay timeline
│
└── KV Cache Files:
├── kv_cache_usage.csv # Usage over time
├── kv_cache_evictions.csv # Eviction events
└── kv_cache_summary.txt # Summary statistics
# Change output location (default: /tmp/vllm_profiling)
export VLLM_PROFILING_DIR="/custom/path"
# Enable verbose logging
export VLLM_PROFILING_VERBOSE=1
# Adjust logging interval (default: 100 operations)
export VLLM_PROFILING_LOG_INTERVAL=50Edit sitecustomize.py:
class ProfilingConfig:
ENABLE_CUDA_GRAPH_TRACKING = True # Set to False to disable
ENABLE_KV_CACHE_TRACKING = True # Set to False to disablecuda_graph_captures.csv:
runtime_mode,num_tokens,num_reqs,uniform,has_lora,capture_time_sec
FULL,256,128,True,False,2.345
FULL,512,256,True,False,3.456
PIECEWISE,1024,None,False,False,4.567What it tells you:
- Which unique CUDA graphs were created
- When each graph was captured (relative to start)
- Full BatchDescriptor configuration
cuda_graph_usage.csv:
runtime_mode,num_tokens,num_reqs,uniform,has_lora,replay_count
FULL,256,128,True,False,5432
FULL,512,256,True,False,3210
PIECEWISE,1024,None,False,False,876What it tells you:
- How often each unique graph was replayed
- Which graphs are "hot" (frequently used)
- Distribution of workload across graphs
Key insights:
- If one graph dominates → Workload is uniform
- If many graphs used → Workload is diverse
- Compare with
--cudagraph-metricsaggregated stats
kv_cache_usage.csv:
timestamp_sec,usage_pct,num_blocks,total_blocks
0.123,15.30,625,4096
0.456,32.45,1329,4096
0.789,58.12,2381,4096What it tells you:
- Cache utilization over time
- Peak usage patterns
- Whether you're over/under-provisioned
Optimal ranges:
- 60-80%: Good balance
- <40%: Over-provisioned, reduce
max_model_len -
90%: Under-provisioned, increase
gpu_memory_utilization
kv_cache_evictions.csv:
timestamp_sec,lifetime_sec,idle_sec
1.234,12.34,3.45
2.345,8.76,2.10What it tells you:
- How long blocks lived before eviction
- How long blocks sat idle
- Cache churn rate
Healthy metrics:
lifetime_sec > 10: Blocks are well-utilizedidle_sec < 5: Efficient eviction- Few evictions overall: Good cache sizing
import pandas as pd
import matplotlib.pyplot as plt
# Load CUDA graph usage
graphs = pd.read_csv('cuda_graph_usage.csv')
# Find most-used graphs
top_graphs = graphs.nlargest(10, 'replay_count')
print("Top 10 CUDA Graphs:")
print(top_graphs[['num_tokens', 'replay_count']])
# Load KV cache usage
kv_cache = pd.read_csv('kv_cache_usage.csv')
# Plot cache usage over time
plt.figure(figsize=(12, 6))
plt.plot(kv_cache['timestamp_sec'], kv_cache['usage_pct'])
plt.xlabel('Time (seconds)')
plt.ylabel('KV Cache Usage (%)')
plt.title('KV Cache Usage Over Time')
plt.grid(True)
plt.savefig('kv_cache_usage.png')# Count unique CUDA graphs
wc -l cuda_graph_captures.csv
# Find most-used graph
sort -t',' -k6 -rn cuda_graph_usage.csv | head -5
# Calculate average KV cache usage
awk -F',' 'NR>1 {sum+=$2; count++} END {print sum/count}' kv_cache_usage.csv
# Check peak usage
sort -t',' -k2 -rn kv_cache_usage.csv | head -1| Feature | sitecustomize.py | --cudagraph-metrics | --kv-cache-metrics |
|---|---|---|---|
| Unique CUDA graphs | ✅ Full details | ❌ Aggregated | N/A |
| Graph replay counts | ✅ Per graph | ❌ Aggregated | N/A |
| BatchDescriptor details | ✅ Complete | ❌ Partial | N/A |
| KV cache usage | ✅ Timeline | N/A | ✅ Sampled |
| Block allocations | ✅ Total count | N/A | ✅ Sampled |
| Block evictions | ✅ All events | N/A | ✅ Sampled |
| Output format | CSV (easy analysis) | Logs | Prometheus |
| Overhead | <1% | <0.1% | <1% |
Recommendation: Use both for comprehensive profiling:
export PYTHONPATH="/path/to/profilemate:$PYTHONPATH"
python -m vllm.entrypoints.openai.api_server \
--model <model> \
--cudagraph-metrics \
--kv-cache-metrics \
--port 9999Check:
-
Verify
PYTHONPATHis set correctly:echo $PYTHONPATH
-
Check if sitecustomize loaded:
python -c "import sys; print('sitecustomize' in sys.modules)" -
Look for startup message:
[sitecustomize] vLLM Comprehensive Instrumentation Loaded
Solution: Ensure vLLM is installed:
pip list | grep vllmSolution: Change output directory:
export VLLM_PROFILING_DIR="$HOME/vllm_profiling"Solution: Disable detailed tracking:
# Edit sitecustomize.py
class ProfilingConfig:
LOG_INTERVAL = 1000 # Log less frequentlyGoal: See which batch sizes are actually used
# Run with profiling
python -m vllm.entrypoints.openai.api_server --model gpt2
# Analyze results
cat cuda_graph_usage.csv | cut -d',' -f2 | sort | uniq -cGoal: Determine if max_model_len is too large
# Run with conservative setting
--max-model-len 2048
# Check peak usage
sort -t',' -k2 -rn kv_cache_usage.csv | head -1
# If peak < 60%, reduce max_model_len
# If peak > 90%, increase gpu_memory_utilizationGoal: Ensure most requests use CUDA graphs
# Compare CUDA graph replays vs total requests
total_replays=$(awk -F',' 'NR>1 {sum+=$6} END {print sum}' cuda_graph_usage.csv)
echo "Total CUDA graph replays: $total_replays"
# High number → Good CUDA graph coverageGoal: Verify prefix caching is effective
# Look for block reuse in eviction data
# Short lifetime + many accesses = good sharing
awk -F',' 'NR>1 && $2<10 {count++} END {print count " blocks evicted quickly"}' \
kv_cache_evictions.csvEdit sitecustomize.py to add custom tracking:
def patch_custom_component():
"""Add your own instrumentation"""
try:
from vllm.custom.module import CustomClass
original_method = CustomClass.method
def instrumented_method(self, *args, **kwargs):
# Your tracking code here
start = time.time()
result = original_method(self, *args, **kwargs)
duration = time.time() - start
# Log or save metrics
print(f"Custom metric: {duration}")
return result
CustomClass.method = instrumented_method
except ImportError:
passExport to Prometheus format:
# convert_to_prometheus.py
import pandas as pd
kv_cache = pd.read_csv('kv_cache_usage.csv')
# Generate Prometheus metrics
for _, row in kv_cache.iterrows():
print(f'vllm_kv_cache_usage{{timestamp="{row["timestamp_sec"]}"}} {row["usage_pct"]}')- KV_CACHE_GUIDE.md: Deep dive into KV cache architecture
- CUDA_GRAPHS.md: CUDA graph metrics and tracking
- sitecustomize.py: Source code with inline comments
To add new metrics or improve tracking:
- Edit
sitecustomize.py - Add profiler class (see
CUDAGraphProfilerorKVCacheProfiler) - Create patch function
- Register in
install_import_hook() - Test and document
Same as vLLM project (Apache 2.0)
Built for analyzing vLLM performance in MLPerf inference benchmarks.