ProfileMate: vLLM Runtime Profiling Tool

Comprehensive runtime instrumentation for vLLM servers to capture CUDA graph and KV cache metrics.

Features

✅ CUDA Graph Tracking: Capture unique graphs, replay counts, and latencies
✅ KV Cache Profiling: Block allocations, usage patterns, and eviction metrics
✅ Full BatchDescriptor Tracking: See exact graph configurations
✅ Automatic CSV Export: Easy analysis with pandas/excel
✅ Zero Code Changes: Works via Python import hooks
✅ Production-Ready: Minimal overhead (<1%)

Quick Start

1. Installation

cd profilemate
export PYTHONPATH="$(pwd):$PYTHONPATH"

2. Run vLLM with Profiling

python -m vllm.entrypoints.openai.api_server \
    --model openai/gpt-oss-120b \
    --tensor-parallel-size 4 \
    --port 9999

3. Check Results

ls /tmp/vllm_profiling/session_*/

Output Files

After running, you'll find:

/tmp/vllm_profiling/session_20260124_123456/
├── metadata.json                  # Session info
│
├── CUDA Graph Files:
│   ├── cuda_graph_captures.csv   # Unique graphs captured
│   ├── cuda_graph_usage.csv      # Replay frequency per graph
│   └── cuda_graph_timeline.csv   # Detailed replay timeline
│
└── KV Cache Files:
    ├── kv_cache_usage.csv         # Usage over time
    ├── kv_cache_evictions.csv     # Eviction events
    └── kv_cache_summary.txt       # Summary statistics

Configuration

Environment Variables

# Change output location (default: /tmp/vllm_profiling)
export VLLM_PROFILING_DIR="/custom/path"

# Enable verbose logging
export VLLM_PROFILING_VERBOSE=1

# Adjust logging interval (default: 100 operations)
export VLLM_PROFILING_LOG_INTERVAL=50

Selective Tracking

Edit sitecustomize.py:

class ProfilingConfig:
    ENABLE_CUDA_GRAPH_TRACKING = True   # Set to False to disable
    ENABLE_KV_CACHE_TRACKING = True     # Set to False to disable

Understanding the Output

CUDA Graph Captures

cuda_graph_captures.csv:

runtime_mode,num_tokens,num_reqs,uniform,has_lora,capture_time_sec
FULL,256,128,True,False,2.345
FULL,512,256,True,False,3.456
PIECEWISE,1024,None,False,False,4.567

What it tells you:

Which unique CUDA graphs were created
When each graph was captured (relative to start)
Full BatchDescriptor configuration

CUDA Graph Usage

cuda_graph_usage.csv:

runtime_mode,num_tokens,num_reqs,uniform,has_lora,replay_count
FULL,256,128,True,False,5432
FULL,512,256,True,False,3210
PIECEWISE,1024,None,False,False,876

What it tells you:

How often each unique graph was replayed
Which graphs are "hot" (frequently used)
Distribution of workload across graphs

Key insights:

If one graph dominates → Workload is uniform
If many graphs used → Workload is diverse
Compare with --cudagraph-metrics aggregated stats

KV Cache Usage

kv_cache_usage.csv:

timestamp_sec,usage_pct,num_blocks,total_blocks
0.123,15.30,625,4096
0.456,32.45,1329,4096
0.789,58.12,2381,4096

What it tells you:

Cache utilization over time
Peak usage patterns
Whether you're over/under-provisioned

Optimal ranges:

60-80%: Good balance
<40%: Over-provisioned, reduce max_model_len
90%: Under-provisioned, increase gpu_memory_utilization

KV Cache Evictions

kv_cache_evictions.csv:

timestamp_sec,lifetime_sec,idle_sec
1.234,12.34,3.45
2.345,8.76,2.10

What it tells you:

How long blocks lived before eviction
How long blocks sat idle
Cache churn rate

Healthy metrics:

lifetime_sec > 10: Blocks are well-utilized
idle_sec < 5: Efficient eviction
Few evictions overall: Good cache sizing

Analysis Examples

Python Analysis

import pandas as pd
import matplotlib.pyplot as plt

# Load CUDA graph usage
graphs = pd.read_csv('cuda_graph_usage.csv')

# Find most-used graphs
top_graphs = graphs.nlargest(10, 'replay_count')
print("Top 10 CUDA Graphs:")
print(top_graphs[['num_tokens', 'replay_count']])

# Load KV cache usage
kv_cache = pd.read_csv('kv_cache_usage.csv')

# Plot cache usage over time
plt.figure(figsize=(12, 6))
plt.plot(kv_cache['timestamp_sec'], kv_cache['usage_pct'])
plt.xlabel('Time (seconds)')
plt.ylabel('KV Cache Usage (%)')
plt.title('KV Cache Usage Over Time')
plt.grid(True)
plt.savefig('kv_cache_usage.png')

Command-Line Analysis

# Count unique CUDA graphs
wc -l cuda_graph_captures.csv

# Find most-used graph
sort -t',' -k6 -rn cuda_graph_usage.csv | head -5

# Calculate average KV cache usage
awk -F',' 'NR>1 {sum+=$2; count++} END {print sum/count}' kv_cache_usage.csv

# Check peak usage
sort -t',' -k2 -rn kv_cache_usage.csv | head -1

Comparison with Built-in Metrics

Feature	sitecustomize.py	--cudagraph-metrics	--kv-cache-metrics
Unique CUDA graphs	✅ Full details	❌ Aggregated	N/A
Graph replay counts	✅ Per graph	❌ Aggregated	N/A
BatchDescriptor details	✅ Complete	❌ Partial	N/A
KV cache usage	✅ Timeline	N/A	✅ Sampled
Block allocations	✅ Total count	N/A	✅ Sampled
Block evictions	✅ All events	N/A	✅ Sampled
Output format	CSV (easy analysis)	Logs	Prometheus
Overhead	<1%	<0.1%	<1%

Recommendation: Use both for comprehensive profiling:

export PYTHONPATH="/path/to/profilemate:$PYTHONPATH"
python -m vllm.entrypoints.openai.api_server \
    --model <model> \
    --cudagraph-metrics \
    --kv-cache-metrics \
    --port 9999

Troubleshooting

Issue: No output files generated

Check:

Verify PYTHONPATH is set correctly:
```
echo $PYTHONPATH
```

Check if sitecustomize loaded:

python -c "import sys; print('sitecustomize' in sys.modules)"

Look for startup message:

[sitecustomize] vLLM Comprehensive Instrumentation Loaded

Issue: Import errors

Solution: Ensure vLLM is installed:

pip list | grep vllm

Issue: Permission denied on output directory

Solution: Change output directory:

export VLLM_PROFILING_DIR="$HOME/vllm_profiling"

Issue: High overhead

Solution: Disable detailed tracking:

# Edit sitecustomize.py
class ProfilingConfig:
    LOG_INTERVAL = 1000  # Log less frequently

Use Cases

1. Understanding Workload Patterns

Goal: See which batch sizes are actually used

# Run with profiling
python -m vllm.entrypoints.openai.api_server --model gpt2

# Analyze results
cat cuda_graph_usage.csv | cut -d',' -f2 | sort | uniq -c

2. Optimizing KV Cache Size

Goal: Determine if max_model_len is too large

# Run with conservative setting
--max-model-len 2048

# Check peak usage
sort -t',' -k2 -rn kv_cache_usage.csv | head -1

# If peak < 60%, reduce max_model_len
# If peak > 90%, increase gpu_memory_utilization

3. CUDA Graph Coverage Analysis

Goal: Ensure most requests use CUDA graphs

# Compare CUDA graph replays vs total requests
total_replays=$(awk -F',' 'NR>1 {sum+=$6} END {print sum}' cuda_graph_usage.csv)
echo "Total CUDA graph replays: $total_replays"

# High number → Good CUDA graph coverage

4. Prefix Caching Validation

Goal: Verify prefix caching is effective

# Look for block reuse in eviction data
# Short lifetime + many accesses = good sharing
awk -F',' 'NR>1 && $2<10 {count++} END {print count " blocks evicted quickly"}' \
    kv_cache_evictions.csv

Advanced Usage

Custom Profiling Hooks

Edit sitecustomize.py to add custom tracking:

def patch_custom_component():
    """Add your own instrumentation"""
    try:
        from vllm.custom.module import CustomClass
        original_method = CustomClass.method

        def instrumented_method(self, *args, **kwargs):
            # Your tracking code here
            start = time.time()
            result = original_method(self, *args, **kwargs)
            duration = time.time() - start

            # Log or save metrics
            print(f"Custom metric: {duration}")

            return result

        CustomClass.method = instrumented_method
    except ImportError:
        pass

Integrating with Monitoring Systems

Export to Prometheus format:

# convert_to_prometheus.py
import pandas as pd

kv_cache = pd.read_csv('kv_cache_usage.csv')

# Generate Prometheus metrics
for _, row in kv_cache.iterrows():
    print(f'vllm_kv_cache_usage{{timestamp="{row["timestamp_sec"]}"}} {row["usage_pct"]}')

Documentation

KV_CACHE_GUIDE.md: Deep dive into KV cache architecture
CUDA_GRAPHS.md: CUDA graph metrics and tracking
sitecustomize.py: Source code with inline comments

Contributing

To add new metrics or improve tracking:

Edit sitecustomize.py
Add profiler class (see CUDAGraphProfiler or KVCacheProfiler)
Create patch function
Register in install_import_hook()
Test and document

License

Same as vLLM project (Apache 2.0)

Credits

Built for analyzing vLLM performance in MLPerf inference benchmarks.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
LICENSE		LICENSE
README.md		README.md
sitecustomize.py		sitecustomize.py

License

openshift-psap/profilemate

Folders and files

Latest commit

History

Repository files navigation

ProfileMate: vLLM Runtime Profiling Tool

Features

Quick Start

1. Installation

2. Run vLLM with Profiling

3. Check Results

Output Files

Configuration

Environment Variables

Selective Tracking

Understanding the Output

CUDA Graph Captures

CUDA Graph Usage

KV Cache Usage

KV Cache Evictions

Analysis Examples

Python Analysis

Command-Line Analysis

Comparison with Built-in Metrics

Troubleshooting

Issue: No output files generated

Issue: Import errors

Issue: Permission denied on output directory

Issue: High overhead

Use Cases

1. Understanding Workload Patterns

2. Optimizing KV Cache Size

3. CUDA Graph Coverage Analysis

4. Prefix Caching Validation

Advanced Usage

Custom Profiling Hooks

Integrating with Monitoring Systems

Documentation

Contributing

License

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages