From 7f5fdd7d0a00c6b4a3f7deaf537938b5b62a6a48 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 8 Nov 2025 11:48:12 +0000 Subject: [PATCH 01/10] Add investigation artifacts for issue #243 - test_issue_243.py: Test script to replicate VRAM duplication issue - ISSUE_243_ANALYSIS.md: Initial analysis of gunicorn/CUDA issue - ISSUE_243_REAL_WORLD_ANALYSIS.md: Analysis of whisper-wrapper implementation These are investigation/documentation artifacts, not SDK changes. --- ISSUE_243_ANALYSIS.md | 392 +++++++++++++++++++++++++ ISSUE_243_REAL_WORLD_ANALYSIS.md | 484 +++++++++++++++++++++++++++++++ test_issue_243.py | 304 +++++++++++++++++++ 3 files changed, 1180 insertions(+) create mode 100644 ISSUE_243_ANALYSIS.md create mode 100644 ISSUE_243_REAL_WORLD_ANALYSIS.md create mode 100755 test_issue_243.py diff --git a/ISSUE_243_ANALYSIS.md b/ISSUE_243_ANALYSIS.md new file mode 100644 index 0000000..da7689f --- /dev/null +++ b/ISSUE_243_ANALYSIS.md @@ -0,0 +1,392 @@ +# Issue #243 Analysis: Gunicorn, Torch, and CUDA + +## Executive Summary + +When CLAMS applications using PyTorch models run in production mode with gunicorn, each worker process loads its own copy of the model into GPU VRAM. This leads to excessive memory consumption that scales linearly with the number of workers, causing OOM errors under concurrent load. + +--- + +## The Problem in Detail + +### Architecture + +The CLAMS Python SDK uses gunicorn for production deployments: + +``` +clams/restify/__init__.py:42-78 +``` + +**Default Configuration:** +- **Workers**: `(CPU_count × 2) + 1` +- **Threads per worker**: 2 +- **Worker class**: sync (default) + +On an 8-core machine: **17 workers** + +### Root Cause + +The issue occurs due to the interaction between: + +1. **Python's fork model**: Gunicorn uses `os.fork()` to spawn workers +2. **CUDA memory allocation**: GPU memory is NOT shared via copy-on-write like CPU RAM +3. **Model loading timing**: Models are loaded in `ClamsApp.__init__()` before the fork + +**Critical Code Path:** + +```python +# Typical CLAMS app structure (clams/develop/templates/app/app.py.template:29-40) +class MyApp(ClamsApp): + def __init__(self): + super().__init__() + self.model = torch.load('model.pt') # ← Loaded BEFORE fork + +# Entry point (clams/develop/templates/app/app.py.template:53-74) +if __name__ == "__main__": + app = MyApp() # ← Model loaded here (single process) + http_app = Restifier(app, port=5000) + + if args.production: + http_app.serve_production() # ← Gunicorn forks 17 workers here + # Each worker now has its own model copy in VRAM! +``` + +### Memory Multiplication + +**Example with Whisper model (~3GB VRAM):** + +| Configuration | Workers | VRAM Usage | +|--------------|---------|------------| +| 4-core CPU | 9 | 27 GB | +| 8-core CPU | 17 | 51 GB | +| 16-core CPU | 33 | 99 GB | + +Most consumer GPUs have 8-24GB VRAM, so this quickly causes OOM errors. + +### Why `torch.cuda.empty_cache()` Doesn't Help + +The SDK includes CUDA cache cleanup (clams/app/__init__.py:389-390): + +```python +finally: + if torch_available and cuda_available: + torch.cuda.empty_cache() +``` + +However, this only clears PyTorch's **caching allocator**, not the model weights themselves. The model stays loaded in each worker's VRAM indefinitely. + +--- + +## Current Behavior vs. Expected Behavior + +### Current (Problematic) Behavior: + +1. App loads model in `__init__()` +2. Gunicorn forks N workers +3. Each worker has independent model in VRAM +4. Concurrent requests → N models active simultaneously +5. VRAM usage = N × model_size +6. OOM when N × model_size > available VRAM + +### Expected Behavior: + +Models should either: +1. **Load on-demand** per request and be freed after +2. **Share VRAM** across workers (if possible) +3. **Use a model server** pattern with worker pooling +4. **Limit workers** based on available VRAM, not CPU count + +--- + +## How to Replicate and Test + +### Test Script: `test_issue_243.py` + +I've created a comprehensive test script that simulates the issue without requiring an actual whisper model. + +#### Prerequisites + +```bash +# Optional: For CUDA testing +pip install torch # with CUDA support + +# For monitoring mode +pip install requests +``` + +#### Test Scenarios + +**1. Development Mode (Baseline)** +```bash +# Single process, single model in VRAM +python test_issue_243.py --mode dev --model-size 100 +``` + +Expected: One model copy (~100MB VRAM) + +**2. Production Mode (Demonstrates Issue)** +```bash +# Multiple workers, multiple models in VRAM +python test_issue_243.py --mode prod --model-size 100 +``` + +Expected: N model copies (~N × 100MB VRAM) + +**3. Concurrent Request Testing** + +Terminal 1 - Start server: +```bash +python test_issue_243.py --mode prod --model-size 100 --port 5000 +``` + +Terminal 2 - Send concurrent requests: +```bash +python test_issue_243.py --mode monitor --port 5000 +``` + +**4. Custom Worker Count** +```bash +# Test with specific number of workers +python test_issue_243.py --mode prod --workers 5 --model-size 100 +``` + +#### What to Look For + +1. **Worker PIDs**: Each request shows which worker processed it +2. **VRAM Growth**: Monitor GPU memory as workers start +3. **Multiple Model Copies**: Different workers have different model instances +4. **Concurrent Load**: When 10 requests hit simultaneously, multiple workers activate + +**With CUDA available:** +```bash +# Watch VRAM in real-time while running tests +watch -n 1 nvidia-smi +``` + +**Without CUDA:** +The script will simulate the issue and show worker-level model duplication even without GPU. + +--- + +## Key Observations from Code Analysis + +### 1. Worker Initialization (clams/restify/__init__.py:51-67) + +```python +def number_of_workers(): + return (multiprocessing.cpu_count() * 2) + 1 + +class ProductionApplication(gunicorn.app.base.BaseApplication): + def __init__(self, app, host, port, **options): + self.options = { + 'bind': f'{host}:{port}', + 'workers': number_of_workers(), # ← CPU-based, ignores GPU + 'threads': 2, + 'accesslog': '-', + } +``` + +**Issue**: Worker count is based solely on CPU cores, completely ignoring GPU memory constraints. + +### 2. CUDA Profiling (clams/app/__init__.py:349-392) + +The SDK includes CUDA memory profiling that tracks peak VRAM usage: + +```python +@staticmethod +def _profile_cuda_memory(func): + def wrapper(*args, **kwargs): + # Reset peak memory tracking + torch.cuda.reset_peak_memory_stats('cuda') + + result = func(*args, **kwargs) + + # Record peak usage per GPU + for device_id in range(device_count): + peak_memory = torch.cuda.max_memory_allocated(f'cuda:{device_id}') + cuda_profiler[key] = peak_memory + + return result, cuda_profiler + finally: + torch.cuda.empty_cache() # ← Only clears cache, not model +``` + +**Key Points:** +- Profiling is helpful for monitoring +- `empty_cache()` doesn't free model weights +- Peak memory tracking is per-request, not per-worker + +### 3. No Pre-Fork Hooks + +Gunicorn provides hooks like `pre_fork()`, `post_fork()`, `post_worker_init()` that could be used to: +- Delay model loading until after fork +- Implement shared model serving +- Manage worker-to-GPU assignment + +**Currently not implemented in the SDK.** + +--- + +## Related Issues and Context + +### Referenced PR: app-doctr-wrapper #6 + +The issue mentions this shares a root cause with a PR in app-doctr-wrapper. This suggests: +- Multiple CLAMS apps experience this issue +- DocTR (Document Text Recognition) models also consume significant VRAM +- The problem is systemic to the SDK, not app-specific + +### Production Environment Context + +The issue specifically mentions: +- **Hardware**: NVIDIA GPU support +- **App**: Whisper wrapper v10 +- **Trigger**: Multiple POST requests (concurrent or sequential) +- **Symptom**: Progressive GPU memory saturation → OOM + +This matches the behavior described above perfectly. + +--- + +## Verification Steps + +To verify this is happening in your production environment: + +### 1. Check Worker Count +```bash +# While app is running in production +ps aux | grep gunicorn | grep -v grep | wc -l +``` + +You should see N+1 processes (master + N workers) + +### 2. Monitor VRAM per Process +```bash +# Install nvidia-smi if not available +nvidia-smi pmon -c 1 + +# Or for continuous monitoring +watch -n 1 'nvidia-smi --query-compute-apps=pid,used_memory --format=csv' +``` + +You should see multiple PIDs each consuming ~model_size VRAM + +### 3. Test Concurrent Requests + +```python +import requests +from concurrent.futures import ThreadPoolExecutor + +url = "http://your-app:5000" +mmif_data = "{}" # minimal MMIF + +def send_request(i): + response = requests.post(url, data=mmif_data, + params={'hwFetch': 'true'}) + return response.json() + +# Send 10 concurrent requests +with ThreadPoolExecutor(max_workers=10) as executor: + results = list(executor.map(send_request, range(10))) + +# Check how many different workers responded +workers = set() +for result in results: + if 'views' in result and result['views']: + # Extract worker info from view metadata + workers.add(result['views'][0].get('metadata', {}).get('app_pid')) + +print(f"Requests distributed across {len(workers)} workers") +``` + +### 4. Compare Development vs. Production + +```bash +# Development (single process) +python app.py --port 5000 & +sleep 5 +nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits + +# Production (multiple workers) +python app.py --production --port 5001 & +sleep 10 # Give workers time to start +nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits +``` + +Production should show significantly more VRAM usage. + +--- + +## Potential Solutions (Analysis Only) + +*Note: Per your request, I'm only analyzing potential solutions, not implementing them.* + +### 1. Lazy Model Loading +Load models in `_annotate()` instead of `__init__()`, with proper cleanup. + +**Pros**: Simple to implement +**Cons**: Slow (load/unload per request), still uses same peak VRAM + +### 2. Worker-to-GPU Affinity +Limit workers to match available GPUs, assign each worker to specific GPU. + +**Pros**: Predictable VRAM usage +**Cons**: Underutilizes CPU on GPU-poor systems + +### 3. Model Server Pattern +Separate model serving process(es), workers communicate via IPC/network. + +**Pros**: True model sharing, scales independently +**Cons**: Complex architecture, adds latency + +### 4. Gunicorn Pre-Load + Smart Fork +Use `preload_app=True` with delayed CUDA initialization after fork. + +**Pros**: Maintains multi-worker concurrency +**Cons**: Requires careful CUDA context management + +### 5. Dynamic Worker Scaling +Calculate workers based on available VRAM, not CPU count. + +**Pros**: Prevents OOM +**Cons**: May underutilize system resources + +--- + +## Recommended Next Steps + +1. **Validate the issue** using `test_issue_243.py` on your setup +2. **Measure actual impact** in your production environment +3. **Gather requirements**: + - Typical request concurrency + - Model sizes + - Available VRAM + - Acceptable latency +4. **Evaluate solutions** based on your constraints +5. **Prototype** the most promising approach + +--- + +## Additional Resources + +### CLAMS SDK Files to Review: +- `clams/restify/__init__.py` - Gunicorn configuration +- `clams/app/__init__.py` - CUDA profiling and app lifecycle +- `clams/develop/templates/app/app.py.template` - App structure + +### Gunicorn Documentation: +- [Server Hooks](https://docs.gunicorn.org/en/stable/settings.html#server-hooks) +- [Worker Configuration](https://docs.gunicorn.org/en/stable/settings.html#worker-processes) +- [Preloading Applications](https://docs.gunicorn.org/en/stable/settings.html#preload-app) + +### PyTorch CUDA: +- [CUDA Semantics](https://pytorch.org/docs/stable/notes/cuda.html) +- [Memory Management](https://pytorch.org/docs/stable/notes/cuda.html#memory-management) +- [Multiprocessing](https://pytorch.org/docs/stable/notes/multiprocessing.html) + +--- + +## Summary + +Issue #243 is a **systemic architectural challenge** where the SDK's CPU-based worker scaling conflicts with GPU memory constraints. The test script (`test_issue_243.py`) provides a safe, controlled way to observe and measure this behavior without affecting your repository or production systems. + +The issue is real, measurable, and impacts any CLAMS app using large PyTorch models in production. Solutions will require careful trade-offs between simplicity, performance, and resource utilization. diff --git a/ISSUE_243_REAL_WORLD_ANALYSIS.md b/ISSUE_243_REAL_WORLD_ANALYSIS.md new file mode 100644 index 0000000..3e01908 --- /dev/null +++ b/ISSUE_243_REAL_WORLD_ANALYSIS.md @@ -0,0 +1,484 @@ +# Issue #243 Real-World Analysis: Whisper Wrapper Implementation + +## Executive Summary + +The whisper wrapper app **attempts** to solve the GPU memory issue by loading models on-demand rather than in `__init__()`. However, the implementation still suffers from **per-worker model duplication** and has a problematic **"conflict prevention" mechanism** that can load duplicate models within the same worker. + +--- + +## Actual Implementation Analysis + +### Source Code +**Repository**: https://github.com/clamsproject/app-whisper-wrapper +**File**: `app.py` + +### Model Loading Strategy + +#### 1. Initialization (Lines 28-31) +```python +def __init__(self): + super().__init__() + self.whisper_models = {} + self.model_usage = {} +``` + +**Good**: Models are NOT loaded in `__init__()`, avoiding the pre-fork duplication issue. + +**Problem**: Each worker still has its own `self.whisper_models` dict after fork, leading to per-worker caching. + +#### 2. On-Demand Loading in `_annotate()` (Lines 78-96) + +```python +if size not in self.whisper_models: + self.logger.debug(f'Loading model {size}') + t = time.perf_counter() + self.whisper_models[size] = whisper.load_model(size) + self.logger.debug(f'Load time: {time.perf_counter() - t:.2f} seconds\n') + self.model_usage[size] = False + +if not self.model_usage[size]: + whisper_model = self.whisper_models.get(size) + self.model_usage[size] = True + cached = True +else: + self.logger.debug(f'Loading model {size} to avoid memory conflict') + t = time.perf_counter() + whisper_model = whisper.load_model(size) + self.logger.debug(f'Load time: {time.perf_counter() - t:.2f} seconds\n') + cached = False +``` + +**Logic**: +1. First request to a worker: Load model into `self.whisper_models[size]` and cache it +2. Subsequent requests to the same worker: + - If `model_usage[size]` is False (model not in use): Use cached model + - If `model_usage[size]` is True (model in use): **Load a SECOND copy!** + +#### 3. Cleanup After Transcription (Line 128) +```python +if size in self.model_usage and cached == True: + self.model_usage[size] = False +``` + +**Intent**: Mark model as "not in use" so next request can reuse it. + +**Problem**: The second model copy (when `cached = False`) is never tracked or cleaned up! + +--- + +## Why This Is Still Problematic + +### Issue #1: Per-Worker Model Caching + +**Scenario**: 8-core CPU → 17 workers, Whisper "medium" model (~3GB VRAM) + +**What happens**: +1. First request hits Worker 1 → loads model (3GB) +2. First request hits Worker 2 → loads model (3GB) +3. ... +4. First request hits Worker 17 → loads model (3GB) + +**Result**: 17 × 3GB = **51GB VRAM** (same as the original issue!) + +**Why**: Each worker's `self.whisper_models` dict is independent. There's no cross-worker model sharing. + +### Issue #2: The "Conflict Prevention" Mechanism + +The code assumes concurrent requests within the same worker can happen. Let's analyze when this occurs: + +**Gunicorn Worker Types**: +- **sync** (default in CLAMS SDK): Single-threaded, handles one request at a time + - The `threads: 2` setting is **ignored** with sync workers! + - `model_usage` tracking is **unnecessary** with sync workers +- **gthread**: Multi-threaded, can handle concurrent requests + - If using gthread workers, the conflict prevention kicks in + +**Problem with gthread workers (2 threads per worker)**: + +**Timeline**: +``` +Worker 1, Thread 1: + T0: Load model into cache (3GB) + T1: Set model_usage = True + T2: Start transcription (takes 10 seconds) + +Worker 1, Thread 2 (concurrent request): + T3: Check model_usage → True + T4: "Loading model to avoid memory conflict" + T5: Load SECOND copy of model (3GB) + T6: Start transcription + +Worker 1 now has: 3GB + 3GB = 6GB in VRAM! +``` + +**Worst Case**: 17 workers × 2 threads × 3GB = **102GB VRAM** + +### Issue #3: Memory Leak + +```python +else: + self.logger.debug(f'Loading model {size} to avoid memory conflict') + t = time.perf_counter() + whisper_model = whisper.load_model(size) # ← Local variable + self.logger.debug(f'Load time: {time.perf_counter() - t:.2f} seconds\n') + cached = False +``` + +**Problem**: The second model copy is stored in `whisper_model` (local variable) but never explicitly deleted or freed. It becomes garbage only when: +1. The function returns +2. Python's GC runs +3. PyTorch releases the CUDA memory + +During long transcriptions, multiple copies can accumulate if requests arrive faster than transcription completes. + +--- + +## Verification Steps + +### Check If You're Hitting This Issue + +**1. Check Worker Configuration** + +Look at the actual running gunicorn config: +```bash +# While app is running +ps aux | grep gunicorn + +# You'll see something like: +# gunicorn: master [WhisperWrapper] +# gunicorn: worker [WhisperWrapper] (17 workers) +``` + +Count the workers: should be `(CPU_count × 2) + 1` + +**2. Monitor Per-Worker VRAM Usage** + +```bash +# Watch VRAM in real-time +watch -n 1 'nvidia-smi --query-compute-apps=pid,used_memory --format=csv' +``` + +Send a request and watch: +- First request to the app → 1 PID appears with ~3GB +- Second concurrent request → potentially 2 PIDs or the same PID with 6GB + +**3. Test Concurrent Requests** + +```python +import requests +import time +from concurrent.futures import ThreadPoolExecutor +import json + +url = "http://your-whisper-app:5000" + +# Create minimal MMIF with audio document +mmif = { + "metadata": {"mmif": "http://mmif.clams.ai/1.0.0"}, + "documents": [{ + "@type": "http://mmif.clams.ai/vocabulary/AudioDocument/v1", + "properties": { + "id": "d1", + "location": "file:///path/to/audio.mp3" + } + }] +} + +def send_request(i): + start = time.time() + response = requests.post(url, json=mmif, params={'model': 'medium'}) + duration = time.time() - start + + # Extract worker PID from logs if available + print(f"Request {i}: {response.status_code}, took {duration:.2f}s") + return response.json() if response.ok else None + +# Send 5 concurrent requests +print("Sending concurrent requests...") +with ThreadPoolExecutor(max_workers=5) as executor: + results = list(executor.map(send_request, range(5))) + +print(f"\nCompleted {len([r for r in results if r])} successful requests") +``` + +Watch `nvidia-smi` while this runs. You should see: +- Multiple PIDs (different workers) each consuming ~3GB, OR +- Same PID consuming 6GB+ (multiple models in one worker) + +--- + +## The Actual Problem Mechanism + +### Scenario 1: Low Concurrency (< Workers) + +**Setup**: 17 workers, 5 concurrent requests + +**What happens**: +1. Request 1 → Worker 1 loads model (3GB) +2. Request 2 → Worker 2 loads model (3GB) +3. Request 3 → Worker 3 loads model (3GB) +4. Request 4 → Worker 4 loads model (3GB) +5. Request 5 → Worker 5 loads model (3GB) + +**Total**: 5 × 3GB = 15GB VRAM + +**After requests complete**: Models stay cached in workers 1-5 + +**Next 5 requests**: Might reuse cached models OR hit different workers and load 5 more + +**Over time**: All 17 workers eventually load models → 51GB VRAM + +### Scenario 2: High Concurrency (> Workers) + +**Setup**: 17 workers, 50 concurrent requests + +**What happens**: +1. Requests 1-17 → Each worker loads one model (51GB) +2. Requests 18-34 → Reuse cached models (no new VRAM) +3. **IF using gthread workers**: Some requests queue up, then hit "in use" models + - Additional model copies loaded (potentially +51GB more!) + +**Total**: 51GB to 102GB VRAM depending on timing + +### Scenario 3: Varied Model Sizes + +**Setup**: Requests use different model parameters ('small', 'medium', 'large') + +**What happens**: +- Each worker caches ALL requested model sizes +- Worker 1: small (1GB) + medium (3GB) = 4GB +- Worker 2: medium (3GB) + large (6GB) = 9GB +- ... + +**Total**: Unpredictable, can exceed single-model worst case + +--- + +## Why the "Fix" Doesn't Work + +The whisper wrapper's approach tries to handle: +1. ✅ Avoiding pre-fork model loading (works) +2. ✅ Lazy loading on first request (works) +3. ✅ Caching within a worker (works) +4. ❌ Concurrent requests within a worker (broken - loads duplicates) +5. ❌ Cross-worker model sharing (impossible with this architecture) + +**The fundamental issue**: Python's multiprocessing fork + CUDA don't support shared GPU memory. + +--- + +## Better Testing Strategy + +### Modified Test Script for Whisper Wrapper + +```python +#!/usr/bin/env python3 +""" +Test script specifically for whisper-wrapper issue #243 +""" +import requests +import time +import subprocess +import json +from concurrent.futures import ThreadPoolExecutor, as_completed + +BASE_URL = "http://localhost:5000" + +def get_gpu_memory(): + """Get current GPU memory usage""" + try: + result = subprocess.run( + ['nvidia-smi', '--query-gpu=memory.used', '--format=csv,noheader,nounits'], + capture_output=True, text=True + ) + return int(result.stdout.strip()) + except: + return None + +def get_gpu_processes(): + """Get PIDs using GPU and their memory""" + try: + result = subprocess.run( + ['nvidia-smi', '--query-compute-apps=pid,used_memory', '--format=csv,noheader,nounits'], + capture_output=True, text=True + ) + processes = {} + for line in result.stdout.strip().split('\n'): + if line: + pid, mem = line.split(', ') + processes[int(pid)] = int(mem) + return processes + except: + return {} + +def create_mmif_input(audio_path): + """Create minimal MMIF with audio document""" + return json.dumps({ + "metadata": {"mmif": "http://mmif.clams.ai/1.0.0"}, + "documents": [{ + "@type": "http://mmif.clams.ai/vocabulary/AudioDocument/v1", + "properties": { + "id": "d1", + "location": f"file://{audio_path}" + } + }] + }) + +def send_request(request_id, model_size='medium', audio_path='/path/to/test.mp3'): + """Send transcription request""" + mmif_input = create_mmif_input(audio_path) + + start = time.time() + try: + response = requests.post( + BASE_URL, + data=mmif_input, + params={'model': model_size, 'hwFetch': 'true'}, + headers={'Content-Type': 'application/json'}, + timeout=300 + ) + duration = time.time() - start + + if response.ok: + result = response.json() + worker_info = "unknown" + if 'views' in result and result['views']: + metadata = result['views'][0].get('metadata', {}) + worker_info = metadata.get('worker_pid', 'unknown') + + return { + 'id': request_id, + 'status': 'success', + 'duration': duration, + 'worker': worker_info, + 'model': model_size + } + else: + return { + 'id': request_id, + 'status': 'error', + 'error': response.status_code, + 'duration': duration + } + except Exception as e: + return { + 'id': request_id, + 'status': 'exception', + 'error': str(e), + 'duration': time.time() - start + } + +def main(): + print("\n" + "="*70) + print("WHISPER WRAPPER - Issue #243 Test") + print("="*70) + + # Initial state + print("\n1. Initial GPU State:") + initial_mem = get_gpu_memory() + initial_procs = get_gpu_processes() + print(f" Total VRAM used: {initial_mem} MB") + print(f" Processes using GPU: {len(initial_procs)}") + + # Test 1: Single request + print("\n2. Sending single request...") + result = send_request(1, model_size='medium', audio_path='/path/to/short-test.mp3') + print(f" Result: {result['status']}, Duration: {result['duration']:.2f}s") + + time.sleep(2) + after_one_mem = get_gpu_memory() + after_one_procs = get_gpu_processes() + print(f" VRAM after 1 request: {after_one_mem} MB (Δ {after_one_mem - initial_mem} MB)") + print(f" Processes: {len(after_one_procs)} - {after_one_procs}") + + # Test 2: Sequential requests (should reuse model in same worker) + print("\n3. Sending 5 sequential requests...") + for i in range(2, 7): + result = send_request(i, model_size='medium') + print(f" Request {i}: {result['status']}, Worker: {result.get('worker', 'unknown')}") + time.sleep(1) + + after_seq_mem = get_gpu_memory() + after_seq_procs = get_gpu_processes() + print(f" VRAM after sequential: {after_seq_mem} MB (Δ {after_seq_mem - initial_mem} MB)") + print(f" Processes: {len(after_seq_procs)} - {after_seq_procs}") + + # Test 3: Concurrent requests (this will trigger multi-worker loading) + print("\n4. Sending 10 concurrent requests...") + with ThreadPoolExecutor(max_workers=10) as executor: + futures = [executor.submit(send_request, i, 'medium') for i in range(10, 20)] + results = [f.result() for f in as_completed(futures)] + + workers_used = set(r.get('worker') for r in results if r.get('worker')) + print(f" Completed: {len([r for r in results if r['status'] == 'success'])}/10") + print(f" Unique workers: {len(workers_used)} - {workers_used}") + + time.sleep(2) + after_concurrent_mem = get_gpu_memory() + after_concurrent_procs = get_gpu_processes() + print(f" VRAM after concurrent: {after_concurrent_mem} MB (Δ {after_concurrent_mem - initial_mem} MB)") + print(f" Processes: {len(after_concurrent_procs)} - {after_concurrent_procs}") + + # Analysis + print("\n" + "="*70) + print("ANALYSIS:") + print("="*70) + print(f"Initial VRAM: {initial_mem} MB") + print(f"After 1 request: {after_one_mem} MB (1 model loaded)") + print(f"After sequential: {after_seq_mem} MB (should be same if worker reused)") + print(f"After concurrent: {after_concurrent_mem} MB ({len(workers_used)} workers loaded models)") + print(f"\nExpected VRAM per model: ~3000 MB (medium)") + print(f"Expected for {len(workers_used)} workers: ~{len(workers_used) * 3000} MB") + print(f"Actual increase: {after_concurrent_mem - initial_mem} MB") + + if len(workers_used) > 5: + print(f"\n⚠️ WARNING: {len(workers_used)} different workers loaded models!") + print(f" This demonstrates the issue: each worker loads independently") + + if len(after_concurrent_procs) > len(workers_used): + print(f"\n⚠️ WARNING: More GPU processes than unique workers!") + print(f" This suggests duplicate models within workers (conflict prevention)") + + print("="*70 + "\n") + +if __name__ == '__main__': + # NOTE: Update the audio_path in send_request() to point to a real audio file + main() +``` + +**Usage**: +1. Start whisper-wrapper in production mode +2. Update `audio_path` in the script to point to a real audio file +3. Run the test script +4. Watch memory grow as concurrent requests hit different workers + +--- + +## Summary + +The whisper wrapper's implementation reveals: + +1. **Models are NOT loaded in `__init__()`** - This is good and avoids one problem +2. **Models ARE cached per-worker** - Each worker loads its own copy on first use +3. **"Conflict prevention" loads duplicates** - If configured with threaded workers +4. **No cleanup mechanism** - Models stay in VRAM indefinitely per worker + +**The core issue remains**: With 17 workers and a 3GB model, you still get 51GB VRAM usage over time as different workers handle requests. + +This is **architectural** - the solution requires either: +- Limiting workers based on VRAM (not CPU) +- External model serving +- Different worker/threading strategy +- Or accepting high VRAM usage and scaling horizontally with multiple GPUs + +--- + +## Recommended Next Steps + +1. **Measure your actual VRAM usage** using the test script above +2. **Determine your concurrency needs** (how many concurrent requests?) +3. **Calculate required VRAM**: workers needed × model size +4. **Compare to available VRAM** on your GPU(s) +5. **Decide on solution approach** based on gap + +The whisper wrapper attempts to mitigate the issue but can't solve it within the current architecture. diff --git a/test_issue_243.py b/test_issue_243.py new file mode 100755 index 0000000..63457d0 --- /dev/null +++ b/test_issue_243.py @@ -0,0 +1,304 @@ +#!/usr/bin/env python3 +""" +Test script to replicate and understand issue #243: +GPU memory consumption with gunicorn workers and torch models. + +This script creates a minimal CLAMS app that simulates the problem +and provides tools to monitor VRAM usage across multiple workers. + +Usage: + # Development mode (single process, 1 model copy in VRAM) + python test_issue_243.py --mode dev + + # Production mode (multiple workers, N model copies in VRAM) + python test_issue_243.py --mode prod + + # Monitor VRAM while sending concurrent requests + python test_issue_243.py --mode prod --monitor +""" + +import argparse +import os +import sys +import time +import logging +from typing import Union + +from clams import ClamsApp, Restifier +from mmif import Mmif, View, AnnotationTypes +from clams.appmetadata import AppMetadata + +# Suppress warnings for cleaner output +import warnings +warnings.filterwarnings('ignore') + + +class DummyTorchModel: + """ + Simulates a torch model with controllable VRAM footprint. + Replace this with real torch model loading to see actual issue. + """ + def __init__(self, size_mb=100): + self.size_mb = size_mb + self.worker_id = os.getpid() + print(f"[Worker {self.worker_id}] Loading dummy model ({size_mb}MB)...", file=sys.stderr) + + # If torch is available, allocate actual VRAM + try: + import torch + if torch.cuda.is_available(): + # Allocate a tensor to consume VRAM + num_elements = (size_mb * 1024 * 1024) // 4 # 4 bytes per float32 + self.tensor = torch.randn(num_elements, device='cuda:0') + print(f"[Worker {self.worker_id}] Model loaded in VRAM: {size_mb}MB", file=sys.stderr) + else: + print(f"[Worker {self.worker_id}] CUDA not available, using CPU", file=sys.stderr) + self.tensor = None + except ImportError: + print(f"[Worker {self.worker_id}] PyTorch not available, simulating model", file=sys.stderr) + self.tensor = None + + def predict(self, data): + """Simulate inference""" + return f"Prediction from worker {self.worker_id}" + + +class TestApp(ClamsApp): + """ + Minimal CLAMS app that demonstrates the VRAM issue. + + The model is loaded in __init__, which happens BEFORE gunicorn forks workers. + This means each worker gets its own copy of the model in VRAM. + """ + + def __init__(self, model_size_mb=100): + super().__init__() + self.model_size_mb = model_size_mb + + # THIS IS THE KEY ISSUE: Model loaded before worker fork + self.model = DummyTorchModel(size_mb=model_size_mb) + + print(f"[Main Process {os.getpid()}] TestApp initialized", file=sys.stderr) + + def _appmetadata(self) -> AppMetadata: + metadata = AppMetadata( + identifier='test-issue-243', + name='Issue 243 Test App', + description='Test app to demonstrate GPU memory issue with gunicorn workers', + app_version='1.0.0', + mmif_version='1.0.0' + ) + metadata.add_parameter( + name='dummy_param', + type='string', + description='A dummy parameter', + default='test' + ) + return metadata + + def _annotate(self, mmif: Union[str, Mmif], **parameters) -> Mmif: + if isinstance(mmif, str): + mmif = Mmif(mmif) + + worker_id = os.getpid() + + # Simulate inference + prediction = self.model.predict("dummy data") + + # Create a new view with worker info + view = mmif.new_view() + self.sign_view(view, parameters) + + # Add annotation showing which worker processed this + view.metadata['worker_pid'] = worker_id + view.metadata['model_worker_pid'] = self.model.worker_id + view.metadata['prediction'] = prediction + + print(f"[Worker {worker_id}] Processed request with model from worker {self.model.worker_id}", + file=sys.stderr) + + return mmif + + +def print_gpu_memory(): + """Print current GPU memory usage""" + try: + import torch + if torch.cuda.is_available(): + for i in range(torch.cuda.device_count()): + allocated = torch.cuda.memory_allocated(i) / 1024**2 + reserved = torch.cuda.memory_reserved(i) / 1024**2 + print(f" GPU {i}: Allocated={allocated:.1f}MB, Reserved={reserved:.1f}MB") + else: + print(" CUDA not available") + except ImportError: + print(" PyTorch not installed") + + +def monitor_mode(): + """ + Monitor mode: Send concurrent requests and monitor VRAM usage. + Run this AFTER starting the server in production mode. + """ + import requests + import subprocess + import json + from concurrent.futures import ThreadPoolExecutor, as_completed + + base_url = "http://localhost:5000" + + print("\n" + "="*70) + print("MONITORING MODE: Sending concurrent requests") + print("="*70) + + # Create a minimal MMIF input + mmif_input = Mmif(validate=False) + mmif_str = mmif_input.serialize() + + # Get initial GPU state + print("\nInitial GPU Memory State:") + try: + result = subprocess.run(['nvidia-smi', '--query-gpu=index,name,memory.used,memory.total', + '--format=csv,noheader,nounits'], + capture_output=True, text=True) + for line in result.stdout.strip().split('\n'): + print(f" {line}") + except FileNotFoundError: + print(" nvidia-smi not available") + + print_gpu_memory() + + # Send concurrent requests + num_requests = 10 + print(f"\nSending {num_requests} concurrent requests...") + + def send_request(request_id): + try: + response = requests.post(base_url, data=mmif_str, + headers={'Content-Type': 'application/json'}) + result = response.json() + + # Extract worker info + worker_info = "N/A" + if 'views' in result and len(result['views']) > 0: + view = result['views'][0] + if 'metadata' in view: + worker_pid = view['metadata'].get('worker_pid', 'N/A') + model_pid = view['metadata'].get('model_worker_pid', 'N/A') + worker_info = f"Worker={worker_pid}, Model={model_pid}" + + return request_id, response.status_code, worker_info + except Exception as e: + return request_id, f"Error: {e}", "N/A" + + with ThreadPoolExecutor(max_workers=num_requests) as executor: + futures = [executor.submit(send_request, i) for i in range(num_requests)] + + results = [] + for future in as_completed(futures): + results.append(future.result()) + + # Display results + print("\nRequest Results:") + results.sort(key=lambda x: x[0]) + for req_id, status, worker_info in results: + print(f" Request {req_id}: Status={status}, {worker_info}") + + # Show unique workers that processed requests + workers_seen = set() + for _, _, worker_info in results: + if "Worker=" in worker_info: + workers_seen.add(worker_info.split(',')[0]) + + print(f"\nUnique workers that processed requests: {len(workers_seen)}") + for worker in sorted(workers_seen): + print(f" {worker}") + + # Get final GPU state + print("\nFinal GPU Memory State:") + try: + result = subprocess.run(['nvidia-smi', '--query-gpu=index,name,memory.used,memory.total', + '--format=csv,noheader,nounits'], + capture_output=True, text=True) + for line in result.stdout.strip().split('\n'): + print(f" {line}") + except FileNotFoundError: + print(" nvidia-smi not available") + + print_gpu_memory() + + print("\n" + "="*70) + print("KEY OBSERVATION:") + print(f"If you see {len(workers_seen)} unique workers, each has its own model copy in VRAM") + print("This demonstrates the issue: N workers = N × model size in VRAM") + print("="*70 + "\n") + + +def main(): + parser = argparse.ArgumentParser(description='Test Issue #243: Gunicorn, Torch, and CUDA') + parser.add_argument('--mode', choices=['dev', 'prod', 'monitor'], default='dev', + help='Run mode: dev (single process), prod (gunicorn), monitor (send test requests)') + parser.add_argument('--port', type=int, default=5000, + help='Port to run on (default: 5000)') + parser.add_argument('--model-size', type=int, default=100, + help='Size of dummy model in MB (default: 100)') + parser.add_argument('--workers', type=int, default=None, + help='Number of gunicorn workers (default: auto-calculated)') + + args = parser.parse_args() + + if args.mode == 'monitor': + monitor_mode() + return + + print("\n" + "="*70) + print(f"ISSUE #243 TEST - Mode: {args.mode.upper()}") + print("="*70) + + if args.mode == 'dev': + print("\nDEVELOPMENT MODE:") + print(" - Single process (Flask development server)") + print(" - One model copy in VRAM") + print(" - Good for testing, not production") + else: + import multiprocessing + num_workers = args.workers if args.workers else (multiprocessing.cpu_count() * 2) + 1 + print("\nPRODUCTION MODE (Gunicorn):") + print(f" - Multiple workers: {num_workers}") + print(f" - Each worker loads model independently") + print(f" - Expected VRAM usage: ~{num_workers * args.model_size}MB") + print(f" - This demonstrates the issue!") + + print(f"\nModel size: {args.model_size}MB") + print(f"Port: {args.port}") + + print("\nInitial GPU state:") + print_gpu_memory() + + print("\nCreating app instance...") + app = TestApp(model_size_mb=args.model_size) + + print("\nStarting HTTP server...") + print(f"Server will be available at: http://localhost:{args.port}") + + if args.mode == 'prod': + print("\nTo test concurrent requests, run in another terminal:") + print(f" python {sys.argv[0]} --mode monitor --port {args.port}") + + print("\n" + "="*70 + "\n") + + # Start the server + http_app = Restifier(app, port=args.port) + + if args.mode == 'prod': + options = {} + if args.workers: + options['workers'] = args.workers + http_app.serve_production(**options) + else: + app.logger.setLevel(logging.DEBUG) + http_app.run() + + +if __name__ == '__main__': + main() From b136767e1dbdc2eddb7b5ecc42e2135828e014f2 Mon Sep 17 00:00:00 2001 From: Claude Date: Sat, 8 Nov 2025 12:09:32 +0000 Subject: [PATCH 02/10] Update issue #243 investigation with SDK-level solution Removed outdated files: - test_issue_243.py (app-level test, no longer relevant) - ISSUE_243_ANALYSIS.md (superseded) - ISSUE_243_REAL_WORLD_ANALYSIS.md (superseded) New consolidated documentation: - ISSUE_243_INVESTIGATION.md: Complete investigation with SDK-level solution Key changes from previous analysis: - Focus on SDK-level VRAM management (not app-level) - Runtime VRAM checking via enhanced _profile_cuda_memory decorator - _get_model_requirements() API for apps to declare memory needs - Conservative worker count when CUDA detected - Runtime status via ?includeVRAM=true parameter - Addresses dynamic VRAM availability (not static calculation) - Process-safe torch.cuda.empty_cache() usage documented --- ISSUE_243_ANALYSIS.md | 392 ------------------ ISSUE_243_INVESTIGATION.md | 655 +++++++++++++++++++++++++++++++ ISSUE_243_REAL_WORLD_ANALYSIS.md | 484 ----------------------- test_issue_243.py | 304 -------------- 4 files changed, 655 insertions(+), 1180 deletions(-) delete mode 100644 ISSUE_243_ANALYSIS.md create mode 100644 ISSUE_243_INVESTIGATION.md delete mode 100644 ISSUE_243_REAL_WORLD_ANALYSIS.md delete mode 100755 test_issue_243.py diff --git a/ISSUE_243_ANALYSIS.md b/ISSUE_243_ANALYSIS.md deleted file mode 100644 index da7689f..0000000 --- a/ISSUE_243_ANALYSIS.md +++ /dev/null @@ -1,392 +0,0 @@ -# Issue #243 Analysis: Gunicorn, Torch, and CUDA - -## Executive Summary - -When CLAMS applications using PyTorch models run in production mode with gunicorn, each worker process loads its own copy of the model into GPU VRAM. This leads to excessive memory consumption that scales linearly with the number of workers, causing OOM errors under concurrent load. - ---- - -## The Problem in Detail - -### Architecture - -The CLAMS Python SDK uses gunicorn for production deployments: - -``` -clams/restify/__init__.py:42-78 -``` - -**Default Configuration:** -- **Workers**: `(CPU_count × 2) + 1` -- **Threads per worker**: 2 -- **Worker class**: sync (default) - -On an 8-core machine: **17 workers** - -### Root Cause - -The issue occurs due to the interaction between: - -1. **Python's fork model**: Gunicorn uses `os.fork()` to spawn workers -2. **CUDA memory allocation**: GPU memory is NOT shared via copy-on-write like CPU RAM -3. **Model loading timing**: Models are loaded in `ClamsApp.__init__()` before the fork - -**Critical Code Path:** - -```python -# Typical CLAMS app structure (clams/develop/templates/app/app.py.template:29-40) -class MyApp(ClamsApp): - def __init__(self): - super().__init__() - self.model = torch.load('model.pt') # ← Loaded BEFORE fork - -# Entry point (clams/develop/templates/app/app.py.template:53-74) -if __name__ == "__main__": - app = MyApp() # ← Model loaded here (single process) - http_app = Restifier(app, port=5000) - - if args.production: - http_app.serve_production() # ← Gunicorn forks 17 workers here - # Each worker now has its own model copy in VRAM! -``` - -### Memory Multiplication - -**Example with Whisper model (~3GB VRAM):** - -| Configuration | Workers | VRAM Usage | -|--------------|---------|------------| -| 4-core CPU | 9 | 27 GB | -| 8-core CPU | 17 | 51 GB | -| 16-core CPU | 33 | 99 GB | - -Most consumer GPUs have 8-24GB VRAM, so this quickly causes OOM errors. - -### Why `torch.cuda.empty_cache()` Doesn't Help - -The SDK includes CUDA cache cleanup (clams/app/__init__.py:389-390): - -```python -finally: - if torch_available and cuda_available: - torch.cuda.empty_cache() -``` - -However, this only clears PyTorch's **caching allocator**, not the model weights themselves. The model stays loaded in each worker's VRAM indefinitely. - ---- - -## Current Behavior vs. Expected Behavior - -### Current (Problematic) Behavior: - -1. App loads model in `__init__()` -2. Gunicorn forks N workers -3. Each worker has independent model in VRAM -4. Concurrent requests → N models active simultaneously -5. VRAM usage = N × model_size -6. OOM when N × model_size > available VRAM - -### Expected Behavior: - -Models should either: -1. **Load on-demand** per request and be freed after -2. **Share VRAM** across workers (if possible) -3. **Use a model server** pattern with worker pooling -4. **Limit workers** based on available VRAM, not CPU count - ---- - -## How to Replicate and Test - -### Test Script: `test_issue_243.py` - -I've created a comprehensive test script that simulates the issue without requiring an actual whisper model. - -#### Prerequisites - -```bash -# Optional: For CUDA testing -pip install torch # with CUDA support - -# For monitoring mode -pip install requests -``` - -#### Test Scenarios - -**1. Development Mode (Baseline)** -```bash -# Single process, single model in VRAM -python test_issue_243.py --mode dev --model-size 100 -``` - -Expected: One model copy (~100MB VRAM) - -**2. Production Mode (Demonstrates Issue)** -```bash -# Multiple workers, multiple models in VRAM -python test_issue_243.py --mode prod --model-size 100 -``` - -Expected: N model copies (~N × 100MB VRAM) - -**3. Concurrent Request Testing** - -Terminal 1 - Start server: -```bash -python test_issue_243.py --mode prod --model-size 100 --port 5000 -``` - -Terminal 2 - Send concurrent requests: -```bash -python test_issue_243.py --mode monitor --port 5000 -``` - -**4. Custom Worker Count** -```bash -# Test with specific number of workers -python test_issue_243.py --mode prod --workers 5 --model-size 100 -``` - -#### What to Look For - -1. **Worker PIDs**: Each request shows which worker processed it -2. **VRAM Growth**: Monitor GPU memory as workers start -3. **Multiple Model Copies**: Different workers have different model instances -4. **Concurrent Load**: When 10 requests hit simultaneously, multiple workers activate - -**With CUDA available:** -```bash -# Watch VRAM in real-time while running tests -watch -n 1 nvidia-smi -``` - -**Without CUDA:** -The script will simulate the issue and show worker-level model duplication even without GPU. - ---- - -## Key Observations from Code Analysis - -### 1. Worker Initialization (clams/restify/__init__.py:51-67) - -```python -def number_of_workers(): - return (multiprocessing.cpu_count() * 2) + 1 - -class ProductionApplication(gunicorn.app.base.BaseApplication): - def __init__(self, app, host, port, **options): - self.options = { - 'bind': f'{host}:{port}', - 'workers': number_of_workers(), # ← CPU-based, ignores GPU - 'threads': 2, - 'accesslog': '-', - } -``` - -**Issue**: Worker count is based solely on CPU cores, completely ignoring GPU memory constraints. - -### 2. CUDA Profiling (clams/app/__init__.py:349-392) - -The SDK includes CUDA memory profiling that tracks peak VRAM usage: - -```python -@staticmethod -def _profile_cuda_memory(func): - def wrapper(*args, **kwargs): - # Reset peak memory tracking - torch.cuda.reset_peak_memory_stats('cuda') - - result = func(*args, **kwargs) - - # Record peak usage per GPU - for device_id in range(device_count): - peak_memory = torch.cuda.max_memory_allocated(f'cuda:{device_id}') - cuda_profiler[key] = peak_memory - - return result, cuda_profiler - finally: - torch.cuda.empty_cache() # ← Only clears cache, not model -``` - -**Key Points:** -- Profiling is helpful for monitoring -- `empty_cache()` doesn't free model weights -- Peak memory tracking is per-request, not per-worker - -### 3. No Pre-Fork Hooks - -Gunicorn provides hooks like `pre_fork()`, `post_fork()`, `post_worker_init()` that could be used to: -- Delay model loading until after fork -- Implement shared model serving -- Manage worker-to-GPU assignment - -**Currently not implemented in the SDK.** - ---- - -## Related Issues and Context - -### Referenced PR: app-doctr-wrapper #6 - -The issue mentions this shares a root cause with a PR in app-doctr-wrapper. This suggests: -- Multiple CLAMS apps experience this issue -- DocTR (Document Text Recognition) models also consume significant VRAM -- The problem is systemic to the SDK, not app-specific - -### Production Environment Context - -The issue specifically mentions: -- **Hardware**: NVIDIA GPU support -- **App**: Whisper wrapper v10 -- **Trigger**: Multiple POST requests (concurrent or sequential) -- **Symptom**: Progressive GPU memory saturation → OOM - -This matches the behavior described above perfectly. - ---- - -## Verification Steps - -To verify this is happening in your production environment: - -### 1. Check Worker Count -```bash -# While app is running in production -ps aux | grep gunicorn | grep -v grep | wc -l -``` - -You should see N+1 processes (master + N workers) - -### 2. Monitor VRAM per Process -```bash -# Install nvidia-smi if not available -nvidia-smi pmon -c 1 - -# Or for continuous monitoring -watch -n 1 'nvidia-smi --query-compute-apps=pid,used_memory --format=csv' -``` - -You should see multiple PIDs each consuming ~model_size VRAM - -### 3. Test Concurrent Requests - -```python -import requests -from concurrent.futures import ThreadPoolExecutor - -url = "http://your-app:5000" -mmif_data = "{}" # minimal MMIF - -def send_request(i): - response = requests.post(url, data=mmif_data, - params={'hwFetch': 'true'}) - return response.json() - -# Send 10 concurrent requests -with ThreadPoolExecutor(max_workers=10) as executor: - results = list(executor.map(send_request, range(10))) - -# Check how many different workers responded -workers = set() -for result in results: - if 'views' in result and result['views']: - # Extract worker info from view metadata - workers.add(result['views'][0].get('metadata', {}).get('app_pid')) - -print(f"Requests distributed across {len(workers)} workers") -``` - -### 4. Compare Development vs. Production - -```bash -# Development (single process) -python app.py --port 5000 & -sleep 5 -nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits - -# Production (multiple workers) -python app.py --production --port 5001 & -sleep 10 # Give workers time to start -nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -``` - -Production should show significantly more VRAM usage. - ---- - -## Potential Solutions (Analysis Only) - -*Note: Per your request, I'm only analyzing potential solutions, not implementing them.* - -### 1. Lazy Model Loading -Load models in `_annotate()` instead of `__init__()`, with proper cleanup. - -**Pros**: Simple to implement -**Cons**: Slow (load/unload per request), still uses same peak VRAM - -### 2. Worker-to-GPU Affinity -Limit workers to match available GPUs, assign each worker to specific GPU. - -**Pros**: Predictable VRAM usage -**Cons**: Underutilizes CPU on GPU-poor systems - -### 3. Model Server Pattern -Separate model serving process(es), workers communicate via IPC/network. - -**Pros**: True model sharing, scales independently -**Cons**: Complex architecture, adds latency - -### 4. Gunicorn Pre-Load + Smart Fork -Use `preload_app=True` with delayed CUDA initialization after fork. - -**Pros**: Maintains multi-worker concurrency -**Cons**: Requires careful CUDA context management - -### 5. Dynamic Worker Scaling -Calculate workers based on available VRAM, not CPU count. - -**Pros**: Prevents OOM -**Cons**: May underutilize system resources - ---- - -## Recommended Next Steps - -1. **Validate the issue** using `test_issue_243.py` on your setup -2. **Measure actual impact** in your production environment -3. **Gather requirements**: - - Typical request concurrency - - Model sizes - - Available VRAM - - Acceptable latency -4. **Evaluate solutions** based on your constraints -5. **Prototype** the most promising approach - ---- - -## Additional Resources - -### CLAMS SDK Files to Review: -- `clams/restify/__init__.py` - Gunicorn configuration -- `clams/app/__init__.py` - CUDA profiling and app lifecycle -- `clams/develop/templates/app/app.py.template` - App structure - -### Gunicorn Documentation: -- [Server Hooks](https://docs.gunicorn.org/en/stable/settings.html#server-hooks) -- [Worker Configuration](https://docs.gunicorn.org/en/stable/settings.html#worker-processes) -- [Preloading Applications](https://docs.gunicorn.org/en/stable/settings.html#preload-app) - -### PyTorch CUDA: -- [CUDA Semantics](https://pytorch.org/docs/stable/notes/cuda.html) -- [Memory Management](https://pytorch.org/docs/stable/notes/cuda.html#memory-management) -- [Multiprocessing](https://pytorch.org/docs/stable/notes/multiprocessing.html) - ---- - -## Summary - -Issue #243 is a **systemic architectural challenge** where the SDK's CPU-based worker scaling conflicts with GPU memory constraints. The test script (`test_issue_243.py`) provides a safe, controlled way to observe and measure this behavior without affecting your repository or production systems. - -The issue is real, measurable, and impacts any CLAMS app using large PyTorch models in production. Solutions will require careful trade-offs between simplicity, performance, and resource utilization. diff --git a/ISSUE_243_INVESTIGATION.md b/ISSUE_243_INVESTIGATION.md new file mode 100644 index 0000000..06689e7 --- /dev/null +++ b/ISSUE_243_INVESTIGATION.md @@ -0,0 +1,655 @@ +# Issue #243 Investigation: GPU Memory Management in CLAMS SDK + +**Issue**: https://github.com/clamsproject/clams-python/issues/243 +**Status**: Investigation Complete - SDK-Level Solution Proposed +**Date**: 2025-01-08 + +--- + +## Executive Summary + +When CLAMS applications using PyTorch models run in production mode (gunicorn), each worker process independently loads models into GPU VRAM. This leads to excessive memory consumption that scales with worker count, causing OOM errors. + +**Root Cause**: Gunicorn's multi-process architecture combined with VRAM being a shared, dynamic resource that cannot be allocated statically. + +**Proposed Solution**: SDK-level VRAM management through runtime checking and unified API, rather than fragmented app-level implementations. + +--- + +## Problem Analysis + +### Current Architecture + +**Production Configuration** (`clams/restify/__init__.py:42-78`): +- Workers: `(CPU_count × 2) + 1` +- Threads per worker: 2 +- Worker class: sync (default) + +On an 8-core machine: **17 workers** + +**Model Loading Pattern** (from templates and existing apps): +```python +class MyApp(ClamsApp): + def __init__(self): + super().__init__() + # Option A: Load in __init__ (pre-fork) - BAD + # self.model = torch.load('model.pt') + + # Option B: Load on-demand in _annotate() - BETTER but still problematic + self.models = {} + + def _annotate(self, mmif, **parameters): + if model_name not in self.models: + self.models[model_name] = torch.load('model.pt') # Each worker loads independently + # ... +``` + +### Why This Causes Issues + +**Multi-Worker Duplication**: +``` +Worker 1: Loads model on first request → 3GB VRAM +Worker 2: Loads model on first request → 3GB VRAM +... +Worker 17: Loads model on first request → 3GB VRAM + +Total: 17 × 3GB = 51GB VRAM required +``` + +**VRAM is a Shared, Dynamic Resource**: +- Other applications can allocate VRAM at any time +- Cannot assume static VRAM availability at startup +- Must check availability at runtime before loading + +**Process Isolation**: +- Each gunicorn worker is a separate OS process +- CUDA contexts are process-isolated (no shared GPU memory) +- Workers cannot share model instances in VRAM + +### Real-World Impact + +**Example: Whisper Wrapper** (`app-whisper-wrapper/app.py`) + +The app attempts mitigation by: +1. Loading models on-demand (not in `__init__()`) +2. Caching models per worker +3. "Conflict prevention" that loads duplicate models if one is in use + +**Problems**: +- Still loads one model per worker over time (51GB for 17 workers) +- Conflict prevention can load duplicates within same worker (102GB worst case) +- No awareness of VRAM availability from other processes + +--- + +## Proposed SDK-Level Solution + +### Design Principles + +1. **Centralized Management**: SDK handles VRAM checking, not individual apps +2. **Runtime Checking**: Check VRAM availability at request time, not startup +3. **Fail Fast**: Return clear errors when VRAM unavailable +4. **Backward Compatible**: Existing apps continue working without changes +5. **Opt-In Enhancement**: Apps can declare requirements for better behavior + +### Architecture Overview + +``` +Request Flow: + HTTP POST → ClamsHTTPApi.post() + → ClamsApp.annotate() + → _profile_cuda_memory() decorator + → Check VRAM requirements (NEW) + → Call _annotate() if sufficient VRAM + → torch.cuda.empty_cache() cleanup (EXISTING) +``` + +### Component 1: Model Requirements API + +**Apps declare their model memory needs:** + +```python +# clams/app/__init__.py - Add to ClamsApp base class + +class ClamsApp(ABC): + + def _get_model_requirements(self, **parameters) -> Optional[dict]: + """ + Declare model memory requirements based on runtime parameters. + + Apps override this to enable VRAM checking. + + :param parameters: Runtime parameters from the request + :return: Dict with 'size_bytes' and optional 'name', or None + + Example: + def _get_model_requirements(self, **parameters): + model_sizes = {'small': 2*1024**3, 'large': 6*1024**3} + model = parameters.get('model', 'small') + return {'size_bytes': model_sizes[model], 'name': model} + """ + return None # Default: no specific requirements +``` + +**App Implementation Example** (whisper-wrapper): +```python +class WhisperWrapper(ClamsApp): + + MODEL_SIZES = { + 'tiny': 500 * 1024**2, + 'base': 1024 * 1024**2, + 'small': 2 * 1024**3, + 'medium': 3 * 1024**3, + 'large': 6 * 1024**3, + 'large-v2': 6 * 1024**3, + 'large-v3': 6 * 1024**3, + 'turbo': 3 * 1024**3, + } + + def _get_model_requirements(self, **parameters): + size = parameters.get('model', 'medium') + if size in self.model_size_alias: + size = self.model_size_alias[size] + + return { + 'size_bytes': self.MODEL_SIZES.get(size, 3 * 1024**3), + 'name': size + } +``` + +### Component 2: Runtime VRAM Checking + +**Enhance existing `_profile_cuda_memory()` decorator:** + +```python +# clams/app/__init__.py:349-392 - Enhanced version + +@staticmethod +def _profile_cuda_memory(func): + """ + Decorator for profiling CUDA memory usage and managing VRAM availability. + """ + def wrapper(self, *args, **kwargs): + cuda_profiler = {} + torch_available = False + cuda_available = False + + try: + import torch + torch_available = True + cuda_available = torch.cuda.is_available() + except ImportError: + pass + + # NEW: Runtime VRAM checking before execution + if torch_available and cuda_available: + # Get model requirements from app + requirements = self._get_model_requirements(**kwargs) + + if requirements: + required_bytes = requirements['size_bytes'] + model_name = requirements.get('name', 'model') + + # Check if sufficient VRAM available RIGHT NOW + if not ClamsApp._check_vram_available(required_bytes): + available_gb = ClamsApp._get_available_vram() / 1024**3 + required_gb = required_bytes / 1024**3 + + error_msg = ( + f"Insufficient GPU memory for {model_name}. " + f"Required: {required_gb:.2f}GB, " + f"Available: {available_gb:.2f}GB. " + f"GPU may be in use by other processes. " + f"Please retry later." + ) + self.logger.error(error_msg) + raise RuntimeError(error_msg) + + self.logger.info( + f"VRAM check passed for {model_name}: " + f"{required_gb:.2f}GB required, " + f"{ClamsApp._get_available_vram() / 1024**3:.2f}GB available" + ) + + # Reset peak memory stats + torch.cuda.reset_peak_memory_stats('cuda') + + try: + result = func(self, *args, **kwargs) + + # Record peak memory usage (EXISTING) + if torch_available and cuda_available: + device_count = torch.cuda.device_count() + for device_id in range(device_count): + device_id_str = f'cuda:{device_id}' + peak_memory = torch.cuda.max_memory_allocated(device_id_str) + gpu_name = torch.cuda.get_device_name(device_id_str) + gpu_total = torch.cuda.get_device_properties(device_id_str).total_memory + key = ClamsApp._cuda_device_name_concat(gpu_name, gpu_total) + cuda_profiler[key] = peak_memory + + return result, cuda_profiler + finally: + # Cleanup (EXISTING) + if torch_available and cuda_available: + torch.cuda.empty_cache() + + return wrapper + +@staticmethod +def _check_vram_available(required_bytes, safety_margin=0.1): + """ + Check if sufficient VRAM is available at this moment. + + :param required_bytes: Bytes needed for model + :param safety_margin: Fraction of total VRAM to keep as headroom (default 10%) + :return: True if sufficient VRAM available + """ + try: + import torch + if not torch.cuda.is_available(): + return True # No CUDA, no constraints + + device = torch.cuda.current_device() + props = torch.cuda.get_device_properties(device) + total_vram = props.total_memory + + # Get currently allocated/reserved memory + allocated = torch.cuda.memory_allocated(device) + reserved = torch.cuda.memory_reserved(device) + used = max(allocated, reserved) + + # Calculate available VRAM RIGHT NOW + available = total_vram - used + + # Apply safety margin + required_with_margin = required_bytes + (total_vram * safety_margin) + + return available >= required_with_margin + + except Exception: + # If we can't check, fail open (allow the request) + return True + +@staticmethod +def _get_available_vram(): + """Get currently available VRAM in bytes""" + try: + import torch + if not torch.cuda.is_available(): + return 0 + + device = torch.cuda.current_device() + total = torch.cuda.get_device_properties(device).total_memory + used = max(torch.cuda.memory_allocated(device), + torch.cuda.memory_reserved(device)) + return total - used + except: + return 0 +``` + +### Component 3: Conservative Worker Count + +**Adjust default worker calculation when CUDA detected:** + +```python +# clams/restify/__init__.py:51-52 - Modified + +def number_of_workers(): + """ + Calculate workers considering GPU constraints. + Use conservative count when CUDA available since VRAM is the bottleneck. + """ + import multiprocessing + + cpu_workers = (multiprocessing.cpu_count() * 2) + 1 + + # Check if CUDA available (indicates GPU workload) + try: + import torch + if torch.cuda.is_available(): + # Use conservative worker count for GPU apps + # Runtime VRAM checking will prevent OOM + # Fewer workers = less memory overhead, more predictable behavior + gpu_conservative_workers = min(4, multiprocessing.cpu_count()) + return gpu_conservative_workers + except ImportError: + pass + + return cpu_workers +``` + +### Component 4: Runtime Status API + +**Expose VRAM status through existing metadata endpoint:** + +```python +# clams/app/__init__.py - Add method + +def get_runtime_info(self) -> dict: + """ + Get runtime information including GPU/VRAM status. + Apps can override to add custom runtime info. + """ + info = {} + + try: + import torch + if torch.cuda.is_available(): + devices = [] + for i in range(torch.cuda.device_count()): + props = torch.cuda.get_device_properties(i) + total = props.total_memory + used = max(torch.cuda.memory_allocated(i), + torch.cuda.memory_reserved(i)) + + devices.append({ + 'id': i, + 'name': props.name, + 'total_memory_gb': round(total / 1024**3, 2), + 'available_memory_gb': round((total - used) / 1024**3, 2), + }) + + info['gpu'] = {'available': True, 'devices': devices} + except: + info['gpu'] = {'available': False} + + return info +``` + +```python +# clams/restify/__init__.py:121-129 - Modify GET handler + +def get(self) -> Response: + """Maps HTTP GET verb to appmetadata with optional runtime info""" + raw_params = request.args.to_dict(flat=False) + + # Check for runtime info request + if 'includeVRAM' in raw_params or 'includeRuntime' in raw_params: + import json + metadata = json.loads(self.cla.appmetadata(**raw_params)) + metadata['runtime'] = self.cla.get_runtime_info() + return self.json_to_response(json.dumps(metadata)) + + return self.json_to_response(self.cla.appmetadata(**raw_params)) +``` + +**Usage:** +```bash +# Normal metadata +curl http://localhost:5000/ + +# Metadata + current VRAM status +curl http://localhost:5000/?includeVRAM=true +``` + +**Response example:** +```json +{ + "name": "Whisper Wrapper", + "version": "1.0.0", + "parameters": [...], + "runtime": { + "gpu": { + "available": true, + "devices": [ + { + "id": 0, + "name": "NVIDIA RTX 4090", + "total_memory_gb": 24.0, + "available_memory_gb": 18.5 + } + ] + } + } +} +``` + +--- + +## How It Works + +### Request Flow + +1. **Client sends POST request** with MMIF data and parameters (e.g., `model=large`) + +2. **SDK calls `_get_model_requirements()`** to determine memory needs + - If app implements it: Returns `{'size_bytes': 6*1024**3, 'name': 'large'}` + - If not implemented: Returns `None`, no VRAM checking + +3. **SDK checks current VRAM availability** + - Queries CUDA driver for real-time memory state + - Accounts for memory used by other processes + - Compares available vs. required (with 10% safety margin) + +4. **Decision:** + - **Sufficient VRAM**: Proceed to `_annotate()`, app loads model + - **Insufficient VRAM**: Raise `RuntimeError`, return HTTP 500 with clear message + +5. **After annotation completes**: SDK calls `torch.cuda.empty_cache()` to release cached memory + +### Error Handling + +**When VRAM is insufficient:** + +``` +HTTP 500 Internal Server Error + +{ + "error": "Insufficient GPU memory for large. Required: 6.00GB, Available: 4.50GB. GPU may be in use by other processes. Please retry later." +} +``` + +**Client retry logic:** +```python +import requests +import time + +def transcribe_with_retry(url, data, max_retries=3): + for attempt in range(max_retries): + response = requests.post(url, data=data) + + if response.ok: + return response.json() + + if "Insufficient GPU memory" in response.text: + wait = 5 * (2 ** attempt) # Exponential backoff + print(f"GPU busy, retrying in {wait}s...") + time.sleep(wait) + continue + + raise Exception(f"Request failed: {response.status_code}") + + raise Exception("Max retries exceeded") +``` + +--- + +## Benefits + +### ✅ Centralized Solution +- All CLAMS apps benefit from VRAM management +- No need for each app to implement separately +- Consistent behavior across ecosystem + +### ✅ Handles Dynamic VRAM +- Checks availability at request time +- Accounts for other processes using GPU +- No static assumptions about available memory + +### ✅ Backward Compatible +- Existing apps continue working without changes +- Apps without `_get_model_requirements()` skip VRAM checking +- No breaking changes to API + +### ✅ Clear Error Messages +- Clients know exactly why request failed +- Can implement retry logic +- Better than cryptic CUDA OOM errors + +### ✅ Observable +- `includeVRAM` parameter exposes current GPU state +- Monitoring systems can track VRAM usage +- Helps with capacity planning + +### ✅ Process-Safe +- `torch.cuda.empty_cache()` only affects current process +- No interference with other workers or applications +- Each worker manages its own CUDA context + +--- + +## App Migration Path + +### Phase 1: SDK Update (No App Changes Required) +1. Update SDK with VRAM checking components +2. Conservative worker count for CUDA-enabled systems +3. All apps automatically get `empty_cache()` cleanup +4. Runtime status available via `?includeVRAM=true` + +### Phase 2: App Opt-In (Enhanced Behavior) +Apps implement `_get_model_requirements()`: + +```python +class MyApp(ClamsApp): + def _get_model_requirements(self, **parameters): + # Declare memory needs + return {'size_bytes': 3 * 1024**3, 'name': 'my-model'} +``` + +Now the app gets: +- Runtime VRAM checking before model load +- Clear error messages when insufficient memory +- Automatic fail-fast behavior + +### Phase 3: Optional Enhancements +Apps can add: +- Model size estimates in metadata +- Alternative suggestions when VRAM low +- Idle model unloading after timeout + +--- + +## Verification Plan + +### 1. VRAM Isolation Test +Verify `empty_cache()` doesn't affect other processes: + +```bash +# Terminal 1: Start whisper-wrapper +python app.py --production + +# Terminal 2: Start another GPU app (e.g., another CLAMS app) +python other_app.py --production + +# Terminal 3: Monitor GPU +watch -n 1 nvidia-smi + +# Send requests to both apps simultaneously +# Verify: Each process maintains independent VRAM, no interference +``` + +### 2. Dynamic VRAM Test +Verify runtime checking handles contention: + +```python +# Start app with available VRAM +# Load large model in separate process to consume VRAM +# Send request to app → should fail with clear error +# Unload model in separate process +# Retry request → should succeed +``` + +### 3. Multi-Worker Test +Verify conservative worker count prevents overload: + +```bash +# 8-core machine, CUDA available +# Start app → verify ≤4 workers (not 17) +# Send concurrent requests +# Monitor VRAM → verify total usage stays within limits +``` + +### 4. Backward Compatibility Test +Verify apps without `_get_model_requirements()` still work: + +```python +# Use app that doesn't implement _get_model_requirements() +# Send requests → should process normally +# VRAM checking skipped, but cleanup still happens +``` + +--- + +## Implementation Checklist + +**SDK Changes:** +- [ ] Add `_get_model_requirements()` API to `ClamsApp` +- [ ] Add `_check_vram_available()` static method +- [ ] Add `_get_available_vram()` static method +- [ ] Enhance `_profile_cuda_memory()` decorator with VRAM checking +- [ ] Add `get_runtime_info()` method to `ClamsApp` +- [ ] Modify `number_of_workers()` for conservative GPU count +- [ ] Modify `ClamsHTTPApi.get()` to support `includeVRAM` parameter + +**Documentation:** +- [ ] Document `_get_model_requirements()` API for app developers +- [ ] Document `?includeVRAM=true` parameter for clients +- [ ] Document error handling and retry best practices +- [ ] Update app development template with example implementation + +**Testing:** +- [ ] Unit tests for VRAM checking logic +- [ ] Integration tests with mock CUDA +- [ ] Multi-process isolation verification +- [ ] Backward compatibility tests + +**App Updates (Optional):** +- [ ] Update whisper-wrapper with `_get_model_requirements()` +- [ ] Update other GPU-based apps as needed + +--- + +## Open Questions + +1. **Safety Margin**: Is 10% headroom appropriate, or should it be configurable? + +2. **Multi-GPU**: Should SDK support GPU selection/load balancing across devices? + +3. **Health Endpoint**: Add dedicated `/health` endpoint in addition to `?includeVRAM`? + +4. **Model Unloading**: Should SDK provide automatic model eviction after idle time? + +5. **Worker Override**: Should apps be able to override worker count calculation? + +--- + +## References + +**Related Code:** +- `clams/app/__init__.py:349-392` - CUDA profiling decorator +- `clams/restify/__init__.py:42-78` - Production server setup +- `app-whisper-wrapper/app.py` - Real-world example with attempted mitigation + +**Related Issues:** +- Issue #243: Main issue tracking this problem +- app-doctr-wrapper PR #6: Similar problem in different app + +**External Resources:** +- [PyTorch CUDA Semantics](https://pytorch.org/docs/stable/notes/cuda.html) +- [Gunicorn Settings](https://docs.gunicorn.org/en/stable/settings.html) +- [CUDA Memory Management](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/) + +--- + +## Conclusion + +The proposed SDK-level solution addresses the root cause of issue #243 by: + +1. **Checking VRAM at runtime** - No static assumptions about availability +2. **Failing fast with clear errors** - Better than OOM crashes +3. **Conservative worker defaults** - Prevents overloading GPU systems +4. **Centralized implementation** - All apps benefit automatically +5. **Backward compatible** - No breaking changes + +This approach provides a robust foundation for GPU resource management in the CLAMS ecosystem while maintaining flexibility for future enhancements. diff --git a/ISSUE_243_REAL_WORLD_ANALYSIS.md b/ISSUE_243_REAL_WORLD_ANALYSIS.md deleted file mode 100644 index 3e01908..0000000 --- a/ISSUE_243_REAL_WORLD_ANALYSIS.md +++ /dev/null @@ -1,484 +0,0 @@ -# Issue #243 Real-World Analysis: Whisper Wrapper Implementation - -## Executive Summary - -The whisper wrapper app **attempts** to solve the GPU memory issue by loading models on-demand rather than in `__init__()`. However, the implementation still suffers from **per-worker model duplication** and has a problematic **"conflict prevention" mechanism** that can load duplicate models within the same worker. - ---- - -## Actual Implementation Analysis - -### Source Code -**Repository**: https://github.com/clamsproject/app-whisper-wrapper -**File**: `app.py` - -### Model Loading Strategy - -#### 1. Initialization (Lines 28-31) -```python -def __init__(self): - super().__init__() - self.whisper_models = {} - self.model_usage = {} -``` - -**Good**: Models are NOT loaded in `__init__()`, avoiding the pre-fork duplication issue. - -**Problem**: Each worker still has its own `self.whisper_models` dict after fork, leading to per-worker caching. - -#### 2. On-Demand Loading in `_annotate()` (Lines 78-96) - -```python -if size not in self.whisper_models: - self.logger.debug(f'Loading model {size}') - t = time.perf_counter() - self.whisper_models[size] = whisper.load_model(size) - self.logger.debug(f'Load time: {time.perf_counter() - t:.2f} seconds\n') - self.model_usage[size] = False - -if not self.model_usage[size]: - whisper_model = self.whisper_models.get(size) - self.model_usage[size] = True - cached = True -else: - self.logger.debug(f'Loading model {size} to avoid memory conflict') - t = time.perf_counter() - whisper_model = whisper.load_model(size) - self.logger.debug(f'Load time: {time.perf_counter() - t:.2f} seconds\n') - cached = False -``` - -**Logic**: -1. First request to a worker: Load model into `self.whisper_models[size]` and cache it -2. Subsequent requests to the same worker: - - If `model_usage[size]` is False (model not in use): Use cached model - - If `model_usage[size]` is True (model in use): **Load a SECOND copy!** - -#### 3. Cleanup After Transcription (Line 128) -```python -if size in self.model_usage and cached == True: - self.model_usage[size] = False -``` - -**Intent**: Mark model as "not in use" so next request can reuse it. - -**Problem**: The second model copy (when `cached = False`) is never tracked or cleaned up! - ---- - -## Why This Is Still Problematic - -### Issue #1: Per-Worker Model Caching - -**Scenario**: 8-core CPU → 17 workers, Whisper "medium" model (~3GB VRAM) - -**What happens**: -1. First request hits Worker 1 → loads model (3GB) -2. First request hits Worker 2 → loads model (3GB) -3. ... -4. First request hits Worker 17 → loads model (3GB) - -**Result**: 17 × 3GB = **51GB VRAM** (same as the original issue!) - -**Why**: Each worker's `self.whisper_models` dict is independent. There's no cross-worker model sharing. - -### Issue #2: The "Conflict Prevention" Mechanism - -The code assumes concurrent requests within the same worker can happen. Let's analyze when this occurs: - -**Gunicorn Worker Types**: -- **sync** (default in CLAMS SDK): Single-threaded, handles one request at a time - - The `threads: 2` setting is **ignored** with sync workers! - - `model_usage` tracking is **unnecessary** with sync workers -- **gthread**: Multi-threaded, can handle concurrent requests - - If using gthread workers, the conflict prevention kicks in - -**Problem with gthread workers (2 threads per worker)**: - -**Timeline**: -``` -Worker 1, Thread 1: - T0: Load model into cache (3GB) - T1: Set model_usage = True - T2: Start transcription (takes 10 seconds) - -Worker 1, Thread 2 (concurrent request): - T3: Check model_usage → True - T4: "Loading model to avoid memory conflict" - T5: Load SECOND copy of model (3GB) - T6: Start transcription - -Worker 1 now has: 3GB + 3GB = 6GB in VRAM! -``` - -**Worst Case**: 17 workers × 2 threads × 3GB = **102GB VRAM** - -### Issue #3: Memory Leak - -```python -else: - self.logger.debug(f'Loading model {size} to avoid memory conflict') - t = time.perf_counter() - whisper_model = whisper.load_model(size) # ← Local variable - self.logger.debug(f'Load time: {time.perf_counter() - t:.2f} seconds\n') - cached = False -``` - -**Problem**: The second model copy is stored in `whisper_model` (local variable) but never explicitly deleted or freed. It becomes garbage only when: -1. The function returns -2. Python's GC runs -3. PyTorch releases the CUDA memory - -During long transcriptions, multiple copies can accumulate if requests arrive faster than transcription completes. - ---- - -## Verification Steps - -### Check If You're Hitting This Issue - -**1. Check Worker Configuration** - -Look at the actual running gunicorn config: -```bash -# While app is running -ps aux | grep gunicorn - -# You'll see something like: -# gunicorn: master [WhisperWrapper] -# gunicorn: worker [WhisperWrapper] (17 workers) -``` - -Count the workers: should be `(CPU_count × 2) + 1` - -**2. Monitor Per-Worker VRAM Usage** - -```bash -# Watch VRAM in real-time -watch -n 1 'nvidia-smi --query-compute-apps=pid,used_memory --format=csv' -``` - -Send a request and watch: -- First request to the app → 1 PID appears with ~3GB -- Second concurrent request → potentially 2 PIDs or the same PID with 6GB - -**3. Test Concurrent Requests** - -```python -import requests -import time -from concurrent.futures import ThreadPoolExecutor -import json - -url = "http://your-whisper-app:5000" - -# Create minimal MMIF with audio document -mmif = { - "metadata": {"mmif": "http://mmif.clams.ai/1.0.0"}, - "documents": [{ - "@type": "http://mmif.clams.ai/vocabulary/AudioDocument/v1", - "properties": { - "id": "d1", - "location": "file:///path/to/audio.mp3" - } - }] -} - -def send_request(i): - start = time.time() - response = requests.post(url, json=mmif, params={'model': 'medium'}) - duration = time.time() - start - - # Extract worker PID from logs if available - print(f"Request {i}: {response.status_code}, took {duration:.2f}s") - return response.json() if response.ok else None - -# Send 5 concurrent requests -print("Sending concurrent requests...") -with ThreadPoolExecutor(max_workers=5) as executor: - results = list(executor.map(send_request, range(5))) - -print(f"\nCompleted {len([r for r in results if r])} successful requests") -``` - -Watch `nvidia-smi` while this runs. You should see: -- Multiple PIDs (different workers) each consuming ~3GB, OR -- Same PID consuming 6GB+ (multiple models in one worker) - ---- - -## The Actual Problem Mechanism - -### Scenario 1: Low Concurrency (< Workers) - -**Setup**: 17 workers, 5 concurrent requests - -**What happens**: -1. Request 1 → Worker 1 loads model (3GB) -2. Request 2 → Worker 2 loads model (3GB) -3. Request 3 → Worker 3 loads model (3GB) -4. Request 4 → Worker 4 loads model (3GB) -5. Request 5 → Worker 5 loads model (3GB) - -**Total**: 5 × 3GB = 15GB VRAM - -**After requests complete**: Models stay cached in workers 1-5 - -**Next 5 requests**: Might reuse cached models OR hit different workers and load 5 more - -**Over time**: All 17 workers eventually load models → 51GB VRAM - -### Scenario 2: High Concurrency (> Workers) - -**Setup**: 17 workers, 50 concurrent requests - -**What happens**: -1. Requests 1-17 → Each worker loads one model (51GB) -2. Requests 18-34 → Reuse cached models (no new VRAM) -3. **IF using gthread workers**: Some requests queue up, then hit "in use" models - - Additional model copies loaded (potentially +51GB more!) - -**Total**: 51GB to 102GB VRAM depending on timing - -### Scenario 3: Varied Model Sizes - -**Setup**: Requests use different model parameters ('small', 'medium', 'large') - -**What happens**: -- Each worker caches ALL requested model sizes -- Worker 1: small (1GB) + medium (3GB) = 4GB -- Worker 2: medium (3GB) + large (6GB) = 9GB -- ... - -**Total**: Unpredictable, can exceed single-model worst case - ---- - -## Why the "Fix" Doesn't Work - -The whisper wrapper's approach tries to handle: -1. ✅ Avoiding pre-fork model loading (works) -2. ✅ Lazy loading on first request (works) -3. ✅ Caching within a worker (works) -4. ❌ Concurrent requests within a worker (broken - loads duplicates) -5. ❌ Cross-worker model sharing (impossible with this architecture) - -**The fundamental issue**: Python's multiprocessing fork + CUDA don't support shared GPU memory. - ---- - -## Better Testing Strategy - -### Modified Test Script for Whisper Wrapper - -```python -#!/usr/bin/env python3 -""" -Test script specifically for whisper-wrapper issue #243 -""" -import requests -import time -import subprocess -import json -from concurrent.futures import ThreadPoolExecutor, as_completed - -BASE_URL = "http://localhost:5000" - -def get_gpu_memory(): - """Get current GPU memory usage""" - try: - result = subprocess.run( - ['nvidia-smi', '--query-gpu=memory.used', '--format=csv,noheader,nounits'], - capture_output=True, text=True - ) - return int(result.stdout.strip()) - except: - return None - -def get_gpu_processes(): - """Get PIDs using GPU and their memory""" - try: - result = subprocess.run( - ['nvidia-smi', '--query-compute-apps=pid,used_memory', '--format=csv,noheader,nounits'], - capture_output=True, text=True - ) - processes = {} - for line in result.stdout.strip().split('\n'): - if line: - pid, mem = line.split(', ') - processes[int(pid)] = int(mem) - return processes - except: - return {} - -def create_mmif_input(audio_path): - """Create minimal MMIF with audio document""" - return json.dumps({ - "metadata": {"mmif": "http://mmif.clams.ai/1.0.0"}, - "documents": [{ - "@type": "http://mmif.clams.ai/vocabulary/AudioDocument/v1", - "properties": { - "id": "d1", - "location": f"file://{audio_path}" - } - }] - }) - -def send_request(request_id, model_size='medium', audio_path='/path/to/test.mp3'): - """Send transcription request""" - mmif_input = create_mmif_input(audio_path) - - start = time.time() - try: - response = requests.post( - BASE_URL, - data=mmif_input, - params={'model': model_size, 'hwFetch': 'true'}, - headers={'Content-Type': 'application/json'}, - timeout=300 - ) - duration = time.time() - start - - if response.ok: - result = response.json() - worker_info = "unknown" - if 'views' in result and result['views']: - metadata = result['views'][0].get('metadata', {}) - worker_info = metadata.get('worker_pid', 'unknown') - - return { - 'id': request_id, - 'status': 'success', - 'duration': duration, - 'worker': worker_info, - 'model': model_size - } - else: - return { - 'id': request_id, - 'status': 'error', - 'error': response.status_code, - 'duration': duration - } - except Exception as e: - return { - 'id': request_id, - 'status': 'exception', - 'error': str(e), - 'duration': time.time() - start - } - -def main(): - print("\n" + "="*70) - print("WHISPER WRAPPER - Issue #243 Test") - print("="*70) - - # Initial state - print("\n1. Initial GPU State:") - initial_mem = get_gpu_memory() - initial_procs = get_gpu_processes() - print(f" Total VRAM used: {initial_mem} MB") - print(f" Processes using GPU: {len(initial_procs)}") - - # Test 1: Single request - print("\n2. Sending single request...") - result = send_request(1, model_size='medium', audio_path='/path/to/short-test.mp3') - print(f" Result: {result['status']}, Duration: {result['duration']:.2f}s") - - time.sleep(2) - after_one_mem = get_gpu_memory() - after_one_procs = get_gpu_processes() - print(f" VRAM after 1 request: {after_one_mem} MB (Δ {after_one_mem - initial_mem} MB)") - print(f" Processes: {len(after_one_procs)} - {after_one_procs}") - - # Test 2: Sequential requests (should reuse model in same worker) - print("\n3. Sending 5 sequential requests...") - for i in range(2, 7): - result = send_request(i, model_size='medium') - print(f" Request {i}: {result['status']}, Worker: {result.get('worker', 'unknown')}") - time.sleep(1) - - after_seq_mem = get_gpu_memory() - after_seq_procs = get_gpu_processes() - print(f" VRAM after sequential: {after_seq_mem} MB (Δ {after_seq_mem - initial_mem} MB)") - print(f" Processes: {len(after_seq_procs)} - {after_seq_procs}") - - # Test 3: Concurrent requests (this will trigger multi-worker loading) - print("\n4. Sending 10 concurrent requests...") - with ThreadPoolExecutor(max_workers=10) as executor: - futures = [executor.submit(send_request, i, 'medium') for i in range(10, 20)] - results = [f.result() for f in as_completed(futures)] - - workers_used = set(r.get('worker') for r in results if r.get('worker')) - print(f" Completed: {len([r for r in results if r['status'] == 'success'])}/10") - print(f" Unique workers: {len(workers_used)} - {workers_used}") - - time.sleep(2) - after_concurrent_mem = get_gpu_memory() - after_concurrent_procs = get_gpu_processes() - print(f" VRAM after concurrent: {after_concurrent_mem} MB (Δ {after_concurrent_mem - initial_mem} MB)") - print(f" Processes: {len(after_concurrent_procs)} - {after_concurrent_procs}") - - # Analysis - print("\n" + "="*70) - print("ANALYSIS:") - print("="*70) - print(f"Initial VRAM: {initial_mem} MB") - print(f"After 1 request: {after_one_mem} MB (1 model loaded)") - print(f"After sequential: {after_seq_mem} MB (should be same if worker reused)") - print(f"After concurrent: {after_concurrent_mem} MB ({len(workers_used)} workers loaded models)") - print(f"\nExpected VRAM per model: ~3000 MB (medium)") - print(f"Expected for {len(workers_used)} workers: ~{len(workers_used) * 3000} MB") - print(f"Actual increase: {after_concurrent_mem - initial_mem} MB") - - if len(workers_used) > 5: - print(f"\n⚠️ WARNING: {len(workers_used)} different workers loaded models!") - print(f" This demonstrates the issue: each worker loads independently") - - if len(after_concurrent_procs) > len(workers_used): - print(f"\n⚠️ WARNING: More GPU processes than unique workers!") - print(f" This suggests duplicate models within workers (conflict prevention)") - - print("="*70 + "\n") - -if __name__ == '__main__': - # NOTE: Update the audio_path in send_request() to point to a real audio file - main() -``` - -**Usage**: -1. Start whisper-wrapper in production mode -2. Update `audio_path` in the script to point to a real audio file -3. Run the test script -4. Watch memory grow as concurrent requests hit different workers - ---- - -## Summary - -The whisper wrapper's implementation reveals: - -1. **Models are NOT loaded in `__init__()`** - This is good and avoids one problem -2. **Models ARE cached per-worker** - Each worker loads its own copy on first use -3. **"Conflict prevention" loads duplicates** - If configured with threaded workers -4. **No cleanup mechanism** - Models stay in VRAM indefinitely per worker - -**The core issue remains**: With 17 workers and a 3GB model, you still get 51GB VRAM usage over time as different workers handle requests. - -This is **architectural** - the solution requires either: -- Limiting workers based on VRAM (not CPU) -- External model serving -- Different worker/threading strategy -- Or accepting high VRAM usage and scaling horizontally with multiple GPUs - ---- - -## Recommended Next Steps - -1. **Measure your actual VRAM usage** using the test script above -2. **Determine your concurrency needs** (how many concurrent requests?) -3. **Calculate required VRAM**: workers needed × model size -4. **Compare to available VRAM** on your GPU(s) -5. **Decide on solution approach** based on gap - -The whisper wrapper attempts to mitigate the issue but can't solve it within the current architecture. diff --git a/test_issue_243.py b/test_issue_243.py deleted file mode 100755 index 63457d0..0000000 --- a/test_issue_243.py +++ /dev/null @@ -1,304 +0,0 @@ -#!/usr/bin/env python3 -""" -Test script to replicate and understand issue #243: -GPU memory consumption with gunicorn workers and torch models. - -This script creates a minimal CLAMS app that simulates the problem -and provides tools to monitor VRAM usage across multiple workers. - -Usage: - # Development mode (single process, 1 model copy in VRAM) - python test_issue_243.py --mode dev - - # Production mode (multiple workers, N model copies in VRAM) - python test_issue_243.py --mode prod - - # Monitor VRAM while sending concurrent requests - python test_issue_243.py --mode prod --monitor -""" - -import argparse -import os -import sys -import time -import logging -from typing import Union - -from clams import ClamsApp, Restifier -from mmif import Mmif, View, AnnotationTypes -from clams.appmetadata import AppMetadata - -# Suppress warnings for cleaner output -import warnings -warnings.filterwarnings('ignore') - - -class DummyTorchModel: - """ - Simulates a torch model with controllable VRAM footprint. - Replace this with real torch model loading to see actual issue. - """ - def __init__(self, size_mb=100): - self.size_mb = size_mb - self.worker_id = os.getpid() - print(f"[Worker {self.worker_id}] Loading dummy model ({size_mb}MB)...", file=sys.stderr) - - # If torch is available, allocate actual VRAM - try: - import torch - if torch.cuda.is_available(): - # Allocate a tensor to consume VRAM - num_elements = (size_mb * 1024 * 1024) // 4 # 4 bytes per float32 - self.tensor = torch.randn(num_elements, device='cuda:0') - print(f"[Worker {self.worker_id}] Model loaded in VRAM: {size_mb}MB", file=sys.stderr) - else: - print(f"[Worker {self.worker_id}] CUDA not available, using CPU", file=sys.stderr) - self.tensor = None - except ImportError: - print(f"[Worker {self.worker_id}] PyTorch not available, simulating model", file=sys.stderr) - self.tensor = None - - def predict(self, data): - """Simulate inference""" - return f"Prediction from worker {self.worker_id}" - - -class TestApp(ClamsApp): - """ - Minimal CLAMS app that demonstrates the VRAM issue. - - The model is loaded in __init__, which happens BEFORE gunicorn forks workers. - This means each worker gets its own copy of the model in VRAM. - """ - - def __init__(self, model_size_mb=100): - super().__init__() - self.model_size_mb = model_size_mb - - # THIS IS THE KEY ISSUE: Model loaded before worker fork - self.model = DummyTorchModel(size_mb=model_size_mb) - - print(f"[Main Process {os.getpid()}] TestApp initialized", file=sys.stderr) - - def _appmetadata(self) -> AppMetadata: - metadata = AppMetadata( - identifier='test-issue-243', - name='Issue 243 Test App', - description='Test app to demonstrate GPU memory issue with gunicorn workers', - app_version='1.0.0', - mmif_version='1.0.0' - ) - metadata.add_parameter( - name='dummy_param', - type='string', - description='A dummy parameter', - default='test' - ) - return metadata - - def _annotate(self, mmif: Union[str, Mmif], **parameters) -> Mmif: - if isinstance(mmif, str): - mmif = Mmif(mmif) - - worker_id = os.getpid() - - # Simulate inference - prediction = self.model.predict("dummy data") - - # Create a new view with worker info - view = mmif.new_view() - self.sign_view(view, parameters) - - # Add annotation showing which worker processed this - view.metadata['worker_pid'] = worker_id - view.metadata['model_worker_pid'] = self.model.worker_id - view.metadata['prediction'] = prediction - - print(f"[Worker {worker_id}] Processed request with model from worker {self.model.worker_id}", - file=sys.stderr) - - return mmif - - -def print_gpu_memory(): - """Print current GPU memory usage""" - try: - import torch - if torch.cuda.is_available(): - for i in range(torch.cuda.device_count()): - allocated = torch.cuda.memory_allocated(i) / 1024**2 - reserved = torch.cuda.memory_reserved(i) / 1024**2 - print(f" GPU {i}: Allocated={allocated:.1f}MB, Reserved={reserved:.1f}MB") - else: - print(" CUDA not available") - except ImportError: - print(" PyTorch not installed") - - -def monitor_mode(): - """ - Monitor mode: Send concurrent requests and monitor VRAM usage. - Run this AFTER starting the server in production mode. - """ - import requests - import subprocess - import json - from concurrent.futures import ThreadPoolExecutor, as_completed - - base_url = "http://localhost:5000" - - print("\n" + "="*70) - print("MONITORING MODE: Sending concurrent requests") - print("="*70) - - # Create a minimal MMIF input - mmif_input = Mmif(validate=False) - mmif_str = mmif_input.serialize() - - # Get initial GPU state - print("\nInitial GPU Memory State:") - try: - result = subprocess.run(['nvidia-smi', '--query-gpu=index,name,memory.used,memory.total', - '--format=csv,noheader,nounits'], - capture_output=True, text=True) - for line in result.stdout.strip().split('\n'): - print(f" {line}") - except FileNotFoundError: - print(" nvidia-smi not available") - - print_gpu_memory() - - # Send concurrent requests - num_requests = 10 - print(f"\nSending {num_requests} concurrent requests...") - - def send_request(request_id): - try: - response = requests.post(base_url, data=mmif_str, - headers={'Content-Type': 'application/json'}) - result = response.json() - - # Extract worker info - worker_info = "N/A" - if 'views' in result and len(result['views']) > 0: - view = result['views'][0] - if 'metadata' in view: - worker_pid = view['metadata'].get('worker_pid', 'N/A') - model_pid = view['metadata'].get('model_worker_pid', 'N/A') - worker_info = f"Worker={worker_pid}, Model={model_pid}" - - return request_id, response.status_code, worker_info - except Exception as e: - return request_id, f"Error: {e}", "N/A" - - with ThreadPoolExecutor(max_workers=num_requests) as executor: - futures = [executor.submit(send_request, i) for i in range(num_requests)] - - results = [] - for future in as_completed(futures): - results.append(future.result()) - - # Display results - print("\nRequest Results:") - results.sort(key=lambda x: x[0]) - for req_id, status, worker_info in results: - print(f" Request {req_id}: Status={status}, {worker_info}") - - # Show unique workers that processed requests - workers_seen = set() - for _, _, worker_info in results: - if "Worker=" in worker_info: - workers_seen.add(worker_info.split(',')[0]) - - print(f"\nUnique workers that processed requests: {len(workers_seen)}") - for worker in sorted(workers_seen): - print(f" {worker}") - - # Get final GPU state - print("\nFinal GPU Memory State:") - try: - result = subprocess.run(['nvidia-smi', '--query-gpu=index,name,memory.used,memory.total', - '--format=csv,noheader,nounits'], - capture_output=True, text=True) - for line in result.stdout.strip().split('\n'): - print(f" {line}") - except FileNotFoundError: - print(" nvidia-smi not available") - - print_gpu_memory() - - print("\n" + "="*70) - print("KEY OBSERVATION:") - print(f"If you see {len(workers_seen)} unique workers, each has its own model copy in VRAM") - print("This demonstrates the issue: N workers = N × model size in VRAM") - print("="*70 + "\n") - - -def main(): - parser = argparse.ArgumentParser(description='Test Issue #243: Gunicorn, Torch, and CUDA') - parser.add_argument('--mode', choices=['dev', 'prod', 'monitor'], default='dev', - help='Run mode: dev (single process), prod (gunicorn), monitor (send test requests)') - parser.add_argument('--port', type=int, default=5000, - help='Port to run on (default: 5000)') - parser.add_argument('--model-size', type=int, default=100, - help='Size of dummy model in MB (default: 100)') - parser.add_argument('--workers', type=int, default=None, - help='Number of gunicorn workers (default: auto-calculated)') - - args = parser.parse_args() - - if args.mode == 'monitor': - monitor_mode() - return - - print("\n" + "="*70) - print(f"ISSUE #243 TEST - Mode: {args.mode.upper()}") - print("="*70) - - if args.mode == 'dev': - print("\nDEVELOPMENT MODE:") - print(" - Single process (Flask development server)") - print(" - One model copy in VRAM") - print(" - Good for testing, not production") - else: - import multiprocessing - num_workers = args.workers if args.workers else (multiprocessing.cpu_count() * 2) + 1 - print("\nPRODUCTION MODE (Gunicorn):") - print(f" - Multiple workers: {num_workers}") - print(f" - Each worker loads model independently") - print(f" - Expected VRAM usage: ~{num_workers * args.model_size}MB") - print(f" - This demonstrates the issue!") - - print(f"\nModel size: {args.model_size}MB") - print(f"Port: {args.port}") - - print("\nInitial GPU state:") - print_gpu_memory() - - print("\nCreating app instance...") - app = TestApp(model_size_mb=args.model_size) - - print("\nStarting HTTP server...") - print(f"Server will be available at: http://localhost:{args.port}") - - if args.mode == 'prod': - print("\nTo test concurrent requests, run in another terminal:") - print(f" python {sys.argv[0]} --mode monitor --port {args.port}") - - print("\n" + "="*70 + "\n") - - # Start the server - http_app = Restifier(app, port=args.port) - - if args.mode == 'prod': - options = {} - if args.workers: - options['workers'] = args.workers - http_app.serve_production(**options) - else: - app.logger.setLevel(logging.DEBUG) - http_app.run() - - -if __name__ == '__main__': - main() From 2d7765c762b7b3f97cd243cad494fa7e382a2b0e Mon Sep 17 00:00:00 2001 From: Claude Date: Thu, 20 Nov 2025 17:05:47 +0000 Subject: [PATCH 03/10] Add automatic memory profiling with conservative first request Updated investigation document with: - Component 5: Automatic Memory Profiling - 80% VRAM requirement for first request (conservative) - Historical measurement for subsequent requests - Hash-based filenames for race-condition-safe persistence - Atomic writes via temp file + rename - Updated request flow to show 3-level priority: 1. App override (explicit) 2. Historical measurement 3. Conservative 80% - Updated implementation checklist with new components - Revised open questions and conclusion --- ISSUE_243_INVESTIGATION.md | 212 +++++++++++++++++++++++++++++++++---- 1 file changed, 192 insertions(+), 20 deletions(-) diff --git a/ISSUE_243_INVESTIGATION.md b/ISSUE_243_INVESTIGATION.md index 06689e7..1f72103 100644 --- a/ISSUE_243_INVESTIGATION.md +++ b/ISSUE_243_INVESTIGATION.md @@ -405,6 +405,148 @@ curl http://localhost:5000/?includeVRAM=true } ``` +### Component 5: Automatic Memory Profiling + +**When developers don't provide model requirements**, the SDK uses a conservative approach with historical profiling: + +**Strategy:** +- **First request**: Require 80% of total VRAM to be available (very conservative) +- **Subsequent requests**: Use measured peak memory from previous runs (accurate) + +**Persistence via hash-based filenames (write-once, read-many):** + +```python +# clams/app/__init__.py - Add to ClamsApp + +import hashlib +import json +import pathlib + +class ClamsApp(ABC): + + def _get_param_hash(self, **parameters): + """Create deterministic hash of parameters for filename""" + param_str = json.dumps(parameters, sort_keys=True) + return hashlib.sha256(param_str.encode()).hexdigest()[:16] + + def _get_profile_path(self, param_hash): + """Get path for memory profile file""" + cache_dir = pathlib.Path.home() / '.cache' / 'clams' / 'memory_profiles' + return cache_dir / f"memory_{param_hash}.txt" + + def _get_model_requirements(self, **parameters): + """ + Default implementation with conservative first request + historical profiling. + Apps can override for explicit model size declarations. + + :param parameters: Runtime parameters from the request + :return: Dict with 'size_bytes', 'name', and 'source' + """ + param_hash = self._get_param_hash(**parameters) + profile_path = self._get_profile_path(param_hash) + + # Check for historical measurement + if profile_path.exists(): + try: + measured = int(profile_path.read_text().strip()) + return { + 'size_bytes': int(measured * 1.2), # 20% buffer + 'name': param_hash, + 'source': 'historical' + } + except: + pass # Corrupted file, fall through to conservative + + # First request: require 80% of total VRAM + try: + import torch + if torch.cuda.is_available(): + device = torch.cuda.current_device() + total_vram = torch.cuda.get_device_properties(device).total_memory + conservative_requirement = int(total_vram * 0.8) + + self.logger.info( + f"First request for {param_hash}: " + f"requiring 80% of VRAM ({conservative_requirement/1024**3:.2f}GB) " + f"until actual usage is measured" + ) + + return { + 'size_bytes': conservative_requirement, + 'name': param_hash, + 'source': 'conservative-first-request' + } + except: + pass + + return None + + def _record_memory_usage(self, parameters, peak_bytes): + """ + Record peak memory to file using write-once pattern. + Uses atomic write (temp file + rename) to avoid race conditions. + + :param parameters: Request parameters (used for hash) + :param peak_bytes: Measured peak VRAM usage + """ + param_hash = self._get_param_hash(**parameters) + profile_path = self._get_profile_path(param_hash) + + try: + profile_path.parent.mkdir(parents=True, exist_ok=True) + + # Only write if file doesn't exist or new measurement is higher + should_write = True + if profile_path.exists(): + try: + existing = int(profile_path.read_text().strip()) + if peak_bytes <= existing: + should_write = False # Existing value is fine + else: + self.logger.info( + f"Updating peak memory {param_hash}: " + f"{existing/1024**3:.2f}GB → {peak_bytes/1024**3:.2f}GB" + ) + except: + pass # Corrupted, overwrite + + if should_write: + # Atomic write: write to temp, then rename + temp_path = profile_path.with_suffix('.tmp') + temp_path.write_text(str(peak_bytes)) + temp_path.rename(profile_path) # Atomic on POSIX + + self.logger.info( + f"Recorded peak memory for {param_hash}: {peak_bytes/1024**3:.2f}GB" + ) + except Exception as e: + self.logger.warning(f"Failed to record memory profile: {e}") +``` + +**File structure:** +``` +~/.cache/clams/memory_profiles/ +├── memory_3a7f2b9c.txt # {model: "large", language: "en"} → "6442450944" +├── memory_8d2c1e4f.txt # {model: "medium", language: "en"} → "3221225472" +└── memory_f1a9b3e7.txt # {model: "large", language: "es"} → "6442450944" +``` + +**Race condition safety:** + +| Scenario | Behavior | Outcome | +|----------|----------|---------| +| Two workers, same params, first request | Both write similar values | Last write wins, both valid | +| Worker reads while another writes | Atomic rename | Sees old or new file, never partial | +| Two workers update with higher values | Each reads, writes higher | Highest value persists | + +**Benefits:** +- ✅ No developer effort required +- ✅ No file locking needed (write-once pattern) +- ✅ Atomic writes via temp + rename +- ✅ Shared across workers and restarts +- ✅ Self-calibrating over time +- ✅ Conservative first request prevents OOM + --- ## How It Works @@ -414,8 +556,9 @@ curl http://localhost:5000/?includeVRAM=true 1. **Client sends POST request** with MMIF data and parameters (e.g., `model=large`) 2. **SDK calls `_get_model_requirements()`** to determine memory needs - - If app implements it: Returns `{'size_bytes': 6*1024**3, 'name': 'large'}` - - If not implemented: Returns `None`, no VRAM checking + - **If app overrides with explicit values**: Uses app-provided size (e.g., `6*1024**3`) + - **If historical measurement exists**: Uses measured peak × 1.2 buffer + - **If first request (no history)**: Requires 80% of total VRAM (conservative) 3. **SDK checks current VRAM availability** - Queries CUDA driver for real-time memory state @@ -426,7 +569,26 @@ curl http://localhost:5000/?includeVRAM=true - **Sufficient VRAM**: Proceed to `_annotate()`, app loads model - **Insufficient VRAM**: Raise `RuntimeError`, return HTTP 500 with clear message -5. **After annotation completes**: SDK calls `torch.cuda.empty_cache()` to release cached memory +5. **After annotation completes**: + - SDK records peak memory usage to profile file (for future requests) + - SDK calls `torch.cuda.empty_cache()` to release cached memory + +### Memory Requirement Resolution + +``` +Priority order for _get_model_requirements(): + +1. App override (explicit) → App knows exact model sizes +2. Historical measurement → Measured from previous run +3. Conservative 80% → First request, no data yet +``` + +**Example progression for new parameter combination:** + +| Request | Source | Requirement | Behavior | +|---------|--------|-------------|----------| +| 1st | conservative | 19.2GB (80% of 24GB) | Fails if <19.2GB available | +| 2nd+ | historical | 3.6GB (3GB measured × 1.2) | Accurate, efficient | ### Error Handling @@ -583,44 +745,52 @@ Verify apps without `_get_model_requirements()` still work: ## Implementation Checklist -**SDK Changes:** -- [ ] Add `_get_model_requirements()` API to `ClamsApp` +**SDK Changes - Core VRAM Management:** +- [ ] Add `_get_model_requirements()` with default implementation (80% conservative + historical) +- [ ] Add `_get_param_hash()` for deterministic parameter hashing +- [ ] Add `_get_profile_path()` for profile file location +- [ ] Add `_record_memory_usage()` with atomic write pattern - [ ] Add `_check_vram_available()` static method - [ ] Add `_get_available_vram()` static method -- [ ] Enhance `_profile_cuda_memory()` decorator with VRAM checking -- [ ] Add `get_runtime_info()` method to `ClamsApp` +- [ ] Enhance `_profile_cuda_memory()` decorator with VRAM checking and recording + +**SDK Changes - Configuration:** - [ ] Modify `number_of_workers()` for conservative GPU count +- [ ] Add `get_runtime_info()` method to `ClamsApp` - [ ] Modify `ClamsHTTPApi.get()` to support `includeVRAM` parameter **Documentation:** -- [ ] Document `_get_model_requirements()` API for app developers +- [ ] Document automatic memory profiling behavior +- [ ] Document `_get_model_requirements()` override for explicit values - [ ] Document `?includeVRAM=true` parameter for clients - [ ] Document error handling and retry best practices -- [ ] Update app development template with example implementation +- [ ] Document profile file location and cleanup **Testing:** - [ ] Unit tests for VRAM checking logic +- [ ] Unit tests for hash-based file persistence +- [ ] Tests for atomic write behavior - [ ] Integration tests with mock CUDA - [ ] Multi-process isolation verification - [ ] Backward compatibility tests -**App Updates (Optional):** -- [ ] Update whisper-wrapper with `_get_model_requirements()` +**App Updates (Optional - for explicit model sizes):** +- [ ] Update whisper-wrapper to override `_get_model_requirements()` with explicit sizes - [ ] Update other GPU-based apps as needed --- ## Open Questions -1. **Safety Margin**: Is 10% headroom appropriate, or should it be configurable? +1. **Profile File Location**: Is `~/.cache/clams/memory_profiles/` appropriate for all deployment scenarios? Consider container environments with ephemeral storage. 2. **Multi-GPU**: Should SDK support GPU selection/load balancing across devices? 3. **Health Endpoint**: Add dedicated `/health` endpoint in addition to `?includeVRAM`? -4. **Model Unloading**: Should SDK provide automatic model eviction after idle time? +4. **Profile Cleanup**: Should SDK provide mechanism to clear old/stale profile files? -5. **Worker Override**: Should apps be able to override worker count calculation? +5. **Conservative Threshold**: Is 80% appropriate for first request, or should it be configurable? --- @@ -647,9 +817,11 @@ Verify apps without `_get_model_requirements()` still work: The proposed SDK-level solution addresses the root cause of issue #243 by: 1. **Checking VRAM at runtime** - No static assumptions about availability -2. **Failing fast with clear errors** - Better than OOM crashes -3. **Conservative worker defaults** - Prevents overloading GPU systems -4. **Centralized implementation** - All apps benefit automatically -5. **Backward compatible** - No breaking changes - -This approach provides a robust foundation for GPU resource management in the CLAMS ecosystem while maintaining flexibility for future enhancements. +2. **Automatic memory profiling** - Conservative first request (80%), then uses measured values +3. **Zero developer effort** - Works without app changes; apps can optionally override for explicit values +4. **Race-condition safe persistence** - Hash-based files with atomic writes +5. **Failing fast with clear errors** - Better than OOM crashes +6. **Conservative worker defaults** - Prevents overloading GPU systems +7. **Centralized implementation** - All apps benefit automatically + +This approach provides a robust foundation for GPU resource management in the CLAMS ecosystem while requiring no changes from app developers. Apps that want more precise control can override `_get_model_requirements()` with explicit model sizes. From a56b9617c94a71ae0ce5a28e23770a27fdb6c05b Mon Sep 17 00:00:00 2001 From: Keigh Rim Date: Thu, 20 Nov 2025 17:51:25 -0500 Subject: [PATCH 04/10] added `gpu_mem_*` fields to app metadata --- clams/appmetadata/__init__.py | 28 ++++++++++++++++++- .../templates/app/metadata.py.template | 3 ++ 2 files changed, 30 insertions(+), 1 deletion(-) diff --git a/clams/appmetadata/__init__.py b/clams/appmetadata/__init__.py index 4a7bd3c..cd8b2ca 100644 --- a/clams/appmetadata/__init__.py +++ b/clams/appmetadata/__init__.py @@ -352,9 +352,20 @@ class AppMetadata(pydantic.BaseModel): "a package name and its version in the string value at the minimum (e.g., ``clams-python==1.2.3``)." ) more: Optional[Dict[str, str]] = pydantic.Field( - None, + None, description="(optional) A string-to-string map that can be used to store any additional metadata of the app." ) + gpu_mem_min: int = pydantic.Field( + 0, + description="(optional) Minimum GPU memory required to run the app, in megabytes (MB). " + "Set to 0 (default) if the app does not use GPU." + ) + gpu_mem_typ: int = pydantic.Field( + 0, + description="(optional) Typical GPU memory usage for default parameters, in megabytes (MB). " + "Must be equal or larger than gpu_mem_min. " + "Set to 0 (default) if the app does not use GPU." + ) model_config = { 'title': 'CLAMS AppMetadata', @@ -372,6 +383,21 @@ def assign_versions(cls, data): data.mmif_version = get_mmif_specver() return data + @pydantic.model_validator(mode='after') + @classmethod + def validate_gpu_memory(cls, data): + import warnings + if data.gpu_mem_typ > 0 and data.gpu_mem_min > 0: + if data.gpu_mem_typ < data.gpu_mem_min: + warnings.warn( + f"gpu_mem_typ ({data.gpu_mem_typ} MB) is less than " + f"gpu_mem_min ({data.gpu_mem_min} MB). " + f"Setting gpu_mem_typ to {data.gpu_mem_min} MB.", + UserWarning + ) + data.gpu_mem_typ = data.gpu_mem_min + return data + @pydantic.field_validator('identifier', mode='before') @classmethod def append_version(cls, val): diff --git a/clams/develop/templates/app/metadata.py.template b/clams/develop/templates/app/metadata.py.template index 8b1f8c7..8506616 100644 --- a/clams/develop/templates/app/metadata.py.template +++ b/clams/develop/templates/app/metadata.py.template @@ -39,6 +39,9 @@ def appmetadata() -> AppMetadata: # this trick can also be useful (replace ANALYZER_NAME with the pypi dist name) analyzer_version=[l.strip().rsplit('==')[-1] for l in open(pathlib.Path(__file__).parent / 'requirements.txt').readlines() if re.match(r'^ANALYZER_NAME==', l)][0], analyzer_license="", # short name for a software license + # GPU memory requirements (in MB). Set to 0 if the app does not use GPU. + # gpu_mem_min=0, # minimum GPU memory required for minimal configuration parameters + # gpu_mem_rec=0, # recommended GPU memory for default parameters, must be equal or larger than gpu_mem_min ) # and then add I/O specifications: an app must have at least one input and one output metadata.add_input(DocumentTypes.Document) From d0759aca52b1605f134280f92aa3e9ff90cb19b7 Mon Sep 17 00:00:00 2001 From: Keigh Rim Date: Thu, 20 Nov 2025 18:09:21 -0500 Subject: [PATCH 05/10] gunicorn workers are configured based on the set gpu_mem_min appmetadata --- clams/restify/__init__.py | 38 ++++++++++++++++++++++++++++++++++++-- documentation/clamsapp.md | 20 ++++++++++++++++++++ 2 files changed, 56 insertions(+), 2 deletions(-) diff --git a/clams/restify/__init__.py b/clams/restify/__init__.py index ad522b8..ec321e7 100644 --- a/clams/restify/__init__.py +++ b/clams/restify/__init__.py @@ -42,14 +42,48 @@ def run(self, **options): def serve_production(self, **options): """ Runs the CLAMS app as a flask webapp, using a production-ready web server (gunicorn, https://docs.gunicorn.org/en/stable/#). - + :param options: any additional options to pass to the web server. """ import gunicorn.app.base import multiprocessing + import os def number_of_workers(): - return (multiprocessing.cpu_count() * 2) + 1 # +1 to make sure at least two workers are running + # Allow override via environment variable + if 'CLAMS_WORKERS' in os.environ: + return int(os.environ['CLAMS_WORKERS']) + + cpu_workers = (multiprocessing.cpu_count() * 2) + 1 + + # Get GPU memory requirement from app metadata + try: + metadata = self.cla.metadata + gpu_mem_mb = metadata.gpu_mem_min # some apps may not have this field, or devs may forget to set it + except Exception: + gpu_mem_mb = 0 + + if gpu_mem_mb <= 0: + return cpu_workers + + # Calculate workers based on total VRAM of the first CUDA device (no other GPUs are considered for now) + try: + import torch + if torch.cuda.is_available(): + total_vram_bytes = torch.cuda.get_device_properties(0).total_memory + total_vram_mb = total_vram_bytes / (1024 * 1024) + vram_workers = max(1, int(total_vram_mb // gpu_mem_mb)) + workers = min(vram_workers, cpu_workers) + self.cla.logger.info( + f"GPU detected: {total_vram_mb:.0f} MB VRAM, " + f"app requires {gpu_mem_mb} MB, " + f"using {workers} workers (max {vram_workers} by VRAM, {cpu_workers} by CPU)" + ) + return workers + except ImportError: + pass + + return cpu_workers class ProductionApplication(gunicorn.app.base.BaseApplication): diff --git a/documentation/clamsapp.md b/documentation/clamsapp.md index 27d1a6e..5e3b2b7 100644 --- a/documentation/clamsapp.md +++ b/documentation/clamsapp.md @@ -209,6 +209,26 @@ $ python app.py * Be default, the app will be running in *debugging* mode, but you can change it to *production* mode by passing `--production` option to support larger traffic volume. * As you might have noticed, the default `CMD` in the prebuilt containers is `python app.py --production --port 5000`. +##### Environment variables for production mode + +When running in production mode, the following environment variables can be used to configure the app server: + +| Variable | Description | Default | +|----------|-------------|---------| +| `CLAMS_WORKERS` | Number of gunicorn worker processes | Auto-calculated based on CPU cores and GPU memory | + +By default, the number of workers is calculated as `(CPU cores × 2) + 1`. However, for GPU-based apps that declare their GPU memory requirements in the app metadata, the SDK will automatically limit workers based on available GPU VRAM to prevent out-of-memory errors. + +To override the automatic calculation: + +```bash +# Set a fixed number of workers +$ CLAMS_WORKERS=2 python app.py --production + +# Or in docker +$ docker run -e CLAMS_WORKERS=2 -p 5000:5000 +``` + #### `metadata.py`: Getting app metadata Running `metadata.py` will print out the app metadata in JSON format. From 6e71270bd06f9f0704d6f2cd13e7a3ed2df3c6d5 Mon Sep 17 00:00:00 2001 From: Keigh Rim Date: Thu, 20 Nov 2025 19:39:58 -0500 Subject: [PATCH 06/10] updated cuda profiler to check available vram, reject request if vram isn't sufficient --- clams/app/__init__.py | 297 ++++++++++++++++++++++++++++++++++---- clams/restify/__init__.py | 5 +- 2 files changed, 275 insertions(+), 27 deletions(-) diff --git a/clams/app/__init__.py b/clams/app/__init__.py index 6550327..bb50e4e 100644 --- a/clams/app/__init__.py +++ b/clams/app/__init__.py @@ -8,11 +8,17 @@ from datetime import datetime from urllib import parse as urlparser -__all__ = ['ClamsApp'] +__all__ = ['ClamsApp', 'InsufficientVRAMError'] + + +class InsufficientVRAMError(RuntimeError): + """Raised when insufficient GPU memory is available for processing.""" + pass from typing import Union, Any, Optional, Dict, List, Tuple from mmif import Mmif, Document, DocumentTypes, View +from mmif.utils.cli.describe import generate_param_hash from clams.appmetadata import AppMetadata, real_valued_primitives, python_type, map_param_kv_delimiter logging.basicConfig( @@ -47,7 +53,7 @@ class ClamsApp(ABC): 'description': 'The JSON body of the HTTP response will be re-formatted with 2-space indentation', }, { - 'name': 'runningTime', 'type': 'boolean', 'choices': None, 'default': False, 'multivalued': False, + 'name': 'runningTime', 'type': 'boolean', 'choices': None, 'default': True, 'multivalued': False, 'description': 'The running time of the app will be recorded in the view metadata', }, { @@ -166,14 +172,12 @@ def annotate(self, mmif: Union[str, dict, Mmif], **runtime_params: List[str]) -> runtime_recs['cuda'] = [] # Use cuda_profiler data if available, otherwise fallback to nvidia-smi if cuda_profiler: - for gpu_info, peak_memory_bytes in cuda_profiler.items(): - # Convert peak memory to human-readable format - peak_memory_mb = peak_memory_bytes / (1000 * 1000) - if peak_memory_mb >= 1000: - peak_memory_str = f"{peak_memory_mb / 1000:.2f} GiB" - else: - peak_memory_str = f"{peak_memory_mb:.1f} MiB" - runtime_recs['cuda'].append(f"{gpu_info}, Used {self._cuda_memory_to_str(peak_memory_bytes)}") + for gpu_info, mem_info in cuda_profiler.items(): + available_str = self._cuda_memory_to_str(mem_info['available_before']) + peak_str = self._cuda_memory_to_str(mem_info['peak']) + runtime_recs['cuda'].append( + f"{gpu_info}, {available_str} available, {peak_str} peak used" + ) elif shutil.which('nvidia-smi'): for gpu in subprocess.run(['nvidia-smi', '--query-gpu=name,memory.total', '--format=csv,noheader'], stdout=subprocess.PIPE).stdout.decode('utf-8').strip().split('\n'): @@ -345,50 +349,291 @@ def _cuda_device_name_concat(name, mem): mem = ClamsApp._cuda_memory_to_str(mem) return f"{name}, With {mem}" + def _get_profile_path(self, param_hash: str) -> pathlib.Path: + """ + Get filesystem path for memory profile file. + + Profile files are stored in a per-app directory under user's cache. + + :param param_hash: Hash of parameters from :func:`mmif.utils.cli.describe.generate_param_hash` + :return: Path to the profile file + """ + # Sanitize app identifier for filesystem use + app_id = self.metadata.identifier.replace('/', '-').replace(':', '-') + cache_dir = pathlib.Path.home() / '.cache' / 'clams' / 'memory_profiles' / app_id + return cache_dir / f"memory_{param_hash}.txt" + + @staticmethod + def _check_vram_available(required_bytes: int, safety_margin: float = 0.1) -> bool: + """ + Check if sufficient VRAM is currently available. + + :param required_bytes: Bytes needed for model + :param safety_margin: Fraction of total VRAM to keep as headroom (default 10%) + :return: True if sufficient VRAM available + """ + try: + import torch + if not torch.cuda.is_available(): + return True # No CUDA, no constraints + + device = torch.cuda.current_device() + props = torch.cuda.get_device_properties(device) + total_vram = props.total_memory + + # Get currently used memory (max of allocated and reserved) + allocated = torch.cuda.memory_allocated(device) + reserved = torch.cuda.memory_reserved(device) + used = max(allocated, reserved) + + # Calculate available VRAM right now + available = total_vram - used + + # Apply safety margin + required_with_margin = required_bytes + (total_vram * safety_margin) + + return available >= required_with_margin + + except Exception: + # If we can't check, fail open (allow the request) + return True + + @staticmethod + def _get_available_vram() -> int: + """ + Get currently available VRAM in bytes. + + :return: Available VRAM in bytes, or 0 if unavailable + """ + try: + import torch + if not torch.cuda.is_available(): + return 0 + + device = torch.cuda.current_device() + total = torch.cuda.get_device_properties(device).total_memory + used = max(torch.cuda.memory_allocated(device), + torch.cuda.memory_reserved(device)) + return total - used + except Exception: + return 0 + + def _get_estimated_vram_usage(self, **parameters) -> Optional[Dict[str, Any]]: + """ + Get model memory requirements for VRAM checking. + + Default implementation uses conservative 80% for first request, + then historical measurements for subsequent requests. + + Apps can override this to provide explicit model sizes. + + :param parameters: Runtime parameters from the request + :return: Dict with 'size_bytes', 'name', and 'source', or None + """ + param_hash = generate_param_hash(parameters) + profile_path = self._get_profile_path(param_hash) + + # Priority 1: Historical measurement + if profile_path.exists(): + try: + measured = int(profile_path.read_text().strip()) + return { + 'size_bytes': int(measured * 1.2), # 20% safety buffer + 'name': f'params:{param_hash}', + 'source': 'historical' + } + except (ValueError, IOError) as e: + self.logger.warning(f"Failed to read profile {profile_path}: {e}") + + # Priority 2: Conservative first request (80% of total VRAM) + try: + import torch + if torch.cuda.is_available(): + device = torch.cuda.current_device() + total_vram = torch.cuda.get_device_properties(device).total_memory + conservative_requirement = int(total_vram * 0.8) + + self.logger.info( + f"First request for params:{param_hash}: " + f"requesting 80% of VRAM ({conservative_requirement/1024**3:.2f}GB) " + f"until actual usage is measured" + ) + + return { + 'size_bytes': conservative_requirement, + 'name': f'params:{param_hash}', + 'source': 'conservative-first-request' + } + except ImportError: + pass + except Exception as e: + self.logger.warning(f"Failed to get CUDA info: {e}") + + return None + + def _record_vram_usage(self, parameters: dict, peak_bytes: int) -> None: + """ + Record peak memory usage to profile file. + + Uses atomic write (temp + rename) to avoid corruption from + concurrent writes. Only updates if new value is higher. + + :param parameters: Request parameters (for hash) + :param peak_bytes: Measured peak VRAM usage + """ + if peak_bytes <= 0: + return + + param_hash = generate_param_hash(parameters) + profile_path = self._get_profile_path(param_hash) + + try: + profile_path.parent.mkdir(parents=True, exist_ok=True) + + # Check if we should update + should_write = True + if profile_path.exists(): + try: + existing = int(profile_path.read_text().strip()) + if peak_bytes <= existing: + should_write = False # Existing value is sufficient + else: + self.logger.debug( + f"Updating peak memory for {param_hash}: " + f"{existing/1024**3:.2f}GB -> {peak_bytes/1024**3:.2f}GB" + ) + except (ValueError, IOError): + pass # Corrupted file, overwrite + + if should_write: + # Atomic write: write to temp, then rename + temp_path = profile_path.with_suffix('.tmp') + temp_path.write_text(str(peak_bytes)) + temp_path.rename(profile_path) # Atomic on POSIX + + self.logger.info( + f"Recorded peak memory for {param_hash}: " + f"{peak_bytes/1024**3:.2f}GB" + ) + except Exception as e: + self.logger.warning(f"Failed to record memory profile: {e}") + @staticmethod def _profile_cuda_memory(func): """ - Decorator for profiling CUDA memory usage during _annotate execution. - + Decorator for profiling CUDA memory usage and managing VRAM availability. + + This decorator: + 1. Checks VRAM requirements before execution (if conditions met) + 2. Rejects requests if insufficient VRAM + 3. Records peak memory usage after execution + 4. Calls empty_cache() for cleanup + :param func: The function to wrap (typically _annotate) :return: Decorated function that returns (result, cuda_profiler) where cuda_profiler is dict with ", " keys - and peak memory usage values + and dict values containing 'available_before' and 'peak' memory in bytes """ def wrapper(*args, **kwargs): + # Get the ClamsApp instance from the bound method + app_instance = func.__self__ + cuda_profiler = {} torch_available = False cuda_available = False device_count = 0 - + available_before = {} + try: import torch # pytype: disable=import-error torch_available = True cuda_available = torch.cuda.is_available() device_count = torch.cuda.device_count() - if cuda_available: - # Reset peak memory stats for all devices - torch.cuda.reset_peak_memory_stats('cuda') except ImportError: pass - + + # VRAM checking: only when torch available, CUDA available, and app declares GPU usage + should_check_vram = ( + torch_available and + cuda_available and + hasattr(app_instance, 'metadata') and + getattr(app_instance.metadata, 'gpu_mem_min', 0) > 0 + ) + + if should_check_vram: + requirements = app_instance._get_estimated_vram_usage(**kwargs) + + if requirements: + required_bytes = requirements['size_bytes'] + model_name = requirements.get('name', 'model') + source = requirements.get('source', 'unknown') + + # Check if sufficient VRAM available RIGHT NOW + if not ClamsApp._check_vram_available(required_bytes): + available_gb = ClamsApp._get_available_vram() / 1024**3 + required_gb = required_bytes / 1024**3 + + error_msg = ( + f"Insufficient GPU memory for {model_name}. " + f"Requested: {required_gb:.2f}GB, " + f"Available: {available_gb:.2f}GB. " + ) + if source == 'conservative-first-request': + error_msg += ( + "This is a first request with this parameter set. " + "Conservative 80% VRAM requirement applied. " + ) + error_msg += ( + "GPU may be in use by other processes. " + "Please retry later." + ) + + app_instance.logger.error(error_msg) + raise InsufficientVRAMError(error_msg) + + app_instance.logger.info( + f"VRAM check passed for {model_name} ({source}): " + f"{required_bytes/1024**3:.2f}GB requested, " + f"{ClamsApp._get_available_vram()/1024**3:.2f}GB available" + ) + + # Capture available VRAM before execution and reset stats + if torch_available and cuda_available: + for device_id in range(device_count): + device_id_str = f'cuda:{device_id}' + total = torch.cuda.get_device_properties(device_id_str).total_memory + allocated = torch.cuda.memory_allocated(device_id_str) + available_before[device_id] = total - allocated + # Reset peak memory stats for all devices + torch.cuda.reset_peak_memory_stats('cuda') + try: result = func(*args, **kwargs) - + + # Record peak memory usage + total_peak = 0 if torch_available and cuda_available and device_count > 0: for device_id in range(device_count): - device_id = f'cuda:{device_id}' - peak_memory = torch.cuda.max_memory_allocated(device_id) - gpu_name = torch.cuda.get_device_name(device_id) - gpu_total_memory = torch.cuda.get_device_properties(device_id).total_memory + device_id_str = f'cuda:{device_id}' + peak_memory = torch.cuda.max_memory_allocated(device_id_str) + total_peak = max(total_peak, peak_memory) + gpu_name = torch.cuda.get_device_name(device_id_str) + gpu_total_memory = torch.cuda.get_device_properties(device_id_str).total_memory key = ClamsApp._cuda_device_name_concat(gpu_name, gpu_total_memory) - cuda_profiler[key] = peak_memory - + cuda_profiler[key] = { + 'available_before': available_before.get(device_id, 0), + 'peak': peak_memory + } + + # Record peak memory for future requests (if VRAM checking enabled) + if should_check_vram and total_peak > 0: + app_instance._record_vram_usage(kwargs, total_peak) + return result, cuda_profiler finally: if torch_available and cuda_available: torch.cuda.empty_cache() - + return wrapper @staticmethod diff --git a/clams/restify/__init__.py b/clams/restify/__init__.py index ec321e7..64e1a7f 100644 --- a/clams/restify/__init__.py +++ b/clams/restify/__init__.py @@ -3,7 +3,7 @@ from flask_restful import Resource, Api from mmif import Mmif -from clams.app import ClamsApp +from clams.app import ClamsApp, InsufficientVRAMError class Restifier(object): @@ -178,6 +178,9 @@ def post(self) -> Response: return Response(response="Invalid input data. See below for validation error.\n\n" + str(e), status=500, mimetype='text/plain') try: return self.json_to_response(self.cla.annotate(raw_data, **raw_params)) + except InsufficientVRAMError as e: + self.cla.logger.warning(f"Request rejected due to insufficient VRAM: {e}") + return Response(response=str(e), status=503, mimetype='text/plain') except Exception: self.cla.logger.exception("Error in annotation") return self.json_to_response(self.cla.record_error(raw_data, **raw_params).serialize(pretty=True), status=500) From 1576feac92f45c04336af02c1539cc882fe39180 Mon Sep 17 00:00:00 2001 From: Keigh Rim Date: Thu, 20 Nov 2025 20:21:56 -0500 Subject: [PATCH 07/10] updated documentation regarding gpu apps --- documentation/clamsapp.md | 12 +- documentation/gpu-apps.md | 257 ++++++++++++++++++++++++++++++++++++++ documentation/index.rst | 1 + 3 files changed, 259 insertions(+), 11 deletions(-) create mode 100644 documentation/gpu-apps.md diff --git a/documentation/clamsapp.md b/documentation/clamsapp.md index 5e3b2b7..33127cf 100644 --- a/documentation/clamsapp.md +++ b/documentation/clamsapp.md @@ -217,17 +217,7 @@ When running in production mode, the following environment variables can be used |----------|-------------|---------| | `CLAMS_WORKERS` | Number of gunicorn worker processes | Auto-calculated based on CPU cores and GPU memory | -By default, the number of workers is calculated as `(CPU cores × 2) + 1`. However, for GPU-based apps that declare their GPU memory requirements in the app metadata, the SDK will automatically limit workers based on available GPU VRAM to prevent out-of-memory errors. - -To override the automatic calculation: - -```bash -# Set a fixed number of workers -$ CLAMS_WORKERS=2 python app.py --production - -# Or in docker -$ docker run -e CLAMS_WORKERS=2 -p 5000:5000 -``` +By default, the number of workers is calculated as `(CPU cores × 2) + 1`. For GPU-based apps, see [GPU Memory Management](gpu-apps.md) for details on automatic worker scaling and VRAM management. #### `metadata.py`: Getting app metadata diff --git a/documentation/gpu-apps.md b/documentation/gpu-apps.md new file mode 100644 index 0000000..54a3d2d --- /dev/null +++ b/documentation/gpu-apps.md @@ -0,0 +1,257 @@ +## GPU Memory Management for CLAMS Apps + +This document covers GPU memory management features in the CLAMS SDK for developers building CUDA-based applications. + +### Overview + +CLAMS apps that use GPU acceleration face memory management challenges when running as HTTP servers with multiple workers. Each gunicorn worker loads models independently into GPU VRAM, which can cause out-of-memory (OOM) errors. + +> **Note**: The VRAM profiling and runtime checking features described in this document are **PyTorch-only**. Apps using TensorFlow or other frameworks will not benefit from automatic VRAM profiling, though worker calculation based on `gpu_mem_min` still works. TensorFlow-based apps should set conservative `gpu_mem_min` values and rely on manual testing to determine appropriate worker counts. + +The CLAMS SDK provides: +1. **Metadata fields** for declaring GPU memory requirements +2. **Automatic worker scaling** based on available VRAM +3. **Runtime VRAM checking** to reject requests when memory is insufficient +4. **Memory profiling** to optimize future requests + +### Declaring GPU Memory Requirements + +App developers should declare GPU memory requirements in the app metadata using two fields: + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `gpu_mem_min` | int | 0 | Minimum GPU memory required to run the app (MB) | +| `gpu_mem_typ` | int | 0 | Typical GPU memory usage with default parameters (MB) | + +#### Example + +```python +from clams.app import ClamsApp +from clams.appmetadata import AppMetadata + +class MyGPUApp(ClamsApp): + def _appmetadata(self): + metadata = AppMetadata( + name="My GPU App", + description="An app that uses GPU acceleration", + app_license="MIT", + identifier="my-gpu-app", + url="https://example.com/my-gpu-app", + gpu_mem_min=4000, # 4GB minimum + gpu_mem_typ=6000, # 6GB typical usage + ) + # ... add inputs/outputs/parameters + return metadata +``` + +#### Guidelines for Setting Values + +- **`gpu_mem_min`**: The absolute minimum VRAM needed to load the smallest supported model configuration. 0 (the default) means the app does not use GPU. + +- **`gpu_mem_typ`**: Expected VRAM usage with default parameters. This value is displayed to users and helps them understand resource requirements. Must be >= `gpu_mem_min`. + +If `gpu_mem_typ` is set lower than `gpu_mem_min`, the SDK will automatically correct it and issue a warning. + +### Automatic Worker Calculation + +When running in production mode (gunicorn), the SDK automatically calculates the optimal number of workers based on: + +1. CPU cores: `(cores × 2) + 1` +2. Available VRAM: `total_vram / gpu_mem_min` + +The final worker count is the minimum of these two values, ensuring workers don't exceed available GPU memory. + +#### Example Calculation + +For a system with: +- 8 CPU cores → 17 CPU-based workers +- 24GB VRAM, app requires 4GB → 6 VRAM-based workers + +Result: 6 workers (limited by VRAM) + +#### Overriding Worker Count + +Use the `CLAMS_WORKERS` environment variable to override automatic calculation: + +```bash +# Set fixed number of workers +CLAMS_WORKERS=2 python app.py --production + +# In Docker +docker run -e CLAMS_WORKERS=2 -p 5000:5000 +``` + +### Runtime VRAM Checking + +Beyond worker calculation, the SDK performs runtime VRAM checks before each annotation request. This catches cases where: +- Other processes are using GPU memory +- Previous requests haven't fully released memory +- Memory fragmentation reduces effective available space + +#### How It Works + +1. **Before annotation**: The SDK estimates required VRAM based on: + - Historical measurements from previous runs (with 20% buffer) + - Conservative estimate (80% of total VRAM) for first request + +2. **If insufficient VRAM**: The request is rejected with `InsufficientVRAMError` + +3. **After annotation**: Peak memory usage is recorded for future estimates + +#### HTTP Response + +When VRAM is insufficient, the REST API returns: +- **Status**: 503 Service Unavailable +- **Body**: Error message describing the shortage + +This allows clients to implement retry logic with backoff. + +### Memory Profiling + +The SDK automatically profiles and caches memory usage per parameter combination. + +#### Profile Storage + +Profiles are stored in: +``` +~/.cache/clams/memory_profiles//.txt +``` + +Each profile contains a single integer: the peak memory usage in bytes. + +#### Profile Behavior + +- **First request**: Uses conservative estimate (80% of total VRAM) +- **Subsequent requests**: Uses historical measurement × 1.2 buffer +- **Updates**: Only when new peak exceeds stored value + +This approach becomes more accurate over time while maintaining safety margins. + +### Error Handling + +#### InsufficientVRAMError + +A custom exception raised when VRAM is insufficient: + +```python +from clams.app import InsufficientVRAMError + +try: + result = app.annotate(mmif_input) +except InsufficientVRAMError as e: + # Handle insufficient memory + print(f"Not enough GPU memory: {e}") +``` + +This exception inherits from `RuntimeError` for backward compatibility. + +#### Best Practices + +1. **Catch the exception** in custom code that calls `annotate()` directly +2. **Implement retry logic** when receiving HTTP 503 +3. **Monitor memory usage** using the `hwFetch` parameter + +### Monitoring with hwFetch + +Enable hardware information in responses to monitor GPU usage: + +```bash +# Via HTTP query parameter +curl -X POST "http://localhost:5000/?hwFetch=true" -d@input.mmif + +# Via CLI +python cli.py --hwFetch true input.mmif output.mmif +``` + +Response metadata will include: +```json +{ + "app-metadata": { + "hwFetch": "NVIDIA RTX 4090, 20480 MB available, 3584 MB peak used" + } +} +``` + +### Conditions for VRAM Checking + +VRAM checking is only performed when all conditions are met: + +1. PyTorch is installed (`import torch` succeeds) +2. CUDA is available (`torch.cuda.is_available()` returns True) +3. App declares GPU requirements (`gpu_mem_min > 0`) + +Apps without GPU requirements (default `gpu_mem_min=0`) skip all VRAM checks. + +> **Important**: TensorFlow-based apps will not trigger VRAM checking even with `gpu_mem_min > 0`, because the profiling relies on PyTorch's CUDA APIs. For TensorFlow apps, the `gpu_mem_min` value is still used for worker calculation, but runtime VRAM checking and memory profiling are disabled. + +### Memory Optimization Tips + +1. **Clear cache between requests**: The SDK calls `torch.cuda.empty_cache()` after annotation + +2. **Use appropriate batch sizes**: Smaller batches use less memory but may be slower + +3. **Consider model variants**: Offer parameters for different model sizes (e.g., base/large/xl) + +4. **Test on target hardware**: Memory usage varies by GPU architecture + +5. **Set accurate metadata values**: Measure actual usage rather than guessing + +### Migration Guide + +To add GPU memory management to an existing app: + +1. **Measure memory usage**: Run your app and note peak VRAM usage + +2. **Update metadata**: Add `gpu_mem_min` and `gpu_mem_typ` fields + +3. **Test worker scaling**: Run in production mode and verify worker count + +4. **Test rejection logic**: Simulate low VRAM scenarios + +5. **Update documentation**: Inform users of GPU requirements + +### Troubleshooting + +#### Workers not scaling correctly + +- Verify `gpu_mem_min` is set in metadata (not 0) +- Check PyTorch is installed and CUDA is available +- Use `CLAMS_WORKERS` to override if needed + +#### Requests being rejected unexpectedly + +- Check available VRAM with `nvidia-smi` +- Clear GPU memory from other processes +- Profile cache may have outdated high values (delete `~/.cache/clams/memory_profiles/`) + +#### OOM errors despite worker limits + +- `gpu_mem_min` may be set too low +- Memory fragmentation; try restarting workers +- Other processes consuming VRAM + +### API Reference + +#### AppMetadata Fields + +```python +gpu_mem_min: int = 0 # Minimum GPU memory (MB) +gpu_mem_typ: int = 0 # Typical GPU memory usage (MB) +``` + +#### Exception Classes + +```python +class InsufficientVRAMError(RuntimeError): + """Raised when insufficient GPU memory is available.""" + pass +``` + +#### Internal Methods + +These methods are used internally but documented for reference: + +- `ClamsApp._get_estimated_vram_usage(**params)` - Get estimated VRAM for parameters +- `ClamsApp._record_vram_usage(params, peak_bytes)` - Record peak usage +- `ClamsApp._check_vram_available(required_bytes)` - Check if VRAM sufficient +- `ClamsApp._get_available_vram()` - Get current available VRAM diff --git a/documentation/index.rst b/documentation/index.rst index 135da15..3f9fdd1 100644 --- a/documentation/index.rst +++ b/documentation/index.rst @@ -10,6 +10,7 @@ Welcome to CLAMS Python SDK documentation! introduction input-output runtime-params + gpu-apps appmetadata appdirectory cli From e1e0d0f9eef238eaa3cb6779e27a24b52c04d4f5 Mon Sep 17 00:00:00 2001 From: Keigh Rim Date: Thu, 20 Nov 2025 20:33:08 -0500 Subject: [PATCH 08/10] added test for gpu-related new features, cleaned up planning document --- ISSUE_243_INVESTIGATION.md | 827 ------------------------------------- tests/test_vram.py | 307 ++++++++++++++ 2 files changed, 307 insertions(+), 827 deletions(-) delete mode 100644 ISSUE_243_INVESTIGATION.md create mode 100644 tests/test_vram.py diff --git a/ISSUE_243_INVESTIGATION.md b/ISSUE_243_INVESTIGATION.md deleted file mode 100644 index 1f72103..0000000 --- a/ISSUE_243_INVESTIGATION.md +++ /dev/null @@ -1,827 +0,0 @@ -# Issue #243 Investigation: GPU Memory Management in CLAMS SDK - -**Issue**: https://github.com/clamsproject/clams-python/issues/243 -**Status**: Investigation Complete - SDK-Level Solution Proposed -**Date**: 2025-01-08 - ---- - -## Executive Summary - -When CLAMS applications using PyTorch models run in production mode (gunicorn), each worker process independently loads models into GPU VRAM. This leads to excessive memory consumption that scales with worker count, causing OOM errors. - -**Root Cause**: Gunicorn's multi-process architecture combined with VRAM being a shared, dynamic resource that cannot be allocated statically. - -**Proposed Solution**: SDK-level VRAM management through runtime checking and unified API, rather than fragmented app-level implementations. - ---- - -## Problem Analysis - -### Current Architecture - -**Production Configuration** (`clams/restify/__init__.py:42-78`): -- Workers: `(CPU_count × 2) + 1` -- Threads per worker: 2 -- Worker class: sync (default) - -On an 8-core machine: **17 workers** - -**Model Loading Pattern** (from templates and existing apps): -```python -class MyApp(ClamsApp): - def __init__(self): - super().__init__() - # Option A: Load in __init__ (pre-fork) - BAD - # self.model = torch.load('model.pt') - - # Option B: Load on-demand in _annotate() - BETTER but still problematic - self.models = {} - - def _annotate(self, mmif, **parameters): - if model_name not in self.models: - self.models[model_name] = torch.load('model.pt') # Each worker loads independently - # ... -``` - -### Why This Causes Issues - -**Multi-Worker Duplication**: -``` -Worker 1: Loads model on first request → 3GB VRAM -Worker 2: Loads model on first request → 3GB VRAM -... -Worker 17: Loads model on first request → 3GB VRAM - -Total: 17 × 3GB = 51GB VRAM required -``` - -**VRAM is a Shared, Dynamic Resource**: -- Other applications can allocate VRAM at any time -- Cannot assume static VRAM availability at startup -- Must check availability at runtime before loading - -**Process Isolation**: -- Each gunicorn worker is a separate OS process -- CUDA contexts are process-isolated (no shared GPU memory) -- Workers cannot share model instances in VRAM - -### Real-World Impact - -**Example: Whisper Wrapper** (`app-whisper-wrapper/app.py`) - -The app attempts mitigation by: -1. Loading models on-demand (not in `__init__()`) -2. Caching models per worker -3. "Conflict prevention" that loads duplicate models if one is in use - -**Problems**: -- Still loads one model per worker over time (51GB for 17 workers) -- Conflict prevention can load duplicates within same worker (102GB worst case) -- No awareness of VRAM availability from other processes - ---- - -## Proposed SDK-Level Solution - -### Design Principles - -1. **Centralized Management**: SDK handles VRAM checking, not individual apps -2. **Runtime Checking**: Check VRAM availability at request time, not startup -3. **Fail Fast**: Return clear errors when VRAM unavailable -4. **Backward Compatible**: Existing apps continue working without changes -5. **Opt-In Enhancement**: Apps can declare requirements for better behavior - -### Architecture Overview - -``` -Request Flow: - HTTP POST → ClamsHTTPApi.post() - → ClamsApp.annotate() - → _profile_cuda_memory() decorator - → Check VRAM requirements (NEW) - → Call _annotate() if sufficient VRAM - → torch.cuda.empty_cache() cleanup (EXISTING) -``` - -### Component 1: Model Requirements API - -**Apps declare their model memory needs:** - -```python -# clams/app/__init__.py - Add to ClamsApp base class - -class ClamsApp(ABC): - - def _get_model_requirements(self, **parameters) -> Optional[dict]: - """ - Declare model memory requirements based on runtime parameters. - - Apps override this to enable VRAM checking. - - :param parameters: Runtime parameters from the request - :return: Dict with 'size_bytes' and optional 'name', or None - - Example: - def _get_model_requirements(self, **parameters): - model_sizes = {'small': 2*1024**3, 'large': 6*1024**3} - model = parameters.get('model', 'small') - return {'size_bytes': model_sizes[model], 'name': model} - """ - return None # Default: no specific requirements -``` - -**App Implementation Example** (whisper-wrapper): -```python -class WhisperWrapper(ClamsApp): - - MODEL_SIZES = { - 'tiny': 500 * 1024**2, - 'base': 1024 * 1024**2, - 'small': 2 * 1024**3, - 'medium': 3 * 1024**3, - 'large': 6 * 1024**3, - 'large-v2': 6 * 1024**3, - 'large-v3': 6 * 1024**3, - 'turbo': 3 * 1024**3, - } - - def _get_model_requirements(self, **parameters): - size = parameters.get('model', 'medium') - if size in self.model_size_alias: - size = self.model_size_alias[size] - - return { - 'size_bytes': self.MODEL_SIZES.get(size, 3 * 1024**3), - 'name': size - } -``` - -### Component 2: Runtime VRAM Checking - -**Enhance existing `_profile_cuda_memory()` decorator:** - -```python -# clams/app/__init__.py:349-392 - Enhanced version - -@staticmethod -def _profile_cuda_memory(func): - """ - Decorator for profiling CUDA memory usage and managing VRAM availability. - """ - def wrapper(self, *args, **kwargs): - cuda_profiler = {} - torch_available = False - cuda_available = False - - try: - import torch - torch_available = True - cuda_available = torch.cuda.is_available() - except ImportError: - pass - - # NEW: Runtime VRAM checking before execution - if torch_available and cuda_available: - # Get model requirements from app - requirements = self._get_model_requirements(**kwargs) - - if requirements: - required_bytes = requirements['size_bytes'] - model_name = requirements.get('name', 'model') - - # Check if sufficient VRAM available RIGHT NOW - if not ClamsApp._check_vram_available(required_bytes): - available_gb = ClamsApp._get_available_vram() / 1024**3 - required_gb = required_bytes / 1024**3 - - error_msg = ( - f"Insufficient GPU memory for {model_name}. " - f"Required: {required_gb:.2f}GB, " - f"Available: {available_gb:.2f}GB. " - f"GPU may be in use by other processes. " - f"Please retry later." - ) - self.logger.error(error_msg) - raise RuntimeError(error_msg) - - self.logger.info( - f"VRAM check passed for {model_name}: " - f"{required_gb:.2f}GB required, " - f"{ClamsApp._get_available_vram() / 1024**3:.2f}GB available" - ) - - # Reset peak memory stats - torch.cuda.reset_peak_memory_stats('cuda') - - try: - result = func(self, *args, **kwargs) - - # Record peak memory usage (EXISTING) - if torch_available and cuda_available: - device_count = torch.cuda.device_count() - for device_id in range(device_count): - device_id_str = f'cuda:{device_id}' - peak_memory = torch.cuda.max_memory_allocated(device_id_str) - gpu_name = torch.cuda.get_device_name(device_id_str) - gpu_total = torch.cuda.get_device_properties(device_id_str).total_memory - key = ClamsApp._cuda_device_name_concat(gpu_name, gpu_total) - cuda_profiler[key] = peak_memory - - return result, cuda_profiler - finally: - # Cleanup (EXISTING) - if torch_available and cuda_available: - torch.cuda.empty_cache() - - return wrapper - -@staticmethod -def _check_vram_available(required_bytes, safety_margin=0.1): - """ - Check if sufficient VRAM is available at this moment. - - :param required_bytes: Bytes needed for model - :param safety_margin: Fraction of total VRAM to keep as headroom (default 10%) - :return: True if sufficient VRAM available - """ - try: - import torch - if not torch.cuda.is_available(): - return True # No CUDA, no constraints - - device = torch.cuda.current_device() - props = torch.cuda.get_device_properties(device) - total_vram = props.total_memory - - # Get currently allocated/reserved memory - allocated = torch.cuda.memory_allocated(device) - reserved = torch.cuda.memory_reserved(device) - used = max(allocated, reserved) - - # Calculate available VRAM RIGHT NOW - available = total_vram - used - - # Apply safety margin - required_with_margin = required_bytes + (total_vram * safety_margin) - - return available >= required_with_margin - - except Exception: - # If we can't check, fail open (allow the request) - return True - -@staticmethod -def _get_available_vram(): - """Get currently available VRAM in bytes""" - try: - import torch - if not torch.cuda.is_available(): - return 0 - - device = torch.cuda.current_device() - total = torch.cuda.get_device_properties(device).total_memory - used = max(torch.cuda.memory_allocated(device), - torch.cuda.memory_reserved(device)) - return total - used - except: - return 0 -``` - -### Component 3: Conservative Worker Count - -**Adjust default worker calculation when CUDA detected:** - -```python -# clams/restify/__init__.py:51-52 - Modified - -def number_of_workers(): - """ - Calculate workers considering GPU constraints. - Use conservative count when CUDA available since VRAM is the bottleneck. - """ - import multiprocessing - - cpu_workers = (multiprocessing.cpu_count() * 2) + 1 - - # Check if CUDA available (indicates GPU workload) - try: - import torch - if torch.cuda.is_available(): - # Use conservative worker count for GPU apps - # Runtime VRAM checking will prevent OOM - # Fewer workers = less memory overhead, more predictable behavior - gpu_conservative_workers = min(4, multiprocessing.cpu_count()) - return gpu_conservative_workers - except ImportError: - pass - - return cpu_workers -``` - -### Component 4: Runtime Status API - -**Expose VRAM status through existing metadata endpoint:** - -```python -# clams/app/__init__.py - Add method - -def get_runtime_info(self) -> dict: - """ - Get runtime information including GPU/VRAM status. - Apps can override to add custom runtime info. - """ - info = {} - - try: - import torch - if torch.cuda.is_available(): - devices = [] - for i in range(torch.cuda.device_count()): - props = torch.cuda.get_device_properties(i) - total = props.total_memory - used = max(torch.cuda.memory_allocated(i), - torch.cuda.memory_reserved(i)) - - devices.append({ - 'id': i, - 'name': props.name, - 'total_memory_gb': round(total / 1024**3, 2), - 'available_memory_gb': round((total - used) / 1024**3, 2), - }) - - info['gpu'] = {'available': True, 'devices': devices} - except: - info['gpu'] = {'available': False} - - return info -``` - -```python -# clams/restify/__init__.py:121-129 - Modify GET handler - -def get(self) -> Response: - """Maps HTTP GET verb to appmetadata with optional runtime info""" - raw_params = request.args.to_dict(flat=False) - - # Check for runtime info request - if 'includeVRAM' in raw_params or 'includeRuntime' in raw_params: - import json - metadata = json.loads(self.cla.appmetadata(**raw_params)) - metadata['runtime'] = self.cla.get_runtime_info() - return self.json_to_response(json.dumps(metadata)) - - return self.json_to_response(self.cla.appmetadata(**raw_params)) -``` - -**Usage:** -```bash -# Normal metadata -curl http://localhost:5000/ - -# Metadata + current VRAM status -curl http://localhost:5000/?includeVRAM=true -``` - -**Response example:** -```json -{ - "name": "Whisper Wrapper", - "version": "1.0.0", - "parameters": [...], - "runtime": { - "gpu": { - "available": true, - "devices": [ - { - "id": 0, - "name": "NVIDIA RTX 4090", - "total_memory_gb": 24.0, - "available_memory_gb": 18.5 - } - ] - } - } -} -``` - -### Component 5: Automatic Memory Profiling - -**When developers don't provide model requirements**, the SDK uses a conservative approach with historical profiling: - -**Strategy:** -- **First request**: Require 80% of total VRAM to be available (very conservative) -- **Subsequent requests**: Use measured peak memory from previous runs (accurate) - -**Persistence via hash-based filenames (write-once, read-many):** - -```python -# clams/app/__init__.py - Add to ClamsApp - -import hashlib -import json -import pathlib - -class ClamsApp(ABC): - - def _get_param_hash(self, **parameters): - """Create deterministic hash of parameters for filename""" - param_str = json.dumps(parameters, sort_keys=True) - return hashlib.sha256(param_str.encode()).hexdigest()[:16] - - def _get_profile_path(self, param_hash): - """Get path for memory profile file""" - cache_dir = pathlib.Path.home() / '.cache' / 'clams' / 'memory_profiles' - return cache_dir / f"memory_{param_hash}.txt" - - def _get_model_requirements(self, **parameters): - """ - Default implementation with conservative first request + historical profiling. - Apps can override for explicit model size declarations. - - :param parameters: Runtime parameters from the request - :return: Dict with 'size_bytes', 'name', and 'source' - """ - param_hash = self._get_param_hash(**parameters) - profile_path = self._get_profile_path(param_hash) - - # Check for historical measurement - if profile_path.exists(): - try: - measured = int(profile_path.read_text().strip()) - return { - 'size_bytes': int(measured * 1.2), # 20% buffer - 'name': param_hash, - 'source': 'historical' - } - except: - pass # Corrupted file, fall through to conservative - - # First request: require 80% of total VRAM - try: - import torch - if torch.cuda.is_available(): - device = torch.cuda.current_device() - total_vram = torch.cuda.get_device_properties(device).total_memory - conservative_requirement = int(total_vram * 0.8) - - self.logger.info( - f"First request for {param_hash}: " - f"requiring 80% of VRAM ({conservative_requirement/1024**3:.2f}GB) " - f"until actual usage is measured" - ) - - return { - 'size_bytes': conservative_requirement, - 'name': param_hash, - 'source': 'conservative-first-request' - } - except: - pass - - return None - - def _record_memory_usage(self, parameters, peak_bytes): - """ - Record peak memory to file using write-once pattern. - Uses atomic write (temp file + rename) to avoid race conditions. - - :param parameters: Request parameters (used for hash) - :param peak_bytes: Measured peak VRAM usage - """ - param_hash = self._get_param_hash(**parameters) - profile_path = self._get_profile_path(param_hash) - - try: - profile_path.parent.mkdir(parents=True, exist_ok=True) - - # Only write if file doesn't exist or new measurement is higher - should_write = True - if profile_path.exists(): - try: - existing = int(profile_path.read_text().strip()) - if peak_bytes <= existing: - should_write = False # Existing value is fine - else: - self.logger.info( - f"Updating peak memory {param_hash}: " - f"{existing/1024**3:.2f}GB → {peak_bytes/1024**3:.2f}GB" - ) - except: - pass # Corrupted, overwrite - - if should_write: - # Atomic write: write to temp, then rename - temp_path = profile_path.with_suffix('.tmp') - temp_path.write_text(str(peak_bytes)) - temp_path.rename(profile_path) # Atomic on POSIX - - self.logger.info( - f"Recorded peak memory for {param_hash}: {peak_bytes/1024**3:.2f}GB" - ) - except Exception as e: - self.logger.warning(f"Failed to record memory profile: {e}") -``` - -**File structure:** -``` -~/.cache/clams/memory_profiles/ -├── memory_3a7f2b9c.txt # {model: "large", language: "en"} → "6442450944" -├── memory_8d2c1e4f.txt # {model: "medium", language: "en"} → "3221225472" -└── memory_f1a9b3e7.txt # {model: "large", language: "es"} → "6442450944" -``` - -**Race condition safety:** - -| Scenario | Behavior | Outcome | -|----------|----------|---------| -| Two workers, same params, first request | Both write similar values | Last write wins, both valid | -| Worker reads while another writes | Atomic rename | Sees old or new file, never partial | -| Two workers update with higher values | Each reads, writes higher | Highest value persists | - -**Benefits:** -- ✅ No developer effort required -- ✅ No file locking needed (write-once pattern) -- ✅ Atomic writes via temp + rename -- ✅ Shared across workers and restarts -- ✅ Self-calibrating over time -- ✅ Conservative first request prevents OOM - ---- - -## How It Works - -### Request Flow - -1. **Client sends POST request** with MMIF data and parameters (e.g., `model=large`) - -2. **SDK calls `_get_model_requirements()`** to determine memory needs - - **If app overrides with explicit values**: Uses app-provided size (e.g., `6*1024**3`) - - **If historical measurement exists**: Uses measured peak × 1.2 buffer - - **If first request (no history)**: Requires 80% of total VRAM (conservative) - -3. **SDK checks current VRAM availability** - - Queries CUDA driver for real-time memory state - - Accounts for memory used by other processes - - Compares available vs. required (with 10% safety margin) - -4. **Decision:** - - **Sufficient VRAM**: Proceed to `_annotate()`, app loads model - - **Insufficient VRAM**: Raise `RuntimeError`, return HTTP 500 with clear message - -5. **After annotation completes**: - - SDK records peak memory usage to profile file (for future requests) - - SDK calls `torch.cuda.empty_cache()` to release cached memory - -### Memory Requirement Resolution - -``` -Priority order for _get_model_requirements(): - -1. App override (explicit) → App knows exact model sizes -2. Historical measurement → Measured from previous run -3. Conservative 80% → First request, no data yet -``` - -**Example progression for new parameter combination:** - -| Request | Source | Requirement | Behavior | -|---------|--------|-------------|----------| -| 1st | conservative | 19.2GB (80% of 24GB) | Fails if <19.2GB available | -| 2nd+ | historical | 3.6GB (3GB measured × 1.2) | Accurate, efficient | - -### Error Handling - -**When VRAM is insufficient:** - -``` -HTTP 500 Internal Server Error - -{ - "error": "Insufficient GPU memory for large. Required: 6.00GB, Available: 4.50GB. GPU may be in use by other processes. Please retry later." -} -``` - -**Client retry logic:** -```python -import requests -import time - -def transcribe_with_retry(url, data, max_retries=3): - for attempt in range(max_retries): - response = requests.post(url, data=data) - - if response.ok: - return response.json() - - if "Insufficient GPU memory" in response.text: - wait = 5 * (2 ** attempt) # Exponential backoff - print(f"GPU busy, retrying in {wait}s...") - time.sleep(wait) - continue - - raise Exception(f"Request failed: {response.status_code}") - - raise Exception("Max retries exceeded") -``` - ---- - -## Benefits - -### ✅ Centralized Solution -- All CLAMS apps benefit from VRAM management -- No need for each app to implement separately -- Consistent behavior across ecosystem - -### ✅ Handles Dynamic VRAM -- Checks availability at request time -- Accounts for other processes using GPU -- No static assumptions about available memory - -### ✅ Backward Compatible -- Existing apps continue working without changes -- Apps without `_get_model_requirements()` skip VRAM checking -- No breaking changes to API - -### ✅ Clear Error Messages -- Clients know exactly why request failed -- Can implement retry logic -- Better than cryptic CUDA OOM errors - -### ✅ Observable -- `includeVRAM` parameter exposes current GPU state -- Monitoring systems can track VRAM usage -- Helps with capacity planning - -### ✅ Process-Safe -- `torch.cuda.empty_cache()` only affects current process -- No interference with other workers or applications -- Each worker manages its own CUDA context - ---- - -## App Migration Path - -### Phase 1: SDK Update (No App Changes Required) -1. Update SDK with VRAM checking components -2. Conservative worker count for CUDA-enabled systems -3. All apps automatically get `empty_cache()` cleanup -4. Runtime status available via `?includeVRAM=true` - -### Phase 2: App Opt-In (Enhanced Behavior) -Apps implement `_get_model_requirements()`: - -```python -class MyApp(ClamsApp): - def _get_model_requirements(self, **parameters): - # Declare memory needs - return {'size_bytes': 3 * 1024**3, 'name': 'my-model'} -``` - -Now the app gets: -- Runtime VRAM checking before model load -- Clear error messages when insufficient memory -- Automatic fail-fast behavior - -### Phase 3: Optional Enhancements -Apps can add: -- Model size estimates in metadata -- Alternative suggestions when VRAM low -- Idle model unloading after timeout - ---- - -## Verification Plan - -### 1. VRAM Isolation Test -Verify `empty_cache()` doesn't affect other processes: - -```bash -# Terminal 1: Start whisper-wrapper -python app.py --production - -# Terminal 2: Start another GPU app (e.g., another CLAMS app) -python other_app.py --production - -# Terminal 3: Monitor GPU -watch -n 1 nvidia-smi - -# Send requests to both apps simultaneously -# Verify: Each process maintains independent VRAM, no interference -``` - -### 2. Dynamic VRAM Test -Verify runtime checking handles contention: - -```python -# Start app with available VRAM -# Load large model in separate process to consume VRAM -# Send request to app → should fail with clear error -# Unload model in separate process -# Retry request → should succeed -``` - -### 3. Multi-Worker Test -Verify conservative worker count prevents overload: - -```bash -# 8-core machine, CUDA available -# Start app → verify ≤4 workers (not 17) -# Send concurrent requests -# Monitor VRAM → verify total usage stays within limits -``` - -### 4. Backward Compatibility Test -Verify apps without `_get_model_requirements()` still work: - -```python -# Use app that doesn't implement _get_model_requirements() -# Send requests → should process normally -# VRAM checking skipped, but cleanup still happens -``` - ---- - -## Implementation Checklist - -**SDK Changes - Core VRAM Management:** -- [ ] Add `_get_model_requirements()` with default implementation (80% conservative + historical) -- [ ] Add `_get_param_hash()` for deterministic parameter hashing -- [ ] Add `_get_profile_path()` for profile file location -- [ ] Add `_record_memory_usage()` with atomic write pattern -- [ ] Add `_check_vram_available()` static method -- [ ] Add `_get_available_vram()` static method -- [ ] Enhance `_profile_cuda_memory()` decorator with VRAM checking and recording - -**SDK Changes - Configuration:** -- [ ] Modify `number_of_workers()` for conservative GPU count -- [ ] Add `get_runtime_info()` method to `ClamsApp` -- [ ] Modify `ClamsHTTPApi.get()` to support `includeVRAM` parameter - -**Documentation:** -- [ ] Document automatic memory profiling behavior -- [ ] Document `_get_model_requirements()` override for explicit values -- [ ] Document `?includeVRAM=true` parameter for clients -- [ ] Document error handling and retry best practices -- [ ] Document profile file location and cleanup - -**Testing:** -- [ ] Unit tests for VRAM checking logic -- [ ] Unit tests for hash-based file persistence -- [ ] Tests for atomic write behavior -- [ ] Integration tests with mock CUDA -- [ ] Multi-process isolation verification -- [ ] Backward compatibility tests - -**App Updates (Optional - for explicit model sizes):** -- [ ] Update whisper-wrapper to override `_get_model_requirements()` with explicit sizes -- [ ] Update other GPU-based apps as needed - ---- - -## Open Questions - -1. **Profile File Location**: Is `~/.cache/clams/memory_profiles/` appropriate for all deployment scenarios? Consider container environments with ephemeral storage. - -2. **Multi-GPU**: Should SDK support GPU selection/load balancing across devices? - -3. **Health Endpoint**: Add dedicated `/health` endpoint in addition to `?includeVRAM`? - -4. **Profile Cleanup**: Should SDK provide mechanism to clear old/stale profile files? - -5. **Conservative Threshold**: Is 80% appropriate for first request, or should it be configurable? - ---- - -## References - -**Related Code:** -- `clams/app/__init__.py:349-392` - CUDA profiling decorator -- `clams/restify/__init__.py:42-78` - Production server setup -- `app-whisper-wrapper/app.py` - Real-world example with attempted mitigation - -**Related Issues:** -- Issue #243: Main issue tracking this problem -- app-doctr-wrapper PR #6: Similar problem in different app - -**External Resources:** -- [PyTorch CUDA Semantics](https://pytorch.org/docs/stable/notes/cuda.html) -- [Gunicorn Settings](https://docs.gunicorn.org/en/stable/settings.html) -- [CUDA Memory Management](https://developer.nvidia.com/blog/unified-memory-cuda-beginners/) - ---- - -## Conclusion - -The proposed SDK-level solution addresses the root cause of issue #243 by: - -1. **Checking VRAM at runtime** - No static assumptions about availability -2. **Automatic memory profiling** - Conservative first request (80%), then uses measured values -3. **Zero developer effort** - Works without app changes; apps can optionally override for explicit values -4. **Race-condition safe persistence** - Hash-based files with atomic writes -5. **Failing fast with clear errors** - Better than OOM crashes -6. **Conservative worker defaults** - Prevents overloading GPU systems -7. **Centralized implementation** - All apps benefit automatically - -This approach provides a robust foundation for GPU resource management in the CLAMS ecosystem while requiring no changes from app developers. Apps that want more precise control can override `_get_model_requirements()` with explicit model sizes. diff --git a/tests/test_vram.py b/tests/test_vram.py new file mode 100644 index 0000000..c93d367 --- /dev/null +++ b/tests/test_vram.py @@ -0,0 +1,307 @@ +import pathlib +import shutil +import tempfile +import unittest +import warnings +from typing import Union +from unittest.mock import patch, MagicMock + +import pytest +from mmif import Mmif, DocumentTypes, AnnotationTypes + +import clams.app +import clams.restify +from clams.app import ClamsApp, InsufficientVRAMError +from clams.appmetadata import AppMetadata + +# Skip entire file if nvidia-smi not available +pytestmark = pytest.mark.skipif( + shutil.which('nvidia-smi') is None, + reason="nvidia-smi not available - no CUDA device" +) + + +class GPUExampleClamsApp(clams.app.ClamsApp): + """Example app with GPU memory requirements declared.""" + + def _appmetadata(self) -> Union[dict, AppMetadata]: + metadata = AppMetadata( + name="GPU Example App", + description="Test app with GPU memory requirements", + app_license="MIT", + identifier="gpu-example-app", + url="https://example.com/gpu-app", + gpu_mem_min=2000, # 2GB minimum + gpu_mem_typ=4000, # 4GB typical + ) + metadata.add_input(DocumentTypes.VideoDocument) + metadata.add_output(AnnotationTypes.TimeFrame) + return metadata + + def _annotate(self, mmif, **kwargs): + if not isinstance(mmif, Mmif): + mmif = Mmif(mmif, validate=False) + new_view = mmif.new_view() + self.sign_view(new_view, kwargs) + new_view.new_contain(AnnotationTypes.TimeFrame) + return mmif + + +class NonGPUExampleClamsApp(clams.app.ClamsApp): + """Example app without GPU memory requirements (gpu_mem_min=0).""" + + def _appmetadata(self) -> Union[dict, AppMetadata]: + metadata = AppMetadata( + name="Non-GPU Example App", + description="Test app without GPU requirements", + app_license="MIT", + identifier="non-gpu-example-app", + url="https://example.com/non-gpu-app", + ) + metadata.add_input(DocumentTypes.TextDocument) + metadata.add_output(AnnotationTypes.TimeFrame) + return metadata + + def _annotate(self, mmif, **kwargs): + if not isinstance(mmif, Mmif): + mmif = Mmif(mmif, validate=False) + new_view = mmif.new_view() + self.sign_view(new_view, kwargs) + new_view.new_contain(AnnotationTypes.TimeFrame) + return mmif + + +class TestVRAMManagement(unittest.TestCase): + + def setUp(self): + self.gpu_app = GPUExampleClamsApp() + self.non_gpu_app = NonGPUExampleClamsApp() + + # ===== A. Pure Logic Tests ===== + + def test_profile_path_structure(self): + """Profile path includes sanitized app identifier.""" + param_hash = "abc123def456" + path = self.gpu_app._get_profile_path(param_hash) + + self.assertIn('.cache', str(path)) + self.assertIn('clams', str(path)) + self.assertIn('memory_profiles', str(path)) + self.assertIn(param_hash, str(path)) + self.assertTrue(str(path).endswith('.txt')) + + def test_profile_path_sanitization(self): + """URLs with / and : are properly sanitized in path.""" + param_hash = "test123" + path = self.gpu_app._get_profile_path(param_hash) + + # App identifier has slashes and colons that should be replaced + path_str = str(path) + # After sanitization, no / or : should be in the app_id part + app_id_part = path.parent.name + self.assertNotIn('/', app_id_part) + self.assertNotIn(':', app_id_part) + + def test_insufficient_vram_error(self): + """InsufficientVRAMError can be raised and caught.""" + with self.assertRaises(InsufficientVRAMError): + raise InsufficientVRAMError("Test error message") + + # Also inherits from RuntimeError + with self.assertRaises(RuntimeError): + raise InsufficientVRAMError("Test error message") + + def test_http_503_on_vram_error(self): + """RestAPI returns 503 for InsufficientVRAMError.""" + app = clams.restify.Restifier(GPUExampleClamsApp()).test_client() + + # Mock the annotate method to raise InsufficientVRAMError + with patch.object(GPUExampleClamsApp, 'annotate', + side_effect=InsufficientVRAMError("Not enough VRAM")): + mmif = Mmif(validate=False) + from mmif import Document + doc = Document({'@type': DocumentTypes.VideoDocument, + 'properties': {'id': 'v1', 'location': '/test.mp4'}}) + mmif.add_document(doc) + + res = app.post('/', data=mmif.serialize()) + self.assertEqual(res.status_code, 503) + self.assertIn('Not enough VRAM', res.get_data(as_text=True)) + + # ===== B. Mocked CUDA Tests ===== + + def test_check_vram_available_sufficient(self): + """Returns True when sufficient VRAM available.""" + mock_props = MagicMock() + mock_props.total_memory = 24 * 1024**3 # 24GB + + with patch('torch.cuda.is_available', return_value=True), \ + patch('torch.cuda.current_device', return_value=0), \ + patch('torch.cuda.get_device_properties', return_value=mock_props), \ + patch('torch.cuda.memory_allocated', return_value=1 * 1024**3), \ + patch('torch.cuda.memory_reserved', return_value=1 * 1024**3): + + # 24GB - 1GB = 23GB available, requesting 6GB + result = ClamsApp._check_vram_available(6 * 1024**3) + self.assertTrue(result) + + def test_check_vram_available_insufficient(self): + """Returns False when insufficient VRAM available.""" + mock_props = MagicMock() + mock_props.total_memory = 8 * 1024**3 # 8GB + + with patch('torch.cuda.is_available', return_value=True), \ + patch('torch.cuda.current_device', return_value=0), \ + patch('torch.cuda.get_device_properties', return_value=mock_props), \ + patch('torch.cuda.memory_allocated', return_value=6 * 1024**3), \ + patch('torch.cuda.memory_reserved', return_value=6 * 1024**3): + + # 8GB - 6GB = 2GB available, requesting 6GB (+ 10% margin) + result = ClamsApp._check_vram_available(6 * 1024**3) + self.assertFalse(result) + + def test_get_available_vram(self): + """Returns correct available VRAM calculation.""" + mock_props = MagicMock() + mock_props.total_memory = 16 * 1024**3 # 16GB + + with patch('torch.cuda.is_available', return_value=True), \ + patch('torch.cuda.current_device', return_value=0), \ + patch('torch.cuda.get_device_properties', return_value=mock_props), \ + patch('torch.cuda.memory_allocated', return_value=4 * 1024**3), \ + patch('torch.cuda.memory_reserved', return_value=5 * 1024**3): + + # Should use max(allocated, reserved) = 5GB + # Available = 16GB - 5GB = 11GB + result = ClamsApp._get_available_vram() + self.assertEqual(result, 11 * 1024**3) + + def test_get_estimated_vram_first_request(self): + """Uses conservative 80% when no historical profile exists.""" + with tempfile.TemporaryDirectory() as tmpdir: + with patch.object(self.gpu_app, '_get_profile_path') as mock_path: + # Profile doesn't exist + profile_file = pathlib.Path(tmpdir) / 'memory_abc123.txt' + mock_path.return_value = profile_file + + mock_props = MagicMock() + mock_props.total_memory = 24 * 1024**3 # 24GB + + with patch('torch.cuda.is_available', return_value=True), \ + patch('torch.cuda.current_device', return_value=0), \ + patch('torch.cuda.get_device_properties', return_value=mock_props): + + result = self.gpu_app._get_estimated_vram_usage(model='large') + + self.assertIsNotNone(result) + self.assertEqual(result['source'], 'conservative-first-request') + # Should be 80% of 24GB + expected = int(24 * 1024**3 * 0.8) + self.assertEqual(result['size_bytes'], expected) + + def test_get_estimated_vram_historical(self): + """Uses historical measurement × 1.2 when profile exists.""" + with tempfile.TemporaryDirectory() as tmpdir: + with patch.object(self.gpu_app, '_get_profile_path') as mock_path: + # Create profile with historical value + profile_file = pathlib.Path(tmpdir) / 'memory_abc123.txt' + profile_file.parent.mkdir(parents=True, exist_ok=True) + historical_peak = 3 * 1024**3 # 3GB + profile_file.write_text(str(historical_peak)) + mock_path.return_value = profile_file + + result = self.gpu_app._get_estimated_vram_usage(model='large') + + self.assertIsNotNone(result) + self.assertEqual(result['source'], 'historical') + # Should be historical × 1.2 + expected = int(historical_peak * 1.2) + self.assertEqual(result['size_bytes'], expected) + + def test_record_vram_usage_creates_file(self): + """Profile file is created with peak value.""" + with tempfile.TemporaryDirectory() as tmpdir: + with patch.object(self.gpu_app, '_get_profile_path') as mock_path: + profile_file = pathlib.Path(tmpdir) / 'subdir' / 'memory_abc123.txt' + mock_path.return_value = profile_file + + peak_bytes = 3 * 1024**3 + self.gpu_app._record_vram_usage({'model': 'large'}, peak_bytes) + + self.assertTrue(profile_file.exists()) + self.assertEqual(int(profile_file.read_text()), peak_bytes) + + def test_record_vram_usage_updates_higher(self): + """Only updates profile if new peak is higher.""" + with tempfile.TemporaryDirectory() as tmpdir: + with patch.object(self.gpu_app, '_get_profile_path') as mock_path: + profile_file = pathlib.Path(tmpdir) / 'memory_abc123.txt' + profile_file.parent.mkdir(parents=True, exist_ok=True) + + # Initial value + initial_peak = 5 * 1024**3 + profile_file.write_text(str(initial_peak)) + mock_path.return_value = profile_file + + # Try to record lower value - should not update + self.gpu_app._record_vram_usage({'model': 'large'}, 3 * 1024**3) + self.assertEqual(int(profile_file.read_text()), initial_peak) + + # Record higher value - should update + higher_peak = 7 * 1024**3 + self.gpu_app._record_vram_usage({'model': 'large'}, higher_peak) + self.assertEqual(int(profile_file.read_text()), higher_peak) + + def test_vram_check_skipped_when_no_gpu_mem_min(self): + """VRAM checking is skipped when gpu_mem_min=0.""" + # non_gpu_app has gpu_mem_min=0, so should skip VRAM checking + self.assertEqual(self.non_gpu_app.metadata.gpu_mem_min, 0) + + # _get_estimated_vram_usage should still work but won't be called + # during annotation because the condition check will fail + + # ===== C. AppMetadata Tests ===== + + def test_gpu_mem_fields_default_zero(self): + """GPU memory fields default to 0.""" + metadata = AppMetadata( + name="Test App", + description="Test", + app_license="MIT", + identifier="test-app", + url="https://example.com", + ) + metadata.add_input(DocumentTypes.TextDocument) + metadata.add_output(AnnotationTypes.TimeFrame) + + self.assertEqual(metadata.gpu_mem_min, 0) + self.assertEqual(metadata.gpu_mem_typ, 0) + + def test_gpu_mem_typ_validation(self): + """Warning issued when gpu_mem_typ < gpu_mem_min, auto-corrected.""" + with warnings.catch_warnings(record=True) as w: + warnings.simplefilter("always") + + metadata = AppMetadata( + name="Test App", + description="Test", + app_license="MIT", + identifier="test-app", + url="https://example.com", + gpu_mem_min=4000, # 4GB min + gpu_mem_typ=2000, # 2GB typical (less than min!) + ) + metadata.add_input(DocumentTypes.TextDocument) + metadata.add_output(AnnotationTypes.TimeFrame) + + # Should have issued a warning + self.assertEqual(len(w), 1) + self.assertIn('gpu_mem_typ', str(w[0].message)) + self.assertIn('gpu_mem_min', str(w[0].message)) + + # Should have auto-corrected + self.assertEqual(metadata.gpu_mem_typ, metadata.gpu_mem_min) + + +if __name__ == '__main__': + unittest.main() From 328c4c45ba6403aa4e92b676d16deea79f20385a Mon Sep 17 00:00:00 2001 From: Keigh Rim Date: Thu, 20 Nov 2025 22:19:07 -0500 Subject: [PATCH 09/10] disabled type checker for conditional torch imports, updated mmif-python version that provides pamameter hashing --- clams/app/__init__.py | 10 +++++----- clams/restify/__init__.py | 2 +- requirements.txt | 2 +- tests/test_clamsapp.py | 4 ++-- 4 files changed, 9 insertions(+), 9 deletions(-) diff --git a/clams/app/__init__.py b/clams/app/__init__.py index bb50e4e..1406ce4 100644 --- a/clams/app/__init__.py +++ b/clams/app/__init__.py @@ -18,7 +18,7 @@ class InsufficientVRAMError(RuntimeError): from typing import Union, Any, Optional, Dict, List, Tuple from mmif import Mmif, Document, DocumentTypes, View -from mmif.utils.cli.describe import generate_param_hash +from mmif.utils.cli.describe import generate_param_hash # pytype: disable=import-error from clams.appmetadata import AppMetadata, real_valued_primitives, python_type, map_param_kv_delimiter logging.basicConfig( @@ -373,7 +373,7 @@ def _check_vram_available(required_bytes: int, safety_margin: float = 0.1) -> bo :return: True if sufficient VRAM available """ try: - import torch + import torch # pytype: disable=import-error if not torch.cuda.is_available(): return True # No CUDA, no constraints @@ -406,7 +406,7 @@ def _get_available_vram() -> int: :return: Available VRAM in bytes, or 0 if unavailable """ try: - import torch + import torch # pytype: disable=import-error if not torch.cuda.is_available(): return 0 @@ -447,7 +447,7 @@ def _get_estimated_vram_usage(self, **parameters) -> Optional[Dict[str, Any]]: # Priority 2: Conservative first request (80% of total VRAM) try: - import torch + import torch # pytype: disable=import-error if torch.cuda.is_available(): device = torch.cuda.current_device() total_vram = torch.cuda.get_device_properties(device).total_memory @@ -536,7 +536,7 @@ def _profile_cuda_memory(func): """ def wrapper(*args, **kwargs): # Get the ClamsApp instance from the bound method - app_instance = func.__self__ + app_instance = getattr(func, '__self__', None) cuda_profiler = {} torch_available = False diff --git a/clams/restify/__init__.py b/clams/restify/__init__.py index 64e1a7f..6aab26d 100644 --- a/clams/restify/__init__.py +++ b/clams/restify/__init__.py @@ -68,7 +68,7 @@ def number_of_workers(): # Calculate workers based on total VRAM of the first CUDA device (no other GPUs are considered for now) try: - import torch + import torch # pytype: disable=import-error if torch.cuda.is_available(): total_vram_bytes = torch.cuda.get_device_properties(0).total_memory total_vram_mb = total_vram_bytes / (1024 * 1024) diff --git a/requirements.txt b/requirements.txt index 8a44892..12d786c 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,4 +1,4 @@ -mmif-python==1.1.2 +mmif-python==1.2.0 Flask>=2 Flask-RESTful>=0.3.9 diff --git a/tests/test_clamsapp.py b/tests/test_clamsapp.py index 8318f1b..cc74bc5 100644 --- a/tests/test_clamsapp.py +++ b/tests/test_clamsapp.py @@ -297,13 +297,13 @@ def test_annotate_returns_invalid_mmif(self): def test_open_document_location(self): mmif = ExampleInputMMIF.get_rawmmif() - with self.app.open_document_location(mmif.documents['t1']) as f: + with self.app.open_document_location(mmif['t1']) as f: self.assertEqual(f.read(), ExampleInputMMIF.EXAMPLE_TEXT) def test_open_document_location_custom_opener(self): from PIL import Image mmif = ExampleInputMMIF.get_rawmmif() - with self.app.open_document_location(mmif.documents['i1'], Image.open) as f: + with self.app.open_document_location(mmif['i1'], Image.open) as f: self.assertEqual(f.size, (200, 71)) def test_refine_parameters(self): From c7b12bdd7eb17cbe88c0fc395550de693dafcb58 Mon Sep 17 00:00:00 2001 From: Keigh Rim Date: Fri, 21 Nov 2025 14:49:59 -0500 Subject: [PATCH 10/10] changed vram usage estimation logic, more documentation on env vars --- clams/app/__init__.py | 58 ++++++++++++--------- clams/restify/__init__.py | 49 +++++++++++++----- documentation/clamsapp.md | 1 + documentation/gpu-apps.md | 104 +++++++++++++++++++++++++++++++++----- 4 files changed, 160 insertions(+), 52 deletions(-) diff --git a/clams/app/__init__.py b/clams/app/__init__.py index 1406ce4..b10af76 100644 --- a/clams/app/__init__.py +++ b/clams/app/__init__.py @@ -22,7 +22,7 @@ class InsufficientVRAMError(RuntimeError): from clams.appmetadata import AppMetadata, real_valued_primitives, python_type, map_param_kv_delimiter logging.basicConfig( - level=logging.WARNING, + level=getattr(logging, os.environ.get('CLAMS_LOGLEVEL', 'WARNING').upper(), logging.WARNING), format="%(asctime)s %(name)s %(levelname)-8s %(thread)d %(message)s", datefmt="%Y-%m-%d %H:%M:%S") @@ -360,37 +360,27 @@ def _get_profile_path(self, param_hash: str) -> pathlib.Path: """ # Sanitize app identifier for filesystem use app_id = self.metadata.identifier.replace('/', '-').replace(':', '-') - cache_dir = pathlib.Path.home() / '.cache' / 'clams' / 'memory_profiles' / app_id + cache_base = pathlib.Path(os.environ.get('XDG_CACHE_HOME', pathlib.Path.home() / '.cache')) + cache_dir = cache_base / 'clams' / 'memory_profiles' / app_id return cache_dir / f"memory_{param_hash}.txt" @staticmethod def _check_vram_available(required_bytes: int, safety_margin: float = 0.1) -> bool: """ - Check if sufficient VRAM is currently available. + Check if sufficient VRAM is currently available (GPU-wide). :param required_bytes: Bytes needed for model - :param safety_margin: Fraction of total VRAM to keep as headroom (default 10%) + :param safety_margin: Additional safety buffer as fraction of required (default 10%) :return: True if sufficient VRAM available """ try: - import torch # pytype: disable=import-error - if not torch.cuda.is_available(): - return True # No CUDA, no constraints - - device = torch.cuda.current_device() - props = torch.cuda.get_device_properties(device) - total_vram = props.total_memory - - # Get currently used memory (max of allocated and reserved) - allocated = torch.cuda.memory_allocated(device) - reserved = torch.cuda.memory_reserved(device) - used = max(allocated, reserved) - - # Calculate available VRAM right now - available = total_vram - used + available = ClamsApp._get_available_vram() + if available == 0: + # Can't determine available VRAM, fail open + return True - # Apply safety margin - required_with_margin = required_bytes + (total_vram * safety_margin) + # Apply safety margin to required bytes + required_with_margin = required_bytes * (1 + safety_margin) return available >= required_with_margin @@ -401,10 +391,28 @@ def _check_vram_available(required_bytes: int, safety_margin: float = 0.1) -> bo @staticmethod def _get_available_vram() -> int: """ - Get currently available VRAM in bytes. + Get currently available VRAM in bytes (GPU-wide, across all processes). + + Uses nvidia-smi to get actual available memory, not just current process. :return: Available VRAM in bytes, or 0 if unavailable """ + try: + import subprocess + import shutil + if shutil.which('nvidia-smi'): + # Get free memory from nvidia-smi (reports GPU-wide, not per-process) + result = subprocess.run( + ['nvidia-smi', '--query-gpu=memory.free', '--format=csv,noheader,nounits', '-i', '0'], + capture_output=True, text=True, timeout=5 + ) + if result.returncode == 0 and result.stdout.strip(): + free_mb = float(result.stdout.strip()) + return int(free_mb * 1024 * 1024) # Convert MB to bytes + except Exception: + pass + + # Fallback to torch (only sees current process memory) try: import torch # pytype: disable=import-error if not torch.cuda.is_available(): @@ -572,11 +580,13 @@ def wrapper(*args, **kwargs): if not ClamsApp._check_vram_available(required_bytes): available_gb = ClamsApp._get_available_vram() / 1024**3 required_gb = required_bytes / 1024**3 + required_with_buffer_gb = required_gb * 1.1 # 10% safety margin error_msg = ( f"Insufficient GPU memory for {model_name}. " - f"Requested: {required_gb:.2f}GB, " - f"Available: {available_gb:.2f}GB. " + f"Tried to allocate {required_with_buffer_gb:.2f}GB " + f"(estimated {required_gb:.2f}GB + 10% buffer), " + f"available {available_gb:.2f}GB. " ) if source == 'conservative-first-request': error_msg += ( diff --git a/clams/restify/__init__.py b/clams/restify/__init__.py index 6aab26d..76c5696 100644 --- a/clams/restify/__init__.py +++ b/clams/restify/__init__.py @@ -57,9 +57,10 @@ def number_of_workers(): cpu_workers = (multiprocessing.cpu_count() * 2) + 1 # Get GPU memory requirement from app metadata + # Use gpu_mem_typ (typical usage) for worker calculation try: metadata = self.cla.metadata - gpu_mem_mb = metadata.gpu_mem_min # some apps may not have this field, or devs may forget to set it + gpu_mem_mb = metadata.gpu_mem_typ # typical usage determines how many workers fit except Exception: gpu_mem_mb = 0 @@ -67,20 +68,26 @@ def number_of_workers(): return cpu_workers # Calculate workers based on total VRAM of the first CUDA device (no other GPUs are considered for now) + # Use nvidia-smi instead of torch to avoid initializing CUDA in parent process before fork try: - import torch # pytype: disable=import-error - if torch.cuda.is_available(): - total_vram_bytes = torch.cuda.get_device_properties(0).total_memory - total_vram_mb = total_vram_bytes / (1024 * 1024) - vram_workers = max(1, int(total_vram_mb // gpu_mem_mb)) - workers = min(vram_workers, cpu_workers) - self.cla.logger.info( - f"GPU detected: {total_vram_mb:.0f} MB VRAM, " - f"app requires {gpu_mem_mb} MB, " - f"using {workers} workers (max {vram_workers} by VRAM, {cpu_workers} by CPU)" + import subprocess + import shutil + if shutil.which('nvidia-smi'): + result = subprocess.run( + ['nvidia-smi', '--query-gpu=memory.total', '--format=csv,noheader,nounits', '-i', '0'], + capture_output=True, text=True, timeout=5 ) - return workers - except ImportError: + if result.returncode == 0 and result.stdout.strip(): + total_vram_mb = float(result.stdout.strip()) + vram_workers = max(1, int(total_vram_mb // gpu_mem_mb)) + workers = min(vram_workers, cpu_workers) + self.cla.logger.info( + f"GPU detected: {total_vram_mb:.0f} MB VRAM, " + f"app requires {gpu_mem_mb} MB, " + f"using {workers} workers (max {vram_workers} by VRAM, {cpu_workers} by CPU)" + ) + return workers + except Exception: pass return cpu_workers @@ -92,9 +99,16 @@ def __init__(self, app, host, port, **options): 'bind': f'{host}:{port}', 'workers': number_of_workers(), 'threads': 2, + # disable timeout for long-running GPU workloads (default 30s is too short) + 'timeout': 0, # because the default is 'None' 'accesslog': '-', # errorlog, however, is redirected to stderr by default since 19.2, so no need to set + # log level is warning by default + 'loglevel': os.environ.get('CLAMS_LOGLEVEL', 'warning').lower(), + # default to 1 to free GPU memory after each request + # developers can override via serve_production(max_requests=N) for single-model apps + 'max_requests': 1, } self.options.update(options) self.application = app @@ -109,6 +123,13 @@ def load_config(self): def load(self): return self.application + # Log max_requests setting + max_req = options.get('max_requests', 1) # default is 1, meaning workers are killed after each request + if max_req == 0: + self.cla.logger.info("Worker recycling: disabled (workers persist)") + else: + self.cla.logger.info(f"Worker recycling: after {max_req} request(s)") + ProductionApplication(self.flask_app, self.host, self.port, **options).run() def serve_development(self, **options): @@ -180,7 +201,7 @@ def post(self) -> Response: return self.json_to_response(self.cla.annotate(raw_data, **raw_params)) except InsufficientVRAMError as e: self.cla.logger.warning(f"Request rejected due to insufficient VRAM: {e}") - return Response(response=str(e), status=503, mimetype='text/plain') + return self.json_to_response(self.cla.record_error(raw_data, **raw_params).serialize(pretty=True), status=503) except Exception: self.cla.logger.exception("Error in annotation") return self.json_to_response(self.cla.record_error(raw_data, **raw_params).serialize(pretty=True), status=500) diff --git a/documentation/clamsapp.md b/documentation/clamsapp.md index 33127cf..0f2443a 100644 --- a/documentation/clamsapp.md +++ b/documentation/clamsapp.md @@ -216,6 +216,7 @@ When running in production mode, the following environment variables can be used | Variable | Description | Default | |----------|-------------|---------| | `CLAMS_WORKERS` | Number of gunicorn worker processes | Auto-calculated based on CPU cores and GPU memory | +| `CLAMS_LOGLEVEL` | Logging verbosity level (`debug`, `info`, `warning`, `error`) | `warning` | By default, the number of workers is calculated as `(CPU cores × 2) + 1`. For GPU-based apps, see [GPU Memory Management](gpu-apps.md) for details on automatic worker scaling and VRAM management. diff --git a/documentation/gpu-apps.md b/documentation/gpu-apps.md index 54a3d2d..d9dcc82 100644 --- a/documentation/gpu-apps.md +++ b/documentation/gpu-apps.md @@ -4,10 +4,17 @@ This document covers GPU memory management features in the CLAMS SDK for develop ### Overview -CLAMS apps that use GPU acceleration face memory management challenges when running as HTTP servers with multiple workers. Each gunicorn worker loads models independently into GPU VRAM, which can cause out-of-memory (OOM) errors. +CLAMS apps that use GPU acceleration face memory management challenges when running as HTTP servers with multiple workers. +Each gunicorn worker loads models independently into GPU VRAM, which can cause out-of-memory (OOM) errors. -> **Note**: The VRAM profiling and runtime checking features described in this document are **PyTorch-only**. Apps using TensorFlow or other frameworks will not benefit from automatic VRAM profiling, though worker calculation based on `gpu_mem_min` still works. TensorFlow-based apps should set conservative `gpu_mem_min` values and rely on manual testing to determine appropriate worker counts. +:::{note} +The memory profiling features (peak usage tracking) require **PyTorch** since they use `torch.cuda` APIs. +Worker calculation and VRAM availability checking use `nvidia-smi` and work with any framework, but the system requires PyTorch to be installed. +TensorFlow-based apps should set conservative (high) VRAM usage values in app metadata since profiling won't track TensorFlow allocations. +All the VRAM-related log messages are set to `info` level. +::: + The CLAMS SDK provides: 1. **Metadata fields** for declaring GPU memory requirements 2. **Automatic worker scaling** based on available VRAM @@ -44,11 +51,11 @@ class MyGPUApp(ClamsApp): return metadata ``` -#### Guidelines for Setting Values +#### General Guidelines for Setting Values - **`gpu_mem_min`**: The absolute minimum VRAM needed to load the smallest supported model configuration. 0 (the default) means the app does not use GPU. -- **`gpu_mem_typ`**: Expected VRAM usage with default parameters. This value is displayed to users and helps them understand resource requirements. Must be >= `gpu_mem_min`. +- **`gpu_mem_typ`**: Expected VRAM usage with default parameters. This value is used for automatic worker calculation and displayed to users to help them understand resource requirements. Must be >= `gpu_mem_min`. If `gpu_mem_typ` is set lower than `gpu_mem_min`, the SDK will automatically correct it and issue a warning. @@ -57,17 +64,33 @@ If `gpu_mem_typ` is set lower than `gpu_mem_min`, the SDK will automatically cor When running in production mode (gunicorn), the SDK automatically calculates the optimal number of workers based on: 1. CPU cores: `(cores × 2) + 1` -2. Available VRAM: `total_vram / gpu_mem_min` +2. Available VRAM: `total_vram / gpu_mem_typ` -The final worker count is the minimum of these two values, ensuring workers don't exceed available GPU memory. +The final worker count is the minimum of these two values, ensuring workers don't exceed available GPU memory. Using `gpu_mem_typ` (typical usage) rather than `gpu_mem_min` provides more realistic worker counts for typical workloads. -#### Example Calculation +##### Example Calculation For a system with: - 8 CPU cores → 17 CPU-based workers -- 24GB VRAM, app requires 4GB → 6 VRAM-based workers +- 24GB VRAM, app typically uses 6GB (`gpu_mem_typ=6000`) → 4 VRAM-based workers -Result: 6 workers (limited by VRAM) +Result: 4 workers (limited by VRAM) + +#### Worker Recycling + +By default, workers are recycled after each request (`max_requests=1`). This ensures GPU memory is fully released between requests, which is important for: +- Apps that load different models based on parameters +- Preventing memory fragmentation over time +- Ensuring accurate VRAM availability checks + +For apps with a single persistent model, developers can disable recycling for better performance: + +```python +# In app.py +if __name__ == '__main__': + restifier = Restifier(MyApp()) + restifier.serve_production(max_requests=0) # Workers persist indefinitely +``` #### Overriding Worker Count @@ -81,6 +104,10 @@ CLAMS_WORKERS=2 python app.py --production docker run -e CLAMS_WORKERS=2 -p 5000:5000 ``` +```bash +CLAMS_LOGLEVEL=info python app.py --production +``` + ### Runtime VRAM Checking Beyond worker calculation, the SDK performs runtime VRAM checks before each annotation request. This catches cases where: @@ -114,9 +141,11 @@ The SDK automatically profiles and caches memory usage per parameter combination Profiles are stored in: ``` -~/.cache/clams/memory_profiles//.txt +$XDG_CACHE_HOME/clams/memory_profiles//.txt ``` +If `XDG_CACHE_HOME` is not set, defaults to `~/.cache`. In containers based on `clams-python-*` base images, this is typically `/cache/clams/memory_profiles/`. + Each profile contains a single integer: the peak memory usage in bytes. #### Profile Behavior @@ -182,7 +211,54 @@ VRAM checking is only performed when all conditions are met: Apps without GPU requirements (default `gpu_mem_min=0`) skip all VRAM checks. -> **Important**: TensorFlow-based apps will not trigger VRAM checking even with `gpu_mem_min > 0`, because the profiling relies on PyTorch's CUDA APIs. For TensorFlow apps, the `gpu_mem_min` value is still used for worker calculation, but runtime VRAM checking and memory profiling are disabled. +:::{important} +The VRAM checking system requires PyTorch to be installed. TensorFlow-based apps with PyTorch installed will get worker calculation and VRAM availability checking (via `nvidia-smi`), but memory profiling will only track PyTorch allocations, not TensorFlow allocations. For accurate profiling, TensorFlow apps should set conservative `gpu_mem_typ` values based on manual measurements. +::: + +### Model Loading Strategy + +#### Single Model + +Load the model in `__init__` so it's ready when requests arrive: + +```python +class MyGPUApp(ClamsApp): + def __init__(self): + super().__init__() + self.model = load_model() # Load once per worker + + def _annotate(self, mmif, **params): + result = self.model.predict(...) # Model already loaded + return mmif +``` + +Each gunicorn worker calls `__init__` independently, so each worker gets its own model copy. Worker count is limited by `gpu_mem_typ` to prevent OOM. +In this case, it's generally recommended to use a `max_requests` value that's larger than 1 to save model loading time. + +#### Multiple Model Variants + +For apps supporting different model sizes (e.g., tiny/base/large), use lazy loading with caching: + +```python +class WhisperApp(ClamsApp): + def __init__(self): + super().__init__() + self.models = {} # Cache for loaded models + + def _annotate(self, mmif, modelSize='base', **params): + if modelSize not in self.models: + self.models[modelSize] = whisper.load_model(modelSize) + + model = self.models[modelSize] + # use model... + return mmif +``` + +**Considerations for multiple models:** +- Set `gpu_mem_min` for the smallest supported model (absolute minimum to run) +- Set `gpu_mem_typ` for the largest commonly-used model (this determines worker count) +- Historical profiles are keyed by parameter hash, so different model sizes get separate profiles +- Multiple models may accumulate in memory within a single worker (consider enabling worker recycling with `max_requests=1`) ### Memory Optimization Tips @@ -214,7 +290,7 @@ To add GPU memory management to an existing app: #### Workers not scaling correctly -- Verify `gpu_mem_min` is set in metadata (not 0) +- Verify `gpu_mem_typ` is set in metadata (not 0) - this determines worker count - Check PyTorch is installed and CUDA is available - Use `CLAMS_WORKERS` to override if needed @@ -222,11 +298,11 @@ To add GPU memory management to an existing app: - Check available VRAM with `nvidia-smi` - Clear GPU memory from other processes -- Profile cache may have outdated high values (delete `~/.cache/clams/memory_profiles/`) +- Profile cache may have outdated high values (delete `~/.cache/clams/memory_profiles/` or `$XDG_CACHE_HOME/clams/memory_profiles/`) #### OOM errors despite worker limits -- `gpu_mem_min` may be set too low +- `gpu_mem_typ` may be set too low, allowing too many workers - Memory fragmentation; try restarting workers - Other processes consuming VRAM