Skip to content

c0mpiler/hpc-runtime-prediction

Repository files navigation

HPC Run-Time Predictor Microservices

A production-ready microservices architecture for HPC job runtime prediction, featuring state-of-the-art machine learning models trained on the NREL Eagle dataset.

πŸš€ Quick Start

Standard Setup

Get everything running in one command:

# Clone the repository
git clone <repository-url>
cd rt-predictor/microservices

# Run the automated setup
./quickstart.sh

M2 Max Optimized Setup (Apple Silicon)

For M2 Max with 64GB RAM:

# Optimized setup for Apple Silicon
make fresh-start-m2max

This will automatically:

  • βœ… Check prerequisites (Docker, Docker Compose)
  • βœ… Pull training data (Git LFS)
  • βœ… Build all Docker images
  • βœ… Train ML models (~5-10 minutes)
  • βœ… Start all services

Architecture Overview

The system consists of three main microservices:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 β”‚     β”‚                 β”‚     β”‚                 β”‚
β”‚  Training       │────▢│  API Service    │◀────│  UI Service     β”‚
β”‚  Service        β”‚     β”‚  (gRPC)         β”‚     β”‚  (Streamlit)    β”‚
β”‚                 β”‚     β”‚                 β”‚     β”‚                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                       β”‚                       β”‚
         β”‚                       β”‚                       β”‚
         β–Ό                       β–Ό                       β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  Models  β”‚            β”‚ Metrics  β”‚            β”‚  Users   β”‚
    β”‚  Volume  β”‚            β”‚  (Prom)  β”‚            β”‚          β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Services

1. RT Predictor Training Service

  • Trains ensemble ML models (XGBoost, LightGBM, CatBoost)
  • Processes 11M+ Eagle HPC job records
  • Generates optimized feature engineering pipeline
  • Outputs model artifacts to shared volume

2. RT Predictor API Service

  • High-performance gRPC service
  • Serves predictions with <10ms latency
  • Handles single, batch, and streaming requests
  • Exposes Prometheus metrics
  • Health check endpoint
  • Response caching and circuit breaker patterns

3. RT Predictor UI Service

  • Modern Streamlit web interface
  • Single and batch prediction capabilities
  • Real-time analytics dashboard
  • CSV upload for batch processing
  • Model performance visualization

πŸ“‹ Prerequisites

  • Docker and Docker Compose
  • 16GB+ RAM recommended (64GB for M2 Max optimization)
  • 10GB+ disk space
  • Git and Git LFS (for training data)

πŸ› οΈ Setup Instructions

1. Environment Setup

Set the $DEV environment variable to your development directory:

# Temporary (current session)
export DEV="/path/to/your/development/directory"

# Or permanent (add to ~/.bashrc or ~/.zshrc)
echo 'export DEV="/path/to/your/development/directory"' >> ~/.bashrc
source ~/.bashrc

2. Clone and Navigate

cd $DEV/rt-predictor/microservices

3. Prepare Training Data

# Option A: Use provided data (requires Git LFS)
git lfs pull  # Download data files
./scripts/copy_data.sh  # Copy to training directory

# Option B: Generate synthetic data
python rt-predictor-training/scripts/generate_synthetic_data.py

4. Train Models (First Time)

# Standard training
docker-compose --profile training up rt-predictor-training

# Or M2 Max optimized training (2-3x faster)
make train-m2max

5. Start All Services

# Start API, UI, and monitoring
docker-compose up -d

# Or with M2 Max optimization
make start-m2max

6. Access Services

πŸ“Έ Screenshots

RT Predictor UI

Single Prediction Page

Single Prediction Make single job runtime predictions with an intuitive interface

Batch Prediction Page

Batch Prediction Upload CSV files for bulk runtime predictions

Analytics Dashboard

Analytics Dashboard Real-time system health monitoring and performance metrics

πŸ“Š Data Information

Dataset Overview

The RT Predictor is trained on the NREL Eagle HPC System Dataset:

  • Size: 11M+ job records
  • Features: 18 columns including resource requests, runtimes, and metadata
  • Format: Parquet files for efficient storage

Data Schema

Column Description Type
job_id Unique job identifier int
processors_req Number of processors requested int
nodes_req Number of nodes requested int
mem_req Memory requested in MB float
wallclock_req Requested walltime in seconds float
partition Compute partition string
qos Quality of Service level string
run_time Actual runtime in seconds float
... Additional features ...

πŸš€ Performance & Optimization

Standard Configuration

  • Training Time: ~10-15 minutes on 11M records
  • Prediction Latency: <10ms (p95)
  • Throughput: 10K+ predictions/second
  • Model Accuracy: MAE ~1.6 hours

M2 Max Optimized (Apple Silicon)

  • Training Time: ~5-8 minutes (2-3x faster)
  • CPU Usage: 10 cores (83% utilization)
  • Memory: Up to 48GB (75% of 64GB)
  • Improved accuracy: Deeper trees and more iterations

M2 Max Key Optimizations:

  1. CPU Utilization: Uses 10 cores, leaving 2 for system
  2. Memory Allocation: 48GB for training, 8GB for API, 4GB for UI
  3. Model Parameters: Increased tree depth and iterations
  4. Data Processing: 5x larger chunk sizes (500k records)

πŸ”§ Common Commands

# Fresh start (clean + setup + build + train + start)
make fresh-start

# Individual operations
make setup      # Initial setup
make build      # Build Docker images
make train      # Train models
make start      # Start services
make stop       # Stop services
make restart    # Restart services
make logs       # View logs
make status     # Check service status

# M2 Max optimized versions
make fresh-start-m2max
make train-m2max
make start-m2max

# Cleanup
make clean      # Stop and remove containers
make clean-all  # Deep clean including networks and volumes

πŸ” Troubleshooting

Common Issues

1. Models not found

# Check shared volume
docker volume inspect microservices_shared-models

# Retrain models
make train

2. Connection refused

# Check service health
docker-compose ps
docker-compose logs rt-predictor-api

# Restart services
make restart

3. High latency

  • Check resource limits
  • Enable caching in API service
  • Scale API replicas

4. Git LFS issues

# Check LFS status
git lfs status

# Re-download LFS files
git lfs fetch --all
git lfs checkout

5. Memory issues

# Use sample for development
python rt-predictor-training/src/train.py --sample-size 100000

# Or increase Docker memory in Docker Desktop settings

πŸ“ˆ Recent Updates

Version 1.2.1 (2025-05-24)

  • βœ… Fixed UI navigation issue (Streamlit auto-detection)
  • βœ… All UI pages now working correctly
  • βœ… Successful end-to-end predictions verified

Version 1.2.0 (2025-05-24)

  • βœ… M2 Max optimization support
  • βœ… Batch prediction page implementation
  • βœ… Analytics dashboard with live metrics
  • βœ… Enhanced error handling and caching

Version 1.1.0 (2025-05-23)

  • βœ… Fixed all proto message mismatches
  • βœ… Implemented health check endpoint
  • βœ… Added circuit breaker and retry logic
  • βœ… Complete UI implementation

See CHANGELOG.md for complete version history.

πŸ—οΈ Development

Local Development

  1. Training Service:
cd rt-predictor-training
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python src/train.py
  1. API Service:
cd rt-predictor-api
./scripts/generate_proto.sh
python src/service/server.py
  1. UI Service:
cd rt-predictor-ui
streamlit run src/app.py

Testing

# Run setup tests
./scripts/test_setup.sh

# Run service tests (when implemented)
docker-compose run rt-predictor-api pytest
docker-compose run rt-predictor-ui pytest

🚒 Production Deployment

Kubernetes

# Apply manifests
kubectl apply -f k8s/

Scaling

  • API: Horizontal scaling with load balancer
  • UI: Multiple replicas behind reverse proxy
  • Training: Scheduled jobs with resource limits

πŸ”’ Security

  • Input validation on all endpoints
  • gRPC with TLS support (configurable)
  • API authentication ready
  • No PII in training data
  • Secure configuration management

πŸ“ License

See LICENSE file in root directory.

🀝 Contributing

  1. Follow microservice boundaries
  2. Add tests for new features
  3. Update documentation
  4. Use conventional commits

πŸ“š Additional Resources

About

Exploring HPC runtime prediction. Demonstrates how to leverage job submission parameters and historical data to build runtime forecasts. Training a Model, and serving it as a MicroService using gRPC - Demo implementation from data engineering to deployment.

Topics

Resources

Stars

Watchers

Forks

Contributors