A production-ready microservices architecture for HPC job runtime prediction, featuring state-of-the-art machine learning models trained on the NREL Eagle dataset.
Get everything running in one command:
# Clone the repository
git clone <repository-url>
cd rt-predictor/microservices
# Run the automated setup
./quickstart.shFor M2 Max with 64GB RAM:
# Optimized setup for Apple Silicon
make fresh-start-m2maxThis will automatically:
- β Check prerequisites (Docker, Docker Compose)
- β Pull training data (Git LFS)
- β Build all Docker images
- β Train ML models (~5-10 minutes)
- β Start all services
The system consists of three main microservices:
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β β β β
β Training ββββββΆβ API Service βββββββ UI Service β
β Service β β (gRPC) β β (Streamlit) β
β β β β β β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β
β β β
βΌ βΌ βΌ
ββββββββββββ ββββββββββββ ββββββββββββ
β Models β β Metrics β β Users β
β Volume β β (Prom) β β β
ββββββββββββ ββββββββββββ ββββββββββββ
- Trains ensemble ML models (XGBoost, LightGBM, CatBoost)
- Processes 11M+ Eagle HPC job records
- Generates optimized feature engineering pipeline
- Outputs model artifacts to shared volume
- High-performance gRPC service
- Serves predictions with <10ms latency
- Handles single, batch, and streaming requests
- Exposes Prometheus metrics
- Health check endpoint
- Response caching and circuit breaker patterns
- Modern Streamlit web interface
- Single and batch prediction capabilities
- Real-time analytics dashboard
- CSV upload for batch processing
- Model performance visualization
- Docker and Docker Compose
- 16GB+ RAM recommended (64GB for M2 Max optimization)
- 10GB+ disk space
- Git and Git LFS (for training data)
Set the $DEV environment variable to your development directory:
# Temporary (current session)
export DEV="/path/to/your/development/directory"
# Or permanent (add to ~/.bashrc or ~/.zshrc)
echo 'export DEV="/path/to/your/development/directory"' >> ~/.bashrc
source ~/.bashrccd $DEV/rt-predictor/microservices# Option A: Use provided data (requires Git LFS)
git lfs pull # Download data files
./scripts/copy_data.sh # Copy to training directory
# Option B: Generate synthetic data
python rt-predictor-training/scripts/generate_synthetic_data.py# Standard training
docker-compose --profile training up rt-predictor-training
# Or M2 Max optimized training (2-3x faster)
make train-m2max# Start API, UI, and monitoring
docker-compose up -d
# Or with M2 Max optimization
make start-m2max- UI: http://localhost:8501
- API: localhost:50051 (gRPC)
- Metrics: http://localhost:8181/metrics
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (admin/admin)
Make single job runtime predictions with an intuitive interface
Upload CSV files for bulk runtime predictions
Real-time system health monitoring and performance metrics
The RT Predictor is trained on the NREL Eagle HPC System Dataset:
- Size: 11M+ job records
- Features: 18 columns including resource requests, runtimes, and metadata
- Format: Parquet files for efficient storage
| Column | Description | Type |
|---|---|---|
| job_id | Unique job identifier | int |
| processors_req | Number of processors requested | int |
| nodes_req | Number of nodes requested | int |
| mem_req | Memory requested in MB | float |
| wallclock_req | Requested walltime in seconds | float |
| partition | Compute partition | string |
| qos | Quality of Service level | string |
| run_time | Actual runtime in seconds | float |
| ... | Additional features | ... |
- Training Time: ~10-15 minutes on 11M records
- Prediction Latency: <10ms (p95)
- Throughput: 10K+ predictions/second
- Model Accuracy: MAE ~1.6 hours
- Training Time: ~5-8 minutes (2-3x faster)
- CPU Usage: 10 cores (83% utilization)
- Memory: Up to 48GB (75% of 64GB)
- Improved accuracy: Deeper trees and more iterations
- CPU Utilization: Uses 10 cores, leaving 2 for system
- Memory Allocation: 48GB for training, 8GB for API, 4GB for UI
- Model Parameters: Increased tree depth and iterations
- Data Processing: 5x larger chunk sizes (500k records)
# Fresh start (clean + setup + build + train + start)
make fresh-start
# Individual operations
make setup # Initial setup
make build # Build Docker images
make train # Train models
make start # Start services
make stop # Stop services
make restart # Restart services
make logs # View logs
make status # Check service status
# M2 Max optimized versions
make fresh-start-m2max
make train-m2max
make start-m2max
# Cleanup
make clean # Stop and remove containers
make clean-all # Deep clean including networks and volumes# Check shared volume
docker volume inspect microservices_shared-models
# Retrain models
make train# Check service health
docker-compose ps
docker-compose logs rt-predictor-api
# Restart services
make restart- Check resource limits
- Enable caching in API service
- Scale API replicas
# Check LFS status
git lfs status
# Re-download LFS files
git lfs fetch --all
git lfs checkout# Use sample for development
python rt-predictor-training/src/train.py --sample-size 100000
# Or increase Docker memory in Docker Desktop settings- β Fixed UI navigation issue (Streamlit auto-detection)
- β All UI pages now working correctly
- β Successful end-to-end predictions verified
- β M2 Max optimization support
- β Batch prediction page implementation
- β Analytics dashboard with live metrics
- β Enhanced error handling and caching
- β Fixed all proto message mismatches
- β Implemented health check endpoint
- β Added circuit breaker and retry logic
- β Complete UI implementation
See CHANGELOG.md for complete version history.
- Training Service:
cd rt-predictor-training
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python src/train.py- API Service:
cd rt-predictor-api
./scripts/generate_proto.sh
python src/service/server.py- UI Service:
cd rt-predictor-ui
streamlit run src/app.py# Run setup tests
./scripts/test_setup.sh
# Run service tests (when implemented)
docker-compose run rt-predictor-api pytest
docker-compose run rt-predictor-ui pytest# Apply manifests
kubectl apply -f k8s/- API: Horizontal scaling with load balancer
- UI: Multiple replicas behind reverse proxy
- Training: Scheduled jobs with resource limits
- Input validation on all endpoints
- gRPC with TLS support (configurable)
- API authentication ready
- No PII in training data
- Secure configuration management
See LICENSE file in root directory.
- Follow microservice boundaries
- Add tests for new features
- Update documentation
- Use conventional commits