A lightweight, containerized job-execution platform that runs Linux workloads, exposes metrics, logs executions, and is designed to scale from a laptop to the cloud.
Think of it as "Slurm-lite for containers" β a minimal distributed compute platform that demonstrates core systems engineering principles.
This platform provides:
- Job Execution: Execute arbitrary Linux commands in isolated containers
- Job Orchestration: A coordinator service dispatches jobs to runner instances
- Observability: Structured logging and Prometheus-style metrics
- Scalability: Designed to scale from a single machine to Kubernetes clusters
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Client Layer β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β curl/CLI β β HTTP Client β β API Client β β
β ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ β
βββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌββββββββββββββββββββββββββ
β β β
β HTTP/REST β β
β Port 8000 β β
β β β
βββββββββββΌβββββββββββββββββββΌβββββββββββββββββββΌββββββββββββββββββββββββββ
β Coordinator Service β
β (FastAPI / Python 3.11+) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β API Endpoints β β
β β β’ POST /jobs - Submit new job β β
β β β’ GET /jobs/{id} - Query job status β β
β β β’ GET /metrics - Prometheus metrics β β
β β β’ GET /health - Health check β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Job Management Layer β β
β β β’ Job Queue (in-memory dict) β β
β β β’ Job State Machine: pending β running β completed/failed β β
β β β’ Async job dispatch (asyncio.create_task) β β
β β β’ Job tracking with UUIDs β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Metrics & Observability β β
β β β’ jobs_total, jobs_completed_total, jobs_failed_total β β
β β β’ job_runtime_seconds (histogram) β β
β β β’ Structured logging (stdout) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
β HTTP POST /execute
β (httpx.AsyncClient)
β Port 8080
β
βββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Runner Service β
β (FastAPI / Python 3.11+) β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β API Endpoints β β
β β β’ POST /execute - Execute Linux command β β
β β β’ GET /metrics - Execution metrics β β
β β β’ GET /health - Health check β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Execution Engine β β
β β β’ subprocess.Popen (shell=True) β β
β β β’ Timeout enforcement (subprocess.communicate) β β
β β β’ Process isolation (container boundaries) β β
β β β’ stdout/stderr capture β β
β β β’ Exit code tracking β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Metrics & Observability β β
β β β’ executions_total, executions_success_total β β
β β β’ executions_failed_total β β
β β β’ Structured logging (stdout) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
β subprocess execution
β
βββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Linux Process Layer β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Isolated Process Execution β β
β β β’ Command execution (shell=True) β β
β β β’ Resource limits (Docker cgroups) β β
β β β’ Timeout handling β β
β β β’ Output capture (stdout/stderr) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββ
β Client β
ββββββ¬βββββ
β
β 1. POST /jobs {"command": "uname -a", "timeout": 30}
β
ββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Coordinator β
β β’ Generate UUID (job_id) β
β β’ Store job in memory: {job_id, status: "pending", ...} β
β β’ Increment jobs_total metric β
β β’ Return JobResponse {job_id, status: "pending"} β
β β
β β’ Async dispatch: asyncio.create_task(execute_job(...)) β
β ββ> Update status: "pending" β "running" β
ββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
β 2. POST http://runner:8080/execute
β {"command": "uname -a", "timeout": 30}
β
ββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Runner β
β β’ Validate request β
β β’ Execute: subprocess.Popen(command, shell=True) β
β β’ Monitor with timeout (subprocess.communicate) β
β β’ Capture stdout/stderr β
β β’ Track exit_code β
β β’ Update metrics (executions_total, etc.) β
β β’ Return ExecuteResponse {exit_code, stdout, stderr, runtime_ms}β
ββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
β 3. HTTP Response {exit_code: 0, stdout: "...", ...}
β
ββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Coordinator β
β β’ Update job status: "running" β "completed"/"failed" β
β β’ Store results: {exit_code, stdout, stderr, runtime_ms} β
β β’ Update metrics: β
β - jobs_completed_total or jobs_failed_total β
β - job_runtimes.append(runtime_ms) β
β β’ Log completion β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
β 4. GET /jobs/{job_id}
β
ββββββΌβββββ
β Client β β Returns full job status with results
βββββββββββ
- API Gateway: Exposes REST API for job submission and status queries
- Job Orchestration: Manages job lifecycle (pending β running β completed/failed)
- Service Discovery: Discovers runner instances via
RUNNER_URLenvironment variable - State Management: Maintains in-memory job store
- Metrics Aggregation: Collects and exposes Prometheus-style metrics
- Error Handling: Manages timeouts, network failures, and runner unavailability
- Command Execution: Executes Linux commands in isolated subprocesses
- Process Management: Handles timeouts, process termination, and resource limits
- Output Capture: Captures stdout/stderr streams
- Metrics Collection: Tracks execution metrics (success/failure rates, runtimes)
- Health Reporting: Exposes health check endpoint for orchestration
- Containerization: Docker containers provide isolation and resource limits
- Networking: Docker bridge network enables service-to-service communication
- Resource Management: CPU and memory limits enforced via Docker cgroups
- Health Monitoring: Health checks enable automatic recovery and load balancing
- Coordinator Service: FastAPI-based service that accepts job requests, dispatches them to runners, and tracks job status
- Runner Service: Executes Linux commands using subprocess, captures output, and reports results back
- Docker Infrastructure: Both services run in containers with proper isolation and resource limits
This project demonstrates:
- Systems Engineering: Process management, inter-service communication, resource isolation
- Platform Engineering: Containerization, orchestration, observability
- Linux Familiarity: Direct interaction with OS primitives, subprocess execution
- Cloud Readiness: Designed with Kubernetes deployment in mind
- Docker and Docker Compose
- Python 3.11+ (for local development)
# Start all services
docker-compose up --build
# In another terminal, submit a job
curl -X POST http://localhost:8000/jobs \
-H "Content-Type: application/json" \
-d '{"command": "echo hello world"}'
# Check job status
curl http://localhost:8000/jobs/{job_id} #a job id should be generated after you've submitted a curl request
# View metrics
curl http://localhost:8000/metrics# Coordinator
cd coordinator
pip install -r requirements.txt
uvicorn main:app --reload --port 8000
# Runner (in another terminal)
cd runner
pip install -r requirements.txt
python main.pyPOST /jobs
Content-Type: application/json
{
"command": "uname -a",
"timeout": 30
}Response:
{
"job_id": "abc123",
"status": "pending",
"created_at": "2024-01-01T00:00:00Z"
}GET /jobs/{job_id}Response:
{
"job_id": "abc123",
"status": "completed",
"exit_code": 0,
"stdout": "Linux hostname 5.4.0...",
"stderr": "",
"runtime_ms": 45
}GET /metricsReturns Prometheus-style metrics:
# HELP jobs_total Total number of jobs processed
# TYPE jobs_total counter
jobs_total 42
# HELP jobs_failed_total Total number of failed jobs
# TYPE jobs_failed_total counter
jobs_failed_total 2
# HELP job_runtime_seconds Average job runtime in seconds
# TYPE job_runtime_seconds histogram
job_runtime_seconds_bucket{le="0.1"} 10
job_runtime_seconds_bucket{le="1.0"} 35
- Docker Compose orchestrates services
- Direct HTTP communication between coordinator and runners
- In-memory job tracking
- Service Discovery: Use Kubernetes Services for runner discovery
- Job Queue: Replace in-memory queue with Redis/RabbitMQ
- State Management: Use PostgreSQL for job persistence
- Autoscaling: Kubernetes HPA based on queue depth
- Security: Network policies, pod security policies, RBAC
- Monitoring: Prometheus + Grafana for metrics visualization
- Logging: Centralized logging with ELK stack or Loki
AWS:
- ECS/EKS for orchestration
- SQS for job queue
- CloudWatch for metrics/logs
- IAM roles for service authentication
Azure:
- AKS for orchestration
- Service Bus for job queue
- Azure Monitor for metrics/logs
- Managed Identity for authentication
- Sandboxing: Jobs run in isolated containers with resource limits
- Input Validation: Command sanitization and timeout enforcement
- Network Isolation: Containers run on isolated networks
- Future: Implement user namespaces, seccomp profiles, AppArmor policies
- Autoscaling: Design supports horizontal scaling of runners
- Resource Management: CPU/memory limits per container
- High Availability: Coordinator can be replicated behind load balancer
- Future: Implement health checks, graceful shutdown, circuit breakers
- Job Priority: Current FIFO queue, extensible to priority queues
- Fairness: Round-robin dispatch to runners
- SLA Tracking: Metrics expose job completion times
- Future: Implement job priorities, quotas, rate limiting
mini-platform/
βββ coordinator/ # Coordinator service
β βββ main.py # FastAPI application
β βββ Dockerfile # Container definition
β βββ requirements.txt # Python dependencies
βββ runner/ # Runner service
β βββ main.py # Job execution logic
β βββ Dockerfile # Container definition
β βββ requirements.txt # Python dependencies
βββ docker-compose.yml # Local orchestration
βββ .github/
β βββ workflows/
β βββ ci.yml # CI/CD pipeline
βββ README.md # This file
# Run tests
docker-compose up -d
pytest tests/
# Manual testing
curl -X POST http://localhost:8000/jobs \
-H "Content-Type: application/json" \
-d '{"command": "cat /proc/cpuinfo"}'