Skip to content

A lightweight, containerized job-execution platform that runs Linux workloads, exposes metrics, logs executions, and is designed to scale from a laptop to the cloud.

Notifications You must be signed in to change notification settings

The-RealOG/Mini-Distributed-Compute-Platform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Mini Distributed Compute Platform

A lightweight, containerized job-execution platform that runs Linux workloads, exposes metrics, logs executions, and is designed to scale from a laptop to the cloud.

Think of it as "Slurm-lite for containers" β€” a minimal distributed compute platform that demonstrates core systems engineering principles.

What This System Does

This platform provides:

  • Job Execution: Execute arbitrary Linux commands in isolated containers
  • Job Orchestration: A coordinator service dispatches jobs to runner instances
  • Observability: Structured logging and Prometheus-style metrics
  • Scalability: Designed to scale from a single machine to Kubernetes clusters

Architecture

High-Level Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         Client Layer                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                  β”‚
β”‚  β”‚   curl/CLI   β”‚  β”‚  HTTP Client β”‚  β”‚  API Client  β”‚                  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚                  β”‚                  β”‚
          β”‚  HTTP/REST       β”‚                  β”‚
          β”‚  Port 8000       β”‚                  β”‚
          β”‚                  β”‚                  β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Coordinator Service                                  β”‚
β”‚                    (FastAPI / Python 3.11+)                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                    API Endpoints                                β”‚   β”‚
β”‚  β”‚  β€’ POST /jobs          - Submit new job                         β”‚   β”‚
β”‚  β”‚  β€’ GET  /jobs/{id}     - Query job status                       β”‚   β”‚
β”‚  β”‚  β€’ GET  /metrics       - Prometheus metrics                     β”‚   β”‚
β”‚  β”‚  β€’ GET  /health        - Health check                           β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚              Job Management Layer                                β”‚   β”‚
β”‚  β”‚  β€’ Job Queue (in-memory dict)                                    β”‚   β”‚
β”‚  β”‚  β€’ Job State Machine: pending β†’ running β†’ completed/failed      β”‚   β”‚
β”‚  β”‚  β€’ Async job dispatch (asyncio.create_task)                     β”‚   β”‚
β”‚  β”‚  β€’ Job tracking with UUIDs                                      β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                    Metrics & Observability                       β”‚   β”‚
β”‚  β”‚  β€’ jobs_total, jobs_completed_total, jobs_failed_total          β”‚   β”‚
β”‚  β”‚  β€’ job_runtime_seconds (histogram)                               β”‚   β”‚
β”‚  β”‚  β€’ Structured logging (stdout)                                   β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β”‚  HTTP POST /execute
          β”‚  (httpx.AsyncClient)
          β”‚  Port 8080
          β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Runner Service                                        β”‚
β”‚                    (FastAPI / Python 3.11+)                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                    API Endpoints                                β”‚   β”‚
β”‚  β”‚  β€’ POST /execute      - Execute Linux command                    β”‚   β”‚
β”‚  β”‚  β€’ GET  /metrics     - Execution metrics                         β”‚   β”‚
β”‚  β”‚  β€’ GET  /health      - Health check                             β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚              Execution Engine                                     β”‚   β”‚
β”‚  β”‚  β€’ subprocess.Popen (shell=True)                                 β”‚   β”‚
β”‚  β”‚  β€’ Timeout enforcement (subprocess.communicate)                  β”‚   β”‚
β”‚  β”‚  β€’ Process isolation (container boundaries)                      β”‚   β”‚
β”‚  β”‚  β€’ stdout/stderr capture                                         β”‚   β”‚
β”‚  β”‚  β€’ Exit code tracking                                            β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                    Metrics & Observability                       β”‚   β”‚
β”‚  β”‚  β€’ executions_total, executions_success_total                    β”‚   β”‚
β”‚  β”‚  β€’ executions_failed_total                                       β”‚   β”‚
β”‚  β”‚  β€’ Structured logging (stdout)                                    β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β”‚  subprocess execution
          β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Linux Process Layer                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚              Isolated Process Execution                          β”‚   β”‚
β”‚  β”‚  β€’ Command execution (shell=True)                                 β”‚   β”‚
β”‚  β”‚  β€’ Resource limits (Docker cgroups)                              β”‚   β”‚
β”‚  β”‚  β€’ Timeout handling                                              β”‚   β”‚
β”‚  β”‚  β€’ Output capture (stdout/stderr)                                β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Container Architecture

image

Data Flow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Client  β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
     β”‚
     β”‚ 1. POST /jobs {"command": "uname -a", "timeout": 30}
     β”‚
β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Coordinator                                                       β”‚
β”‚  β€’ Generate UUID (job_id)                                        β”‚
β”‚  β€’ Store job in memory: {job_id, status: "pending", ...}         β”‚
β”‚  β€’ Increment jobs_total metric                                    β”‚
β”‚  β€’ Return JobResponse {job_id, status: "pending"}                β”‚
β”‚                                                                   β”‚
β”‚  β€’ Async dispatch: asyncio.create_task(execute_job(...))         β”‚
β”‚    └─> Update status: "pending" β†’ "running"                       β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     β”‚
     β”‚ 2. POST http://runner:8080/execute
     β”‚    {"command": "uname -a", "timeout": 30}
     β”‚
β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Runner                                                            β”‚
β”‚  β€’ Validate request                                               β”‚
β”‚  β€’ Execute: subprocess.Popen(command, shell=True)                β”‚
β”‚  β€’ Monitor with timeout (subprocess.communicate)                  β”‚
β”‚  β€’ Capture stdout/stderr                                          β”‚
β”‚  β€’ Track exit_code                                                β”‚
β”‚  β€’ Update metrics (executions_total, etc.)                        β”‚
β”‚  β€’ Return ExecuteResponse {exit_code, stdout, stderr, runtime_ms}β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     β”‚
     β”‚ 3. HTTP Response {exit_code: 0, stdout: "...", ...}
     β”‚
β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Coordinator                                                       β”‚
β”‚  β€’ Update job status: "running" β†’ "completed"/"failed"          β”‚
β”‚  β€’ Store results: {exit_code, stdout, stderr, runtime_ms}        β”‚
β”‚  β€’ Update metrics:                                                β”‚
β”‚    - jobs_completed_total or jobs_failed_total                   β”‚
β”‚    - job_runtimes.append(runtime_ms)                             β”‚
β”‚  β€’ Log completion                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     β”‚
     β”‚ 4. GET /jobs/{job_id}
     β”‚
β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”
β”‚ Client  β”‚ ← Returns full job status with results
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Component Responsibilities

Coordinator Service

  • API Gateway: Exposes REST API for job submission and status queries
  • Job Orchestration: Manages job lifecycle (pending β†’ running β†’ completed/failed)
  • Service Discovery: Discovers runner instances via RUNNER_URL environment variable
  • State Management: Maintains in-memory job store
  • Metrics Aggregation: Collects and exposes Prometheus-style metrics
  • Error Handling: Manages timeouts, network failures, and runner unavailability

Runner Service

  • Command Execution: Executes Linux commands in isolated subprocesses
  • Process Management: Handles timeouts, process termination, and resource limits
  • Output Capture: Captures stdout/stderr streams
  • Metrics Collection: Tracks execution metrics (success/failure rates, runtimes)
  • Health Reporting: Exposes health check endpoint for orchestration

Infrastructure Layer

  • Containerization: Docker containers provide isolation and resource limits
  • Networking: Docker bridge network enables service-to-service communication
  • Resource Management: CPU and memory limits enforced via Docker cgroups
  • Health Monitoring: Health checks enable automatic recovery and load balancing

Components

  1. Coordinator Service: FastAPI-based service that accepts job requests, dispatches them to runners, and tracks job status
  2. Runner Service: Executes Linux commands using subprocess, captures output, and reports results back
  3. Docker Infrastructure: Both services run in containers with proper isolation and resource limits

Why This Exists

This project demonstrates:

  • Systems Engineering: Process management, inter-service communication, resource isolation
  • Platform Engineering: Containerization, orchestration, observability
  • Linux Familiarity: Direct interaction with OS primitives, subprocess execution
  • Cloud Readiness: Designed with Kubernetes deployment in mind

To Execute this program

Prerequisites

  • Docker and Docker Compose
  • Python 3.11+ (for local development)

Running the Platform

# Start all services
docker-compose up --build

# In another terminal, submit a job
curl -X POST http://localhost:8000/jobs \
  -H "Content-Type: application/json" \
  -d '{"command": "echo hello world"}'

# Check job status
curl http://localhost:8000/jobs/{job_id} #a job id should be generated after you've submitted a curl request

# View metrics
curl http://localhost:8000/metrics

Local Development

# Coordinator
cd coordinator
pip install -r requirements.txt
uvicorn main:app --reload --port 8000

# Runner (in another terminal)
cd runner
pip install -r requirements.txt
python main.py

API Reference

Submit a Job

POST /jobs
Content-Type: application/json

{
  "command": "uname -a",
  "timeout": 30
}

Response:

{
  "job_id": "abc123",
  "status": "pending",
  "created_at": "2024-01-01T00:00:00Z"
}

Get Job Status

GET /jobs/{job_id}

Response:

{
  "job_id": "abc123",
  "status": "completed",
  "exit_code": 0,
  "stdout": "Linux hostname 5.4.0...",
  "stderr": "",
  "runtime_ms": 45
}

Metrics Endpoint

GET /metrics

Returns Prometheus-style metrics:

# HELP jobs_total Total number of jobs processed
# TYPE jobs_total counter
jobs_total 42

# HELP jobs_failed_total Total number of failed jobs
# TYPE jobs_failed_total counter
jobs_failed_total 2

# HELP job_runtime_seconds Average job runtime in seconds
# TYPE job_runtime_seconds histogram
job_runtime_seconds_bucket{le="0.1"} 10
job_runtime_seconds_bucket{le="1.0"} 35

Scaling to Kubernetes / Cloud

Current Design (Local)

  • Docker Compose orchestrates services
  • Direct HTTP communication between coordinator and runners
  • In-memory job tracking

Production Deployment (Kubernetes)

  1. Service Discovery: Use Kubernetes Services for runner discovery
  2. Job Queue: Replace in-memory queue with Redis/RabbitMQ
  3. State Management: Use PostgreSQL for job persistence
  4. Autoscaling: Kubernetes HPA based on queue depth
  5. Security: Network policies, pod security policies, RBAC
  6. Monitoring: Prometheus + Grafana for metrics visualization
  7. Logging: Centralized logging with ELK stack or Loki

Cloud Provider Considerations

AWS:

  • ECS/EKS for orchestration
  • SQS for job queue
  • CloudWatch for metrics/logs
  • IAM roles for service authentication

Azure:

  • AKS for orchestration
  • Service Bus for job queue
  • Azure Monitor for metrics/logs
  • Managed Identity for authentication

Cross-Functional Considerations

Security

  • Sandboxing: Jobs run in isolated containers with resource limits
  • Input Validation: Command sanitization and timeout enforcement
  • Network Isolation: Containers run on isolated networks
  • Future: Implement user namespaces, seccomp profiles, AppArmor policies

Infrastructure

  • Autoscaling: Design supports horizontal scaling of runners
  • Resource Management: CPU/memory limits per container
  • High Availability: Coordinator can be replicated behind load balancer
  • Future: Implement health checks, graceful shutdown, circuit breakers

Product

  • Job Priority: Current FIFO queue, extensible to priority queues
  • Fairness: Round-robin dispatch to runners
  • SLA Tracking: Metrics expose job completion times
  • Future: Implement job priorities, quotas, rate limiting

Development

Project Structure

mini-platform/
β”œβ”€β”€ coordinator/          # Coordinator service
β”‚   β”œβ”€β”€ main.py          # FastAPI application
β”‚   β”œβ”€β”€ Dockerfile       # Container definition
β”‚   └── requirements.txt # Python dependencies
β”œβ”€β”€ runner/              # Runner service
β”‚   β”œβ”€β”€ main.py         # Job execution logic
β”‚   β”œβ”€β”€ Dockerfile      # Container definition
β”‚   └── requirements.txt # Python dependencies
β”œβ”€β”€ docker-compose.yml   # Local orchestration
β”œβ”€β”€ .github/
β”‚   └── workflows/
β”‚       └── ci.yml      # CI/CD pipeline
└── README.md           # This file

Testing

# Run tests
docker-compose up -d
pytest tests/

# Manual testing
curl -X POST http://localhost:8000/jobs \
  -H "Content-Type: application/json" \
  -d '{"command": "cat /proc/cpuinfo"}'

About

A lightweight, containerized job-execution platform that runs Linux workloads, exposes metrics, logs executions, and is designed to scale from a laptop to the cloud.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors