Mini Distributed Compute Platform

A lightweight, containerized job-execution platform that runs Linux workloads, exposes metrics, logs executions, and is designed to scale from a laptop to the cloud.

Think of it as "Slurm-lite for containers" — a minimal distributed compute platform that demonstrates core systems engineering principles.

What This System Does

This platform provides:

Job Execution: Execute arbitrary Linux commands in isolated containers
Job Orchestration: A coordinator service dispatches jobs to runner instances
Observability: Structured logging and Prometheus-style metrics
Scalability: Designed to scale from a single machine to Kubernetes clusters

Architecture

High-Level Overview

┌─────────────────────────────────────────────────────────────────────────┐
│                         Client Layer                                     │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐                  │
│  │   curl/CLI   │  │  HTTP Client │  │  API Client  │                  │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘                  │
└─────────┼──────────────────┼──────────────────┼─────────────────────────┘
          │                  │                  │
          │  HTTP/REST       │                  │
          │  Port 8000       │                  │
          │                  │                  │
┌─────────▼──────────────────▼──────────────────▼─────────────────────────┐
│                    Coordinator Service                                  │
│                    (FastAPI / Python 3.11+)                             │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │                    API Endpoints                                │   │
│  │  • POST /jobs          - Submit new job                         │   │
│  │  • GET  /jobs/{id}     - Query job status                       │   │
│  │  • GET  /metrics       - Prometheus metrics                     │   │
│  │  • GET  /health        - Health check                           │   │
│  └──────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │              Job Management Layer                                │   │
│  │  • Job Queue (in-memory dict)                                    │   │
│  │  • Job State Machine: pending → running → completed/failed      │   │
│  │  • Async job dispatch (asyncio.create_task)                     │   │
│  │  • Job tracking with UUIDs                                      │   │
│  └──────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │                    Metrics & Observability                       │   │
│  │  • jobs_total, jobs_completed_total, jobs_failed_total          │   │
│  │  • job_runtime_seconds (histogram)                               │   │
│  │  • Structured logging (stdout)                                   │   │
│  └──────────────────────────────────────────────────────────────────┘   │
└─────────┬───────────────────────────────────────────────────────────────┘
          │
          │  HTTP POST /execute
          │  (httpx.AsyncClient)
          │  Port 8080
          │
┌─────────▼───────────────────────────────────────────────────────────────┐
│                    Runner Service                                        │
│                    (FastAPI / Python 3.11+)                             │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │                    API Endpoints                                │   │
│  │  • POST /execute      - Execute Linux command                    │   │
│  │  • GET  /metrics     - Execution metrics                         │   │
│  │  • GET  /health      - Health check                             │   │
│  └──────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │              Execution Engine                                     │   │
│  │  • subprocess.Popen (shell=True)                                 │   │
│  │  • Timeout enforcement (subprocess.communicate)                  │   │
│  │  • Process isolation (container boundaries)                      │   │
│  │  • stdout/stderr capture                                         │   │
│  │  • Exit code tracking                                            │   │
│  └──────────────────────────────────────────────────────────────────┘   │
│                                                                          │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │                    Metrics & Observability                       │   │
│  │  • executions_total, executions_success_total                    │   │
│  │  • executions_failed_total                                       │   │
│  │  • Structured logging (stdout)                                    │   │
│  └──────────────────────────────────────────────────────────────────┘   │
└─────────┬───────────────────────────────────────────────────────────────┘
          │
          │  subprocess execution
          │
┌─────────▼───────────────────────────────────────────────────────────────┐
│                    Linux Process Layer                                   │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │              Isolated Process Execution                          │   │
│  │  • Command execution (shell=True)                                 │   │
│  │  • Resource limits (Docker cgroups)                              │   │
│  │  • Timeout handling                                              │   │
│  │  • Output capture (stdout/stderr)                                │   │
│  └──────────────────────────────────────────────────────────────────┘   │
└───────────────────────────────────────────────────────────────────────────┘

Container Architecture

Data Flow

┌─────────┐
│ Client  │
└────┬────┘
     │
     │ 1. POST /jobs {"command": "uname -a", "timeout": 30}
     │
┌────▼──────────────────────────────────────────────────────────────┐
│ Coordinator                                                       │
│  • Generate UUID (job_id)                                        │
│  • Store job in memory: {job_id, status: "pending", ...}         │
│  • Increment jobs_total metric                                    │
│  • Return JobResponse {job_id, status: "pending"}                │
│                                                                   │
│  • Async dispatch: asyncio.create_task(execute_job(...))         │
│    └─> Update status: "pending" → "running"                       │
└────┬──────────────────────────────────────────────────────────────┘
     │
     │ 2. POST http://runner:8080/execute
     │    {"command": "uname -a", "timeout": 30}
     │
┌────▼──────────────────────────────────────────────────────────────┐
│ Runner                                                            │
│  • Validate request                                               │
│  • Execute: subprocess.Popen(command, shell=True)                │
│  • Monitor with timeout (subprocess.communicate)                  │
│  • Capture stdout/stderr                                          │
│  • Track exit_code                                                │
│  • Update metrics (executions_total, etc.)                        │
│  • Return ExecuteResponse {exit_code, stdout, stderr, runtime_ms}│
└────┬──────────────────────────────────────────────────────────────┘
     │
     │ 3. HTTP Response {exit_code: 0, stdout: "...", ...}
     │
┌────▼──────────────────────────────────────────────────────────────┐
│ Coordinator                                                       │
│  • Update job status: "running" → "completed"/"failed"          │
│  • Store results: {exit_code, stdout, stderr, runtime_ms}        │
│  • Update metrics:                                                │
│    - jobs_completed_total or jobs_failed_total                   │
│    - job_runtimes.append(runtime_ms)                             │
│  • Log completion                                                 │
└────────────────────────────────────────────────────────────────────┘
     │
     │ 4. GET /jobs/{job_id}
     │
┌────▼────┐
│ Client  │ ← Returns full job status with results
└─────────┘

Component Responsibilities

Coordinator Service

API Gateway: Exposes REST API for job submission and status queries
Job Orchestration: Manages job lifecycle (pending → running → completed/failed)
Service Discovery: Discovers runner instances via RUNNER_URL environment variable
State Management: Maintains in-memory job store
Metrics Aggregation: Collects and exposes Prometheus-style metrics
Error Handling: Manages timeouts, network failures, and runner unavailability

Runner Service

Command Execution: Executes Linux commands in isolated subprocesses
Process Management: Handles timeouts, process termination, and resource limits
Output Capture: Captures stdout/stderr streams
Metrics Collection: Tracks execution metrics (success/failure rates, runtimes)
Health Reporting: Exposes health check endpoint for orchestration

Infrastructure Layer

Containerization: Docker containers provide isolation and resource limits
Networking: Docker bridge network enables service-to-service communication
Resource Management: CPU and memory limits enforced via Docker cgroups
Health Monitoring: Health checks enable automatic recovery and load balancing

Components

Coordinator Service: FastAPI-based service that accepts job requests, dispatches them to runners, and tracks job status
Runner Service: Executes Linux commands using subprocess, captures output, and reports results back
Docker Infrastructure: Both services run in containers with proper isolation and resource limits

Why This Exists

This project demonstrates:

Systems Engineering: Process management, inter-service communication, resource isolation
Platform Engineering: Containerization, orchestration, observability
Linux Familiarity: Direct interaction with OS primitives, subprocess execution
Cloud Readiness: Designed with Kubernetes deployment in mind

To Execute this program

Prerequisites

Docker and Docker Compose
Python 3.11+ (for local development)

Running the Platform

# Start all services
docker-compose up --build

# In another terminal, submit a job
curl -X POST http://localhost:8000/jobs \
  -H "Content-Type: application/json" \
  -d '{"command": "echo hello world"}'

# Check job status
curl http://localhost:8000/jobs/{job_id} #a job id should be generated after you've submitted a curl request

# View metrics
curl http://localhost:8000/metrics

Local Development

# Coordinator
cd coordinator
pip install -r requirements.txt
uvicorn main:app --reload --port 8000

# Runner (in another terminal)
cd runner
pip install -r requirements.txt
python main.py

API Reference

Submit a Job

POST /jobs
Content-Type: application/json

{
  "command": "uname -a",
  "timeout": 30
}

Response:

{
  "job_id": "abc123",
  "status": "pending",
  "created_at": "2024-01-01T00:00:00Z"
}

Get Job Status

GET /jobs/{job_id}

Response:

{
  "job_id": "abc123",
  "status": "completed",
  "exit_code": 0,
  "stdout": "Linux hostname 5.4.0...",
  "stderr": "",
  "runtime_ms": 45
}

Metrics Endpoint

GET /metrics

Returns Prometheus-style metrics:

# HELP jobs_total Total number of jobs processed
# TYPE jobs_total counter
jobs_total 42

# HELP jobs_failed_total Total number of failed jobs
# TYPE jobs_failed_total counter
jobs_failed_total 2

# HELP job_runtime_seconds Average job runtime in seconds
# TYPE job_runtime_seconds histogram
job_runtime_seconds_bucket{le="0.1"} 10
job_runtime_seconds_bucket{le="1.0"} 35

Scaling to Kubernetes / Cloud

Current Design (Local)

Docker Compose orchestrates services
Direct HTTP communication between coordinator and runners
In-memory job tracking

Production Deployment (Kubernetes)

Service Discovery: Use Kubernetes Services for runner discovery
Job Queue: Replace in-memory queue with Redis/RabbitMQ
State Management: Use PostgreSQL for job persistence
Autoscaling: Kubernetes HPA based on queue depth
Security: Network policies, pod security policies, RBAC
Monitoring: Prometheus + Grafana for metrics visualization
Logging: Centralized logging with ELK stack or Loki

Cloud Provider Considerations

AWS:

ECS/EKS for orchestration
SQS for job queue
CloudWatch for metrics/logs
IAM roles for service authentication

Azure:

AKS for orchestration
Service Bus for job queue
Azure Monitor for metrics/logs
Managed Identity for authentication

Cross-Functional Considerations

Security

Sandboxing: Jobs run in isolated containers with resource limits
Input Validation: Command sanitization and timeout enforcement
Network Isolation: Containers run on isolated networks
Future: Implement user namespaces, seccomp profiles, AppArmor policies

Infrastructure

Autoscaling: Design supports horizontal scaling of runners
Resource Management: CPU/memory limits per container
High Availability: Coordinator can be replicated behind load balancer
Future: Implement health checks, graceful shutdown, circuit breakers

Product

Job Priority: Current FIFO queue, extensible to priority queues
Fairness: Round-robin dispatch to runners
SLA Tracking: Metrics expose job completion times
Future: Implement job priorities, quotas, rate limiting

Development

Project Structure

mini-platform/
├── coordinator/          # Coordinator service
│   ├── main.py          # FastAPI application
│   ├── Dockerfile       # Container definition
│   └── requirements.txt # Python dependencies
├── runner/              # Runner service
│   ├── main.py         # Job execution logic
│   ├── Dockerfile      # Container definition
│   └── requirements.txt # Python dependencies
├── docker-compose.yml   # Local orchestration
├── .github/
│   └── workflows/
│       └── ci.yml      # CI/CD pipeline
└── README.md           # This file

Testing

# Run tests
docker-compose up -d
pytest tests/

# Manual testing
curl -X POST http://localhost:8000/jobs \
  -H "Content-Type: application/json" \
  -d '{"command": "cat /proc/cpuinfo"}'

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
coordinator		coordinator
runner		runner
.gitignore		.gitignore
QUICKSTART.md		QUICKSTART.md
README.md		README.md
docker-compose.yml		docker-compose.yml
test_platform.sh		test_platform.sh

The-RealOG/Mini-Distributed-Compute-Platform

Folders and files

Latest commit

History

Repository files navigation

Mini Distributed Compute Platform

What This System Does

Architecture

High-Level Overview

Container Architecture

Data Flow

Component Responsibilities

Coordinator Service

Runner Service

Infrastructure Layer

Components

Why This Exists

To Execute this program

Prerequisites

Running the Platform

Local Development

API Reference

Submit a Job

Get Job Status

Metrics Endpoint

Scaling to Kubernetes / Cloud

Current Design (Local)

Production Deployment (Kubernetes)

Cloud Provider Considerations

Cross-Functional Considerations

Security

Infrastructure

Product

Development

Project Structure

Testing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages