Open Data Platform

This repository contains the infrastructure, code, and documentation for building a modern, end-to-end data platform using exclusively open-source components.

The entire stack is designed to be cloud-agnostic, capable of running on any Kubernetes cluster—from a local laptop (using kind) to a federated multi-cloud production environment.

🚀 Core Principles

100% Open-Source: No vendor lock-in. Every component is open-source.
Cloud-Agnostic: Built on Kubernetes and Cilium, allowing the platform to run on any cloud provider (AWS, GCP, Azure, Hetzner, OVH) or on-premise.
Modular: Each component (e.g., orchestration, compute) is independent. You can swap out tools as needed.
Scalable: Designed to scale from a single-node setup to a high-availability, multi-cluster mesh.

🛠️ Core Architecture & Tech Stack

The platform is built from the following open-source components, categorized by their function:

Networking

Provides the core connectivity and security for all services.

Tool	Primary Role	Notes
Cilium	CNI & Cluster Mesh	L3/L4/L7 networking and security. Used for connecting multiple clusters.
Headscale	Overlay Network	(Alternative) Self-hosted control server for Tailscale clients.
Netbird	Overlay Network	(Alternative) Open-source VPN for simple L3 connectivity.

Security

Handles identity, authentication, and authorization for users and services.

Tool	Primary Role	Notes
Keycloak	Identity & Access (IAM)	Provides secure, single sign-on (SSO) for all platform UIs.

Code Storage

Manages all code, CI/CD pipelines, and container images.

Tool	Primary Role	Notes
GitLab	Git + CI/CD	All-in-one, self-hosted platform for code and container registry.

Orchestration

Defines, schedules, and monitors all data pipelines and workflows.

Tool	Primary Role	Notes
Dagster	Data Orchestrator	A modern, asset-based orchestrator. Our primary choice.
Prefect	Data Orchestrator	A Python-native workflow engine with a focus on dynamic pipelines.
Airflow	Data Orchestrator	The classic, battle-tested tool. Evaluated as an alternative.

Data Ingestion

Handles moving data from external sources into the platform.

Tool	Primary Role	Notes
SFTPGo	SFTP Server	For ingesting files from external partners via SFTP.
(TBD)	Database Movers	(e.g., Airbyte, Meltano) For moving data from operational DBs.

Object Storage

The "data lake" for storing all raw and processed data in an S3-compatible format.

Tool	Primary Role	Notes
MinIO	S3-Compatible Storage	High-performance, distributed object storage.
Garage	S3-Compatible Storage	Alternative focused on distributed, multi-datacenter setups.

Data Compute

The processing engines for running large-scale transformations.

Tool	Primary Role	Notes
Apache Spark	Distributed Compute	Primary engine for large-scale data transformation (ETL/ELT).
Sail	(TBD)	(TBD)

Serverless Compute

For running event-driven, short-lived functions.

Tool	Primary Role	Notes
Apache OpenWhisk	FaaS Platform	Ideal for real-time ingestion, ML model inference, etc.

Governance & BI

For visualizing, monitoring, and governing the data.

Tool	Primary Role	Notes
Apache Superset	Business Intelligence	Data visualization, dashboards, and BI.
Apache Nifi	Data Flow	Visual data-flow management, governance, and lineage.

Here is a complete, polished Markdown section you can paste directly into your README.md. It covers prerequisites, the startup sequence, and how to use the development tunnel.

Based on your GitOps architecture (clusters/dev vs clusters/prd), here is the exact order of commands for both environments.

💻 Scenario 1: Local Development (Kind on Mac)

Goal: Create a cluster from scratch, hack networking to work on Mac, and start developing.

Commit your changes (Flux pulls from Git, not your disk):
```
git push origin main
```
Create Cluster & Network:
- Creates Kind.
- Injects the IP into ConfigMap.
- Installs Cilium manually (to fix NotReady nodes).
```
just up
```
Install Flux:
- Installs controllers.
- Adopts Cilium.
- Installs Platform (RustFS, Sail).
```
just bootstrap
```
Access Services:
- Opens the tunnel to *.localhost.
```
just connect
```

☁️ Scenario 2: VPS / Cloud (Production)

Goal: Connect to a real remote cluster (e.g., Hetzner, AWS) that was provisioned with No CNI and No Kube-Proxy.

Prerequisite: You have the KUBECONFIG for your remote cluster.

Connect to Remote Cluster:

export KUBECONFIG=~/path/to/vps.kubeconfig

Bootstrap Networking (Manual Step):

You cannot use just up (it tries to create Kind).
You must manually run the Helm command to install Cilium, but with the Real Public IP of your VPS.

# 1. Install CRDs
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.0/experimental-install.yaml

# 2. Install Cilium (Point host to API Server Internal IP)
helm install cilium cilium/cilium \
  --version 1.18.4 \
  --namespace kube-system \
  -f values/cilium.yaml \
  --set k8sServiceHost=10.0.0.1 \   <-- YOUR VPS PRIVATE IP
  --set k8sServicePort=6443

Bootstrap Flux:
- You override the path to point to the Production cluster definition.
```
just bootstrap path=clusters/prd
```
Access:
- Do not use just connect.
- Get the External IP of your Gateway: kubectl get svc -n default.
- Configure your DNS (Cloudflare/GoDaddy) to point *.your-domain.com to that IP.

Summary Checklist

Step	Local (Kind)	VPS / Cloud (Strict Mode)
1. Provision	`just up` (Creates Kind)	Terraform / Ansible / Manual
2. Networking	`just up` (Auto-runs `bootstrap-cni`)	Manual Helm Install (Targeting VPS IP)
3. GitOps	`just bootstrap`	`just bootstrap path=clusters/prd`
4. Access	`just connect` (Tunnel)	Public DNS / LoadBalancer IP

Name		Name	Last commit message	Last commit date
Latest commit History 208 Commits
.vscode		.vscode
charts		charts
clusters/dev		clusters/dev
infrastructure		infrastructure
platform		platform
talos		talos
tests		tests
.gitignore		.gitignore
.prettierignore		.prettierignore
.python-version		.python-version
Justfile		Justfile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Open Data Platform

🚀 Core Principles

🛠️ Core Architecture & Tech Stack

Networking

Security

Code Storage

Orchestration

Data Ingestion

Object Storage

Data Compute

Serverless Compute

Governance & BI

💻 Scenario 1: Local Development (Kind on Mac)

☁️ Scenario 2: VPS / Cloud (Production)

Summary Checklist

About

Uh oh!

Releases

Packages

Uh oh!

Languages

melchiorhering/open-data-platform

Folders and files

Latest commit

History

Repository files navigation

Open Data Platform

🚀 Core Principles

🛠️ Core Architecture & Tech Stack

Networking

Security

Code Storage

Orchestration

Data Ingestion

Object Storage

Data Compute

Serverless Compute

Governance & BI

💻 Scenario 1: Local Development (Kind on Mac)

☁️ Scenario 2: VPS / Cloud (Production)

Summary Checklist

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages