Skip to content

melchiorhering/open-data-platform

Repository files navigation

Open Data Platform

This repository contains the infrastructure, code, and documentation for building a modern, end-to-end data platform using exclusively open-source components.

The entire stack is designed to be cloud-agnostic, capable of running on any Kubernetes cluster—from a local laptop (using kind) to a federated multi-cloud production environment.

🚀 Core Principles

  • 100% Open-Source: No vendor lock-in. Every component is open-source.
  • Cloud-Agnostic: Built on Kubernetes and Cilium, allowing the platform to run on any cloud provider (AWS, GCP, Azure, Hetzner, OVH) or on-premise.
  • Modular: Each component (e.g., orchestration, compute) is independent. You can swap out tools as needed.
  • Scalable: Designed to scale from a single-node setup to a high-availability, multi-cluster mesh.

🛠️ Core Architecture & Tech Stack

The platform is built from the following open-source components, categorized by their function:

Networking

Provides the core connectivity and security for all services.

Tool Primary Role Notes
Cilium CNI & Cluster Mesh L3/L4/L7 networking and security. Used for connecting multiple clusters.
Headscale Overlay Network (Alternative) Self-hosted control server for Tailscale clients.
Netbird Overlay Network (Alternative) Open-source VPN for simple L3 connectivity.

Security

Handles identity, authentication, and authorization for users and services.

Tool Primary Role Notes
Keycloak Identity & Access (IAM) Provides secure, single sign-on (SSO) for all platform UIs.

Code Storage

Manages all code, CI/CD pipelines, and container images.

Tool Primary Role Notes
GitLab Git + CI/CD All-in-one, self-hosted platform for code and container registry.

Orchestration

Defines, schedules, and monitors all data pipelines and workflows.

Tool Primary Role Notes
Dagster Data Orchestrator A modern, asset-based orchestrator. Our primary choice.
Prefect Data Orchestrator A Python-native workflow engine with a focus on dynamic pipelines.
Airflow Data Orchestrator The classic, battle-tested tool. Evaluated as an alternative.

Data Ingestion

Handles moving data from external sources into the platform.

Tool Primary Role Notes
SFTPGo SFTP Server For ingesting files from external partners via SFTP.
(TBD) Database Movers (e.g., Airbyte, Meltano) For moving data from operational DBs.

Object Storage

The "data lake" for storing all raw and processed data in an S3-compatible format.

Tool Primary Role Notes
MinIO S3-Compatible Storage High-performance, distributed object storage.
Garage S3-Compatible Storage Alternative focused on distributed, multi-datacenter setups.

Data Compute

The processing engines for running large-scale transformations.

Tool Primary Role Notes
Apache Spark Distributed Compute Primary engine for large-scale data transformation (ETL/ELT).
Sail (TBD) (TBD)

Serverless Compute

For running event-driven, short-lived functions.

Tool Primary Role Notes
Apache OpenWhisk FaaS Platform Ideal for real-time ingestion, ML model inference, etc.

Governance & BI

For visualizing, monitoring, and governing the data.

Tool Primary Role Notes
Apache Superset Business Intelligence Data visualization, dashboards, and BI.
Apache Nifi Data Flow Visual data-flow management, governance, and lineage.

Here is a complete, polished Markdown section you can paste directly into your README.md. It covers prerequisites, the startup sequence, and how to use the development tunnel.


Based on your GitOps architecture (clusters/dev vs clusters/prd), here is the exact order of commands for both environments.

💻 Scenario 1: Local Development (Kind on Mac)

Goal: Create a cluster from scratch, hack networking to work on Mac, and start developing.

  1. Commit your changes (Flux pulls from Git, not your disk):
    git push origin main
  2. Create Cluster & Network:
    • Creates Kind.
    • Injects the IP into ConfigMap.
    • Installs Cilium manually (to fix NotReady nodes).
    just up
  3. Install Flux:
    • Installs controllers.
    • Adopts Cilium.
    • Installs Platform (RustFS, Sail).
    just bootstrap
  4. Access Services:
    • Opens the tunnel to *.localhost.
    just connect

☁️ Scenario 2: VPS / Cloud (Production)

Goal: Connect to a real remote cluster (e.g., Hetzner, AWS) that was provisioned with No CNI and No Kube-Proxy.

Prerequisite: You have the KUBECONFIG for your remote cluster.

  1. Connect to Remote Cluster:

    export KUBECONFIG=~/path/to/vps.kubeconfig
  2. Bootstrap Networking (Manual Step):

    • You cannot use just up (it tries to create Kind).
    • You must manually run the Helm command to install Cilium, but with the Real Public IP of your VPS.
    # 1. Install CRDs
    kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.4.0/experimental-install.yaml
    
    # 2. Install Cilium (Point host to API Server Internal IP)
    helm install cilium cilium/cilium \
      --version 1.18.4 \
      --namespace kube-system \
      -f values/cilium.yaml \
      --set k8sServiceHost=10.0.0.1 \   <-- YOUR VPS PRIVATE IP
      --set k8sServicePort=6443
  3. Bootstrap Flux:

    • You override the path to point to the Production cluster definition.
    just bootstrap path=clusters/prd
  4. Access:

    • Do not use just connect.
    • Get the External IP of your Gateway: kubectl get svc -n default.
    • Configure your DNS (Cloudflare/GoDaddy) to point *.your-domain.com to that IP.

Summary Checklist

Step Local (Kind) VPS / Cloud (Strict Mode)
1. Provision just up (Creates Kind) Terraform / Ansible / Manual
2. Networking just up (Auto-runs bootstrap-cni) Manual Helm Install (Targeting VPS IP)
3. GitOps just bootstrap just bootstrap path=clusters/prd
4. Access just connect (Tunnel) Public DNS / LoadBalancer IP

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published