This repository implements a production-grade observability stack for Kubernetes clusters using GitOps principles and Argo CD.
It provides a clean, reproducible setup for metrics, logs, dashboards, and alerting, following patterns commonly used in mid–large scale production environments.
This repository demonstrates a production-grade observability platform implemented using GitOps principles, designed for technical review and real-world adaptation.
The repository focuses only on the Kubernetes and GitOps layers.
Infrastructure provisioning (cluster, networking, IAM, storage) is treated as an external dependency and intentionally kept out of scope.
This repository is designed as:
- A realistic GitHub demo suitable for technical review
- A reference architecture for GitOps-managed observability
- A foundation that can be hardened for production without structural changes
Some configurations are intentionally simplified to ensure:
- fast bootstrap
- minimal external dependencies
- clarity of architecture and intent
This repository assumes a pre-existing Kubernetes cluster with sufficient capacity to run a full observability stack.
The following baseline has been validated to reliably run all components defined in this repository.
- Worker nodes
- Instance types:
t3.medium(primary),t3a.medium(capacity fallback) - Node count: 2
- Instance types:
- Total capacity
- vCPU: 4
- Memory: 8 GiB
This baseline is sufficient for:
- Prometheus (
kube-prometheus-stack) - Alertmanager
- Grafana
- Loki (SingleBinary mode)
- Grafana Alloy (DaemonSet)
- CRDs and controllers reconciled by Argo CD
It supports:
- steady-state metrics and log ingestion
- short-lived reconciliation and compaction bursts
- interactive Grafana usage for review and demos
- Single-node clusters are not supported
- Prometheus scheduling and resharding may fail
- Loki compaction and ingestion become unstable
- Smaller instance types (e.g.
t3.small) may:- cause OOM kills during Prometheus startup
- throttle Loki under moderate log volume
- This baseline is not intended for high-ingest production workloads
Production sizing depends on:
- log volume
- retention period
- scrape frequency
- alert cardinality
Production capacity planning is intentionally out of scope for this repository.
The architecture prioritizes:
- GitOps as the single source of truth
- Explicit separation between infrastructure and platform configuration
- Deterministic, observable system behavior
- Minimal operational magic
- Clear ownership boundaries between components
flowchart TD
K8S[Kubernetes Cluster]
ARGO[Argo CD<br/>GitOps Control Plane]
PROM[Prometheus]
AM[Alertmanager]
GRAF[Grafana]
LOKI[Loki]
ALLOY[Grafana Alloy]
ARGO --> PROM
ARGO --> AM
ARGO --> GRAF
ARGO --> LOKI
ARGO --> ALLOY
ALLOY --> LOKI
PROM --> AM
PROM --> GRAF
🔎 Detailed Architecture
The full, production-level architecture diagram (including GitOps control flow, CRD lifecycle, metrics and logging pipelines) is available here:
- Prometheus (via
kube-prometheus-stack) - Node, Kubernetes, and workload metrics
- Custom
PrometheusRuleresources for alerting
- Alertmanager
- Explicit routing logic
- Slack receivers separated by concern:
- platform alerts
- workload alerts
- Noise-reduction via a dedicated
"null"receiver
- Grafana
- Preconfigured Loki datasource
- Admin access enabled for evaluation and platform review
- Grafana Loki (SingleBinary mode)
- S3-backed storage
- Retention and compaction configured
- Grafana Alloy
- Runs as a DaemonSet
- Discovers pods automatically
- Handles Docker and containerd logs
- Normalizes labels before ingestion
- Acts as the reconciliation engine
- Uses the App-of-Apps pattern
- Fully automated sync with pruning and self-healing enabled
In scope
- Platform observability services
- Alerting rules and routing
- Log ingestion and storage
- Dashboard access
Out of scope
- Cluster provisioning
- Cloud networking
- IAM, TLS, DNS
- Terraform state
The underlying AWS infrastructure and EKS cluster are provisioned via Terraform and maintained in a separate, authoritative IaC repository:
https://github.com/LaurisNeimanis/aws-eks-platform
That repository is responsible for:
- EKS cluster lifecycle and versioning
- VPC, subnetting, routing, and network security boundaries
- Platform-level IAM roles and access boundaries
- Core AWS foundation resources required to run and secure the Kubernetes platform
Some infrastructure components required by this observability stack are intentionally not provisioned here, including:
- S3 buckets and access policies for Loki object storage
- Kubernetes StorageClasses (e.g. gp3) used by stateful workloads
- Workload and backend-specific IAM permissions for logs and metrics
These components are treated as explicit external dependencies, not hidden assumptions, ensuring a clear separation between infrastructure provisioning and platform/application concerns.
├── docs/
│ ├── architecture-diagram.mmd
│ └── installation.md
│
└── gitops/
├── apps/
│ └── platform/
│ └── observability/
│ ├── alloy/
│ ├── kube-prometheus-stack/
│ ├── loki/
│ └── prometheus-rules/
│
├── argo/
│ ├── applications/
│ ├── projects/
│ └── root-application.yaml
│
├── crds/
│ └── prometheus-operator/
│
└── bootstrap/
└── argocd/
- Watchdog alerts are intentionally muted
- Kubernetes control-plane alerts are suppressed
- Workload alerts are routed separately from platform alerts
- Alert noise is controlled via a dedicated
"null"receiver
This repository assumes a pre-existing Kubernetes cluster and Argo CD control plane.
The full bootstrap flow is documented separately:
Grafana is enabled to allow interactive exploration of metrics and logs when evaluating or reviewing the observability platform.
Grafana access requires a pre-created Kubernetes Secret (grafana-admin), as described in the installation guide.
Authentication is intentionally externalized to keep the GitOps layer clean and avoid storing credentials in Git.
Grafana is configured to read admin credentials from an existing Kubernetes Secret:
grafana:
admin:
existingSecret: grafana-adminRefer to the installation guide for secret creation details:
In production environments, Grafana access should be hardened by:
- External secret management
- SSO (OIDC / SAML)
- Read-only dashboards for non-admin users
- Network-level access restrictions
This demo intentionally simplifies certain security aspects.
In production environments, the following must be applied:
- Secrets managed outside Git
- TLS everywhere
- IAM-based access controls
- Restricted RBAC policies
-
Architecture
- High-level overview: this README
- Detailed diagram: architecture-diagram.mmd
-
Installation
This project provides a clean, extensible, and realistic example of a GitOps-managed observability platform for Kubernetes.
It is intentionally designed to be:
- understandable
- reviewable
- extensible
- production-ready with minimal adjustments