Skip to content

Kubernetes deployment#41

Merged
edinstance merged 25 commits intodevfrom
kubernetes-deployment-config
Nov 3, 2025
Merged

Kubernetes deployment#41
edinstance merged 25 commits intodevfrom
kubernetes-deployment-config

Conversation

@edinstance
Copy link
Owner

@edinstance edinstance commented Nov 3, 2025

Summary by CodeRabbit

  • New Features

    • Added a dedicated /health endpoint for the gateway.
    • Enabled configurable CORS options for the authentication service.
  • Infrastructure & Deployment

    • Large Kubernetes deployment added (namespaces, ingress, services, deployments, jobs) enabling cluster-based rollout.
    • Ingress routing and a new api.aviation.local domain for services.
    • Added monitoring stack (Grafana, Prometheus, Jaeger, Loki, Mimir, OTEL), Kafka and Schema Registry, OpenSearch and dashboards, and search service.
    • Router/gateway and service health/readiness checks improved.

@coderabbitai
Copy link

coderabbitai bot commented Nov 3, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

Adds extensive Kubernetes manifests and supporting changes to deploy microservices, observability, and infra; renames router service, introduces gateway health endpoint and routing fallback, updates load-test endpoints and auth header handling, and adjusts CORS and OTEL configuration.

Changes

Cohort / File(s) Summary
Docker Compose & top-level rename
\docker-compose.yml``
Renamed service graphql-routerrouter-service and updated corresponding container_name.
Kubernetes namespaces & ingress
\k8s/00-namespace.yml`, `k8s/01-ingress-controller.yml`, `k8s/02-cross-namespace-services.yml`, `k8s/03-ingress.yml`, `k8s/README.md``
Added namespaces and a full ingress-nginx controller stack (RBAC, webhook, Services, Deployment, IngressClass) and ingress routing for api.aviation.local.
Flights stack
\k8s/flights/*``
Added flights DB StatefulSet, Redis cache, migrations Job, Service/Deployment for flights-service, secrets and README.
Aircraft stack
\k8s/aircraft/*``
Added aircraft DB StatefulSet, Redis cache, migrations Job, service/deployment, secrets and README.
Authentication stack
\k8s/authentication/*``
Added auth DB StatefulSet, keygen Job, migrations Job, secrets and authentication Service with init containers.
Search / OpenSearch stack
\k8s/search/*``
OpenSearch StatefulSet, Dashboards, init job for dashboards, and search-service Deployment/Service.
Messaging stack
\k8s/messaging/*``
Kafka StatefulSet (KRaft), Schema Registry Deployment and schema-loader Job.
Monitoring & observability
\k8s/monitoring/*``
Grafana, Jaeger, Loki, Mimir, OpenTelemetry collector and Prometheus manifests with init-container config provisioning.
Router & supergraph orchestration
\k8s/router/*`, `k8s/cross-namespace-services-aviation.yml``
Added router-service Deployment/Service, router-config and supergraph compose Jobs, PVCs and ConfigMap for supergraph.
Gateway service
\k8s/gateway/service.yml`, `services/gateway/src/*``
Added k8s gateway Deployment/Service; in Rust gateway added /health handler, exported module and routing changes with fallback to existing route handler.
Configuration updates
\router.yml`, `supergraph.yml`, `services/authentication/authentication/settings.py`, `services/authentication/docker-compose.yml``
OTEL endpoints parameterised via env vars; supergraph subgraph URLs parameterised; added CORS settings and adjusted ALLOWED_HOSTS/CORS defaults.
Load tests & auth headers
\load-tests/src/config.ts`, `load-tests/src/helpers/auth_headers.ts`, `load-tests/src/tests/shells/*.ts``
Updated service URLs to api.aviation.local; introduced buildAuthHeaders() and refactored tests to use it (supporting JWT and direct headers).
Monitoring compose tweak
\monitoring/docker-compose.yml``
Removed Loki --config.file command-line argument.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Ingress as nginx-ingress
    participant Router as router-service
    participant Gateway as gateway-service
    participant ServiceA as flights/aircraft/search...

    Note right of Client: External request to api.aviation.local
    Client->>Ingress: HTTP request (path /health or /service/...)
    alt /health on gateway
        Ingress->>Gateway: GET /health
        Gateway->>Gateway: health_handler -> return {"status":"UP"}
        Gateway-->>Ingress: 200 JSON
        Ingress-->>Client: 200 JSON
    else other paths
        Ingress->>Router: forward request (hot-reload supergraph)
        Router->>ServiceA: route to appropriate backend (flights/aircraft/search)
        ServiceA-->>Router: response
        Router-->>Ingress: response
        Ingress-->>Client: response
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Areas to focus review on:

  • k8s/01-ingress-controller.yml (RBAC, webhook certs, service accounts)
  • k8s/router/supergraph-job.yml and router config PVCs (paths, permissions)
  • k8s/monitoring/* init containers vs app config paths (Prometheus, Mimir, OTEL)
  • services/gateway routing changes and health endpoint wiring
  • load-tests/helpers/auth_headers.ts for header correctness and edge cases
  • services/authentication CORS and ALLOWED_HOSTS security implications

Possibly related PRs

Poem

🐇 I hopped through manifests, lines so neat,

PVCs, namespaces and configs complete.
Health pings and routers now learn to dance,
Headers and tests get their better chance.
A carrot for CI — hooray, take a seat! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Kubernetes deployment' is vague and overly generic, failing to capture the substantial scope of changes including namespace setup, ingress configuration, cross-namespace service routing, monitoring stack deployment, and application service orchestration across multiple microservices. Consider a more descriptive title that reflects the primary focus, such as 'Add complete Kubernetes infrastructure setup for distributed aviation system' or 'Configure Kubernetes deployment with namespaces, ingress, services, and monitoring stack'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8b60b1b and 64d5e85.

📒 Files selected for processing (53)
  • docker-compose.yml (1 hunks)
  • k8s/00-namespace.yml (1 hunks)
  • k8s/01-ingress-controller.yml (1 hunks)
  • k8s/02-cross-namespace-services.yml (1 hunks)
  • k8s/03-ingress.yml (1 hunks)
  • k8s/README.md (1 hunks)
  • k8s/aircraft/README.md (1 hunks)
  • k8s/aircraft/cache.yml (1 hunks)
  • k8s/aircraft/database.yml (1 hunks)
  • k8s/aircraft/migrations.yml (1 hunks)
  • k8s/aircraft/secrets.yml (1 hunks)
  • k8s/aircraft/service.yml (1 hunks)
  • k8s/authentication/database.yml (1 hunks)
  • k8s/authentication/keygen.yml (1 hunks)
  • k8s/authentication/migrations.yml (1 hunks)
  • k8s/authentication/secrets.yml (1 hunks)
  • k8s/authentication/service.yml (1 hunks)
  • k8s/cross-namespace-services-aviation.yml (1 hunks)
  • k8s/flights/README.md (1 hunks)
  • k8s/flights/cache.yml (1 hunks)
  • k8s/flights/database.yml (1 hunks)
  • k8s/flights/migrations.yml (1 hunks)
  • k8s/flights/secrets.yml (1 hunks)
  • k8s/flights/service.yml (1 hunks)
  • k8s/gateway/service.yml (1 hunks)
  • k8s/messaging/kafka.yml (1 hunks)
  • k8s/messaging/schema-loader.yml (1 hunks)
  • k8s/messaging/schema-registry.yml (1 hunks)
  • k8s/monitoring/grafana.yml (1 hunks)
  • k8s/monitoring/jaeger.yml (1 hunks)
  • k8s/monitoring/loki.yml (1 hunks)
  • k8s/monitoring/mimir.yml (1 hunks)
  • k8s/monitoring/otel.yml (1 hunks)
  • k8s/monitoring/promethus.yml (1 hunks)
  • k8s/router/router-config-job.yml (1 hunks)
  • k8s/router/service.yml (1 hunks)
  • k8s/router/supergraph-job.yml (1 hunks)
  • k8s/search/dashboards.yml (1 hunks)
  • k8s/search/init-job.yml (1 hunks)
  • k8s/search/opensearch.yml (1 hunks)
  • k8s/search/service.yml (1 hunks)
  • load-tests/src/config.ts (1 hunks)
  • load-tests/src/helpers/auth_headers.ts (1 hunks)
  • load-tests/src/tests/shells/aircraft.ts (3 hunks)
  • load-tests/src/tests/shells/flights.ts (4 hunks)
  • monitoring/docker-compose.yml (0 hunks)
  • router.yml (1 hunks)
  • services/authentication/authentication/settings.py (1 hunks)
  • services/authentication/docker-compose.yml (1 hunks)
  • services/gateway/src/handling/health_handler.rs (1 hunks)
  • services/gateway/src/handling/mod.rs (1 hunks)
  • services/gateway/src/main.rs (2 hunks)
  • supergraph.yml (1 hunks)

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 58

🧹 Nitpick comments (31)
k8s/messaging/schema-loader.yml (1)

34-42: Consider using a pre-built image with dependencies included.

Installing curl and jq at runtime on every Job execution adds unnecessary latency and external dependencies (apk repositories). Consider building a custom image with these tools pre-installed for faster, more reliable execution.

Example Dockerfile:

FROM alpine:3.20@sha256:<digest>
RUN apk add --no-cache curl jq
COPY script.sh /usr/local/bin/
RUN chmod +x /usr/local/bin/script.sh
k8s/messaging/kafka.yml (2)

39-39: Pin the Kafka image by digest for reproducible deployments.

Using apache/kafka:4.0.0 without a digest allows the image content to change if the tag is re-pushed, potentially introducing unexpected behaviour or security vulnerabilities.

-          image: apache/kafka:4.0.0
+          image: apache/kafka:4.0.0@sha256:<digest>

Run docker pull apache/kafka:4.0.0 and docker inspect to obtain the current digest, or use kubectl to retrieve it from your registry.


52-53: Document the hardcoded Kafka CLUSTER_ID.

The CLUSTER_ID 7sWhx1QdSqGGl3a57SSZXA appears to be hardcoded. In KRaft mode, this ID must be consistent across all brokers but should be documented or managed through configuration to prevent accidental changes during cluster recreation.

Consider adding a comment explaining the ID's origin and whether it should remain fixed:

            # Cluster ID - must remain consistent for this Kafka cluster
            # Generated with: kafka-storage.sh random-uuid
            - name: CLUSTER_ID
              value: "7sWhx1QdSqGGl3a57SSZXA"
k8s/aircraft/database.yml (2)

44-48: Consider adding a liveness probe for PostgreSQL.

Whilst the readiness probe correctly checks database availability, adding a liveness probe would enable Kubernetes to restart the container if PostgreSQL becomes unresponsive (e.g., deadlocked).

           readinessProbe:
             exec:
               command: [ "pg_isready","-U","postgres","-d","aircraft" ]
             initialDelaySeconds: 10
             periodSeconds: 5
+          livenessProbe:
+            exec:
+              command: [ "pg_isready","-U","postgres","-d","aircraft" ]
+            initialDelaySeconds: 30
+            periodSeconds: 10
+            timeoutSeconds: 5
+            failureThreshold: 3

31-31: Pin the PostgreSQL image by digest.

Using postgres:17.6 without a digest allows image content changes if the tag is updated.

-          image: postgres:17.6
+          image: postgres:17.6@sha256:<digest>
k8s/flights/README.md (1)

8-8: Consider omitting or referencing secrets for credentials.

Whilst this is documentation, explicitly stating credentials as "postgres/postgres" could encourage weak password usage. Consider referencing that credentials are managed via Kubernetes Secrets instead of documenting the actual values.

-**Credentials**: postgres/postgres
+**Credentials**: Managed via Kubernetes Secrets (see `secrets.yml`)
load-tests/src/tests/shells/flights.ts (1)

27-28: Consider more explicit header strategy configuration.

Whilst the URL-based detection works, using !url.includes("/gateway") is fragile and couples the authentication strategy to URL patterns. If URL structures change or additional routing patterns are introduced, this logic could break silently.

Consider passing useDirectHeaders as an explicit parameter to runFlightScenario or deriving it from a configuration object rather than inferring it from the URL string:

export function runFlightScenario(
  url: string,
  aircraftId: string,
  accessToken?: string,
  authContext?: AuthContext,
  useDirectHeaders: boolean = true, // Explicit parameter
) {
  // ...
}
load-tests/src/tests/shells/aircraft.ts (1)

21-22: Consider more explicit header strategy configuration.

This has the same URL-based detection fragility as in flights.ts. Using !url.includes("/gateway") couples the authentication strategy to URL patterns, which could break if URL structures evolve.

Consider passing useDirectHeaders as an explicit parameter:

export function runAircraftScenario(
  url: string,
  accessToken?: string,
  authContext?: AuthContext,
  useDirectHeaders: boolean = true, // Explicit parameter
) {
  // ...
}
k8s/monitoring/jaeger.yml (1)

24-55: Add security context to harden container posture.

Checkov has flagged that this deployment allows privilege escalation and permits root container admission. For production deployments, add a security context to enforce non-root execution and prevent privilege escalation.

Apply this diff to add security hardening:

      spec:
+       securityContext:
+         runAsNonRoot: true
+         runAsUser: 1000
+         fsGroup: 1000
        containers:
          - name: jaeger
            image: jaegertracing/all-in-one:1.65.0
+           securityContext:
+             allowPrivilegeEscalation: false
+             readOnlyRootFilesystem: false
+             capabilities:
+               drop:
+                 - ALL

Note: Verify that the Jaeger image supports running as a non-root user, as some container images may require root privileges.

k8s/aircraft/README.md (1)

1-29: Update documentation to highlight security considerations.

The README documents placeholder credentials (postgres/postgres) and does not mention the use of Kubernetes Secrets for sensitive data. Add a note clarifying that credentials should be managed via Secrets in production.

Consider adding a Security section:

#### Security

- **Database Credentials**: Use Kubernetes Secrets to manage database passwords in production
- **Image Registry**: Ensure container images are sourced from trusted registries
- **Pod Security**: Apply Pod Security Standards appropriate for your environment
k8s/aircraft/service.yml (1)

20-98: Add security context to enforce non-root execution and prevent privilege escalation.

Similar to other deployments in this PR, this deployment lacks a security context. For production workloads, enforce non-root execution.

Apply this diff to add security hardening:

    spec:
+     securityContext:
+       runAsNonRoot: true
+       runAsUser: 1000
+       fsGroup: 1000
      initContainers:
k8s/search/dashboards.yml (2)

16-67: Add security context to enforce non-root execution and prevent privilege escalation.

This deployment lacks a security context. Enforce non-root execution for the dashboards container.

    spec:
+     securityContext:
+       runAsNonRoot: true
+       runAsUser: 1000
+       fsGroup: 1000
      containers:

60-67: Missing newline at end of file.

YAML files should end with a newline character. Add one after the final --- marker.

k8s/aircraft/cache.yml (1)

12-43: Add security context to enforce non-root execution and prevent privilege escalation.

This deployment lacks a security context. Add proper security settings for the Redis container.

    spec:
+     securityContext:
+       fsGroup: 999
      containers:
        - name: redis
          image: redis:8.2
+         securityContext:
+           allowPrivilegeEscalation: false
+           runAsNonRoot: true
+           runAsUser: 999
+           capabilities:
+             drop:
+               - ALL
          ports:
k8s/monitoring/promethus.yml (2)

15-65: Add security context to enforce non-root execution and prevent privilege escalation.

This deployment lacks security hardening. Add a security context to prevent root execution.

    spec:
+     securityContext:
+       runAsNonRoot: true
+       runAsUser: 65534
+       fsGroup: 65534
      initContainers:
        - name: fetch-config
          image: alpine/git:2.49.1
+         securityContext:
+           allowPrivilegeEscalation: false
+           capabilities:
+             drop:
+               - ALL
          command:

36-42: Add error handling to the shell script in initContainer.

The shell script lacks error handling. If any step fails (git clone, mkdir, cp), the container will still exit with success, potentially causing Prometheus to start without a valid configuration.

Apply this diff to add error handling:

          command:
            - sh
            - -c
            - |
+             set -e
              echo "Cloning configuration..."
              git clone --branch main https://github.com/edinstance/distributed-aviation-system.git /repo
              echo "Copying prometheus config files..."
              mkdir -p /config
              cp /repo/monitoring/prometheus/prometheus.yml /config/config.yaml
+             echo "Verifying config file..."
+             test -f /config/config.yaml || exit 1
              echo "Config ready."
k8s/03-ingress.yml (1)

10-103: Consider consolidating duplicate path routing to reduce configuration verbosity.

Each service has both a regex pattern (/service/(.*)) and a prefix pattern (/service). The prefix pattern alone would suffice, as nginx Ingress will route /service/anything via the Prefix type. The regex patterns add configuration overhead without clear benefit.

Consider removing the regex patterns and keeping only Prefix routing:

        paths:
-          - path: /router/(.*)
-            pathType: ImplementationSpecific
-            backend:
-              service:
-                name: router-service
-                port:
-                  number: 4000
          - path: /router
            pathType: Prefix
            backend:
              service:
                name: router-service
                port:
                  number: 4000

Repeat this simplification for all six services (flights, aircraft, auth, gateway, search).

k8s/messaging/schema-registry.yml (1)

14-58: Add health checks and security context to the deployment.

The deployment lacks readiness and liveness probes, which are important for orchestration and reliability. Additionally, no security context is defined.

Apply this diff to add health checks and security hardening:

        - name: schema-registry
          image: confluentinc/cp-schema-registry:8.1.0
+         securityContext:
+           allowPrivilegeEscalation: false
+           runAsNonRoot: true
+           runAsUser: 1000
+           capabilities:
+             drop:
+               - ALL
          ports:
            - containerPort: 8081
          env:
            - name: SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS
              value: "kafka.messaging.svc.cluster.local:9092"
            - name: SCHEMA_REGISTRY_LISTENERS
              value: "http://0.0.0.0:8081"
            - name: SCHEMA_REGISTRY_HOST_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: SCHEMA_REGISTRY_LOG4J_ROOT_LOGLEVEL
              value: "INFO"
            - name: SCHEMA_REGISTRY_KAFKASTORE_TOPIC_REPLICATION_FACTOR
              value: "1"
            - name: SCHEMA_REGISTRY_KAFKASTORE_SECURITY_PROTOCOL
              value: "PLAINTEXT"
            - name: SCHEMA_REGISTRY_SCHEMA_COMPATIBILITY_LEVEL
              value: "BACKWARD"
+         readinessProbe:
+           httpGet:
+             path: /subjects
+             port: 8081
+           initialDelaySeconds: 30
+           periodSeconds: 10
+         livenessProbe:
+           httpGet:
+             path: /subjects
+             port: 8081
+           initialDelaySeconds: 60
+           periodSeconds: 30
          resources:
            requests:
              cpu: 250m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 512Mi
k8s/search/service.yml (1)

66-71: Consider adding a liveness probe.

Whilst a readiness probe is defined, there's no liveness probe. Adding one enables Kubernetes to automatically restart pods that become unresponsive.

Add a liveness probe after the readiness probe:

           readinessProbe:
             httpGet:
               path: /details/health
               port: 8082
             initialDelaySeconds: 10
             periodSeconds: 10
+          livenessProbe:
+            httpGet:
+              path: /details/health
+              port: 8082
+            initialDelaySeconds: 30
+            periodSeconds: 30
+            timeoutSeconds: 5
           resources:
k8s/monitoring/mimir.yml (1)

63-65: Consider using ConfigMap for configuration.

Whilst the current approach of fetching configuration via an initContainer works, using a Kubernetes ConfigMap would be more idiomatic, eliminate the Git dependency at runtime, and allow faster pod startup. This also provides better configuration versioning through Kubernetes-native mechanisms.

k8s/search/opensearch.yml (1)

47-50: Cluster configuration is functional but consider HA improvements.

The current configuration uses a single initial cluster manager node (opensearch-0). For improved resilience in production, consider configuring both nodes as eligible cluster managers.

             - name: cluster.initial_cluster_manager_nodes
-              value: "opensearch-0"
+              value: "opensearch-0,opensearch-1"
k8s/router/service.yml (1)

57-62: Consider adding a liveness probe.

Whilst a readiness probe is configured, adding a liveness probe would enable Kubernetes to automatically restart the router if it becomes unresponsive.

           readinessProbe:
             httpGet:
               path: /health
               port: 8088
             initialDelaySeconds: 10
             periodSeconds: 10
+          livenessProbe:
+            httpGet:
+              path: /health
+              port: 8088
+            initialDelaySeconds: 30
+            periodSeconds: 30
+            timeoutSeconds: 5
           resources:
k8s/gateway/service.yml (1)

52-57: Consider adding a liveness probe.

Adding a liveness probe alongside the readiness probe enables Kubernetes to automatically restart the gateway if it becomes unresponsive.

           readinessProbe:
             httpGet:
               path: /health
               port: 4001
             initialDelaySeconds: 10
             periodSeconds: 10
+          livenessProbe:
+            httpGet:
+              path: /health
+              port: 4001
+            initialDelaySeconds: 30
+            periodSeconds: 30
+            timeoutSeconds: 5
           resources:
k8s/authentication/database.yml (1)

44-48: Consider adding a liveness probe.

Whilst a readiness probe is configured, adding a liveness probe would enable Kubernetes to automatically restart the database if it becomes unresponsive.

           readinessProbe:
             exec:
               command: [ "pg_isready","-U","postgres","-d","authentication" ]
             initialDelaySeconds: 10
             periodSeconds: 5
+          livenessProbe:
+            exec:
+              command: [ "pg_isready","-U","postgres","-d","authentication" ]
+            initialDelaySeconds: 30
+            periodSeconds: 30
+            timeoutSeconds: 5
           volumeMounts:
k8s/flights/database.yml (1)

44-48: Add livenessProbe to detect and restart hung PostgreSQL instances.

The readinessProbe ensures the pod is ready to serve traffic, but a livenessProbe is needed to restart the pod if PostgreSQL hangs or becomes unresponsive.

Apply this diff to add a livenessProbe:

          readinessProbe:
            exec:
              command: [ "pg_isready","-U","postgres","-d","flights" ]
            initialDelaySeconds: 10
            periodSeconds: 5
+         livenessProbe:
+           exec:
+             command: [ "pg_isready","-U","postgres","-d","flights" ]
+           initialDelaySeconds: 30
+           periodSeconds: 10
+           failureThreshold: 3
          volumeMounts:
k8s/aircraft/migrations.yml (1)

36-51: Consider extracting migration Job into a shared Helm chart or kustomize template.

The aircraft migrations Job closely mirrors k8s/authentication/migrations.yml. Both use identical patterns (wait-for-db, fetch-repo, liquibase update) with only namespace and changelog path varying. This violates the DRY principle and increases maintenance burden.

Consolidate the migration pattern into a reusable Kustomize component or Helm chart with parameterized namespace, database hostname, and changelog path. This reduces duplication across flights, aircraft, authentication, and any future services requiring database migrations.

k8s/monitoring/otel.yml (1)

34-34: Single-replica OTEL collector introduces single point of failure for observability.

Monitoring infrastructure should be resilient. A single replica means any pod eviction or restart will interrupt telemetry collection.

Increase replicas to at least 2 and consider adding pod anti-affinity to spread across nodes:

  replicas: 1
+  replicas: 2
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
+     affinity:
+       podAntiAffinity:
+         preferredDuringSchedulingIgnoredDuringExecution:
+           - weight: 100
+             podAffinityTerm:
+               labelSelector:
+                 matchExpressions:
+                   - key: app
+                     operator: In
+                     values:
+                       - otel-collector
+               topologyKey: kubernetes.io/hostname
k8s/flights/service.yml (1)

52-60: Pin busybox image tags and consider more specific/smaller init containers.

The init containers use unversioned busybox images (lines 32, 35), which will pull the latest tag by default. This introduces non-determinism and potential compatibility issues.

Pin busybox to a specific version:

        - name: wait-for-db
-         image: busybox
+         image: busybox:1.36
          command: [ "sh", "-c", "until nc -z flights-database.flights.svc.cluster.local 5432; do sleep 2; done;" ]
        - name: wait-for-cache
-         image: busybox
+         image: busybox:1.36
          command: [ "sh", "-c", "until nc -z flights-cache.flights.svc.cluster.local 6379; do sleep 2; done;" ]
k8s/monitoring/grafana.yml (1)

36-42: Consider pinning the Git commit for reproducible deployments.

The init container clones from the main branch without verifying the commit hash, which could introduce unexpected configuration changes or malicious content between deployments.

Consider pinning to a specific commit:

            - |
              echo "Cloning configuration..."
-              git clone --branch main https://github.com/edinstance/distributed-aviation-system.git /repo
+              git clone https://github.com/edinstance/distributed-aviation-system.git /repo
+              cd /repo
+              git checkout <specific-commit-hash>
+              cd /
              echo "Copying grafana config files..."

Alternatively, consider using a ConfigMap to store provisioning configuration directly in the cluster for better auditability and version control.

k8s/01-ingress-controller.yml (1)

510-618: Consider setting a non-zero TTL for admission Jobs.

Both admission Jobs have ttlSecondsAfterFinished: 0, causing immediate deletion after completion. This makes debugging certificate generation failures difficult, as logs and Job status are lost.

Apply this diff to retain Jobs for 24 hours:

      serviceAccountName: ingress-nginx-admission
-  ttlSecondsAfterFinished: 0
+  ttlSecondsAfterFinished: 86400
---

(Apply to both Job definitions at lines 562 and 618.)

k8s/authentication/service.yml (1)

48-58: Replace sleep-based wait with proper Job status check.

The wait-for-migrations init container uses a fixed 10-second sleep without verifying that migrations actually completed. If migrations take longer or fail, the service may start with an incomplete schema.

Consider using a Kubernetes Job status check via the API or implement a readiness endpoint on the migrations Job that the init container can poll:

        - name: wait-for-migrations
          image: bitnami/kubectl:latest
          securityContext:
            # ... add securityContext as noted above
          command:
            - sh
            - -c
            - |
              echo "Waiting for migrations Job to complete..."
              kubectl wait --for=condition=complete --timeout=300s job/authentication-migrations -n authentication
              echo "Migrations complete!"

This requires granting the ServiceAccount permissions to read Job status in the authentication namespace.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8b60b1b and 9aca94f.

📒 Files selected for processing (53)
  • docker-compose.yml (1 hunks)
  • k8s/00-namespace.yml (1 hunks)
  • k8s/01-ingress-controller.yml (1 hunks)
  • k8s/02-cross-namespace-services.yml (1 hunks)
  • k8s/03-ingress.yml (1 hunks)
  • k8s/README.md (1 hunks)
  • k8s/aircraft/README.md (1 hunks)
  • k8s/aircraft/cache.yml (1 hunks)
  • k8s/aircraft/database.yml (1 hunks)
  • k8s/aircraft/migrations.yml (1 hunks)
  • k8s/aircraft/secrets.yml (1 hunks)
  • k8s/aircraft/service.yml (1 hunks)
  • k8s/authentication/database.yml (1 hunks)
  • k8s/authentication/keygen.yml (1 hunks)
  • k8s/authentication/migrations.yml (1 hunks)
  • k8s/authentication/secrets.yml (1 hunks)
  • k8s/authentication/service.yml (1 hunks)
  • k8s/cross-namespace-services-aviation.yml (1 hunks)
  • k8s/flights/README.md (1 hunks)
  • k8s/flights/cache.yml (1 hunks)
  • k8s/flights/database.yml (1 hunks)
  • k8s/flights/migrations.yml (1 hunks)
  • k8s/flights/secrets.yml (1 hunks)
  • k8s/flights/service.yml (1 hunks)
  • k8s/gateway/service.yml (1 hunks)
  • k8s/messaging/kafka.yml (1 hunks)
  • k8s/messaging/schema-loader.yml (1 hunks)
  • k8s/messaging/schema-registry.yml (1 hunks)
  • k8s/monitoring/grafana.yml (1 hunks)
  • k8s/monitoring/jaeger.yml (1 hunks)
  • k8s/monitoring/loki.yml (1 hunks)
  • k8s/monitoring/mimir.yml (1 hunks)
  • k8s/monitoring/otel.yml (1 hunks)
  • k8s/monitoring/promethus.yml (1 hunks)
  • k8s/router/router-config-job.yml (1 hunks)
  • k8s/router/service.yml (1 hunks)
  • k8s/router/supergraph-job.yml (1 hunks)
  • k8s/search/dashboards.yml (1 hunks)
  • k8s/search/init-job.yml (1 hunks)
  • k8s/search/opensearch.yml (1 hunks)
  • k8s/search/service.yml (1 hunks)
  • load-tests/src/config.ts (1 hunks)
  • load-tests/src/helpers/auth_headers.ts (1 hunks)
  • load-tests/src/tests/shells/aircraft.ts (3 hunks)
  • load-tests/src/tests/shells/flights.ts (4 hunks)
  • monitoring/docker-compose.yml (0 hunks)
  • router.yml (1 hunks)
  • services/authentication/authentication/settings.py (2 hunks)
  • services/authentication/docker-compose.yml (1 hunks)
  • services/gateway/src/handling/health_handler.rs (1 hunks)
  • services/gateway/src/handling/mod.rs (1 hunks)
  • services/gateway/src/main.rs (2 hunks)
  • supergraph.yml (1 hunks)
💤 Files with no reviewable changes (1)
  • monitoring/docker-compose.yml
🧰 Additional context used
🧬 Code graph analysis (5)
services/gateway/src/handling/mod.rs (1)
services/gateway/src/handling/health_handler.rs (1)
  • health_handler (4-6)
load-tests/src/tests/shells/flights.ts (2)
load-tests/src/helpers/graphql_request.ts (1)
  • graphql (6-51)
load-tests/src/helpers/auth_headers.ts (1)
  • buildAuthHeaders (9-29)
load-tests/src/helpers/auth_headers.ts (1)
load-tests/src/types/auth_context.ts (1)
  • AuthContext (1-13)
load-tests/src/tests/shells/aircraft.ts (3)
load-tests/src/types/auth_context.ts (1)
  • AuthContext (1-13)
load-tests/src/helpers/graphql_request.ts (1)
  • graphql (6-51)
load-tests/src/helpers/auth_headers.ts (1)
  • buildAuthHeaders (9-29)
services/gateway/src/main.rs (2)
services/gateway/src/handling/route_handler.rs (1)
  • route_handler (22-150)
services/gateway/src/handling/health_handler.rs (1)
  • health_handler (4-6)
🪛 Checkov (3.2.334)
k8s/messaging/schema-registry.yml

[medium] 14-58: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 14-58: Minimize the admission of root containers

(CKV_K8S_23)

k8s/monitoring/jaeger.yml

[medium] 24-55: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 24-55: Minimize the admission of root containers

(CKV_K8S_23)

k8s/search/dashboards.yml

[medium] 16-67: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 16-67: Minimize the admission of root containers

(CKV_K8S_23)

k8s/aircraft/database.yml

[medium] 13-66: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 13-66: Minimize the admission of root containers

(CKV_K8S_23)

k8s/router/service.yml

[medium] 24-83: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 24-83: Minimize the admission of root containers

(CKV_K8S_23)

k8s/messaging/kafka.yml

[medium] 21-118: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 21-118: Minimize the admission of root containers

(CKV_K8S_23)

k8s/authentication/migrations.yml

[medium] 1-77: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 1-77: Minimize the admission of root containers

(CKV_K8S_23)

k8s/search/service.yml

[medium] 15-78: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 15-78: Minimize the admission of root containers

(CKV_K8S_23)

k8s/flights/cache.yml

[medium] 12-43: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 12-43: Minimize the admission of root containers

(CKV_K8S_23)

k8s/authentication/keygen.yml

[medium] 1-51: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 1-51: Minimize the admission of root containers

(CKV_K8S_23)

k8s/messaging/schema-loader.yml

[medium] 1-54: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 1-54: Minimize the admission of root containers

(CKV_K8S_23)

k8s/gateway/service.yml

[medium] 15-64: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 15-64: Minimize the admission of root containers

(CKV_K8S_23)

k8s/router/supergraph-job.yml

[medium] 23-72: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 23-72: Minimize the admission of root containers

(CKV_K8S_23)

k8s/01-ingress-controller.yml

[high] 218-236: Minimize ClusterRoles that grant control over validating or mutating admission webhook configurations

(CKV_K8S_155)


[high] 257-276: No ServiceAccount/Node should be able to read all secrets

(CKV2_K8S_5)

k8s/search/opensearch.yml

[medium] 18-98: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 18-98: Minimize the admission of root containers

(CKV_K8S_23)

k8s/authentication/service.yml

[medium] 15-144: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 15-144: Minimize the admission of root containers

(CKV_K8S_23)

k8s/aircraft/migrations.yml

[medium] 1-71: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 1-71: Minimize the admission of root containers

(CKV_K8S_23)

k8s/aircraft/service.yml

[medium] 20-98: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 20-98: Minimize the admission of root containers

(CKV_K8S_23)

k8s/router/router-config-job.yml

[medium] 1-42: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 1-42: Minimize the admission of root containers

(CKV_K8S_23)

k8s/search/init-job.yml

[medium] 1-63: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 1-63: Minimize the admission of root containers

(CKV_K8S_23)

k8s/flights/migrations.yml

[medium] 1-51: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 1-51: Minimize the admission of root containers

(CKV_K8S_23)

k8s/authentication/database.yml

[medium] 13-66: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 13-66: Minimize the admission of root containers

(CKV_K8S_23)

k8s/monitoring/otel.yml

[medium] 28-81: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 28-81: Minimize the admission of root containers

(CKV_K8S_23)

k8s/monitoring/loki.yml

[medium] 15-41: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 15-41: Minimize the admission of root containers

(CKV_K8S_23)

k8s/monitoring/promethus.yml

[medium] 15-65: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 15-65: Minimize the admission of root containers

(CKV_K8S_23)

k8s/aircraft/cache.yml

[medium] 12-43: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 12-43: Minimize the admission of root containers

(CKV_K8S_23)

k8s/monitoring/mimir.yml

[medium] 15-65: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 15-65: Minimize the admission of root containers

(CKV_K8S_23)

k8s/monitoring/grafana.yml

[medium] 15-73: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 15-73: Minimize the admission of root containers

(CKV_K8S_23)

k8s/flights/service.yml

[medium] 15-73: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 15-73: Minimize the admission of root containers

(CKV_K8S_23)


[medium] 44-45: Basic Auth Credentials

(CKV_SECRET_4)

k8s/flights/database.yml

[medium] 13-66: Containers should not run with allowPrivilegeEscalation

(CKV_K8S_20)


[medium] 13-66: Minimize the admission of root containers

(CKV_K8S_23)

🪛 Gitleaks (8.28.0)
k8s/authentication/secrets.yml

[high] 2-8: Possible Kubernetes Secret detected, posing a risk of leaking credentials/tokens from your deployments

(kubernetes-secret-yaml)


[high] 11-17: Possible Kubernetes Secret detected, posing a risk of leaking credentials/tokens from your deployments

(kubernetes-secret-yaml)

k8s/flights/secrets.yml

[high] 2-8: Possible Kubernetes Secret detected, posing a risk of leaking credentials/tokens from your deployments

(kubernetes-secret-yaml)

k8s/aircraft/secrets.yml

[high] 2-8: Possible Kubernetes Secret detected, posing a risk of leaking credentials/tokens from your deployments

(kubernetes-secret-yaml)

🔇 Additional comments (26)
services/gateway/src/handling/mod.rs (1)

1-5: LGTM!

The module structure is clean and correctly exports both handlers.

services/authentication/docker-compose.yml (1)

10-19: LGTM! Improved CORS security configuration.

The changes properly restrict CORS to specific origins whilst allowing credentials, which is a security improvement over the previous CORS_ALLOW_ALL_ORIGINS: True configuration. The addition of host.docker.internal supports Docker Desktop local development.

services/gateway/src/handling/health_handler.rs (1)

1-6: LGTM!

The health check endpoint implementation is clean and follows best practices for Kubernetes readiness/liveness probes.

services/gateway/src/main.rs (1)

8-8: LGTM! Clean separation of health and application routing.

The introduction of a dedicated /health endpoint alongside a fallback handler for all other routes is a good pattern for Kubernetes deployments. The Arc clones in the closure are efficient and appropriate.

Also applies to: 46-55

k8s/authentication/keygen.yml (1)

7-7: Verify: Zero backoff limit prevents retries on transient failures.

Setting backoffLimit: 0 means the Job will not retry if it fails due to transient issues (network errors, temporary Git unavailability, etc.). For critical key generation, this might leave the system in a broken state.

Is this intentional? If key generation must succeed exactly once (idempotency concerns), consider adding retry logic with idempotency checks in the script itself rather than disabling Kubernetes retries entirely.

k8s/messaging/kafka.yml (1)

109-118: Verify storage class availability and durability requirements.

The StatefulSet uses storageClassName: local-path, which typically provides node-local storage without replication. If a node fails, data on that node is lost. For a 3-replica Kafka cluster with min.insync.replicas=2, this may be acceptable, but ensure this aligns with your durability requirements.

Confirm that:

  1. The local-path storage class exists in your cluster
  2. Node-local storage is acceptable for your use case
  3. Kafka's replication factor (3) provides sufficient durability despite non-replicated storage
services/authentication/authentication/settings.py (2)

212-217: Document the custom CORS headers and verify necessity.

The CORS_ALLOWED_HEADERS includes x-tenant-id, which appears to be custom. Ensure this header is actually required by your frontend clients and that its inclusion doesn't inadvertently expose tenant information in insecure contexts.

Confirm:

  1. The x-tenant-id header is required for your multi-tenant architecture
  2. Clients properly set this header
  3. The backend validates tenant isolation based on this header

212-225: CORS credentials with environment-controlled origins requires careful production configuration.

Setting CORS_ALLOW_CREDENTIALS=True whilst allowing credentials requires strict origin controls. Ensure the CORS_ALLOWED_ORIGINS environment variable is properly configured in production to prevent credential leakage to unauthorised origins.

Verify that:

  1. CORS_ALLOW_ALL_ORIGINS is set to False in production (via environment)
  2. CORS_ALLOWED_ORIGINS contains only trusted frontend origins
  3. The combination of credentials + origins is tested to prevent security gaps
k8s/cross-namespace-services-aviation.yml (1)

1-32: Note: Port field in ExternalName services is informational only.

For type: ExternalName services, the ports field does not affect traffic routing—clients must connect directly to the port specified in the externalName DNS entry. The port declarations here serve only as documentation.

This is acceptable for documentation purposes, but be aware that changing these port values won't affect actual connectivity. Consider adding comments to clarify this:

spec:
  type: ExternalName
  externalName: flights-service.flights.svc.cluster.local
  ports:
  - port: 8081  # Documentation only - ExternalName services do not route traffic
k8s/00-namespace.yml (1)

1-43: LGTM! Clean namespace structure.

The namespace declarations are well-organised and follow Kubernetes conventions. The ingress-nginx namespace correctly includes the standard labels for ingress controller installations.

supergraph.yml (1)

4-12: Excellent use of environment variables for service URLs.

The migration to environment-variable-backed routing URLs with sensible fallbacks improves deployment flexibility across different environments whilst maintaining backwards compatibility.

k8s/02-cross-namespace-services.yml (1)

1-65: ExternalName services configured correctly.

The ExternalName service definitions properly enable cross-namespace routing by mapping service names in the ingress-nginx namespace to their actual locations in service-specific namespaces. The fully qualified domain names follow the correct pattern: <service>.<namespace>.svc.cluster.local.

Note: ExternalName services use the port field primarily for documentation; the actual port resolution happens at the target service.

k8s/flights/README.md (1)

1-29: Clear and comprehensive documentation.

The README effectively documents the service stack components, their purposes, and interdependencies. This will be valuable for operators and developers working with the deployment.

load-tests/src/tests/shells/flights.ts (1)

13-13: Excellent refactoring to centralize header construction.

The introduction of buildAuthHeaders eliminates code duplication and ensures consistent authentication header handling across all GraphQL requests.

load-tests/src/tests/shells/aircraft.ts (1)

14-14: Excellent refactoring to centralize header construction.

The adoption of buildAuthHeaders maintains consistency with flights.ts and eliminates code duplication across the test suite.

k8s/messaging/schema-registry.yml (1)

46-48: Verify that replication factor of 1 is appropriate for your environment.

The Schema Registry uses SCHEMA_REGISTRY_KAFKASTORE_TOPIC_REPLICATION_FACTOR: 1, which means schema metadata is stored in a Kafka topic with no replicas. This is suitable for development but poses data loss risk in production. For production deployments, increase this to at least 2 or 3.

If this is a production cluster, update the replication factor:

            - name: SCHEMA_REGISTRY_KAFKASTORE_TOPIC_REPLICATION_FACTOR
-             value: "1"
+             value: "3"
k8s/monitoring/promethus.yml (1)

31-45: Repository path and configuration file verified—add error handling and confirm cluster network access.

The repository path and Prometheus configuration file have been verified to exist and contain valid YAML. However:

  • Add error handling to the init script to fail fast if git clone or file copy operations fail—currently, if the clone fails, the script continues and Prometheus starts without a valid configuration
  • Ensure the repository is accessible from your Kubernetes cluster (may require network policies, ingress configuration, or proxy settings to be configured)
  • Consider storing the configuration in a ConfigMap instead of cloning at runtime to avoid network dependency on each pod startup
k8s/aircraft/service.yml (1)

35-41: No action required — standard busybox images include netcat.

The official busybox Docker image includes the nc utility by default, and the init container commands will function correctly. Whilst netcat is technically optional in BusyBox compilation (dependent on CONFIG_NC flags), the standard busybox image used here includes it. This port-checking pattern is widely established in Kubernetes deployments.

load-tests/src/helpers/auth_headers.ts (1)

1-29: LGTM!

The buildAuthHeaders utility function is well-implemented with clear documentation. The conditional logic correctly handles both JWT-based gateway authentication and direct service authentication headers. The function signature with optional parameters and default values provides good flexibility for different test scenarios.

k8s/router/service.yml (1)

77-83: Verify PVC initialisation before deployment.

The router depends on two PVCs (supergraph-schema-pvc and router-config-pvc) being created and populated before the deployment starts. Ensure the jobs that populate these PVCs (referenced in the PR summary) complete successfully before applying this deployment, or consider adding init containers to wait for the required files.

k8s/flights/database.yml (1)

20-20: Verify single-replica StatefulSet appropriateness for production use.

This manifest deploys only 1 PostgreSQL replica, providing no redundancy. For development, this is acceptable; for production, ensure this is intentional or plan for replication strategy (e.g., PostgreSQL streaming replication or managed database service).

Confirm whether single-replica PostgreSQL is appropriate for your deployment strategy, or plan for a high-availability setup with multiple replicas and failover logic.

k8s/flights/service.yml (1)

51-60: Verify KAFKA_FLIGHTS_TOPIC and other hardcoded service URLs for correctness.

All external service URLs are hardcoded (AIRCRAFT_SERVICE_GRPC_URL, OTLP_GRPC_URL, KAFKA_BROKER_URL, KAFKA_SCHEMA_REGISTRY_URL). Whilst Kubernetes DNS conventions support this, any future namespace reorganisation or service relocation will require manual pod spec updates.

Confirm these hardcoded URLs align with your Kubernetes topology:

  • aircraft-service.aircraft:9090
  • otel-collector.monitoring:4317
  • kafka.messaging.svc.cluster.local:9092
  • schema-registry.messaging.svc.cluster.local:8081

If service namespaces or names change frequently, consider using a ConfigMap or external service discovery to inject these values dynamically.

k8s/monitoring/grafana.yml (1)

1-13: LGTM!

The Service definition is correctly configured for internal cluster access to Grafana on port 3000.

k8s/01-ingress-controller.yml (1)

383-509: LGTM!

The ingress-nginx controller Deployment follows security best practices with a properly configured securityContext, running as non-root, dropping all capabilities except NET_BIND_SERVICE, and defining appropriate resource limits.

k8s/authentication/service.yml (1)

1-13: LGTM!

The Service definition correctly exposes the authentication service on port 8000 within the authentication namespace.

load-tests/src/config.ts (1)

1-7: Verification confirmed—endpoint paths are correctly configured.

The ingress rules in k8s/03-ingress.yml define both /router and /router/(.*) paths routing to router-service:4000, and /gateway and /gateway/(.*) paths routing to gateway-service:4001. The configuration in load-tests/src/config.ts correctly uses these paths: http://api.aviation.local/router and http://api.aviation.local/gateway. The migration to the Kubernetes ingress configuration is consistent.

@edinstance edinstance merged commit 7967c0a into dev Nov 3, 2025
1 of 2 checks passed
@edinstance edinstance deleted the kubernetes-deployment-config branch November 3, 2025 21:54
@coderabbitai coderabbitai bot mentioned this pull request Nov 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant