chore: speed-up of CI tests #46

HardMax71 · 2025-12-27T13:12:24Z

wait instead of curl polling, + coverage_core=sysmon

Summary by CodeRabbit

New Features
- Kubernetes-based end-to-end tests added; CI now supports Kubernetes test runs.
- Docker image caching to speed up CI.
Infrastructure
- CI reorganized into unit, integration, and e2e phases with improved caching, log collection, and startup sequencing.
- Deployment/startup ordering refined to ensure seed tasks complete before services start.
Bug Fixes
- Fixed database cursor/aggregation handling.
- Unified user settings payload field naming and parsing.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-27T13:12:33Z

📝 Walkthrough

Walkthrough

This PR reorganizes CI into separate unit/integration/e2e pipelines with image caching and enhanced log collection; migrates async Mongo usage from Motor to PyMongo's async client across code and tests; updates test fixtures and scopes (session-scoped app, integration cleanup); and normalizes Mongo aggregation/list-index usage to await cursors.

Changes

Cohort / File(s)	Summary
CI workflows & actions `.github/workflows/backend-ci.yml`, `.github/workflows/frontend-ci.yml`, `.github/actions/docker-cache/action.yml`, `.github/actions/setup-ci-compose/action.yml` (removed)	Split backend CI into unit/integration/e2e jobs, added docker image cache action, removed setup-ci-compose composite action, added image pinning and Codecov flags, improved log/artifact collection and Kubernetes steps.
Docker Compose & service config `docker-compose.yaml`, `docker-compose.ci.yaml`	Added `user-seed` completion dependency for cert-generator/backend; introduced CI-optimized compose file with infra vs full profiles, healthchecks, KRaft Kafka, and host networking for cert-generator.
Backend runtime & packaging `backend/Dockerfile`, `backend/pyproject.toml`, `README.md`	Gunicorn env-driven defaults in Dockerfile; removed `motor` dependency, updated pytest-asyncio loop scope, added coverage run config; updated backend Codecov badge link.
Mongo client migration `backend/app/core/database_context.py`, `backend/app/services/coordinator/coordinator.py`, `backend/app/services/k8s_worker/worker.py`, `backend/workers/`, `backend/scripts/seed_users.py`, `backend/tests/fixtures/real_services.py`, `backend/workers/run_`	Replaced Motor types/usage with PyMongo async equivalents (AsyncMongoClient/AsyncDatabase/AsyncCollection/etc.), updated connects/disconnects to await close(), adjusted type aliases and imports across services, workers, scripts and real-service fixtures.
Aggregation / cursor semantics `backend/app/db/repositories/*`, `backend/app/dlq/manager.py`, `backend/app/events/event_store.py`, `backend/app/db/repositories/notification_repository.py`, `backend/tests/integration/db/schema/test_schema_manager.py`	Normalized aggregate() and list_indexes() use to await cursor results then iterate or call to_list(), added explicit cursor variables in repository logic.
Test fixture & scope changes `backend/tests/conftest.py`, `backend/tests/unit/conftest.py`, `backend/tests/unit/.env.unit`, `backend/tests/integration/conftest.py`, `backend/tests/integration/app/__init__.py`, `backend/tests/fixtures/real_services.py`	Session-scoped `app`/`app_container`, new unit .env and isolation conftest (errors for DB/Redis/http), new integration autouse `_cleanup` to drop non-system collections and flush Redis, renamed shared_* fixtures to test_*.
Tests — markers, imports, signature renames `backend/tests/integration/*` (many files, e.g., `db/repositories/`, `events/`, `k8s/`, `test_*.py`)	Switched many tests from `pytest.mark.unit` → `pytest.mark.integration`, replaced `AsyncIOMotorDatabase` imports/annotations with `Database` alias, and globally renamed fixture params `shared_user`/`shared_admin` → `test_user`/`test_admin`.
Unit test edits & test scaffolding `backend/tests/unit/*`, `backend/tests/unit/events/test_admin_utils.py` (removed), `backend/tests/unit/infrastructure/mappers/test_replay_api_mapper.py` (removed portions), `backend/tests/unit/services/pod_monitor/test_monitor.py`	Added unit isolation conftest, removed or simplified some unit tests, consolidated k8s mocking helpers and stubs in pod_monitor tests.
Events & user settings `backend/app/infrastructure/kafka/events/user.py`, `backend/app/services/user_settings_service.py`	Renamed event field `changes` → `updated`, updated user settings service to accept stringified/JSON payload values and apply nested/top-level updates via a new parse helper and extended _apply_event handling.
Misc tests added/changed `backend/tests/integration/events/test_admin_utils.py`, `backend/tests/integration/services/sse/*`, other route tests	Refactored AdminUtils integration test to use idempotent ensure_topics_exist, removed manual timeouts in some SSE tests, and updated numerous route tests to use renamed fixtures.

Sequence Diagram(s)

sequenceDiagram
    participant GH as GitHub Actions
    participant Cache as Docker Image Cache Action
    participant Compose as Docker Compose (CI)
    participant K3s as K3s / Kubernetes
    participant Tests as Test Runners
    participant Codecov as Codecov

    GH->>Cache: request/cache images (mongo, redis, kafka, schema-registry)
    Cache-->>GH: images loaded or saved

    GH->>Compose: start infra (integration job)
    Compose->>Tests: run integration tests (with _cleanup)
    Tests->>Compose: on failure collect docker-compose logs
    Tests->>Codecov: upload integration coverage (flag)

    GH->>K3s: bootstrap k3s (e2e job)
    GH->>Cache: ensure images for k8s
    K3s->>Tests: run k8s e2e tests
    Tests->>K3s: on failure collect k8s events/logs
    Tests->>Codecov: upload e2e coverage (flag)

    GH->>Tests: run unit tests (isolated env)
    Tests->>Codecov: upload unit coverage (flag)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

PR #39: Modifies the same database abstraction (Motor → PyMongo type alias changes in backend/app/core/database_context.py) and related test/type updates.
PR #27: Overlaps on MongoDB client migration in service/worker files (coordinator, k8s_worker) converting to AsyncMongoClient.
PR #3: Related CI and docker-compose changes; overlaps on CI composition and compose-based test infra.

Suggested labels

enhancement

Poem

🐰 a little hop, a tiny tweak

Motor waved goodbye, PyMongo took the stage,
CI learned to cache and sprint across each page,
Fixtures settled in a session's cozy nest,
Tests tidy, logs captured, workers put to rest,
A rabbit's refactor — clean, nimble, and well-pressed.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 56.63% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'chore: speed-up of CI tests' is directly related to the primary changes in the PR, which focus on optimizing CI workflows through Docker image caching, environment-based test separation (unit/integration/e2e), and MongoDB driver migration for performance.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch ci-speedup

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov-commenter · 2025-12-27T14:17:39Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 75.00000% with 24 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
backend/app/services/user_settings_service.py	60.46%	17 Missing ⚠️
backend/app/core/database_context.py	80.00%	3 Missing ⚠️
backend/app/db/repositories/event_repository.py	85.71%	1 Missing ⚠️
backend/app/events/event_store.py	66.66%	1 Missing ⚠️
backend/app/services/coordinator/coordinator.py	50.00%	1 Missing ⚠️
backend/app/services/k8s_worker/worker.py	50.00%	1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Flag	Coverage Δ
backend-e2e	`49.77% <16.66%> (?)`
backend-integration	`70.72% <73.95%> (?)`
backend-unit	`54.84% <14.58%> (?)`
frontend	`56.84% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...p/db/repositories/admin/admin_events_repository.py	`69.27% <100.00%> (+28.93%)`	⬆️
backend/app/db/repositories/dlq_repository.py	`87.91% <100.00%> (+40.13%)`	⬆️
...ackend/app/db/repositories/execution_repository.py	`86.48% <100.00%> (+19.36%)`	⬆️
...end/app/db/repositories/notification_repository.py	`93.47% <100.00%> (+25.09%)`	⬆️
backend/app/db/repositories/saga_repository.py	`83.78% <100.00%> (+36.48%)`	⬆️
backend/app/dlq/manager.py	`84.10% <100.00%> (+5.64%)`	⬆️
backend/app/infrastructure/kafka/events/user.py	`100.00% <100.00%> (ø)`
backend/app/db/repositories/event_repository.py	`70.04% <85.71%> (+22.48%)`	⬆️
backend/app/events/event_store.py	`74.52% <66.66%> (+8.17%)`	⬆️
backend/app/services/coordinator/coordinator.py	`44.02% <50.00%> (+2.23%)`	⬆️
... and 3 more

... and 100 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (4)

docker-compose.ci.yaml (2)
117-124: Consider adding a healthcheck to shared-ca service.

The shared-ca service is a dependency for cert-generator with condition: service_completed_successfully. However, the sleep 1 command means it completes almost immediately. While this works, adding an explicit healthcheck or using a more deterministic completion signal would be more robust.

56-88: Consider using a specific Kafka image tag for reproducibility.

The tag bitnami/kafka:3.6 is valid and points to the latest patch in the 3.6 series. However, for consistent CI builds, use an explicit full tag like bitnami/kafka:3.6.2-debian-12-r5 to avoid potential manifest issues across different architectures or future tag changes.
   kafka:
-    image: bitnami/kafka:3.6
+    image: bitnami/kafka:3.6.2-debian-12-r5
     container_name: kafka
.github/workflows/frontend-ci.yml (1)
117-120: Consider consolidating redundant log capture.

Line 117 captures all compose logs, while lines 118-120 capture individual service logs. The individual service logs are subsets of the full log. Consider if you need both, or if filtering could be applied during analysis instead.
🔎 Alternative: Keep only full logs or add filtering
          mkdir -p logs
          docker compose -f docker-compose.ci.yaml logs > logs/docker-compose.log 2>&1
-          docker compose -f docker-compose.ci.yaml logs backend > logs/backend.log 2>&1
-          docker compose -f docker-compose.ci.yaml logs frontend > logs/frontend.log 2>&1
-          docker compose -f docker-compose.ci.yaml logs kafka > logs/kafka.log 2>&1
+          # Individual logs can be extracted from docker-compose.log if needed
          kubectl get events --sort-by='.metadata.creationTimestamp' -A > logs/k8s-events.log 2>&1 || true
Alternatively, keep individual logs if you prefer separate files for easier debugging but note the duplication.
.github/workflows/backend-ci.yml (1)
159-160: Consider adding Kafka and Schema Registry logs for k8s job failures.

The non-k8s integration job collects Kafka and Schema Registry logs on failure (lines 79-81), but the k8s-integration job only collects the general compose log. For consistency and easier debugging, consider adding service-specific logs.
🔎 Proposed enhancement
        run: |
          mkdir -p logs
          docker compose -f docker-compose.ci.yaml logs > logs/docker-compose.log 2>&1
+          docker compose -f docker-compose.ci.yaml logs kafka > logs/kafka.log 2>&1
+          docker compose -f docker-compose.ci.yaml logs schema-registry > logs/schema-registry.log 2>&1
          kubectl get events --sort-by='.metadata.creationTimestamp' -A > logs/k8s-events.log 2>&1 || true
          kubectl describe pods -A > logs/k8s-describe-pods.log 2>&1 || true

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between da86acc and c86129f.

📒 Files selected for processing (4)

.github/actions/setup-ci-compose/action.yml
.github/workflows/backend-ci.yml
.github/workflows/frontend-ci.yml
docker-compose.ci.yaml

💤 Files with no reviewable changes (1)

.github/actions/setup-ci-compose/action.yml

🧰 Additional context used

🪛 Checkov (3.2.334)

docker-compose.ci.yaml

[medium] 160-161: Basic Auth Credentials

(CKV_SECRET_4)

.github/workflows/backend-ci.yml

[medium] 50-51: Basic Auth Credentials

(CKV_SECRET_4)

🪛 GitHub Actions: Backend CI

docker-compose.ci.yaml

[error] 1-1: Command 'docker compose -f docker-compose.ci.yaml up -d --wait --wait-timeout 120' failed: manifest for bitnami/kafka:3.6 not found (manifest unknown).

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: E2E Tests

🔇 Additional comments (10)

docker-compose.ci.yaml (3)

126-147: Network configuration conflict in cert-generator.

The service declares network_mode: host (line 144) while also having a depends_on relationship with shared-ca which uses ci-network. When using network_mode: host, the container cannot communicate with containers on the custom bridge network by their service names. This may cause issues if cert-generator needs to reach services on ci-network.

Additionally, network_mode: host and networks: are mutually exclusive in Docker Compose - the service correctly omits networks: but the dependency on shared-ca (which is on ci-network) might not work as expected.

Verify that the cert-generator can successfully complete its task without needing to communicate with other services on the ci-network. The volume mount approach appears to be the intended communication method via /shared_ca.

21-39: Good CI optimizations for infrastructure services.

The configuration demonstrates thoughtful CI optimizations:

tmpfs for MongoDB (/data/db) eliminates disk I/O

Redis memory limits and disabled persistence (--save "")

KRaft mode Kafka with reduced heap and single-node replication factors

Appropriate health check intervals for CI speed

Also applies to: 56-88, 90-110

27-29: Hardcoded credentials are acceptable for CI-only configuration.

The static analysis tool flagged hardcoded credentials. These are intentionally hardcoded for a CI test environment and are not used in production. The file header and naming (docker-compose.ci.yaml) clearly indicate this is CI-specific. No action needed.

Also applies to: 160-161

.github/workflows/frontend-ci.yml (3)

84-89: Simplified kubeconfig creation approach looks good.

The sed-based approach to replace the server URL is cleaner than maintaining a separate heredoc template. The chmod 644 is appropriate for read access by Docker containers.

91-94: Appropriate use of --wait flag with generous timeout.

The --build --wait --wait-timeout 300 combination ensures services are built and healthy before proceeding. The 5-minute timeout is reasonable given the full stack build (cert-generator, backend, frontend).

113-121: Comprehensive log collection on failure.

Good addition of Kafka logs and Kubernetes events. The || true on kubectl ensures log collection doesn't fail if k8s setup was the issue.

.github/workflows/backend-ci.yml (4)

18-89: Well-structured non-k8s integration job.

The job correctly:

Sets up UV with caching for faster dependency installation

Starts only infrastructure services (no --profile full)

Explicitly ignores k8s tests with --ignore=tests/integration/k8s

Uses appropriate environment variables for local service access

Sets COVERAGE_CORE: sysmon for faster coverage collection per PR objectives

90-168: K8s integration job properly isolated.

Good separation of k8s-specific tests into a parallel job with:

Dedicated k8s setup with namespace creation

Appropriate KUBECONFIG and K8S_NAMESPACE environment variables

Distinct coverage flags (backend-k8s) for Codecov

46-55: Hardcoded test credentials are acceptable for CI.

The static analysis flagged credentials at lines 50-51. These are intentionally hardcoded for CI test environment matching the docker-compose.ci.yaml configuration. This is standard practice for CI pipelines.

Also applies to: 127-137

114-122: This concern is not valid.

The kubeconfig file is already owned by the runner user because the shell redirection (>) is processed by the runner's shell before sudo executes. This means sudo k3s kubectl config view --raw > /home/runner/.kube/config creates a runner-owned file, not a root-owned one. With chmod 600, the runner user has read and write access, which is correct. The tests can access the file without issues.

Likely an incorrect or invalid review comment.

.github/workflows/backend-ci.yml

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

.github/workflows/backend-ci.yml (1)
18-24: Fix the invalid Kafka image tag in the environment variable.

The KAFKA_IMAGE environment variable uses bitnami/kafka:3.6 which doesn't exist. This needs to match the fix in docker-compose.ci.yaml.
🔎 Proposed fix
 env:
   MONGO_IMAGE: mongo:8.0
   REDIS_IMAGE: redis:7-alpine
-  KAFKA_IMAGE: bitnami/kafka:3.6
+  KAFKA_IMAGE: bitnami/kafka:3.6.2
   SCHEMA_REGISTRY_IMAGE: confluentinc/cp-schema-registry:7.5.0

🧹 Nitpick comments (2)

docker-compose.ci.yaml (1)
118-125: Consider a more restrictive chmod for shared CA directory.

chmod 777 grants full permissions to everyone. While acceptable for CI, chmod 755 would be slightly more secure while still allowing the necessary access.
🔎 Proposed fix
   shared-ca:
     image: alpine:latest
     profiles: ["full"]
     volumes:
       - shared_ca:/shared_ca
-    command: sh -c "mkdir -p /shared_ca && chmod 777 /shared_ca && sleep 1"
+    command: sh -c "mkdir -p /shared_ca && chmod 755 /shared_ca && sleep 1"
     networks:
       - ci-network
.github/workflows/backend-ci.yml (1)

145-184: Caching logic is duplicated across jobs.

The Docker image caching steps are identical in both integration and k8s-integration jobs. Consider extracting this to a reusable workflow or composite action to reduce duplication.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c86129f and 094ddbf.

📒 Files selected for processing (2)

.github/workflows/backend-ci.yml
docker-compose.ci.yaml

🧰 Additional context used

🪛 Checkov (3.2.334)

.github/workflows/backend-ci.yml

[medium] 98-99: Basic Auth Credentials

(CKV_SECRET_4)

docker-compose.ci.yaml

[medium] 161-162: Basic Auth Credentials

(CKV_SECRET_4)

🪛 GitHub Actions: Backend CI

docker-compose.ci.yaml

[error] 1-1: Docker image not found: bitnami/kafka:3.6 (manifest unknown). Docker could not pull the image required by docker-compose. Command failed: 'docker compose -f docker-compose.ci.yaml up -d --wait --wait-timeout 120'.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: E2E Tests

🔇 Additional comments (11)

docker-compose.ci.yaml (6)

21-39: Good CI optimization with tmpfs; credentials are acceptable for CI-only use.

The tmpfs mount for MongoDB data is a smart choice for faster CI cycles. The hardcoded credentials flagged by static analysis (CKV_SECRET_4) are acceptable here since this configuration is explicitly for CI testing and not production.

41-55: LGTM!

Redis configuration is appropriately tuned for CI with memory limits and LRU eviction policy.

91-111: LGTM!

Schema-registry is properly configured with health checks and correct dependency on Kafka.

127-148: LGTM!

The cert-generator is properly configured as a one-shot container with correct dependency ordering on shared-ca.

150-194: LGTM!

Backend service has proper health checks and dependency chain (cert-generator → infra services). The explicitly named ci-test-secret-key makes it clear this is for testing only.

195-228: LGTM!

Frontend service is correctly configured with dependency on backend health and proper TLS certificate mounts.

.github/workflows/backend-ci.yml (5)

33-72: Good caching strategy with parallel operations and zstd compression.

The Docker image caching approach is well-designed:

Parallel pulls/saves reduce wall-clock time

zstd compression balances speed vs size

Cache key includes all image versions for proper invalidation

86-121: LGTM!

Good use of --wait flag with timeout for infrastructure startup. The COVERAGE_CORE: sysmon setting aligns with the PR objective for faster CI tests.

123-136: LGTM!

Comprehensive log collection on failure will aid debugging CI issues.

203-231: LGTM!

The k3s setup is well-designed:

Disables traefik (not needed for tests)

Uses idempotent namespace creation (dry-run | apply)

Proper timeout for cluster readiness

Environment variables correctly configured for k8s tests

244-257: LGTM!

Good defensive logging with || true to ensure log collection doesn't fail the step, maximizing debug information availability.

docker-compose.ci.yaml

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

docker-compose.ci.yaml (1)

56-89: Kafka configuration successfully addresses the past image issue.

The switch from bitnami/kafka:3.6 to apache/kafka:3.9.0 resolves the invalid image tag issue flagged in previous reviews. The KRaft mode configuration is appropriate for CI, eliminating the Zookeeper dependency and speeding up startup.

The health check uses 30 retries (60 seconds total with 2s intervals). While this provides a safety margin for Kafka's startup, you might optionally reduce it to 20-25 retries if you find Kafka consistently starts faster in your CI environment.
.github/workflows/backend-ci.yml (1)
145-184: Consider extracting Docker caching into a reusable composite action.

The Docker image caching logic (lines 145-184) is identical to the integration job (lines 33-72). This duplication creates maintenance overhead and increases the risk of inconsistencies.

Consider creating a reusable composite action (e.g., .github/actions/cache-docker-images/action.yml) that both jobs can consume. This would centralize the caching logic and make future updates easier.
Example structure for a composite action

Create .github/actions/cache-docker-images/action.yml:
name: Cache Docker Images
description: Cache and load Docker images for CI
runs:
  using: composite
  steps:
    - name: Cache Docker images
      uses: actions/cache@v5
      id: docker-cache
      with:
        path: /tmp/docker-cache
        key: docker-${{ runner.os }}-${{ env.MONGO_IMAGE }}-${{ env.REDIS_IMAGE }}-${{ env.KAFKA_IMAGE }}-${{ env.SCHEMA_REGISTRY_IMAGE }}
    
    # ... rest of the caching logic
Then in both jobs:
- uses: ./.github/actions/cache-docker-images

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6fc929b and 3613d03.

📒 Files selected for processing (2)

.github/workflows/backend-ci.yml
docker-compose.ci.yaml

🧰 Additional context used

🪛 Checkov (3.2.334)

.github/workflows/backend-ci.yml

[medium] 98-99: Basic Auth Credentials

(CKV_SECRET_4)

docker-compose.ci.yaml

[medium] 161-162: Basic Auth Credentials

(CKV_SECRET_4)

🔇 Additional comments (11)

docker-compose.ci.yaml (5)

1-15: LGTM! Clear documentation of CI compose usage.

The header comments provide helpful context on the two usage modes (infra-only vs full stack) and clearly explain the key differences from the main docker-compose.yaml file.

21-39: LGTM! MongoDB optimized for CI with tmpfs.

Using tmpfs for /data/db is an excellent optimization for CI speed, avoiding disk I/O overhead. The health check configuration is appropriate for the CI environment.

117-149: LGTM! Certificate generation flow correctly sequenced.

The dependency chain ensures proper startup order: shared-ca → cert-generator → backend/frontend. Using host network mode for cert-generator is appropriate for certificate propagation in CI.

150-193: LGTM! Backend service properly configured with all dependencies.

The backend service correctly depends on:

cert-generator (completion) for TLS setup

All infrastructure services (health) for runtime dependencies

The health check configuration with a 30s start period and 20 retries is appropriate given the backend's initialization requirements.

195-221: LGTM! Frontend service correctly depends on backend health.

The frontend service properly waits for backend health before starting, ensuring the API is available for E2E tests. The health check configuration is appropriate.

.github/workflows/backend-ci.yml (6)

18-23: LGTM! Image version pinning ensures cache stability.

Pinning image versions in environment variables provides consistent cache keys across CI runs, improving cache hit rates and reducing CI time.

33-72: LGTM! Efficient Docker image caching with parallel operations.

The caching strategy is well-implemented:

Parallel image pulls and saves with background jobs (& and wait)

zstd compression with multi-threading (-T0) and reasonable compression level (3)

Cache hit logic properly loads images in parallel

This should significantly speed up CI runs after the initial cache population.

86-110: LGTM! Test execution aligns with PR objectives.

The changes successfully implement the PR objectives:

✅ Uses --wait instead of curl polling for service startup (line 88)

✅ Sets COVERAGE_CORE: sysmon for improved coverage collection (line 104)

Correctly excludes k8s tests with --ignore=tests/integration/k8s

123-136: LGTM! Comprehensive log collection on failure.

The failure path collects relevant logs (docker-compose, kafka, schema-registry) with proper stderr redirection (2>&1). The artifact name backend-logs clearly distinguishes it from the k8s job's logs.

203-232: LGTM! k3s setup and test execution properly configured.

The k8s-integration job is well-structured:

k3s installation with traefik disabled (appropriate for test environment)

Idempotent namespace creation with --dry-run=client

Correct test filtering for k8s-specific tests

✅ COVERAGE_CORE: sysmon set per PR objectives (line 226)

Separate coverage flag (backend-k8s) distinguishes k8s coverage from non-k8s

244-257: LGTM! k8s-specific log collection on failure.

The failure path appropriately collects Kubernetes events and pod descriptions alongside docker-compose logs. The artifact name k8s-logs clearly distinguishes it from the non-k8s job's logs.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

docker-compose.ci.yaml (1)
125-133: Consider using a more robust completion signal.

The sleep 1 pattern ensures the shared directory exists but is fragile. If the directory setup takes longer on slow CI runners, dependent services might fail.
🔎 Alternative approach using a touch file
   shared-ca:
     image: alpine:latest
     profiles: ["full"]
     volumes:
       - shared_ca:/shared_ca
-    command: sh -c "mkdir -p /shared_ca && chmod 777 /shared_ca && sleep 1"
+    command: sh -c "mkdir -p /shared_ca && chmod 777 /shared_ca && touch /shared_ca/.ready && tail -f /dev/null"
     networks:
       - ci-network
+    healthcheck:
+      test: ["CMD", "test", "-f", "/shared_ca/.ready"]
+      interval: 1s
+      timeout: 1s
+      retries: 5
Then update cert-generator to depend on health instead:
     depends_on:
       shared-ca:
-        condition: service_completed_successfully
+        condition: service_healthy

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3613d03 and 0bc30dc.

📒 Files selected for processing (1)

docker-compose.ci.yaml

🧰 Additional context used

🪛 Checkov (3.2.334)

docker-compose.ci.yaml

[medium] 171-172: Basic Auth Credentials

(CKV_SECRET_4)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: E2E Tests
GitHub Check: Integration Tests
GitHub Check: K8s Integration Tests

🔇 Additional comments (9)

docker-compose.ci.yaml (9)

1-15: LGTM! Clear CI documentation.

The header comments provide excellent context on usage patterns and key differences from the production compose file.

27-29: Hardcoded credentials are acceptable for CI.

The MongoDB credentials are hardcoded, which is flagged by static analysis. For CI environments, this is acceptable since they're ephemeral and not exposed to production.

41-54: LGTM! Redis optimized for CI.

The memory limits, disabled persistence (--save ""), and tight healthcheck intervals are appropriate for ephemeral CI workloads.

91-111: LGTM! Schema Registry properly configured.

The service correctly depends on Kafka health, uses internal listener (kafka:29092), and has appropriate resource limits for CI.

135-156: Review the necessity of host networking for cert-generator.

Using network_mode: host bypasses Docker network isolation. This might be required for certificate generation that needs to access the host, but it increases the attack surface in CI.

Verify whether host networking is strictly necessary or if the same functionality can be achieved with the standard bridge network and volume mounts.

158-205: Hardcoded secrets are acceptable for CI environments.

Lines 171 and 178 contain hardcoded MongoDB credentials and SECRET_KEY, which static analysis correctly flags. For ephemeral CI environments, this is acceptable since:

The environment is temporary and isolated

No production data is involved

Credentials are not exposed outside the CI runner

The explicit TESTING=true flag (line 170) clearly marks this as a test configuration.

207-233: LGTM! Frontend properly configured for E2E testing.

The service correctly:

Depends on backend health before starting

Uses NODE_EXTRA_CA_CERTS to trust the self-signed CA

Has appropriate healthcheck with -k flag for self-signed certificates

235-239: LGTM! Standard volume and network configuration.

The shared_ca volume enables certificate sharing across services, and the bridge network provides appropriate isolation for CI.

56-89: Apache Kafka image apache/kafka:3.9.0 exists and is stable.

The official Apache Kafka image version 3.9.0 is available on Docker Hub and confirmed as a stable release.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

.github/workflows/backend-ci.yml (1)

178-224: Docker caching logic is duplicated across jobs.

The Docker caching steps (lines 185-224) are nearly identical to the integration job (lines 73-112). While this duplication is necessary since jobs run on separate runners, consider extracting this into a composite action for maintainability if the caching logic needs to evolve.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7cf4eb0 and df27af4.

📒 Files selected for processing (2)

.github/workflows/backend-ci.yml
README.md

🧰 Additional context used

🪛 Checkov (3.2.334)

.github/workflows/backend-ci.yml

[medium] 138-139: Basic Auth Credentials

(CKV_SECRET_4)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: E2E Tests
GitHub Check: K8s Integration Tests
GitHub Check: Unit Tests
GitHub Check: Integration Tests

🔇 Additional comments (7)

.github/workflows/backend-ci.yml (7)

18-23: LGTM!

Pinning image versions at the workflow level ensures cache key consistency and reproducible builds across all jobs.

26-64: LGTM!

The unit test job is well-structured with proper UV caching, coverage collection using sysmon core for speedup, and Codecov upload with distinct backend-unit flag.

73-112: Well-designed Docker image caching with parallel operations.

The caching strategy is efficient: parallel pulls/loads with & and wait, and zstd compression for optimal speed-to-size ratio. The cache key correctly includes all image versions for proper invalidation.

One consideration: if an individual docker load fails silently in the background, it won't be caught until compose starts. The failure would still be detected, but the error message might be less clear.

131-151: LGTM!

The test configuration correctly excludes k8s tests for the non-k8s integration path. The MongoDB credentials flagged by static analysis (CKV_SECRET_4) are CI-only test fixtures for ephemeral containers, not production secrets—this is a false positive.

163-176: LGTM!

Targeted log collection for Kafka and Schema Registry alongside the general docker-compose logs will aid debugging of integration test failures.

243-251: LGTM!

The k3s setup is well-structured with proper kubeconfig permissions, timeout for cluster readiness, and idempotent namespace creation using the dry-run pattern.

253-297: LGTM!

The k8s test configuration is well-structured with proper environment setup, distinct Codecov flag (backend-k8s), and comprehensive failure logging including Kubernetes events and pod descriptions.

README.md

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

.github/workflows/backend-ci.yml (1)

73-112: Consider extracting Docker caching logic to a reusable workflow.

The Docker image caching implementation is excellent and will significantly speed up CI runs. However, this 40-line block is duplicated identically in the k8s-integration job (lines 185-224).

Consider extracting it into a composite action or reusable workflow to improve maintainability and ensure consistency across both jobs.

💡 Example: Create a composite action

Create .github/actions/docker-cache/action.yml:

name: 'Cache Docker Images'
description: 'Cache and restore Docker images with zstd compression'
runs:
  using: 'composite'
  steps:
    - name: Cache Docker images
      uses: actions/cache@v5
      id: docker-cache
      with:
        path: /tmp/docker-cache
        key: docker-${{ runner.os }}-${{ env.MONGO_IMAGE }}-${{ env.REDIS_IMAGE }}-${{ env.KAFKA_IMAGE }}-${{ env.SCHEMA_REGISTRY_IMAGE }}

    - name: Load cached Docker images
      if: steps.docker-cache.outputs.cache-hit == 'true'
      shell: bash
      run: |
        echo "Loading cached images..."
        for f in /tmp/docker-cache/*.tar.zst; do
          zstd -d -c "$f" | docker load &
        done
        wait
        docker images

    - name: Pull and save Docker images
      if: steps.docker-cache.outputs.cache-hit != 'true'
      shell: bash
      run: |
        mkdir -p /tmp/docker-cache
        echo "Pulling images in parallel..."
        docker pull $MONGO_IMAGE &
        docker pull $REDIS_IMAGE &
        docker pull $KAFKA_IMAGE &
        docker pull $SCHEMA_REGISTRY_IMAGE &
        wait
        echo "Saving images with zstd compression..."
        docker save $MONGO_IMAGE | zstd -T0 -3 > /tmp/docker-cache/mongo.tar.zst &
        docker save $REDIS_IMAGE | zstd -T0 -3 > /tmp/docker-cache/redis.tar.zst &
        docker save $KAFKA_IMAGE | zstd -T0 -3 > /tmp/docker-cache/kafka.tar.zst &
        docker save $SCHEMA_REGISTRY_IMAGE | zstd -T0 -3 > /tmp/docker-cache/schema-registry.tar.zst &
        wait
        echo "Cache size:"
        du -sh /tmp/docker-cache/

Then replace both caching sections with:

- uses: ./.github/actions/docker-cache

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between df27af4 and d8988d6.

📒 Files selected for processing (1)

.github/workflows/backend-ci.yml

🧰 Additional context used

🪛 Checkov (3.2.334)

.github/workflows/backend-ci.yml

[medium] 138-139: Basic Auth Credentials

(CKV_SECRET_4)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: E2E Tests
GitHub Check: K8s Integration Tests
GitHub Check: Unit Tests
GitHub Check: Integration Tests

🔇 Additional comments (8)

.github/workflows/backend-ci.yml (8)

26-64: Well-implemented unit test job with performance optimizations.

The unit test job is cleanly separated from integration tests, includes proper UV caching, and leverages COVERAGE_CORE=sysmon for faster coverage collection. The 5-minute timeout is appropriate for unit tests.

126-129: Excellent improvement: replaced polling with --wait.

The use of docker compose --wait --wait-timeout 120 is a clean replacement for the previous curl polling approach mentioned in the PR objectives. This makes the startup process more reliable and easier to maintain.

134-142: Note: Hardcoded credentials are acceptable for CI.

The static analysis tool flagged the hardcoded MongoDB credentials. While hardcoded credentials are generally a security concern, these are specifically for ephemeral CI environments and pose no security risk. The credentials are:

Only used in isolated CI containers

Never exposed to production

Reset with each test run

131-161: Well-structured integration test execution with proper coverage separation.

The non-k8s integration tests are properly isolated using --ignore=tests/integration/k8s, and the COVERAGE_CORE=sysmon setting aligns with the PR's speed-up objectives. The coverage upload with the backend-integration flag enables proper tracking in Codecov.

243-251: Verify that the 90-second timeout for k3s cluster readiness is sufficient.

The k3s setup looks well-structured with proper kubeconfig handling and namespace creation. However, the 90-second timeout for kubectl cluster-info might be tight on slower runners or during resource contention.

Consider monitoring CI run times to ensure this timeout is consistently adequate. If you observe intermittent failures, consider increasing to 120 seconds to match the docker-compose wait timeout.

253-282: K8s integration tests properly configured with distinct coverage tracking.

The k8s integration tests are cleanly separated with their own job, use the backend-e2e coverage flag (matching the Codecov comment structure in the PR), and properly configure the Kubernetes environment variables. The COVERAGE_CORE=sysmon setting maintains consistency with the other test jobs for performance.

284-297: Comprehensive failure logging for k8s troubleshooting.

The log collection includes both docker-compose logs and Kubernetes-specific artifacts (events and pod descriptions), which will be invaluable for debugging k8s-related test failures.

18-23: All pinned Docker image versions are valid and actively maintained.

The images resolve correctly: mongo:8.0, redis:7-alpine, apache/kafka:3.9.0, and confluentinc/cp-schema-registry:7.5.0 are all official images from their respective registries. The Kafka version is current (released November 2024), addressing the prior concern about invalid image tags.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

backend/tests/unit/schemas_pydantic/test_events_schemas.py (1)
77-77: Consider moving import to module level for consistency.

The import of EventMetadataResponse is local to this function, while all other imports are at the module level. Moving it to the top would improve consistency and make the dependency more discoverable.
🔎 Proposed refactor

Move the import to line 9-22 where other event schema imports are located:
 from app.schemas_pydantic.events import (
     EventAggregationRequest,
     EventBase,
     EventFilterRequest,
     EventInDB,
     EventListResponse,
+    EventMetadataResponse,
     EventProjection,
     EventQuery,
     EventResponse,
     EventStatistics,
     PublishEventRequest,
     PublishEventResponse,
     ResourceUsage,
 )
And remove the local import:
     # Minimal list response compose/decompose
-    from app.schemas_pydantic.events import EventMetadataResponse
     er = EventResponse(

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d8988d6 and 98a8347.

📒 Files selected for processing (28)

backend/tests/conftest.py
backend/tests/integration/app/__init__.py
backend/tests/integration/app/test_main_app.py
backend/tests/integration/conftest.py
backend/tests/integration/core/test_dishka_lifespan.py
backend/tests/integration/db/repositories/__init__.py
backend/tests/integration/db/repositories/test_admin_events_repository.py
backend/tests/integration/db/repositories/test_admin_settings_repository.py
backend/tests/integration/db/repositories/test_admin_user_repository.py
backend/tests/integration/db/repositories/test_dlq_repository.py
backend/tests/integration/db/repositories/test_event_repository.py
backend/tests/integration/db/repositories/test_execution_repository.py
backend/tests/integration/db/repositories/test_notification_repository.py
backend/tests/integration/db/repositories/test_replay_repository.py
backend/tests/integration/db/repositories/test_saga_repository.py
backend/tests/integration/db/repositories/test_saved_script_repository.py
backend/tests/integration/db/repositories/test_sse_repository.py
backend/tests/integration/db/repositories/test_user_repository.py
backend/tests/integration/db/repositories/test_user_settings_repository.py
backend/tests/integration/events/test_admin_utils.py
backend/tests/integration/services/sse/__init__.py
backend/tests/integration/services/sse/test_redis_bus.py
backend/tests/unit/.env.unit
backend/tests/unit/conftest.py
backend/tests/unit/db/repositories/__init__.py
backend/tests/unit/events/test_admin_utils.py
backend/tests/unit/infrastructure/mappers/test_replay_api_mapper.py
backend/tests/unit/schemas_pydantic/test_events_schemas.py

💤 Files with no reviewable changes (2)

backend/tests/unit/infrastructure/mappers/test_replay_api_mapper.py
backend/tests/unit/events/test_admin_utils.py

✅ Files skipped from review due to trivial changes (2)

backend/tests/integration/services/sse/init.py
backend/tests/integration/app/init.py

🧰 Additional context used

🧬 Code graph analysis (2)

backend/tests/integration/events/test_admin_utils.py (1)

backend/app/events/admin_utils.py (3)

AdminUtils (10-62)

ensure_topics_exist (50-58)

check_topic_exists (27-34)

backend/tests/unit/schemas_pydantic/test_events_schemas.py (2)

backend/app/schemas_pydantic/events.py (2)

EventMetadataResponse (26-37)

EventResponse (40-52)

backend/app/domain/events/event_models.py (1)

correlation_id (93-94)

🪛 dotenv-linter (4.0.0)

backend/tests/unit/.env.unit

[warning] 2-2: [UnorderedKey] The SECRET_KEY key should go before the TESTING key

(UnorderedKey)

[warning] 3-3: [UnorderedKey] The ENABLE_TRACING key should go before the SECRET_KEY key

(UnorderedKey)

[warning] 4-4: [UnorderedKey] The OTEL_SDK_DISABLED key should go before the SECRET_KEY key

(UnorderedKey)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: E2E Tests
GitHub Check: Integration Tests
GitHub Check: K8s Integration Tests

🔇 Additional comments (9)

backend/tests/unit/schemas_pydantic/test_events_schemas.py (1)

78-87: Excellent improvement to test data accuracy.

The switch from an empty dict to a properly structured EventMetadataResponse instance makes the test more accurate and aligns with the actual schema definition. All required fields (service_name, service_version, correlation_id) are correctly provided, making the test both realistic and maintainable.

backend/tests/conftest.py (1)

189-191: LGTM! Clear documentation of test fixture reorganization.

The comments effectively explain the separation of concerns: integration tests handle cleanup via their own conftest, while unit tests use guards to prevent infrastructure access.

backend/tests/integration/conftest.py (1)

1-24: LGTM! Solid integration test isolation strategy.

The autouse cleanup fixture ensures each integration test runs with a clean MongoDB and Redis state, filtering out system collections appropriately.
backend/tests/integration/events/test_admin_utils.py (2)
19-20: LGTM!

The topic existence check properly validates that the topic was created/exists.

15-17: Weak assertion doesn't validate anything meaningful.

The assertion assert res.get(topic) in (True, False) will pass even if res.get(topic) returns None. This doesn't verify that the topic was actually processed by ensure_topics_exist.
🔎 Proposed fix
-    # Ensure topic exists (idempotent)
-    res = await au.ensure_topics_exist([(topic, 1)])
-    assert res.get(topic) in (True, False)  # Some clusters may report exists
+    # Ensure topic exists (idempotent)
+    res = await au.ensure_topics_exist([(topic, 1)])
+    assert topic in res, f"Topic {topic} should be in results"
Likely an incorrect or invalid review comment.
backend/tests/unit/.env.unit (1)

1-4: LGTM! Appropriate unit test environment configuration.

The environment variables correctly configure unit test mode and disable tracing/telemetry.

backend/tests/unit/conftest.py (3)

1-9: LGTM! Proper unit test environment setup.

Loading .env.unit with override=True ensures unit test configuration takes precedence.

12-15: LGTM! No-op cleanup provides fixture compatibility.

The autouse=False no-op _cleanup fixture allows tests that explicitly depend on _cleanup to work in unit test context without performing unnecessary cleanup operations.

18-35: LGTM! Excellent enforcement of unit test boundaries.

The guard fixtures correctly prevent unit tests from accessing infrastructure dependencies, with clear error messages directing developers to use mocks or integration tests.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

backend/tests/unit/services/pod_monitor/test_monitor.py (2)
202-202: Consider whether substring check is intentional.

The assertion changed from an exact match to a case-insensitive substring check. While this makes the test more flexible, it could pass for unexpected state values like "RUNNING_IDLE" or "PENDING_IDLE".

If the state value format is stable, an exact comparison might catch regressions more reliably:
assert status["state"] == "IDLE"  # or whatever the exact expected value is
560-560: Tests verify graceful handling without assertions.

These tests validate that methods handle edge cases (already running/stopped, malformed input) without raising exceptions. The lack of explicit assertions is intentional—the test passes if no exception occurs.

For clarity, you could optionally add brief comments:
await pm.start()  # Should not raise when already running
Also applies to: 569-569, 605-605

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 98a8347 and 2ec7474.

📒 Files selected for processing (1)

backend/tests/unit/services/pod_monitor/test_monitor.py

🧰 Additional context used

🧬 Code graph analysis (1)

backend/tests/unit/services/pod_monitor/test_monitor.py (1)

backend/app/services/pod_monitor/monitor.py (3)

PodMonitor (132-558)

create_pod_monitor (562-574)

_update_resource_version (322-328)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: E2E Tests
GitHub Check: K8s Integration Tests
GitHub Check: Integration Tests

🔇 Additional comments (3)

backend/tests/unit/services/pod_monitor/test_monitor.py (3)

13-43: Good centralization of K8s stubs.

The shared stub classes effectively reduce duplication across tests. The minimal implementations are appropriate for unit testing where the focus is on the monitor's logic rather than K8s client behavior.

74-81: Excellent refactoring of K8s patching logic.

The _patch_k8s helper centralizes monkeypatching with sensible defaults while allowing test-specific overrides. This significantly reduces boilerplate and makes test setup more maintainable.

109-109: Consistent use of the new patching helper.

The tests correctly leverage _patch_k8s with appropriate overrides where needed, maintaining clear test intent while reducing setup boilerplate.

Also applies to: 541-541, 650-650, 675-675

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (9)

backend/tests/integration/db/repositories/test_user_repository.py (1)
12-12: Fix the MyPy type checking failure for the untyped db parameter.

The pipeline failure indicates MyPy's strict mode is rejecting the untyped db parameter. The # type: ignore[valid-type] comment is likely using the wrong error code for this scenario.

Given the commit message mentions "motor -> pymongo", ensure you're using the correct type annotation. Consider one of these solutions:
Add proper type annotation (preferred):
from pymongo.database import Database

async def test_create_get_update_delete_user(db: Database) -> None:
If the type is complex or fixture-dependent, use the correct mypy error code:
async def test_create_get_update_delete_user(db) -> None:  # type: ignore[no-untyped-def]
backend/workers/run_saga_orchestrator.py (1)
114-114: Critical: close() must be awaited.

MyPy correctly identifies that db_client.close() returns a coroutine that is not being awaited, which means the database connection may not close properly.
🔎 Proposed fix
-        db_client.close()
+        await db_client.close()
backend/app/services/k8s_worker/worker.py (1)

580-580: Use stack.push_async_callback() instead of stack.callback() for async MongoDB client cleanup.

AsyncMongoClient.close() is an async coroutine in PyMongo 4.9.2 and must be awaited. The code at line 580 uses stack.callback() (for synchronous functions), but should use stack.push_async_callback() to properly handle the async cleanup, consistent with the other async callbacks on lines 578–579.
backend/workers/run_event_replay.py (1)
71-71: Critical: Use push_async_callback instead of callback for async close().

PyMongo's AsyncMongoClient.close() is now a coroutine that must be awaited. Using stack.callback() will not await the coroutine, causing a resource leak and MyPy error.
🔎 Proposed fix
-        stack.callback(db_client.close)
+        stack.push_async_callback(db_client.close)
backend/tests/fixtures/real_services.py (2)
96-101: Critical: Missing await on client.close().

PyMongo's AsyncMongoClient.close() is a coroutine that must be awaited. The synchronous call will cause a resource leak and MyPy error.
🔎 Proposed fix
         # Drop test MongoDB database
         if self.mongo_client:
             await self.mongo_client.drop_database(self.db_name)
-            self.mongo_client.close()
+            await self.mongo_client.close()
314-320: Critical: Missing await on client.close().

PyMongo's AsyncMongoClient.close() is a coroutine that must be awaited.
🔎 Proposed fix
         client = AsyncIOMotorClient(
             "mongodb://root:rootpassword@localhost:27017",
             serverSelectionTimeoutMS=5000
         )
         await client.admin.command("ping")
-        client.close()
+        await client.close()
backend/scripts/seed_users.py (1)
107-107: Critical: Missing await on client.close().

PyMongo's AsyncMongoClient.close() returns a coroutine that must be awaited. The pipeline failure explicitly flags this: "Value of type 'Coroutine[Any, Any, None]' must be used. Are you missing an await?"
🔎 Proposed fix
-    client.close()
+    await client.close()
backend/app/core/database_context.py (1)
158-172: Critical: Remove incorrect await on start_session().

PyMongo's client.start_session() returns an AsyncClientSession context manager directly, not a coroutine. The pipeline failure explicitly flags this: "Incompatible types in 'await' (actual type 'AsyncClientSession', expected type 'Awaitable[Any]')".
🔎 Proposed fix
-        async with await self.client.start_session() as session:
+        async with self.client.start_session() as session:
             async with session.start_transaction():
                 yield session
backend/app/services/coordinator/coordinator.py (1)
543-548: Critical: Use push_async_callback instead of callback for async close().

PyMongo's AsyncMongoClient.close() is a coroutine that must be awaited. Using stack.callback() will not await the coroutine, causing a resource leak and contributing to the MyPy type checking failure.
🔎 Proposed fix
         await stack.enter_async_context(coordinator)
         stack.push_async_callback(idem_manager.close)
         stack.push_async_callback(r.aclose)
-        stack.callback(db_client.close)
+        stack.push_async_callback(db_client.close)

🧹 Nitpick comments (4)

backend/tests/integration/events/test_event_store.py (1)

5-10: Import ordering: third-party import after local imports.

The pymongo import at line 10 is placed after the app.* imports. Standard Python convention (PEP 8) groups imports as: stdlib → third-party → local. Consider moving the pymongo import before the app imports for consistency.
🔎 Suggested reordering
 from datetime import datetime, timezone, timedelta
 
 import pytest
+from pymongo.asynchronous.database import AsyncDatabase as AsyncIOMotorDatabase
 
 from app.events.event_store import EventStore
 from app.events.schema.schema_registry import SchemaRegistryManager
 from app.infrastructure.kafka.events.metadata import AvroEventMetadata
 from app.infrastructure.kafka.events.pod import PodCreatedEvent
 from app.infrastructure.kafka.events.user import UserLoggedInEvent
-from pymongo.asynchronous.database import AsyncDatabase as AsyncIOMotorDatabase

backend/tests/integration/conftest.py (1)

10-24: Consider extracting duplicate cleanup logic.

The pre-test (lines 11-15) and post-test (lines 20-24) cleanup logic is identical. Extracting to a helper improves maintainability.

🔎 Suggested refactor

+async def _do_cleanup(db: AsyncIOMotorDatabase, redis_client: redis.Redis) -> None:
+    collections = await db.list_collection_names()
+    for name in collections:
+        if not name.startswith("system."):
+            await db.drop_collection(name)
+    await redis_client.flushdb()
+
+
 @pytest_asyncio.fixture(scope="function", autouse=True)
 async def _cleanup(db: AsyncIOMotorDatabase, redis_client: redis.Redis):
     """Clean DB and Redis before/after each integration test."""
-    # Pre-test cleanup
-    collections = await db.list_collection_names()
-    for name in collections:
-        if not name.startswith("system."):
-            await db.drop_collection(name)
-    await redis_client.flushdb()
+    await _do_cleanup(db, redis_client)
 
     yield
 
-    # Post-test cleanup
-    collections = await db.list_collection_names()
-    for name in collections:
-        if not name.startswith("system."):
-            await db.drop_collection(name)
-    await redis_client.flushdb()
+    await _do_cleanup(db, redis_client)

.github/workflows/backend-ci.yml (2)

73-112: Consider extracting Docker caching to a composite action.

The Docker image caching logic (cache restore, parallel pull, zstd save) is duplicated between the integration and e2e jobs. Extracting this to a composite action would reduce duplication and simplify maintenance.

Also applies to: 185-224

85-88: Background docker load errors may be silently ignored.

When loading images in parallel with &, if any docker load command fails, the error is not captured before wait. Consider adding error handling or using set -e at the script start to fail on first error.
🔎 Suggested fix
       - name: Load cached Docker images
         if: steps.docker-cache.outputs.cache-hit == 'true'
         run: |
+          set -e
           echo "Loading cached images..."
+          pids=()
           for f in /tmp/docker-cache/*.tar.zst; do
-            zstd -d -c "$f" | docker load &
+            zstd -d -c "$f" | docker load &
+            pids+=($!)
           done
-          wait
+          for pid in "${pids[@]}"; do
+            wait "$pid" || exit 1
+          done
           docker images

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2ec7474 and a5949f5.

⛔ Files ignored due to path filters (1)

backend/uv.lock is excluded by !**/*.lock

📒 Files selected for processing (36)

.github/workflows/backend-ci.yml
backend/app/core/database_context.py
backend/app/services/coordinator/coordinator.py
backend/app/services/k8s_worker/worker.py
backend/pyproject.toml
backend/scripts/seed_users.py
backend/tests/conftest.py
backend/tests/fixtures/real_services.py
backend/tests/integration/app/test_main_app.py
backend/tests/integration/conftest.py
backend/tests/integration/core/test_container.py
backend/tests/integration/core/test_database_context.py
backend/tests/integration/db/repositories/test_admin_events_repository.py
backend/tests/integration/db/repositories/test_admin_settings_repository.py
backend/tests/integration/db/repositories/test_admin_user_repository.py
backend/tests/integration/db/repositories/test_dlq_repository.py
backend/tests/integration/db/repositories/test_event_repository.py
backend/tests/integration/db/repositories/test_notification_repository.py
backend/tests/integration/db/repositories/test_replay_repository.py
backend/tests/integration/db/repositories/test_saga_repository.py
backend/tests/integration/db/repositories/test_saved_script_repository.py
backend/tests/integration/db/repositories/test_sse_repository.py
backend/tests/integration/db/repositories/test_user_repository.py
backend/tests/integration/db/repositories/test_user_settings_repository.py
backend/tests/integration/events/test_event_store.py
backend/tests/integration/events/test_event_store_consumer.py
backend/tests/integration/events/test_event_store_consumer_flush_e2e.py
backend/tests/integration/events/test_event_store_e2e.py
backend/tests/integration/k8s/test_k8s_worker_create_pod.py
backend/tests/integration/result_processor/test_result_processor.py
backend/tests/integration/services/admin/test_admin_user_service.py
backend/tests/integration/services/saved_script/test_saved_script_service.py
backend/tests/integration/services/sse/test_redis_bus.py
backend/workers/dlq_processor.py
backend/workers/run_event_replay.py
backend/workers/run_saga_orchestrator.py

💤 Files with no reviewable changes (1)

backend/pyproject.toml

✅ Files skipped from review due to trivial changes (1)

backend/tests/integration/db/repositories/test_admin_user_repository.py

🧰 Additional context used

🧬 Code graph analysis (13)

backend/tests/integration/events/test_event_store_consumer.py (1)

backend/app/core/database_context.py (3)

database (67-69)

database (146-149)

database (201-202)

backend/tests/integration/events/test_event_store.py (1)

backend/app/core/database_context.py (3)

database (67-69)

database (146-149)

database (201-202)

backend/tests/integration/services/sse/test_redis_bus.py (3)

backend/app/schemas_pydantic/sse.py (2)

RedisSSEMessage (63-68)

RedisNotificationMessage (102-112)

backend/app/dlq/models.py (1)

event_type (93-95)

backend/app/services/coordinator/queue_manager.py (1)

execution_id (30-31)

backend/tests/integration/events/test_event_store_consumer_flush_e2e.py (1)

backend/app/core/database_context.py (3)

database (67-69)

database (146-149)

database (201-202)

backend/tests/integration/events/test_event_store_e2e.py (1)

backend/app/core/database_context.py (3)

database (67-69)

database (146-149)

database (201-202)

backend/tests/integration/k8s/test_k8s_worker_create_pod.py (1)

backend/app/core/database_context.py (3)

database (67-69)

database (146-149)

database (201-202)

backend/tests/integration/conftest.py (3)

backend/app/core/database_context.py (3)

database (67-69)

database (146-149)

database (201-202)

backend/tests/conftest.py (3)

scope (172-174)

db (178-180)

redis_client (184-186)

backend/tests/unit/conftest.py (3)

_cleanup (13-15)

db (19-20)

redis_client (24-25)

backend/tests/integration/services/admin/test_admin_user_service.py (1)

backend/app/core/database_context.py (3)

database (67-69)

database (146-149)

database (201-202)

backend/tests/integration/result_processor/test_result_processor.py (1)

backend/app/core/database_context.py (3)

database (67-69)

database (146-149)

database (201-202)

backend/tests/conftest.py (1)

backend/app/core/database_context.py (3)

database (67-69)

database (146-149)

database (201-202)

backend/tests/integration/core/test_database_context.py (1)

backend/app/core/database_context.py (3)

database (67-69)

database (146-149)

database (201-202)

backend/tests/integration/core/test_container.py (1)

backend/app/core/database_context.py (3)

database (67-69)

database (146-149)

database (201-202)

backend/tests/fixtures/real_services.py (1)

backend/app/core/database_context.py (3)

database (67-69)

database (146-149)

database (201-202)

🪛 Checkov (3.2.334)

.github/workflows/backend-ci.yml

[medium] 138-139: Basic Auth Credentials

(CKV_SECRET_4)

🪛 GitHub Actions: MyPy Type Checking

backend/tests/integration/db/repositories/test_sse_repository.py