Skip to content

feat: integrate PeerDB native maintenance mode into upgrade lifecycle#12

Merged
Neurostep merged 6 commits intomainfrom
Neurostep/implement-peerdb-native-maintenance-mode
Mar 8, 2026
Merged

feat: integrate PeerDB native maintenance mode into upgrade lifecycle#12
Neurostep merged 6 commits intomainfrom
Neurostep/implement-peerdb-native-maintenance-mode

Conversation

@Neurostep
Copy link
Owner

Description

Integrates PeerDB's native maintenance mode (PeerDB-io/peerdb#2211) into the operator's upgrade state machine. When spec.maintenance is configured on a PeerDBCluster, the operator gracefully pauses all running mirrors before upgrading components and resumes them after.

Upgrade flow with maintenance mode

Waiting → StartMaintenance → Config → InitJobs → FlowAPI → Server → UI → EndMaintenance → Complete

When spec.maintenance is not set, the upgrade flow remains unchanged (fully backwards compatible).

Changes

API (api/v1alpha1/)

  • MaintenanceSpec type with image, backoffLimit, and resources fields
  • spec.maintenance field on PeerDBClusterSpec
  • StartMaintenance and EndMaintenance upgrade phases
  • MaintenanceMode condition type and associated reason constants
  • Regenerated deepcopy and CRD manifests

Resource builders (internal/resources/)

  • maintenance_jobs.go: BuildStartMaintenanceJob / BuildEndMaintenanceJob — creates K8s Jobs using ghcr.io/peerdb-io/flow-maintenance:stable-{version} with start/end subcommands

Controller (internal/controller/)

  • upgradePhaseMaintenanceJob — generic handler for both maintenance phases: creates Job, polls completion, deletes and retries on failure
  • upgradePhaseWaiting routes to StartMaintenance when maintenance is configured
  • UpgradePhaseUI routes to EndMaintenance (instead of Complete) when maintenance is configured

Documentation

  • API reference: MaintenanceSpec, MaintenanceMode condition, new upgrade phases
  • Architecture: updated diagram and reconciliation strategy
  • Safe upgrade runbook: new Maintenance Mode section with YAML examples
  • README: feature bullet

Testing

Unit tests (run locally)

go test ./...

Four new test cases verify:

  1. Maintenance Job creation during upgrade
  2. Advancing past StartMaintenance when Job completes
  3. No maintenance phases when spec.maintenance is not set
  4. Failed Job deletion/retry with Degraded condition

E2E tests (requires Kind cluster)

make test-e2e

New test creates a cluster with spec.maintenance: {}, patches version, and verifies:

  • Start maintenance Job is created with correct command and ownerReference
  • Upgrade status shows StartMaintenance phase

Manual verification

  1. Deploy the operator to a cluster with PeerDB and external dependencies
  2. Create a PeerDBCluster with spec.maintenance: {}:
    spec:
      version: "v0.36.7"
      maintenance: {}
  3. Patch the version: kubectl patch peerdbcluster <name> --type merge -p '{"spec":{"version":"v0.36.8"}}'
  4. Observe:
    • kubectl get job — maintenance start job created
    • kubectl get peerdbcluster <name> -o jsonpath="{.status.upgrade}" — phases progress through StartMaintenance → ... → EndMaintenance → Complete
    • kubectl get peerdbcluster <name> -o jsonpath="{.status.conditions}"MaintenanceMode condition toggles

Related

  • PeerDB maintenance mode PR: PeerDB-io/peerdb#2211 — introduces StartMaintenance / EndMaintenance Temporal workflows and the flow-maintenance container image
  • PeerDB maintenance mode issue: PeerDB-io/peerdb#2174
  • Maintenance image: ghcr.io/peerdb-io/flow-maintenance — runs via a dedicated Temporal task queue, ensuring compatibility across version upgrades

Neurostep and others added 2 commits March 8, 2026 11:35
Add MaintenanceSpec to PeerDBClusterSpec for configuring PeerDB's native
maintenance mode during upgrades. This includes:

- MaintenanceSpec type with image, backoffLimit, and resources fields
- Two new UpgradePhase constants: StartMaintenance and EndMaintenance
- ConditionMaintenanceMode condition type for tracking maintenance state
- Reason constants: MaintenanceStarting, Active, Ending, Complete, Failed
- Regenerated deepcopy methods and CRD manifests

Amp-Thread-ID: https://ampcode.com/threads/T-019ccbea-b6d3-7583-8ac6-4f8a88c21dbd
Co-authored-by: Amp <amp@ampcode.com>
Add BuildStartMaintenanceJob and BuildEndMaintenanceJob that create
Kubernetes Jobs using the ghcr.io/peerdb-io/flow-maintenance image.

The Jobs run PeerDB's maintenance entrypoint with 'start' or 'end'
subcommands to trigger the StartMaintenance/EndMaintenance Temporal
workflows. Jobs inherit catalog connection config via the shared
ConfigMap and password secret, following the same pattern as init jobs.

Amp-Thread-ID: https://ampcode.com/threads/T-019ccbea-b6d3-7583-8ac6-4f8a88c21dbd
Co-authored-by: Amp <amp@ampcode.com>
@Neurostep Neurostep force-pushed the Neurostep/implement-peerdb-native-maintenance-mode branch from f9060f2 to b0334dd Compare March 8, 2026 06:06
Insert StartMaintenance and EndMaintenance phases into the upgrade
lifecycle. When spec.maintenance is configured:

  Waiting → StartMaintenance → Config → InitJobs → FlowAPI →
  Server → UI → EndMaintenance → Complete

The new upgradePhaseMaintenanceJob method handles Job creation,
completion polling, and failure retry (delete + recreate) for both
phases. When spec.maintenance is nil, the upgrade flow is unchanged.

Key behaviors:
- StartMaintenance Job pauses mirrors before any component restarts
- EndMaintenance Job resumes mirrors after all components are upgraded
- Failed maintenance Jobs are auto-deleted for retry with Degraded condition
- MaintenanceMode condition tracks whether mirrors are paused

Amp-Thread-ID: https://ampcode.com/threads/T-019ccbea-b6d3-7583-8ac6-4f8a88c21dbd
Co-authored-by: Amp <amp@ampcode.com>
@Neurostep Neurostep force-pushed the Neurostep/implement-peerdb-native-maintenance-mode branch from b0334dd to 0f9906f Compare March 8, 2026 06:33
Neurostep and others added 3 commits March 8, 2026 12:11
Add four test cases covering the maintenance mode upgrade lifecycle:

- Job creation during upgrade when maintenance is configured
- Advancing past StartMaintenance when job completes successfully
- Skipping maintenance phases when spec.maintenance is not set
- Failed maintenance job deletion, retry, and Degraded condition

Amp-Thread-ID: https://ampcode.com/threads/T-019ccbea-b6d3-7583-8ac6-4f8a88c21dbd
Co-authored-by: Amp <amp@ampcode.com>
Add an e2e test that verifies the maintenance mode integration in a
real cluster:

- Creates a PeerDBCluster with spec.maintenance configured
- Patches the version to trigger an upgrade
- Verifies the start maintenance Job is created with correct command
- Verifies the upgrade status shows StartMaintenance phase
- Verifies ownerReferences point to PeerDBCluster for GC

Amp-Thread-ID: https://ampcode.com/threads/T-019ccbea-b6d3-7583-8ac6-4f8a88c21dbd
Co-authored-by: Amp <amp@ampcode.com>
Update documentation across all relevant files:

- README: add Maintenance Mode Integration to features list
- API reference: add MaintenanceSpec type, MaintenanceMode condition,
  StartMaintenance/EndMaintenance upgrade phases
- Architecture: add Maintenance Jobs to diagram and reconciliation
  strategy, add maintenance_jobs.go to project structure
- Safe upgrade runbook: add Maintenance Mode section with YAML examples,
  update upgrade order and phases table

Amp-Thread-ID: https://ampcode.com/threads/T-019ccbea-b6d3-7583-8ac6-4f8a88c21dbd
Co-authored-by: Amp <amp@ampcode.com>
@Neurostep Neurostep force-pushed the Neurostep/implement-peerdb-native-maintenance-mode branch from 0f9906f to bbedb60 Compare March 8, 2026 06:41
@Neurostep Neurostep added the enhancement New feature or request label Mar 8, 2026
@Neurostep Neurostep merged commit a1b3d15 into main Mar 8, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant