Skip to content

fix: rotate pods on pod-config change#2299

Open
rugggger wants to merge 1 commit into02-26-fix_add_podconfigversion_and_save_a_calculated_struct_as_annotation_on_a_podfrom
02-26-fix_rotate_pods_on_pod-config_change
Open

fix: rotate pods on pod-config change#2299
rugggger wants to merge 1 commit into02-26-fix_add_podconfigversion_and_save_a_calculated_struct_as_annotation_on_a_podfrom
02-26-fix_rotate_pods_on_pod-config_change

Conversation

@rugggger
Copy link
Contributor

@rugggger rugggger commented Feb 26, 2026

TL;DR

Added automatic pod rotation when configuration version changes to ensure pods run with the latest configuration.

What changed?

  • Added a new step deletePodOnConfigVersionMismatch to the active state flow that checks for configuration version mismatches between running pods and the current configuration
  • Implemented priority-based rotation logic that rotates pods in order: drive → compute → frontend containers
  • Added stability checks to ensure only one pod rotates at a time and waits for sibling containers to be stable before proceeding
  • Skipped configuration checks for discovery and driver containers as they don't require rotation
  • Added event recording when pods are deleted due to configuration mismatches

How to test?

  1. Deploy a WekaContainer with a specific configuration
  2. Update the configuration to trigger a hash change
  3. Verify that pods are rotated in the correct priority order (drive first, then compute, then frontend)
  4. Confirm that only one pod rotates at a time and waits for siblings to be stable
  5. Check that discovery and driver containers are not affected by configuration changes

Why make this change?

This ensures that running pods automatically pick up configuration changes without manual intervention. The priority-based rotation system maintains cluster stability by rotating critical components (drives) first and preventing multiple simultaneous rotations that could impact service availability.

Copy link
Contributor Author

rugggger commented Feb 26, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label main-merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has required the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@graphite-app
Copy link

graphite-app bot commented Feb 26, 2026

Graphite Automations

"Add anton/matt/sergey/kristina as reviwers on operator PRs" took an action on this PR • (02/26/26)

3 reviewers were added to this PR based on Anton Bykov's automation.

@rugggger rugggger force-pushed the 02-26-fix_rotate_pods_on_pod-config_change branch from 9fc542d to d3da503 Compare March 4, 2026 09:32
@rugggger rugggger force-pushed the 02-26-fix_add_podconfigversion_and_save_a_calculated_struct_as_annotation_on_a_pod branch from 5bf5f3d to cf93221 Compare March 4, 2026 09:32
}

func (r *containerReconcilerLoop) deletePodOnConfigVersionMismatch(ctx context.Context) error {
ctx, logger, end := instrumentation.GetLogSpan(ctx, "deletePodOnConfigVersionMismatch")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use can use empty string for span to avoid creating duplicated one (deletePodOnConfigVersionMismatch is created in step engine as it's SimpleStep)

Suggested change
ctx, logger, end := instrumentation.GetLogSpan(ctx, "deletePodOnConfigVersionMismatch")
ctx, logger, end := instrumentation.GetLogSpan(ctx, "deletePodOnConfigVersionMismatch")

Comment on lines +263 to +269
mode := r.container.Spec.Mode
if mode == weka.WekaContainerModeDiscovery ||
mode == weka.WekaContainerModeDriversDist ||
mode == weka.WekaContainerModeDriversLoader ||
mode == weka.WekaContainerModeDriversBuilder {
return nil
}
Copy link
Collaborator

@kristina-solovyova kristina-solovyova Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe let's use r.container.IsServiceContainer() ?

Comment on lines +280 to +283
siblings, err := r.getClusterContainers(ctx)
if err != nil {
return fmt.Errorf("failed to get cluster containers: %w", err)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we cannot always get cluster containers

it will work only if current container has owner - wekacluster

Comment on lines +292 to +300
// If any sibling is unstable, defer rotation
status := sibling.Status.Status
if (status == weka.PodTerminating || status == weka.PodNotRunning || status == weka.Init) && sibling.DeletionTimestamp == nil {
logger.Info("Deferring pod config rotation: sibling container is unstable",
"sibling", sibling.Name,
"siblingStatus", status,
)
return lifecycle.NewWaitError(fmt.Errorf("deferring config rotation: sibling %s is in %s state", sibling.Name, status))
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might have cases when some wekacontainers are stuck in PodTerminating, PodNotRunning, Deleting (because of node being not ready or not enough resources on nodes to create pod)

Not sure it's possible in real env to get all pods stable

Comment on lines +274 to +276
if exists && podHash == currentHash {
return nil
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could check it earlier, in step predicates (including simple checks before this one)

@rugggger rugggger force-pushed the 02-26-fix_add_podconfigversion_and_save_a_calculated_struct_as_annotation_on_a_pod branch from cf93221 to 8576c16 Compare March 8, 2026 09:06
@rugggger rugggger force-pushed the 02-26-fix_rotate_pods_on_pod-config_change branch from d3da503 to fda9371 Compare March 8, 2026 09:06
@rugggger rugggger force-pushed the 02-26-fix_rotate_pods_on_pod-config_change branch from fda9371 to 5ae7f5f Compare March 8, 2026 09:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants