OCPBUGS-38662: guard controller: Handle conflict gracefully by tchap · Pull Request #2093 · openshift/library-go

tchap · 2026-01-27T11:30:17Z

Currently the controller degrades on any error, including an apply conflict. This can happen during upgrades and should not be the case.

The cause is probably the fact that we delete and create the guard pod in the very same sync call. But instead of handling this specifically, the apply call now simply ignores conflict errors and sync is automatically invoked again as the changes propagate.

This is to remove degrading like

GuardControllerDegraded: Unable to apply pod openshift-kube-scheduler-guard-ip-10-0-106-117.us-east-2.compute.internal changes: Operation cannot be fulfilled on pods \"openshift-kube-scheduler-guard-ip-10-0-106-117.us-east-2.compute.internal\": the object has been modified; please apply your changes to the latest version and try again

Currently the controller degrades on any error, including an apply conflict. This can happen during upgrades and should not be the case. The cause is probably the fact that we delete and create the guard pod in the very same sync call. But instead of handling this specifically, the apply call now simply ignores conflict errors and sync is automatically invoked again as the changes propagate.

openshift-ci-robot · 2026-01-27T11:30:23Z

@tchap: This pull request references Jira Issue OCPBUGS-38662, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Currently the controller degrades on any error, including an apply conflict. This can happen during upgrades and should not be the case.

The cause is probably the fact that we delete and create the guard pod in the very same sync call. But instead of handling this specifically, the apply call now simply ignores conflict errors and sync is automatically invoked again as the changes propagate.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-01-27T11:30:45Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tchap
Once this PR has been reviewed and has the lgtm label, please assign dgrisonnet for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

pkg/operator/staticpod/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2026-01-27T11:42:52Z

@tchap: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

tchap · 2026-01-27T11:46:26Z

pkg/operator/staticpod/controller/guard/guard_controller.go

-				errs = append(errs, fmt.Errorf("Unable to apply pod %v changes: %v", pod.Name, err))
+				if !apierrors.IsConflict(err) {
+					errs = append(errs, fmt.Errorf("Unable to apply pod %v changes: %v", pod.Name, err))
+				}


I am not sure logging such an error is useful, but I left the log statement there.

p0lyn0mial · 2026-01-28T12:06:38Z

pkg/operator/staticpod/controller/guard/guard_controller.go

 			if err != nil {
 				klog.Errorf("Unable to apply pod %v changes: %v", pod.Name, err)
-				errs = append(errs, fmt.Errorf("Unable to apply pod %v changes: %v", pod.Name, err))
+				if !apierrors.IsConflict(err) {


How is this PR handling conflicts gracefully? 🙂

In my opinion, we should not hide errors.

A conflict error is still an error.
For example, it might indicate a conflict with a different actor in the system.

What we could do instead (if we aren’t already) is consider retrying certain types of requests when a conflict error occurs.

I think that ignoring is very graceful 🙂

Or we can simply return when we delete the pod and wait for another round. I think that that's the cause. Because the deployment is using Recreate strategy, there should not be multiple pods at once.

I will investigate further how to handle this in a better way and what the root cause actually is...

It's actually racing with Kubelet that is updating the status after pod creation, from I can tell. This seems to be normal.

What do you mean retrying requests? This is actually retrying the request, but once an up-to-date object is fetched.

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jan 27, 2026

openshift-ci bot requested review from dgrisonnet and tkashem January 27, 2026 11:30

tchap mentioned this pull request Jan 27, 2026

OCPBUGS-38662: deps: Update library-go to improve GuardController openshift/cluster-kube-scheduler-operator#606

Open

tchap commented Jan 27, 2026

View reviewed changes

p0lyn0mial reviewed Jan 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

OCPBUGS-38662: guard controller: Handle conflict gracefully#2093

OCPBUGS-38662: guard controller: Handle conflict gracefully#2093
tchap wants to merge 1 commit intoopenshift:masterfrom
tchap:guardcontroller-handle-conflict

tchap commented Jan 27, 2026 •

edited

Loading

Uh oh!

openshift-ci-robot commented Jan 27, 2026

Uh oh!

openshift-ci bot commented Jan 27, 2026

Uh oh!

openshift-ci bot commented Jan 27, 2026

Uh oh!

tchap Jan 27, 2026

Uh oh!

p0lyn0mial Jan 28, 2026

Uh oh!

tchap Jan 28, 2026

Uh oh!

tchap Jan 28, 2026

Uh oh!

tchap Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

tchap commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Jan 27, 2026

Uh oh!

openshift-ci bot commented Jan 27, 2026

Uh oh!

openshift-ci bot commented Jan 27, 2026

Uh oh!

tchap Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

p0lyn0mial Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

tchap Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

tchap Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

tchap Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tchap commented Jan 27, 2026 •

edited

Loading