OCPBUGS-38662: guard controller: Handle conflict gracefully#2093
OCPBUGS-38662: guard controller: Handle conflict gracefully#2093tchap wants to merge 1 commit intoopenshift:masterfrom
Conversation
Currently the controller degrades on any error, including an apply conflict. This can happen during upgrades and should not be the case. The cause is probably the fact that we delete and create the guard pod in the very same sync call. But instead of handling this specifically, the apply call now simply ignores conflict errors and sync is automatically invoked again as the changes propagate.
|
@tchap: This pull request references Jira Issue OCPBUGS-38662, which is valid. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: tchap The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@tchap: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| errs = append(errs, fmt.Errorf("Unable to apply pod %v changes: %v", pod.Name, err)) | ||
| if !apierrors.IsConflict(err) { | ||
| errs = append(errs, fmt.Errorf("Unable to apply pod %v changes: %v", pod.Name, err)) | ||
| } |
There was a problem hiding this comment.
I am not sure logging such an error is useful, but I left the log statement there.
| if err != nil { | ||
| klog.Errorf("Unable to apply pod %v changes: %v", pod.Name, err) | ||
| errs = append(errs, fmt.Errorf("Unable to apply pod %v changes: %v", pod.Name, err)) | ||
| if !apierrors.IsConflict(err) { |
There was a problem hiding this comment.
How is this PR handling conflicts gracefully? 🙂
In my opinion, we should not hide errors.
A conflict error is still an error.
For example, it might indicate a conflict with a different actor in the system.
What we could do instead (if we aren’t already) is consider retrying certain types of requests when a conflict error occurs.
There was a problem hiding this comment.
I think that ignoring is very graceful 🙂
Or we can simply return when we delete the pod and wait for another round. I think that that's the cause. Because the deployment is using Recreate strategy, there should not be multiple pods at once.
There was a problem hiding this comment.
I will investigate further how to handle this in a better way and what the root cause actually is...
There was a problem hiding this comment.
It's actually racing with Kubelet that is updating the status after pod creation, from I can tell. This seems to be normal.
What do you mean retrying requests? This is actually retrying the request, but once an up-to-date object is fetched.
Currently the controller degrades on any error, including an apply conflict. This can happen during upgrades and should not be the case.
The cause is probably the fact that we delete and create the guard pod in the very same sync call. But instead of handling this specifically, the apply call now simply ignores conflict errors and sync is automatically invoked again as the changes propagate.
This is to remove degrading like