OCPBUGS-65896: controllers: Prevent Progressing=True when scaling only#836
OCPBUGS-65896: controllers: Prevent Progressing=True when scaling only#836tchap wants to merge 2 commits intoopenshift:masterfrom
Conversation
|
Skipping CI for Draft Pull Request. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds scaling guardrail logic that annotates Deployments on replica changes and emits OperatorConditionApplyConfiguration overwrites; wires those overwrites through workload and controller call chains by extending several function signatures; adds a go.mod replace pointing Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes ✨ Finishing touches🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: tchap The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
2c5caf5 to
d8b074a
Compare
93c17d5 to
40dbc69
Compare
2e340f0 to
5fa520e
Compare
|
/retitle OCPBUGS-65896: controllers: Prevent Progressing=True when scaling only We can only merge this after the library-go changes are merged. /hold |
|
@tchap: This pull request references Jira Issue OCPBUGS-65896, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@tchap: This pull request references Jira Issue OCPBUGS-65896, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (ksiddiqu@redhat.com), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@tchap: This pull request references Jira Issue OCPBUGS-65896, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (ksiddiqu@redhat.com), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
5fa520e to
ab1ed0f
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@pkg/controllers/deployment/deployment_controller.go`:
- Around line 269-283: setRollingUpdateParameters mutates RollingUpdate fields
so scaling.ProcessDeployment sees non‑scaling diffs; to fix, normalize
RollingUpdate parameters before the scaling-only comparison by making a copy of
expectedDeployment (or a temp deployment) and setting its
Spec.Strategy.RollingUpdate to match
existingDeployment.Spec.Strategy.RollingUpdate (or nil) so only Replicas differ,
then call scaling.ProcessDeployment(existingDeployment, normalizedExpected,
...). Ensure you still use the original expectedDeployment for applying changes
but use the normalized copy for the conditionOverwrites call involving
setRollingUpdateParameters and scaling.ProcessDeployment.
ab1ed0f to
966cb2c
Compare
| } | ||
|
|
||
| actualDeployment, err := target.syncDeployment(context.TODO(), &scenario.operator.Spec.OperatorSpec, &scenario.operator.Status.OperatorStatus, eventRecorder) | ||
| actualDeployment, _, err := target.syncDeployment(context.TODO(), &scenario.operator.Spec.OperatorSpec, &scenario.operator.Status.OperatorStatus, eventRecorder) |
There was a problem hiding this comment.
I should add some tests here to also check the condition overwrites, I guess?
966cb2c to
229cba5
Compare
| conditionOverwrites, err := scaling.ProcessDeployment(existingDeployment, expectedDeployment, clock.RealClock{}, "OAuthServer") | ||
| if err != nil { | ||
| return nil, false, nil, append(errs, err) | ||
| } |
There was a problem hiding this comment.
There are no tests for this package really, so I can't extend them really to make sure this works properly. For now only the scaling package is being tested properly.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@pkg/operator/workload/sync_openshift_oauth_apiserver.go`:
- Around line 287-297: Add a short comment above the call to
encryptionkms.AddKMSPluginVolumeAndMountToPodSpec explaining why the KMS
volume/mount is applied before calling scaling.ProcessDeployment (e.g., "KMS
volumes must be added before scaling detection so ProcessDeployment sees the
final podSpec and does not misattribute changes" or note that this ordering is
intentional to ensure KMS config is present when comparing specs), mirroring the
explanatory style used in deployment_controller.go; reference the functions
AddKMSPluginVolumeAndMountToPodSpec and scaling.ProcessDeployment in the comment
so future readers understand the rationale.
229cba5 to
cd9e42a
Compare
cd9e42a to
933f489
Compare
933f489 to
ab43aaa
Compare
Assisted-by: Claude Code
ab43aaa to
a09bdc0
Compare
|
@tchap: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| // When the timestamp annotation is present, we should overwrite Progressing to be false. | ||
| // | ||
| // So, ProcessDeployment amends the expected deployment in place, also returning any conditions to set on the operator. | ||
| func ProcessDeployment(existing, expected *appsv1.Deployment, clock clock.Clock, conditionPrefix string) ([]*applyoperatorv1.OperatorConditionApplyConfiguration, error) { |
There was a problem hiding this comment.
I understand what this is trying to solve, but the place is not correct. We cannot expect each deployment instance to have such a specific logic dedicated to progressing condition. This responsibility has to lie in the library-go or even better in the deployment controller.
The deployment controller should correctly report the progressing state via the Progressing condition. I think the best course of action should be to remove the Generation+ObservedGeneration check from library-go. As we can see in OCPBUGS-65896, not every spec update (e.g. scaling) should result in a progressing state, so generation tracking is not well equipped to do that.
Now there will be a small risk of reporting late the progressing state before the controller reacts to the real template change. This should be okay, unless the controller is down (this results in much bigger issues) or there is a long queue in the controller. Normally the reaction time should be similar to what we would have processing the annotation changes. I think it should be okay to live with this small delay.
Please also note openshift/cluster-image-registry-operator#1293.
Also, I think it might be useful to surface ProgressDeadlineExceeded reason in a degraded condition to report that the deployment controller timed out rolling out the new changes.
|
I am closing this in favor of openshift/library-go#2128 |
|
@tchap: This pull request references Jira Issue OCPBUGS-65896. The bug has been updated to no longer refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Before we start, the idea basically is to extend
Syncto return conditions that would overwrite the conditions generated by thelibrary-gomachinery. The interface change is not yet implemented inlibrary-go, but it should be OK as this operator is the only operator using the package.To handle scaling properly, We need to keep some context regarding the state of the Deployment, so I decided to propose a few Deployment annotations. The names are not final, not sure about what the proper prefix is actually. Not sure whether there is a better way to store state, feel free to propose another way.
The main idea is summarized in the doc comment for
ProcessDeployment:The annotation machinery is there to account for eventual consistency between creating/updating the deployment and the deployment controller picking the change up and actually updating the conditions there.