OCPBUGS-65896: controllers: Prevent Progressing=True when scaling only by tchap · Pull Request #836 · openshift/cluster-authentication-operator

tchap · 2026-02-05T12:15:54Z

Before we start, the idea basically is to extend Sync to return conditions that would overwrite the conditions generated by the library-go machinery. The interface change is not yet implemented in library-go, but it should be OK as this operator is the only operator using the package.

To handle scaling properly, We need to keep some context regarding the state of the Deployment, so I decided to propose a few Deployment annotations. The names are not final, not sure about what the proper prefix is actually. Not sure whether there is a better way to store state, feel free to propose another way.

The main idea is summarized in the doc comment for ProcessDeployment:

// ProcessDeployment ensures the operator does not end up progressing on scaling.
// We define that scaling happens any time .spec.replicas is the only field that changes.
// The idea is then as follows:
//
//  1. When the replicas field is updated, store the change timestamp in a deployment annotation.
//  2. When the deployment eventually starts progressing, add another annotation so that we know it happened.
//  3. When the deployment hasn't progressing for too long, or it has finished progressing, remove all annotations.
//
// When the timestamp annotation is present, we should overwrite Progressing to be false.
//
// So, ProcessDeployment amends the expected deployment in place, also returning any conditions to set on the operator.

The annotation machinery is there to account for eventual consistency between creating/updating the deployment and the deployment controller picking the change up and actually updating the conditions there.

openshift-ci · 2026-02-05T12:15:59Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2026-02-05T12:16:02Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds scaling guardrail logic that annotates Deployments on replica changes and emits OperatorConditionApplyConfiguration overwrites; wires those overwrites through workload and controller call chains by extending several function signatures; adds a go.mod replace pointing github.com/openshift/library-go to a fork.

Changes

Cohort / File(s)	Summary
Module Dependencies `go.mod`	Adds `// workload-condition-overwrites` comment and a `replace` directing `github.com/openshift/library-go` to `github.com/tchap/library-go v0.0.0-20260205162238-4ebccddf99eb`.
Scaling implementation & tests `pkg/controllers/common/scaling/scaling.go`, `pkg/controllers/common/scaling/scaling_test.go`	Adds scaling guardrail functions (`ProcessDeployment`, `specsEqualIgnoringReplicas`, `isDeploymentProgressing`, `cancelProgressing`) that annotate/track replica changes and return condition-overwrites; adds extensive unit tests covering timing, annotation propagation, parsing errors, and progression logic.
Deployment controller `pkg/controllers/deployment/deployment_controller.go`	Extends `Sync` signature to return `[]*applyoperatorv1.OperatorConditionApplyConfiguration`; fetches existing Deployment and invokes `scaling.ProcessDeployment`, propagating condition-overwrites through success and error paths.
Workload API & sync logic `pkg/operator/workload/sync_openshift_oauth_apiserver.go`	Adds `conditionPrefix` field to `OAuthAPIServerWorkload`, updates `NewOAuthAPIServerWorkload` constructor, and changes `Sync`/`syncDeployment` to return condition-overwrites alongside Deployment and errors; prefetches existing Deployment to run scaling logic.
Operator starter `pkg/operator/starter.go`	Introduces `apiServerConditionsPrefix` constant and passes it into `NewOAuthAPIServerWorkload` during workload construction.
Workload tests updated `pkg/operator/workload/sync_openshift_oauth_apiserver_test.go`	Wires a deployments lister via fake informers, adapts test setup to the new three-value return from `syncDeployment` (captures/ignores the new value), and adjusts comparisons accordingly.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-02-05T12:16:11Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tchap
Once this PR has been reviewed and has the lgtm label, please assign liouk for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tchap · 2026-02-05T16:30:09Z

/retitle OCPBUGS-65896: controllers: Prevent Progressing=True when scaling only

We can only merge this after the library-go changes are merged.

/hold

openshift-ci-robot · 2026-02-05T16:30:22Z

@tchap: This pull request references Jira Issue OCPBUGS-65896, which is invalid:

expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

This is a PoV, it doesn't build currently because of some changes needed in library-go.

Anyway, the idea basically is to extend Sync to return conditions that would overwrite the conditions generated by the library-go machinery.

We need to keep some context regarding the state of the Deployment, so I decided to propose a few annotations. The names are not final, not sure about what the proper prefix is actually. Not sure whether there is a better way to store state, feel free to propose another way.

The main idea is summarized in the doc comment for handleDeploymentScaling:
// handleDeploymentScaling ensures the operator does not end up progressing on scaling.
// We define that scaling happens any time .spec.replicas is the only field that changes.
// The idea is then as follows:
//
//  1. When the replicas field is updated, store the change timestamp in a deployment annotation.
//  2. When the deployment eventually starts progressing, add another annotation so that we know it happened.
//  3. When the deployment hasn't progressing for too long, or it has finished progressing, remove all annotations.
//
// When the timestamp annotation is present, we should overwrite Progressing to be false.
The annotation machinery is there to account for eventual consistency between creating/updating the deployment and the deployment controller picking the change up and actually updating the conditions there.

Also the current change is a PoC for a single controller, there is another controller to change basically in the same way. That refactoring should be pretty trivial.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

tchap · 2026-02-05T16:30:51Z

/jira refresh

openshift-ci-robot · 2026-02-05T16:31:00Z

@tchap: This pull request references Jira Issue OCPBUGS-65896, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (ksiddiqu@redhat.com), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-02-05T16:32:36Z

@tchap: This pull request references Jira Issue OCPBUGS-65896, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (ksiddiqu@redhat.com), skipping review request.

Details

In response to this:

This is a PoV, it doesn't build currently because of some changes needed in library-go.

Anyway, the idea basically is to extend Sync to return conditions that would overwrite the conditions generated by the library-go machinery.

We need to keep some context regarding the state of the Deployment, so I decided to propose a few annotations. The names are not final, not sure about what the proper prefix is actually. Not sure whether there is a better way to store state, feel free to propose another way.

The main idea is summarized in the doc comment for ProcessDeployment:
// ProcessDeployment ensures the operator does not end up progressing on scaling.
// We define that scaling happens any time .spec.replicas is the only field that changes.
// The idea is then as follows:
//
//  1. When the replicas field is updated, store the change timestamp in a deployment annotation.
//  2. When the deployment eventually starts progressing, add another annotation so that we know it happened.
//  3. When the deployment hasn't progressing for too long, or it has finished progressing, remove all annotations.
//
// When the timestamp annotation is present, we should overwrite Progressing to be false.
//
// So, ProcessDeployment amends the expected deployment in place, also returning any conditions to set on the operator.
The annotation machinery is there to account for eventual consistency between creating/updating the deployment and the deployment controller picking the change up and actually updating the conditions there.

Also the current change is a PoC for a single controller, there is another controller to change basically in the same way. That refactoring should be pretty trivial.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@pkg/controllers/deployment/deployment_controller.go`:
- Around line 269-283: setRollingUpdateParameters mutates RollingUpdate fields
so scaling.ProcessDeployment sees non‑scaling diffs; to fix, normalize
RollingUpdate parameters before the scaling-only comparison by making a copy of
expectedDeployment (or a temp deployment) and setting its
Spec.Strategy.RollingUpdate to match
existingDeployment.Spec.Strategy.RollingUpdate (or nil) so only Replicas differ,
then call scaling.ProcessDeployment(existingDeployment, normalizedExpected,
...). Ensure you still use the original expectedDeployment for applying changes
but use the normalized copy for the conditionOverwrites call involving
setRollingUpdateParameters and scaling.ProcessDeployment.

pkg/controllers/deployment/deployment_controller.go

tchap · 2026-02-05T17:09:47Z

pkg/operator/workload/sync_openshift_oauth_apiserver_test.go

 			}

-			actualDeployment, err := target.syncDeployment(context.TODO(), &scenario.operator.Spec.OperatorSpec, &scenario.operator.Status.OperatorStatus, eventRecorder)
+			actualDeployment, _, err := target.syncDeployment(context.TODO(), &scenario.operator.Spec.OperatorSpec, &scenario.operator.Status.OperatorStatus, eventRecorder)


I should add some tests here to also check the condition overwrites, I guess?

tchap · 2026-02-05T17:17:58Z

pkg/controllers/deployment/deployment_controller.go

+	conditionOverwrites, err := scaling.ProcessDeployment(existingDeployment, expectedDeployment, clock.RealClock{}, "OAuthServer")
+	if err != nil {
+		return nil, false, nil, append(errs, err)
+	}


There are no tests for this package really, so I can't extend them really to make sure this works properly. For now only the scaling package is being tested properly.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@pkg/operator/workload/sync_openshift_oauth_apiserver.go`:
- Around line 287-297: Add a short comment above the call to
encryptionkms.AddKMSPluginVolumeAndMountToPodSpec explaining why the KMS
volume/mount is applied before calling scaling.ProcessDeployment (e.g., "KMS
volumes must be added before scaling detection so ProcessDeployment sees the
final podSpec and does not misattribute changes" or note that this ordering is
intentional to ensure KMS config is present when comparing specs), mirroring the
explanatory style used in deployment_controller.go; reference the functions
AddKMSPluginVolumeAndMountToPodSpec and scaling.ProcessDeployment in the comment
so future readers understand the rationale.

pkg/operator/workload/sync_openshift_oauth_apiserver.go

Assisted-by: Claude Code

openshift-ci · 2026-02-16T14:07:26Z

@tchap: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-operator-encryption-perf-serial-ote-1of2	`a09bdc0`	link	false	`/test e2e-aws-operator-encryption-perf-serial-ote-1of2`
ci/prow/e2e-aws-operator-encryption-kms-serial-ote-2of2	`a09bdc0`	link	false	`/test e2e-aws-operator-encryption-kms-serial-ote-2of2`
ci/prow/e2e-aws-operator-encryption-kms-serial-ote-1of2	`a09bdc0`	link	false	`/test e2e-aws-operator-encryption-kms-serial-ote-1of2`
ci/prow/e2e-aws-operator-encryption-serial-ote-2of2	`a09bdc0`	link	false	`/test e2e-aws-operator-encryption-serial-ote-2of2`
ci/prow/e2e-aws-operator-parallel-ote	`a09bdc0`	link	false	`/test e2e-aws-operator-parallel-ote`
ci/prow/e2e-aws-operator-encryption-perf-serial-ote-2of2	`a09bdc0`	link	false	`/test e2e-aws-operator-encryption-perf-serial-ote-2of2`
ci/prow/e2e-aws-operator-encryption-rotation-serial-ote-1of2	`a09bdc0`	link	false	`/test e2e-aws-operator-encryption-rotation-serial-ote-1of2`
ci/prow/e2e-aws-operator-encryption-rotation-serial-ote-2of2	`a09bdc0`	link	false	`/test e2e-aws-operator-encryption-rotation-serial-ote-2of2`
ci/prow/e2e-aws-operator-serial-ote	`a09bdc0`	link	false	`/test e2e-aws-operator-serial-ote`
ci/prow/e2e-aws-operator-encryption-serial-ote-1of2	`a09bdc0`	link	false	`/test e2e-aws-operator-encryption-serial-ote-1of2`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

atiratree · 2026-02-17T10:24:17Z

pkg/controllers/common/scaling/scaling.go

+// When the timestamp annotation is present, we should overwrite Progressing to be false.
+//
+// So, ProcessDeployment amends the expected deployment in place, also returning any conditions to set on the operator.
+func ProcessDeployment(existing, expected *appsv1.Deployment, clock clock.Clock, conditionPrefix string) ([]*applyoperatorv1.OperatorConditionApplyConfiguration, error) {


I understand what this is trying to solve, but the place is not correct. We cannot expect each deployment instance to have such a specific logic dedicated to progressing condition. This responsibility has to lie in the library-go or even better in the deployment controller.

The deployment controller should correctly report the progressing state via the Progressing condition. I think the best course of action should be to remove the Generation+ObservedGeneration check from library-go. As we can see in OCPBUGS-65896, not every spec update (e.g. scaling) should result in a progressing state, so generation tracking is not well equipped to do that.

Now there will be a small risk of reporting late the progressing state before the controller reacts to the real template change. This should be okay, unless the controller is down (this results in much bigger issues) or there is a long queue in the controller. Normally the reaction time should be similar to what we would have processing the annotation changes. I think it should be okay to live with this small delay.

Please also note openshift/cluster-image-registry-operator#1293.

Also, I think it might be useful to surface ProgressDeadlineExceeded reason in a degraded condition to report that the deployment controller timed out rolling out the new changes.

tchap · 2026-02-18T10:50:08Z

I am closing this in favor of openshift/library-go#2128

openshift-ci-robot · 2026-02-18T10:50:12Z

@tchap: This pull request references Jira Issue OCPBUGS-65896. The bug has been updated to no longer refer to the pull request using the external bug tracker.

Details

In response to this:

Before we start, the idea basically is to extend Sync to return conditions that would overwrite the conditions generated by the library-go machinery. The interface change is not yet implemented in library-go, but it should be OK as this operator is the only operator using the package.

To handle scaling properly, We need to keep some context regarding the state of the Deployment, so I decided to propose a few Deployment annotations. The names are not final, not sure about what the proper prefix is actually. Not sure whether there is a better way to store state, feel free to propose another way.

The main idea is summarized in the doc comment for ProcessDeployment:
// ProcessDeployment ensures the operator does not end up progressing on scaling.
// We define that scaling happens any time .spec.replicas is the only field that changes.
// The idea is then as follows:
//
//  1. When the replicas field is updated, store the change timestamp in a deployment annotation.
//  2. When the deployment eventually starts progressing, add another annotation so that we know it happened.
//  3. When the deployment hasn't progressing for too long, or it has finished progressing, remove all annotations.
//
// When the timestamp annotation is present, we should overwrite Progressing to be false.
//
// So, ProcessDeployment amends the expected deployment in place, also returning any conditions to set on the operator.
The annotation machinery is there to account for eventual consistency between creating/updating the deployment and the deployment controller picking the change up and actually updating the conditions there.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 5, 2026

tchap force-pushed the no-progression-on-cluster-scaleup branch 5 times, most recently from 2c5caf5 to d8b074a Compare February 5, 2026 14:03

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 5, 2026

tchap force-pushed the no-progression-on-cluster-scaleup branch 2 times, most recently from 93c17d5 to 40dbc69 Compare February 5, 2026 14:12

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 5, 2026

tchap force-pushed the no-progression-on-cluster-scaleup branch 2 times, most recently from 2e340f0 to 5fa520e Compare February 5, 2026 16:26

openshift-ci bot changed the title ~~controllers: Prevent Progressing=True when scaling only~~ OCPBUGS-65896: controllers: Prevent Progressing=True when scaling only Feb 5, 2026

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 5, 2026

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Feb 5, 2026

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Feb 5, 2026

tchap marked this pull request as ready for review February 5, 2026 16:31

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 5, 2026

openshift-ci bot requested review from ibihim and liouk February 5, 2026 16:33

tchap force-pushed the no-progression-on-cluster-scaleup branch from 5fa520e to ab1ed0f Compare February 5, 2026 16:54

coderabbitai bot reviewed Feb 5, 2026

View reviewed changes

pkg/controllers/deployment/deployment_controller.go Show resolved Hide resolved

tchap force-pushed the no-progression-on-cluster-scaleup branch from ab1ed0f to 966cb2c Compare February 5, 2026 17:08

tchap commented Feb 5, 2026

View reviewed changes

tchap force-pushed the no-progression-on-cluster-scaleup branch from 966cb2c to 229cba5 Compare February 5, 2026 17:13

tchap commented Feb 5, 2026

View reviewed changes

coderabbitai bot reviewed Feb 5, 2026

View reviewed changes

pkg/operator/workload/sync_openshift_oauth_apiserver.go Show resolved Hide resolved

tchap force-pushed the no-progression-on-cluster-scaleup branch from 229cba5 to cd9e42a Compare February 5, 2026 17:20

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 7, 2026

tchap force-pushed the no-progression-on-cluster-scaleup branch from cd9e42a to 933f489 Compare February 9, 2026 09:22

openshift-merge-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Feb 9, 2026

tchap force-pushed the no-progression-on-cluster-scaleup branch from 933f489 to ab43aaa Compare February 16, 2026 09:31

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 16, 2026

tchap added 2 commits February 16, 2026 11:32

controllers: Prevent progressing on scaling only

edf5d59

Assisted-by: Claude Code

go mod vendor

a09bdc0

tchap force-pushed the no-progression-on-cluster-scaleup branch from ab43aaa to a09bdc0 Compare February 16, 2026 10:33

atiratree reviewed Feb 17, 2026

View reviewed changes

tchap closed this Feb 18, 2026

Comments

Conversation

tchap commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Feb 5, 2026

Uh oh!

coderabbitai bot commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Uh oh!

openshift-ci bot commented Feb 5, 2026

Uh oh!

tchap commented Feb 5, 2026

Uh oh!

openshift-ci-robot commented Feb 5, 2026

Uh oh!

tchap commented Feb 5, 2026

Uh oh!

openshift-ci-robot commented Feb 5, 2026

Uh oh!

openshift-ci-robot commented Feb 5, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tchap Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

tchap Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

openshift-ci bot commented Feb 16, 2026

Uh oh!

atiratree Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

tchap commented Feb 18, 2026

Uh oh!

openshift-ci-robot commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tchap commented Feb 5, 2026 •

edited

Loading

coderabbitai bot commented Feb 5, 2026 •

edited

Loading