Skip to content

Comments

CNTRLPLANE-2204: Add Progressing condition for Hypershift NodePool status tracking#1937

Open
sdminonne wants to merge 2 commits intoopenshift:masterfrom
sdminonne:CNTRLPLANE-2204
Open

CNTRLPLANE-2204: Add Progressing condition for Hypershift NodePool status tracking#1937
sdminonne wants to merge 2 commits intoopenshift:masterfrom
sdminonne:CNTRLPLANE-2204

Conversation

@sdminonne
Copy link

This enhancement introduces a Progressing condition to the Hypershift NodePool API.

The new condition provides a clear signal when a NodePool is progression or reach a terminal state. It clearly indicates when the controller cannot make forward progress due to persistent failures requiring manual intervention. The same condition indicates when the NodePool reach the desired state: Progressing=False, Reason=AsExpected.

The condition follows the standard Kubernetes pattern (similar to Deployments) and helps users and automation distinguish between transient issues that may self-heal and persistent failures like cloud provider quota exhaustion, missing infrastructure resources, or critical controller errors.

When Progressing=False with error (Reason supplied), the condition may include specific recovery guidance to help operators remediate the issue.

Signed-off-by: Salvatore Dario Minonne <sminonne@redhat.com>
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 4, 2026
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 4, 2026

@sdminonne: This pull request references CNTRLPLANE-2204 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "4.22." or "openshift-4.22.", but it targets "openshift-4.21" instead.

Details

In response to this:

This enhancement introduces a Progressing condition to the Hypershift NodePool API.

The new condition provides a clear signal when a NodePool is progression or reach a terminal state. It clearly indicates when the controller cannot make forward progress due to persistent failures requiring manual intervention. The same condition indicates when the NodePool reach the desired state: Progressing=False, Reason=AsExpected.

The condition follows the standard Kubernetes pattern (similar to Deployments) and helps users and automation distinguish between transient issues that may self-heal and persistent failures like cloud provider quota exhaustion, missing infrastructure resources, or critical controller errors.

When Progressing=False with error (Reason supplied), the condition may include specific recovery guidance to help operators remediate the issue.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from csrwng and sjenning February 4, 2026 18:03
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 4, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign csrwng for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@joshbranham
Copy link

/assign @joshbranham

Signed-off-by: Salvatore Dario Minonne <sminonne@redhat.com>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 5, 2026

@sdminonne: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@sdminonne
Copy link
Author

cc @andreadecorte

@joshbranham
Copy link

I will review by EOD Tuesday!

Copy link

@joshbranham joshbranham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should help ROSA lots, thank you for putting this together.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 10, 2026

1. Should `Progressing=False` prevent modifications to the NodePool spec?
- Current answer: No. Users should be able to fix the spec to recover.
- This may be revisited during implementation if use cases emerge.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how any condition should ever prevent changes to the spec. You can always choose to short circuit reconciliation

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I'll remove this.

condition.

This was rejected because:
- `AllMachinesReady=False` is already overloaded and represents many
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AllMachinesReady is an unequivocal statement and I think a good place to capture this failures.

Progressing is ambiguous and opinionated which is double edged sword.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original idea was to introduce a Failed condition which I agree could be seen as AllMachinesReady=false. Then after discussion with @csrwng we thought following Deployment progressing condition could be better. Ah about that I agree the proposal is erroneously setting Progressing=false when the NodePools are up and running.


**5. Threshold Configuration**

Define threshold constants for each failure type:
Copy link
Member

@enxebre enxebre Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are this thresholds meaningful? e.g. if an instance backed by a Machine failed to be created in aws because of quota, that Machine CR would be failed forever.
Even if quota limits are fixed in aws, that particular machine won't recover, so the heuristic seems flawed.

Maybe we a good signal would be a percentage of the expected vs status replicas in the heuristic, e.g. a NodePool only have <= 40% of their desired replicas during a period of time. The you inspect the cause from the Machine CRs. That still wouldn't mean you are suffering quota issues at that particular time.

Copy link
Member

@enxebre enxebre Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodePool only have <= 40% of their desired replicas during a period of time

Is there anything preventing SREs from monitoring this today? cc @joshbranham

30-minute threshold.

- **Missing Resources Detection**: Detect when required infrastructure
resources are deleted after initial setup (IAM roles, security groups,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above data is inferred from capxMachine resources, how would this Missing Resources detection be actually implemented? would this come from capi as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can you include details on how this will work with v1beta2 CAPI CRs?

etc.). Set `Progressing=False` if errors persist beyond 15-minute
threshold.

- **Capacity Detection**: Check if all machines consistently fail with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why all machines here and single machine above?

(`hypershift-operator/controllers/nodepool/conditions.go`) for each
failure scenario:

- **Cloud Quota Detection**: Inspect CAPI Machine `FailureReason` and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the logic intent here that a single machine failure dictates progressing state for the NodePool?
See https://github.com/openshift/enhancements/pull/1937/changes#r2797728248


#### Standalone Clusters

This enhancement is not applicable to standalone OpenShift clusters, as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we please mention here any prior art for this existing in mapi and hive nodepools

**New Reason Constants:**

When `Progressing=False`:
- `CloudQuotaExceeded`: Cloud provider quota limits reached
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the heuristic to decide any of these reasons?
e.g. what happens when there's 1 machine that hits quota, but there're other 10 in the same nodepool that are running successfully. See https://github.com/openshift/enhancements/pull/1937/changes#r2797728248


The `Progressing` condition follows the standard Kubernetes pattern
established by Deployments and other core resources, where
`Progressing=False` indicates that a resource is not progressing anymore. The `reason` for the condtion value will give indication if the NodePool is blocked, for example for bad configuration, or an 'AsExpected' reason will indicate a succes. This is more semantically accurate than a `Failed`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo condtion, succes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Progressing=False reason AsExpected mixes two different semanics for true/false. Since you mention the deployments analogy, I believe a healthy deployment is always progressing=true

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, my bad


### User Stories

* As a cluster administrator, I want to know when my NodePool has
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This use stories are generic around "persistent failure". Can we include specific SREs scenarios where this condition would be valuable for specific "persistent failures".

for users and automation to detect and respond to these failures.
Currently, failures like cloud provider quota exhaustion, IAM role
deletion, or persistent CAPI provider errors are reported only through
generic conditions such as `AllMachinesReady=False`, which cannot
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if AllMachinesReady reported clear the root failure, I see the decision on how to proceed as an exercise for the consumer.
Because failed machines remains as such forever, is very ambiguous to let a semantic decide what requires intervention

@enxebre
Copy link
Member

enxebre commented Feb 12, 2026

thanks!! dropped some questions

This enhancement does not modify the behavior of existing NodePool
resources beyond adding the new condition. The condition is purely
informational and does not affect the controller's reconciliation logic,
except that the controller may use longer requeue delays when
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why wouldn't the controller built-in rate limiter be enough?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants