CNTRLPLANE-2204: Add Progressing condition for Hypershift NodePool status tracking#1937
CNTRLPLANE-2204: Add Progressing condition for Hypershift NodePool status tracking#1937sdminonne wants to merge 2 commits intoopenshift:masterfrom
Conversation
Signed-off-by: Salvatore Dario Minonne <sminonne@redhat.com>
|
@sdminonne: This pull request references CNTRLPLANE-2204 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target either version "4.22." or "openshift-4.22.", but it targets "openshift-4.21" instead. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/assign @joshbranham |
Signed-off-by: Salvatore Dario Minonne <sminonne@redhat.com>
|
@sdminonne: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
I will review by EOD Tuesday! |
joshbranham
left a comment
There was a problem hiding this comment.
This should help ROSA lots, thank you for putting this together.
/lgtm
|
|
||
| 1. Should `Progressing=False` prevent modifications to the NodePool spec? | ||
| - Current answer: No. Users should be able to fix the spec to recover. | ||
| - This may be revisited during implementation if use cases emerge. |
There was a problem hiding this comment.
I don't see how any condition should ever prevent changes to the spec. You can always choose to short circuit reconciliation
| condition. | ||
|
|
||
| This was rejected because: | ||
| - `AllMachinesReady=False` is already overloaded and represents many |
There was a problem hiding this comment.
AllMachinesReady is an unequivocal statement and I think a good place to capture this failures.
Progressing is ambiguous and opinionated which is double edged sword.
There was a problem hiding this comment.
The original idea was to introduce a Failed condition which I agree could be seen as AllMachinesReady=false. Then after discussion with @csrwng we thought following Deployment progressing condition could be better. Ah about that I agree the proposal is erroneously setting Progressing=false when the NodePools are up and running.
|
|
||
| **5. Threshold Configuration** | ||
|
|
||
| Define threshold constants for each failure type: |
There was a problem hiding this comment.
how are this thresholds meaningful? e.g. if an instance backed by a Machine failed to be created in aws because of quota, that Machine CR would be failed forever.
Even if quota limits are fixed in aws, that particular machine won't recover, so the heuristic seems flawed.
Maybe we a good signal would be a percentage of the expected vs status replicas in the heuristic, e.g. a NodePool only have <= 40% of their desired replicas during a period of time. The you inspect the cause from the Machine CRs. That still wouldn't mean you are suffering quota issues at that particular time.
There was a problem hiding this comment.
NodePool only have <= 40% of their desired replicas during a period of time
Is there anything preventing SREs from monitoring this today? cc @joshbranham
| 30-minute threshold. | ||
|
|
||
| - **Missing Resources Detection**: Detect when required infrastructure | ||
| resources are deleted after initial setup (IAM roles, security groups, |
There was a problem hiding this comment.
Above data is inferred from capxMachine resources, how would this Missing Resources detection be actually implemented? would this come from capi as well?
There was a problem hiding this comment.
Also can you include details on how this will work with v1beta2 CAPI CRs?
| etc.). Set `Progressing=False` if errors persist beyond 15-minute | ||
| threshold. | ||
|
|
||
| - **Capacity Detection**: Check if all machines consistently fail with |
There was a problem hiding this comment.
why all machines here and single machine above?
| (`hypershift-operator/controllers/nodepool/conditions.go`) for each | ||
| failure scenario: | ||
|
|
||
| - **Cloud Quota Detection**: Inspect CAPI Machine `FailureReason` and |
There was a problem hiding this comment.
is the logic intent here that a single machine failure dictates progressing state for the NodePool?
See https://github.com/openshift/enhancements/pull/1937/changes#r2797728248
|
|
||
| #### Standalone Clusters | ||
|
|
||
| This enhancement is not applicable to standalone OpenShift clusters, as |
There was a problem hiding this comment.
can we please mention here any prior art for this existing in mapi and hive nodepools
| **New Reason Constants:** | ||
|
|
||
| When `Progressing=False`: | ||
| - `CloudQuotaExceeded`: Cloud provider quota limits reached |
There was a problem hiding this comment.
what's the heuristic to decide any of these reasons?
e.g. what happens when there's 1 machine that hits quota, but there're other 10 in the same nodepool that are running successfully. See https://github.com/openshift/enhancements/pull/1937/changes#r2797728248
|
|
||
| The `Progressing` condition follows the standard Kubernetes pattern | ||
| established by Deployments and other core resources, where | ||
| `Progressing=False` indicates that a resource is not progressing anymore. The `reason` for the condtion value will give indication if the NodePool is blocked, for example for bad configuration, or an 'AsExpected' reason will indicate a succes. This is more semantically accurate than a `Failed` |
There was a problem hiding this comment.
Progressing=False reason AsExpected mixes two different semanics for true/false. Since you mention the deployments analogy, I believe a healthy deployment is always progressing=true
|
|
||
| ### User Stories | ||
|
|
||
| * As a cluster administrator, I want to know when my NodePool has |
There was a problem hiding this comment.
This use stories are generic around "persistent failure". Can we include specific SREs scenarios where this condition would be valuable for specific "persistent failures".
| for users and automation to detect and respond to these failures. | ||
| Currently, failures like cloud provider quota exhaustion, IAM role | ||
| deletion, or persistent CAPI provider errors are reported only through | ||
| generic conditions such as `AllMachinesReady=False`, which cannot |
There was a problem hiding this comment.
if AllMachinesReady reported clear the root failure, I see the decision on how to proceed as an exercise for the consumer.
Because failed machines remains as such forever, is very ambiguous to let a semantic decide what requires intervention
|
thanks!! dropped some questions |
| This enhancement does not modify the behavior of existing NodePool | ||
| resources beyond adding the new condition. The condition is purely | ||
| informational and does not affect the controller's reconciliation logic, | ||
| except that the controller may use longer requeue delays when |
There was a problem hiding this comment.
why wouldn't the controller built-in rate limiter be enough?
This enhancement introduces a
Progressingcondition to the Hypershift NodePool API.The new condition provides a clear signal when a NodePool is progression or reach a terminal state. It clearly indicates when the controller cannot make forward progress due to persistent failures requiring manual intervention. The same condition indicates when the NodePool reach the desired state: Progressing=False, Reason=AsExpected.
The condition follows the standard Kubernetes pattern (similar to Deployments) and helps users and automation distinguish between transient issues that may self-heal and persistent failures like cloud provider quota exhaustion, missing infrastructure resources, or critical controller errors.
When
Progressing=Falsewith error (Reason supplied), the condition may include specific recovery guidance to help operators remediate the issue.