NE-2489: Gateway API without OLM followup #1938
NE-2489: Gateway API without OLM followup #1938rikatz wants to merge 5 commits intoopenshift:masterfrom
Conversation
|
Skipping CI for Draft Pull Request. |
|
@rikatz: This pull request references ne-2470 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| * If all OLM subscriptions are removed: The cluster-ingress-operator owns them. More below. | ||
| Open Question: How do you handle fields that are extra but withouit a default. | ||
| */ | ||
| /* Todo (by Candace) |
There was a problem hiding this comment.
commenting here:
We are not just installing Gateway API. We are installing Istio, and Istio may need its own APIs. The way I see, it is comparable to something like "we have an operator XYZ that we rely just on part of the solution. Installing a partial manifest is worst then installing it fully". I understand the concern here, but as pointed during the meeting:
- we are NOT providing Istio for our customers, but Gateway API
- but we are providing Gateway API with Istio for layered products that are authorized to rely on Istio CRs to modify the Gateway API behavior on a supported way
- we need a compatibility matrix, that means "works with RHOAI, RHCL, etc" which we can simply state "this is the minimum version that we support" (based on the Istioversion we are installing) and then OSSM can take over it. What we need is a signal of who installed that CRD
Also, OSSM team stated that Istio CRDs are retrocompatible and as opposite to Gateway API CRDs, they are well controller
This PR is a followup on the EP for Gateway API without OLM. It updates how Istio CRDs should be installed by CIO, how to deal with the existence of OSSM installed on the cluster, what conditions should be considered and also provides some other followup fixes on the EP
74a559e to
243c586
Compare
|
@rikatz: This pull request references NE-2489 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@rikatz: This pull request references NE-2489 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| set from the beginning prevents this version fragmentation. | ||
|
|
||
| Additionally, since Istio will support resource filtering in a future release, this | ||
| approach allows layered products to simply request new resources to be added to the filter configuration |
There was a problem hiding this comment.
doesn't this invalidate the point above about layered products not requiring changes in CIO to use new Istio CRDs?
There was a problem hiding this comment.
right, the point is: we do not want to have to bump the CRDs (this may mean even bumping the library, etc) but instead "we can simply change the PILOT_RESOURCE_FILTER_INCLUDE" to add this resource that we were not reconciling on pilot before.
There was a problem hiding this comment.
@rikatz In this scenario, isn't it better to have the FILTER match the available CRDs if we can? It's a lot easier to detect a missing CRD for a layered product than some possible obscure bug due to a config change that was never picked up.
There was a problem hiding this comment.
actually, I wouldn't like to have pilot using resources that it shouldn't for N/S traffic. Eg.: VirtualServices.
Otherwise a user that installs OSSM will end up having the N/S controller having the same capabilities, and be able to do unsupported actions
| For example, if `CIO` installed only a subset of CRDs, and a user subsequently | ||
| installs OSSM and later removes the OSSM subscription, the cluster could end up | ||
| with a mix of CRD versions (some from OSSM, some from CIO). Installing the complete | ||
| set from the beginning prevents this version fragmentation. |
There was a problem hiding this comment.
This assumes a stable set. OSSM might install a subset of what CIO could install(a later version could introduce a new type for instance). If CIO only install a complete set or nothing, we have to accept that CIO might not provide 100% of the features given the CRDs available from another source.
There was a problem hiding this comment.
agreed! but is any other situation better, from a perspective of consistency? I am ok if we say "sail operator will install whatever sail operator thinks it is best for the profile", but we need to be sure to provide a status for users on how are Istio CRDs from a Gateway API perspective, and checking the consistency of istio.io group is easier than a per-CRD state.
As an example:
- CIO installs 3 Istio CRDs - status is ManagedByCIO
- OLM installs 4 Istio CRDs - status is ManagedByOLM
- OLM is uninstalled - sail-operator library marks the 3 provided CRDs back with CIO label - we have an orphan CRD - I can use this information to provide it on the status saying "one CRD is unknown by CIO"
The point of the status is letting layered products and users know the state of Istio CRD and take the decision like "should RHCL be installed on this environment?"
There was a problem hiding this comment.
I think the idea of providing status of Istio CRDs is beyond the scope. Managing Istio CRDs shouldn't become a majority of the work required for a non-OLM installation, and we should be doing the minimum required for now.
There was a problem hiding this comment.
I like your cases, and yes, we only need to worry about the CRDs we can install (so if OLM installs more, or a 3rd party does, we don't know or care).
But we need to handle the mixed case... what if our label is on one CRD, but is missing on others? I assume we have to take the "worst" management for the overall status? But what do we do with the one labelled CRD? Do we update it since we own it? Even if we can't update the full set?
There was a problem hiding this comment.
I will actually defer this question to @aslakknutsen as we are considering this to be part of the sail-operator library management.
There was a problem hiding this comment.
I'd say;
- Try to own the complete set we know about/want
- If we can't because someone else does; attempt to detect who is/what is happening and report that to "some" status/user. If that's OSSM/OLM, or unknown or some mixture is some what irrelevant, we can't do anything more or less in either case then report.
- Install "istio" regardless and hope it works (the crds are pretty stable, and not needed at all by the core functionality of gateway api)
There was a problem hiding this comment.
echoing my proposal on Slack:
- CIO passes to sail-operator the label we expect to exist on CRD
- sail-operator is the one that takes the decision about installing/updating the CRDs. This process must always use REPLACE/UPDATE (never patch) so the label map is not merged
- If there is no CRD, or the CRDs has the label provided by CIO, sail-operator installs the CRDs or updates based on its version
- If there are CRDs and they have the OLM label, sail-operator does nothing with the CRDs
- If there are CRDs and they do not have OLM label or CIO label, sail-operator does nothing with the CRDs
- If there are CRDs and they are on partial mismatch state (eg.: some has OLM labels, some has CIO labels, some has no labels) sail-operator does nothing with CRDs
- CIO watches a list of CRDs that sail-operator provides (this is a static list based on what sail-operator will always be installing) to know what it cares and add the conditions to gateway class
Given CIO watches the CRDs, we can call a new reconciliation and call sail-operator library in case we detect some change on the CRDs
This way we have some sort of "idempotent" reconciliation here: sail-operator cares about the whole installation but CRD installation is not added to its informer. Instead it expects to be called specifically to reconcile the CRDs
|
|
||
| An Istio CRD can exist in one of three management states: | ||
|
|
||
| 1. **Managed by CIO**: Contains the label `ingress.operator.openshift.io/owned` |
There was a problem hiding this comment.
Question - is the "owned" label protected by RBAC? Can someone introduce a security issue by adding this label, accidentally or maliciously?
There was a problem hiding this comment.
there is no such RBAC protection for fields, it would be the whole resource. But no, we are intentionally not protecting these CRDs.
- In case we have CIO and users install a 3rd party Istio - CIO will add a condition to GatewayClass status letting users and layered product know that the installed Istio CRDs are not the original
- In case we have OSSM and users install 3rd party Istio - same situation as above, saying it is not managed by a known service (CIO or OSSM)
A user can rollback this by simply adding the label mentioned here, and the sail-operator library will reconcile the CRDs back
| contain the annotation `"helm.sh/resource-policy": keep` to prevent deletion | ||
| during Helm operations. | ||
|
|
||
| When `CIO` installs Istio CRDs, it installs the complete set of CRDs provided by |
There was a problem hiding this comment.
How many CRDs are there? What are the drawbacks to maintaining so many new but unused CRDs and how do we handle them?
There was a problem hiding this comment.
I am not sure tbh how this impacts the current proposal. We don't know the CRDs, they will be installed by sail-operator library. We know we will reconcile them and see if all of them are on the "managed by Openshift" state.
The drawback is similar to any other CRD on the cluster (like ClusterAPI CRDs, Gateway API CRDs that are there but never used by users), they are just installed CRDs.
|
|
||
| ### API Extensions | ||
|
|
||
| This enhancement does not introduce new CRDs or API extensions. |
There was a problem hiding this comment.
I suggest we add an API knob to allow the cluster admin to configure whether their cluster ingress solution was "RHCL-enabled" or "RHOAI-enabled", and not be forced to install any unneeded Istio CRDs.
There was a problem hiding this comment.
I have discussed this with @Miciah yesterday. The proposed approach here is the simpler as possible, adding a knob adds unnecessary complexity that we want to avoid.
| which includes a set of Custom Resource Definitions required for Istio and its | ||
| integrations to work properly. | ||
|
|
||
| An Istio CRD can exist in one of three management states: |
There was a problem hiding this comment.
"An Istio CRD" -- This is a good way to phrase it... I assume we have to do this on a CRD by CRD basis for all Istio CRDs.
There was a problem hiding this comment.
yes, for the validation. But I am still unsure if I want a single condition for every Istio CRD (not possible with the current GatewayClass API) or if I want one condition on the class that represents the CRD condition with Reason like "ManagedByCIO", "ManagedByOLM", etc and add a 4th condition as "VersionMismatch" and on the message explaining that some CRDs are managed by one component, and others by another component
| For example, if `CIO` installed only a subset of CRDs, and a user subsequently | ||
| installs OSSM and later removes the OSSM subscription, the cluster could end up | ||
| with a mix of CRD versions (some from OSSM, some from CIO). Installing the complete | ||
| set from the beginning prevents this version fragmentation. |
There was a problem hiding this comment.
I like your cases, and yes, we only need to worry about the CRDs we can install (so if OLM installs more, or a 3rd party does, we don't know or care).
But we need to handle the mixed case... what if our label is on one CRD, but is missing on others? I assume we have to take the "worst" management for the overall status? But what do we do with the one labelled CRD? Do we update it since we own it? Even if we can't update the full set?
| When CRDs exist but do not contain CIO or OSSM management labels: | ||
| 1. `CIO` sets the `CompatibleIstioCRDs` condition on the `GatewayClass` status to `False` with reason `UnknownManagement` to indicate the CRD management state | ||
| 2. If the user removes the third-party CRDs, `CIO` will install its own CRDs and update the condition to reflect CIO management | ||
| 3. Alternatively, if the user adds the `ingress.operator.openshift.io/owned` label to the existing CRDs, `CIO` will take ownership and manage them |
There was a problem hiding this comment.
Is there any concern about incompatible CRDs? Are field removals acceptable?
There was a problem hiding this comment.
I have answered it and Github didn't persisted, so let me try again:
On Gateway API side, I don't have many concerns on field removals. We are kind of stating "Our Gateway API ships with this version" and usually field removals will happen on downgrade.
A downgrade can happen on 2 moments:
- OLM installs a lower version - I expect this never happens (@dgn can confirm)
- User installs a lower version - Condition for CRDs gets into unmanaged
Based on that, layered products take the decision if they keep working or not. From our instance of istiod, if an upgrade happens with the addition of a new field nothing should happen because our version does not know that field yet (and our own products are the ones creating Istio CRs, so they should never be using fields that are not supported on that version).
If user downgrades,user is intentionally removing a field and this should be represented on layered product status.
|
|
||
| #### GatewayClass Condition | ||
|
|
||
| **Note**: Adding conditions is a soft requirement for this Enhancement Proposal. While |
There was a problem hiding this comment.
I'm a little nervous about this statement. It will make support and debugging harder if it is not implemented.
There was a problem hiding this comment.
I will remove it, during my discussion with @Miciah we felt this would be scope creep, but actually it does make sense for us (and it is not a huge effort) to add to GatewayClass status controller
knobunc
left a comment
There was a problem hiding this comment.
I like the revisions. Seems clear and complete. Nice work.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: knobunc The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| - The CRDs were installed by a third party | ||
| - Message: Indicates the reason for the status and that the user can either remove the CRDs to allow CIO to install its own version, or add the `ingress.operator.openshift.io/owned` label to allow CIO to take ownership and manage them | ||
|
|
||
| * **Status: `False`, Reason: `ManagementMismatch`** |
There was a problem hiding this comment.
We need to clarify what we do when we are in a mismatch state. Do we upgrade the CRDs we own, even if there are ones that we know we do not own?
Or do we say "it is dangerous to touch anything" and not update the CRDs?
There was a problem hiding this comment.
Nope, IMO we only touch the CRDs if they are on some known state:
- Managed by CIO
- No CRD present
- Managed by OLM, but no subscription (and even this one I think it is arguable cc @aslakknutsen )
There was a problem hiding this comment.
I mean, for the later one I was willing to have a seamless migration (eg.: user upgraded to OCP 4.22, they were on OLM, they are not anymore)
The fact is: we do not remove the existing subscription. So would it make sense to tell them "hey, if you were on 4.21 and now are on 4.22, you can remove the subscription but must run this oc label crd ..." or do we expect the platform to do everything?
|
@rikatz: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This PR is a followup on the EP for Gateway API without OLM.
It updates how Istio CRDs should be installed by CIO, how to deal
with the existence of OSSM installed on the cluster, what conditions
should be considered and also provides some other followup fixes on the EP