CNTRLPLANE-2640: Add HyperShift private CAPI types enhancement#1927
CNTRLPLANE-2640: Add HyperShift private CAPI types enhancement#1927csrwng wants to merge 1 commit intoopenshift:masterfrom
Conversation
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
We should call out all the CAPI platforms we support: CAPA, CAPZ, CAPV, CAP-Agent, CAPG, etc. |
|
@csrwng: This pull request references OCPSTRAT-2789 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@csrwng: This pull request references CNTRLPLANE-2640 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
This enhancement proposes isolating HyperShift's Cluster API (CAPI) CRDs from those installed by the OpenShift platform on management clusters. As OpenShift evolves toward using CAPI for standalone cluster machine management, a conflict emerges: both the platform and HyperShift need to install CAPI CRDs on the same management cluster, potentially with incompatible versions. The proposal introduces two major components: 1. Private CAPI Types and API Proxy: HyperShift-specific CAPI CRDs using the cluster.hypershift.openshift.io group, with an API proxy sidecar to transparently translate between standard and private CAPI types. 2. Automatic Migration: A migration controller that automatically converts existing hosted clusters from standard to private CAPI types without disrupting operations. This enables independent version management for both HyperShift and platform CAPI dependencies while maintaining transparent operation for hosted cluster administrators and workloads.
|
@csrwng: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
|
||
| d. **Workload Update and Resume**: | ||
| - Update CAPI-dependent workload deployments to include the API proxy sidecar | ||
| - For deployments managed by the Control Plane Operator, update the CPO deployment to signal it should add proxy sidecars |
There was a problem hiding this comment.
does this imply a backport to the minimum supported hc version, so autoscaler/machine-approver get their spec updated with the proxy?
| Historically, HyperShift management clusters did not use Cluster API (CAPI) for their own machine management, relying instead on the OpenShift Machine API. This allowed HyperShift to install and manage its own version of CAPI CRDs, effectively owning the CAPI types on the management cluster. | ||
|
|
||
| With OpenShift's evolution toward using CAPI for standalone cluster machine management, a critical conflict emerges: both the platform and HyperShift will need to install CAPI CRDs on the same management cluster. If these CRD versions are incompatible, neither the platform nor HyperShift can function correctly. | ||
|
|
There was a problem hiding this comment.
might be worth mentioning MCE which is the delivery mechanism for self hosted hcp has also a desire to handle their cluster.x-k8s.io CRDs
There was a problem hiding this comment.
also standalone has a toggle to don't clobber this CRDs
|
|
||
| This enhancement introduces new CRDs that mirror the standard CAPI CRDs but use the `cluster.hypershift.openshift.io` API group: | ||
|
|
||
| - `Cluster.cluster.hypershift.openshift.io` |
There was a problem hiding this comment.
also MHC and other CRDs the controllers require to work
|
|
||
| **dual operator architecture** consists of two HyperShift operator instances running simultaneously: one supporting private CAPI types (new) and one using standard CAPI types (legacy). | ||
|
|
||
| 1. The platform administrator upgrades their HyperShift operator to a version that supports the `--private-capi-types` flag and runs `hypershift install --private-capi-types`. |
There was a problem hiding this comment.
having this as a flag would require update the delivery mechanisms for managed and selfhosted. Does it really need to be? when would you opt-out?
| | `hypershift.openshift.io/private-capi-types: "true"` | New Operator | Successfully migrated clusters | | ||
| | `hypershift.openshift.io/scope: "legacy"` | Legacy Operator | Existing clusters awaiting migration | | ||
| | `hypershift.openshift.io/migration-in-progress: "true"` | Migration Controller | Clusters actively being migrated (neither operator reconciles) | | ||
| | `hypershift.openshift.io/migration-failed` | Legacy Operator | Previous migration failed; requires SRE remediation before retry | |
There was a problem hiding this comment.
I think this whole process is more sensitive for self hosted in which case this all would need to be documented and would impact user directly with additional burden.
There was a problem hiding this comment.
Also even though is not recommended, there's users who might be consuming Machine CRs directly specially in baremetal scenarios to .e.g. annotate next one for deletion on scale down. It would be good to collect some feedback.
| * Isolate HyperShift's CAPI CRD dependencies from the platform's CAPI CRDs by using a distinct API group (`cluster.hypershift.openshift.io`). | ||
| * Enable HyperShift components to continue using standard CAPI client libraries without modification through a transparent API proxy. | ||
| * Automatically migrate existing HyperShift installations to use the private CAPI types without user intervention or hosted cluster downtime. | ||
| * Ensure zero user-facing impact - hosted cluster administrators and workloads should experience no behavioral changes. |
There was a problem hiding this comment.
should probably call out there's a period where ability to operate would be degraded e.g. ability to scale dataplane while the controllers are scaled down
There was a problem hiding this comment.
We don't expect anyone ever does oc get machines.cluster.x-k8s.io?
|
|
||
| ### Alternative 1: Coordinate CAPI Versions Between Platform and HyperShift | ||
|
|
||
| Instead of isolating CAPI types, ensure that the platform and HyperShift always use compatible CAPI versions through tight coordination. |
There was a problem hiding this comment.
Coupling from standalone management clusters would be solved by the flag they provide to not clobber the CRDs. Leaving the only possible conflict with MCE. Each MCE version bundles a pinned version of hypershift. What would prevent MCE and HO from running with the same latest capi APIs release for each downstream cycle?
There was a problem hiding this comment.
The mechanism we've designed was also intended to extend to MCE/others in the future so that different components could be configured to ignore CRD management while using our CompatibilityRequirement system to ensure the CRD manager doesn't install something that breaks them
The requirement from a hypershift side once standalone implements this is:
- Configure our config object to tell us which CRDs you are installing
- Don't try and remove an API group we still rely on
The latter creates a coupling between HyperShift and OpenShift, but, our support contracts mean we have to support different API versions for some period already, and I fear those periods will extend over time to the point where this doesn't actually impact this conversation at all
There was a problem hiding this comment.
Adding on what Alberto mentioned, MCE has already aligned its CAPI version with downstream CAPI starting in OCP v4.19. It’s unclear why HyperShift cannot follow the same approach.
Regarding platform support, my understanding is that HyperShift similar to MCE targets an N-2 alignment with OCP. If that’s the case, the API support lifetime should already be covered within the existing support matrix.
Overall, the proposal provide a solution to addresses the CRD conflict issue, but the operational cost of an apiGroup proxy conversion (e.g., webhook handling, RBAC extensions, etc.) seems significant compared to simply aligning HyperShift with the downstream CAPI version, the trade-off is questionable.
One additional point related to the Integration Test section (and potentially a risk):
The test scenarios should include running standard CAPI and CAPI providers (e.g., CAPA) side-by-side with HyperShift in the same management cluster. This is already MCE use-case that we do not have a solution for it in MCE releases v2.10/v2.11.
There was a problem hiding this comment.
MCE version bundles a pinned version of hypershift. What would prevent MCE and HO from running with the same latest capi APIs release for each downstream cycle?
This seems like the easy fix (but maybe hard to maintain) approach. My understanding is MCE's CAPI is aligned with OCP's integration in the latest release, but HyperShift that is bundled in MCE is a version back(for its CAPI) which is causing issues.
|
|
||
| Historically, HyperShift management clusters did not use Cluster API (CAPI) for their own machine management, relying instead on the OpenShift Machine API. This allowed HyperShift to install and manage its own version of CAPI CRDs, effectively owning the CAPI types on the management cluster. | ||
|
|
||
| With OpenShift's evolution toward using CAPI for standalone cluster machine management, a critical conflict emerges: both the platform and HyperShift will need to install CAPI CRDs on the same management cluster. If these CRD versions are incompatible, neither the platform nor HyperShift can function correctly. |
There was a problem hiding this comment.
How often do we expect this to really happen? What does compatible really mean?
What is the version skew that HyperShift operator supports between itself and its management cluster?
My read of this situation is that from a compatibility perpsective, suppose HyperShift operator is managing the CRD lifecycle, then the cluster CAPI system is compatible as long as hypershift is still installing the same API version, and no fields that we care about have been removed from the spec. Additional fields won't matter, validating that's tightened will be considered upstream and ratchet. And we can run in-cluster validation that validates that resources in the openshift-cluster-api namespace validate against both the original schema for that cluster and the hypershift cluster
The only major problem I see would be if hypershift operator wanted to install CRDs that did not have an API version that the cluster CAPI was relying on. I'm expecting at least some amount of carrying downstream as CAPI APIs evolve to support skip level upgrades in the future, HyperShift may end up in that same predicament independent of this requirement
|
|
||
| * As a HyperShift platform engineer, I want HyperShift to use isolated CAPI CRDs, so that I can ensure compatibility between platform and HyperShift CAPI versions without coordination overhead. | ||
|
|
||
| * As a management cluster administrator, I want to upgrade my standalone OpenShift cluster's CAPI implementation independently from HyperShift, so that I can adopt new platform features without risking HyperShift stability. |
There was a problem hiding this comment.
The mechanism we are building presently (which I think we have to build regardless of HyperShift adopting it or not) would be usable bi-directionally for HyperShift and OpenShift to protect their own concerns.
Realistically, we expect HyperShift operator is likely ahead of the management cluster at all times no?
|
|
||
| ### Goals | ||
|
|
||
| * Isolate HyperShift's CAPI CRD dependencies from the platform's CAPI CRDs by using a distinct API group (`cluster.hypershift.openshift.io`). |
There was a problem hiding this comment.
What are the implications of this from an end user perspective?
For SRE, does this mean they have to rework their tooling to look at the different API groups? Do console/UIs now need to support multiple different types of machine for the long term?
| * Isolate HyperShift's CAPI CRD dependencies from the platform's CAPI CRDs by using a distinct API group (`cluster.hypershift.openshift.io`). | ||
| * Enable HyperShift components to continue using standard CAPI client libraries without modification through a transparent API proxy. | ||
| * Automatically migrate existing HyperShift installations to use the private CAPI types without user intervention or hosted cluster downtime. | ||
| * Ensure zero user-facing impact - hosted cluster administrators and workloads should experience no behavioral changes. |
There was a problem hiding this comment.
We don't expect anyone ever does oc get machines.cluster.x-k8s.io?
|
|
||
| * Supporting both standard and private CAPI types simultaneously in production deployments long-term. This is a one-way migration for the entire deployment. | ||
| * Backporting this functionality to HyperShift Operator releases prior to 4.22. | ||
| * Making the API proxy sidecar a general-purpose, reusable component for other use cases beyond CAPI type translation. |
There was a problem hiding this comment.
How much knowledge about CAPI is baked into this proxy? Surely all it needs to do is translate the group of the request? It knows nothing about the actual data structures?
| - Validation and defaulting webhooks follow the same pattern as standard CAPI CRDs | ||
| - Conversion webhooks require a shim layer to translate between private and standard CAPI groups during version conversion (see Conversion Webhook Shim section) | ||
| - The CRDs are owned and lifecycled by the HyperShift operator | ||
| - Deleting a HostedCluster will clean up all associated private CAPI resources through standard owner reference garbage collection |
There was a problem hiding this comment.
The CRDs are owned by the HostedCluster? Does this imply that each cluster has its own copy of the CRD?
| - The cluster continues operating normally with standard CAPI types | ||
| - No manual intervention is required for failed migrations | ||
| - Alerts notify operators of migration failures for awareness and potential retry | ||
| - The migration controller will retry failed migrations on subsequent reconciliation loops |
There was a problem hiding this comment.
What happens if it hits the same issue repeatedly?
|
|
||
| ### Alternative 1: Coordinate CAPI Versions Between Platform and HyperShift | ||
|
|
||
| Instead of isolating CAPI types, ensure that the platform and HyperShift always use compatible CAPI versions through tight coordination. |
There was a problem hiding this comment.
The mechanism we've designed was also intended to extend to MCE/others in the future so that different components could be configured to ignore CRD management while using our CompatibilityRequirement system to ensure the CRD manager doesn't install something that breaks them
The requirement from a hypershift side once standalone implements this is:
- Configure our config object to tell us which CRDs you are installing
- Don't try and remove an API group we still rely on
The latter creates a coupling between HyperShift and OpenShift, but, our support contracts mean we have to support different API versions for some period already, and I fear those periods will extend over time to the point where this doesn't actually impact this conversation at all
|
|
||
| **Why not chosen**: | ||
| - CAPI is designed around cluster-scoped resources, and changing this would be a fundamental architecture change requiring upstream buy-in. | ||
| - Kubernetes does not support multiple versions of the same CRD with different scopes. |
There was a problem hiding this comment.
I'm told that folks are keen to investigate this for the future (NamespacedCRD), but that's of now help to us right now
| - Proxy failures result in CAPI operation failures but do not expose new attack vectors | ||
| - The proxy does not process or store sensitive data beyond what is necessary for API group translation | ||
|
|
||
| ## Alternatives (Not Implemented) |
There was a problem hiding this comment.
You don't call out leveraging the existing mechanisms that the cluster infra team are building (designed explicitly to allow HyperShift to not worry about these issues) as an alternative, and therefore don't document why the plan is insufficient
This enhancement proposes isolating HyperShift's Cluster API (CAPI)
CRDs from those installed by the OpenShift platform on management
clusters. As OpenShift evolves toward using CAPI for standalone
cluster machine management, a conflict emerges: both the platform
and HyperShift need to install CAPI CRDs on the same management
cluster, potentially with incompatible versions.
The proposal introduces two major components:
Private CAPI Types and API Proxy: HyperShift-specific CAPI CRDs
using the cluster.hypershift.openshift.io group, with an API proxy
sidecar to transparently translate between standard and private
CAPI types.
Automatic Migration: A migration controller that automatically
converts existing hosted clusters from standard to private CAPI
types without disrupting operations.
This enables independent version management for both HyperShift and
platform CAPI dependencies while maintaining transparent operation
for hosted cluster administrators and workloads.