Skip to content

Comments

CNTRLPLANE-2640: Add HyperShift private CAPI types enhancement#1927

Open
csrwng wants to merge 1 commit intoopenshift:masterfrom
csrwng:capi-proxy
Open

CNTRLPLANE-2640: Add HyperShift private CAPI types enhancement#1927
csrwng wants to merge 1 commit intoopenshift:masterfrom
csrwng:capi-proxy

Conversation

@csrwng
Copy link
Contributor

@csrwng csrwng commented Jan 21, 2026

This enhancement proposes isolating HyperShift's Cluster API (CAPI)
CRDs from those installed by the OpenShift platform on management
clusters. As OpenShift evolves toward using CAPI for standalone
cluster machine management, a conflict emerges: both the platform
and HyperShift need to install CAPI CRDs on the same management
cluster, potentially with incompatible versions.

The proposal introduces two major components:

  1. Private CAPI Types and API Proxy: HyperShift-specific CAPI CRDs
    using the cluster.hypershift.openshift.io group, with an API proxy
    sidecar to transparently translate between standard and private
    CAPI types.

  2. Automatic Migration: A migration controller that automatically
    converts existing hosted clusters from standard to private CAPI
    types without disrupting operations.

This enables independent version management for both HyperShift and
platform CAPI dependencies while maintaining transparent operation
for hosted cluster administrators and workloads.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 21, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 21, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 21, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sjenning for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bryan-cox
Copy link
Member

We should call out all the CAPI platforms we support: CAPA, CAPZ, CAPV, CAP-Agent, CAPG, etc.

@csrwng csrwng changed the title WIP: proposal for isolating CAPI types in hypershift OCPSTRAT-2789: Add HyperShift private CAPI types enhancement Jan 26, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 26, 2026
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 26, 2026

@csrwng: This pull request references OCPSTRAT-2789 which is a valid jira issue.

Details

In response to this:

This enhancement proposes isolating HyperShift's Cluster API (CAPI)
CRDs from those installed by the OpenShift platform on management
clusters. As OpenShift evolves toward using CAPI for standalone
cluster machine management, a conflict emerges: both the platform
and HyperShift need to install CAPI CRDs on the same management
cluster, potentially with incompatible versions.

The proposal introduces two major components:

  1. Private CAPI Types and API Proxy: HyperShift-specific CAPI CRDs
    using the cluster.hypershift.openshift.io group, with an API proxy
    sidecar to transparently translate between standard and private
    CAPI types.

  2. Automatic Migration: A migration controller that automatically
    converts existing hosted clusters from standard to private CAPI
    types without disrupting operations.

This enables independent version management for both HyperShift and
platform CAPI dependencies while maintaining transparent operation
for hosted cluster administrators and workloads.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@csrwng csrwng changed the title OCPSTRAT-2789: Add HyperShift private CAPI types enhancement CNTRLPLANE-2640: Add HyperShift private CAPI types enhancement Jan 26, 2026
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 26, 2026

@csrwng: This pull request references CNTRLPLANE-2640 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.22.0" version, but no target version was set.

Details

In response to this:

This enhancement proposes isolating HyperShift's Cluster API (CAPI)
CRDs from those installed by the OpenShift platform on management
clusters. As OpenShift evolves toward using CAPI for standalone
cluster machine management, a conflict emerges: both the platform
and HyperShift need to install CAPI CRDs on the same management
cluster, potentially with incompatible versions.

The proposal introduces two major components:

  1. Private CAPI Types and API Proxy: HyperShift-specific CAPI CRDs
    using the cluster.hypershift.openshift.io group, with an API proxy
    sidecar to transparently translate between standard and private
    CAPI types.

  2. Automatic Migration: A migration controller that automatically
    converts existing hosted clusters from standard to private CAPI
    types without disrupting operations.

This enables independent version management for both HyperShift and
platform CAPI dependencies while maintaining transparent operation
for hosted cluster administrators and workloads.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@csrwng csrwng marked this pull request as ready for review January 26, 2026 20:54
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 26, 2026
@openshift-ci openshift-ci bot requested review from enxebre and sjenning January 26, 2026 20:55
This enhancement proposes isolating HyperShift's Cluster API (CAPI)
CRDs from those installed by the OpenShift platform on management
clusters. As OpenShift evolves toward using CAPI for standalone
cluster machine management, a conflict emerges: both the platform
and HyperShift need to install CAPI CRDs on the same management
cluster, potentially with incompatible versions.

The proposal introduces two major components:

1. Private CAPI Types and API Proxy: HyperShift-specific CAPI CRDs
   using the cluster.hypershift.openshift.io group, with an API proxy
   sidecar to transparently translate between standard and private
   CAPI types.

2. Automatic Migration: A migration controller that automatically
   converts existing hosted clusters from standard to private CAPI
   types without disrupting operations.

This enables independent version management for both HyperShift and
platform CAPI dependencies while maintaining transparent operation
for hosted cluster administrators and workloads.
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 26, 2026

@csrwng: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.


d. **Workload Update and Resume**:
- Update CAPI-dependent workload deployments to include the API proxy sidecar
- For deployments managed by the Control Plane Operator, update the CPO deployment to signal it should add proxy sidecars
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this imply a backport to the minimum supported hc version, so autoscaler/machine-approver get their spec updated with the proxy?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw that's clarified below

Historically, HyperShift management clusters did not use Cluster API (CAPI) for their own machine management, relying instead on the OpenShift Machine API. This allowed HyperShift to install and manage its own version of CAPI CRDs, effectively owning the CAPI types on the management cluster.

With OpenShift's evolution toward using CAPI for standalone cluster machine management, a critical conflict emerges: both the platform and HyperShift will need to install CAPI CRDs on the same management cluster. If these CRD versions are incompatible, neither the platform nor HyperShift can function correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be worth mentioning MCE which is the delivery mechanism for self hosted hcp has also a desire to handle their cluster.x-k8s.io CRDs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also standalone has a toggle to don't clobber this CRDs


This enhancement introduces new CRDs that mirror the standard CAPI CRDs but use the `cluster.hypershift.openshift.io` API group:

- `Cluster.cluster.hypershift.openshift.io`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also MHC and other CRDs the controllers require to work


**dual operator architecture** consists of two HyperShift operator instances running simultaneously: one supporting private CAPI types (new) and one using standard CAPI types (legacy).

1. The platform administrator upgrades their HyperShift operator to a version that supports the `--private-capi-types` flag and runs `hypershift install --private-capi-types`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having this as a flag would require update the delivery mechanisms for managed and selfhosted. Does it really need to be? when would you opt-out?

| `hypershift.openshift.io/private-capi-types: "true"` | New Operator | Successfully migrated clusters |
| `hypershift.openshift.io/scope: "legacy"` | Legacy Operator | Existing clusters awaiting migration |
| `hypershift.openshift.io/migration-in-progress: "true"` | Migration Controller | Clusters actively being migrated (neither operator reconciles) |
| `hypershift.openshift.io/migration-failed` | Legacy Operator | Previous migration failed; requires SRE remediation before retry |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this whole process is more sensitive for self hosted in which case this all would need to be documented and would impact user directly with additional burden.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also even though is not recommended, there's users who might be consuming Machine CRs directly specially in baremetal scenarios to .e.g. annotate next one for deletion on scale down. It would be good to collect some feedback.

* Isolate HyperShift's CAPI CRD dependencies from the platform's CAPI CRDs by using a distinct API group (`cluster.hypershift.openshift.io`).
* Enable HyperShift components to continue using standard CAPI client libraries without modification through a transparent API proxy.
* Automatically migrate existing HyperShift installations to use the private CAPI types without user intervention or hosted cluster downtime.
* Ensure zero user-facing impact - hosted cluster administrators and workloads should experience no behavioral changes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should probably call out there's a period where ability to operate would be degraded e.g. ability to scale dataplane while the controllers are scaled down

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't expect anyone ever does oc get machines.cluster.x-k8s.io?


### Alternative 1: Coordinate CAPI Versions Between Platform and HyperShift

Instead of isolating CAPI types, ensure that the platform and HyperShift always use compatible CAPI versions through tight coordination.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coupling from standalone management clusters would be solved by the flag they provide to not clobber the CRDs. Leaving the only possible conflict with MCE. Each MCE version bundles a pinned version of hypershift. What would prevent MCE and HO from running with the same latest capi APIs release for each downstream cycle?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mechanism we've designed was also intended to extend to MCE/others in the future so that different components could be configured to ignore CRD management while using our CompatibilityRequirement system to ensure the CRD manager doesn't install something that breaks them

The requirement from a hypershift side once standalone implements this is:

  • Configure our config object to tell us which CRDs you are installing
  • Don't try and remove an API group we still rely on

The latter creates a coupling between HyperShift and OpenShift, but, our support contracts mean we have to support different API versions for some period already, and I fear those periods will extend over time to the point where this doesn't actually impact this conversation at all

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding on what Alberto mentioned, MCE has already aligned its CAPI version with downstream CAPI starting in OCP v4.19. It’s unclear why HyperShift cannot follow the same approach.

Regarding platform support, my understanding is that HyperShift similar to MCE targets an N-2 alignment with OCP. If that’s the case, the API support lifetime should already be covered within the existing support matrix.

Overall, the proposal provide a solution to addresses the CRD conflict issue, but the operational cost of an apiGroup proxy conversion (e.g., webhook handling, RBAC extensions, etc.) seems significant compared to simply aligning HyperShift with the downstream CAPI version, the trade-off is questionable.

One additional point related to the Integration Test section (and potentially a risk):
The test scenarios should include running standard CAPI and CAPI providers (e.g., CAPA) side-by-side with HyperShift in the same management cluster. This is already MCE use-case that we do not have a solution for it in MCE releases v2.10/v2.11.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MCE version bundles a pinned version of hypershift. What would prevent MCE and HO from running with the same latest capi APIs release for each downstream cycle?

This seems like the easy fix (but maybe hard to maintain) approach. My understanding is MCE's CAPI is aligned with OCP's integration in the latest release, but HyperShift that is bundled in MCE is a version back(for its CAPI) which is causing issues.


Historically, HyperShift management clusters did not use Cluster API (CAPI) for their own machine management, relying instead on the OpenShift Machine API. This allowed HyperShift to install and manage its own version of CAPI CRDs, effectively owning the CAPI types on the management cluster.

With OpenShift's evolution toward using CAPI for standalone cluster machine management, a critical conflict emerges: both the platform and HyperShift will need to install CAPI CRDs on the same management cluster. If these CRD versions are incompatible, neither the platform nor HyperShift can function correctly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often do we expect this to really happen? What does compatible really mean?

What is the version skew that HyperShift operator supports between itself and its management cluster?

My read of this situation is that from a compatibility perpsective, suppose HyperShift operator is managing the CRD lifecycle, then the cluster CAPI system is compatible as long as hypershift is still installing the same API version, and no fields that we care about have been removed from the spec. Additional fields won't matter, validating that's tightened will be considered upstream and ratchet. And we can run in-cluster validation that validates that resources in the openshift-cluster-api namespace validate against both the original schema for that cluster and the hypershift cluster

The only major problem I see would be if hypershift operator wanted to install CRDs that did not have an API version that the cluster CAPI was relying on. I'm expecting at least some amount of carrying downstream as CAPI APIs evolve to support skip level upgrades in the future, HyperShift may end up in that same predicament independent of this requirement


* As a HyperShift platform engineer, I want HyperShift to use isolated CAPI CRDs, so that I can ensure compatibility between platform and HyperShift CAPI versions without coordination overhead.

* As a management cluster administrator, I want to upgrade my standalone OpenShift cluster's CAPI implementation independently from HyperShift, so that I can adopt new platform features without risking HyperShift stability.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mechanism we are building presently (which I think we have to build regardless of HyperShift adopting it or not) would be usable bi-directionally for HyperShift and OpenShift to protect their own concerns.

Realistically, we expect HyperShift operator is likely ahead of the management cluster at all times no?


### Goals

* Isolate HyperShift's CAPI CRD dependencies from the platform's CAPI CRDs by using a distinct API group (`cluster.hypershift.openshift.io`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the implications of this from an end user perspective?

For SRE, does this mean they have to rework their tooling to look at the different API groups? Do console/UIs now need to support multiple different types of machine for the long term?

* Isolate HyperShift's CAPI CRD dependencies from the platform's CAPI CRDs by using a distinct API group (`cluster.hypershift.openshift.io`).
* Enable HyperShift components to continue using standard CAPI client libraries without modification through a transparent API proxy.
* Automatically migrate existing HyperShift installations to use the private CAPI types without user intervention or hosted cluster downtime.
* Ensure zero user-facing impact - hosted cluster administrators and workloads should experience no behavioral changes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't expect anyone ever does oc get machines.cluster.x-k8s.io?


* Supporting both standard and private CAPI types simultaneously in production deployments long-term. This is a one-way migration for the entire deployment.
* Backporting this functionality to HyperShift Operator releases prior to 4.22.
* Making the API proxy sidecar a general-purpose, reusable component for other use cases beyond CAPI type translation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much knowledge about CAPI is baked into this proxy? Surely all it needs to do is translate the group of the request? It knows nothing about the actual data structures?

- Validation and defaulting webhooks follow the same pattern as standard CAPI CRDs
- Conversion webhooks require a shim layer to translate between private and standard CAPI groups during version conversion (see Conversion Webhook Shim section)
- The CRDs are owned and lifecycled by the HyperShift operator
- Deleting a HostedCluster will clean up all associated private CAPI resources through standard owner reference garbage collection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CRDs are owned by the HostedCluster? Does this imply that each cluster has its own copy of the CRD?

- The cluster continues operating normally with standard CAPI types
- No manual intervention is required for failed migrations
- Alerts notify operators of migration failures for awareness and potential retry
- The migration controller will retry failed migrations on subsequent reconciliation loops
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if it hits the same issue repeatedly?


### Alternative 1: Coordinate CAPI Versions Between Platform and HyperShift

Instead of isolating CAPI types, ensure that the platform and HyperShift always use compatible CAPI versions through tight coordination.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mechanism we've designed was also intended to extend to MCE/others in the future so that different components could be configured to ignore CRD management while using our CompatibilityRequirement system to ensure the CRD manager doesn't install something that breaks them

The requirement from a hypershift side once standalone implements this is:

  • Configure our config object to tell us which CRDs you are installing
  • Don't try and remove an API group we still rely on

The latter creates a coupling between HyperShift and OpenShift, but, our support contracts mean we have to support different API versions for some period already, and I fear those periods will extend over time to the point where this doesn't actually impact this conversation at all


**Why not chosen**:
- CAPI is designed around cluster-scoped resources, and changing this would be a fundamental architecture change requiring upstream buy-in.
- Kubernetes does not support multiple versions of the same CRD with different scopes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm told that folks are keen to investigate this for the future (NamespacedCRD), but that's of now help to us right now

- Proxy failures result in CAPI operation failures but do not expose new attack vectors
- The proxy does not process or store sensitive data beyond what is necessary for API group translation

## Alternatives (Not Implemented)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't call out leveraging the existing mechanisms that the cluster infra team are building (designed explicitly to allow HyperShift to not worry about these issues) as an alternative, and therefore don't document why the plan is insufficient

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants