discontinue support of OCI hook by enp0s3 · Pull Request #92 · openshift-virtualization/wasp-agent

enp0s3 · 2025-02-12T18:46:01Z

maintaining of the OCI hook becomes expensive - the hook is tightly coupled with the specific OCI runtime that was chosen to be used on the node. Moreover the hook cannot be adjusted on a per-pod basis when the pod is using a custom runtime class.
In addition the OCI hook is detached from the wasp-agent lifecycle. It means that extra effort needs to be put in order to clean the hook when the wasp-agent is unresponsive or when its deleted from the cluster.

if we consider to remove the hook we should compare two scenarios - with and w/o the hook. The difference is as follows:

(1) with the hook the transition is unlimited->limited swap usage. (2) w/o the hook the transition is zero->limited swap usage.

setting the limited swap is done by the limited swap controller which runs inside the wasp-agent daemonset. The time it takes to set the limited swap depends on the API latency (this design by itself can be improved).

by switching from (1) to (2) we actually don't introduce regression from workload stability perspective, because in both scenarios if the workload exceeds its allowed limited swap, it will be OOMkilled.

From node stability perspective switching to (2) is even safer.

scenario (2) puts in risk only the container itself that could be OOMkilled in the worst case, while in scenario (1) unlimited swap consumption can put the whole node in risk.

Regarding API latency the following steps can be taken: (*) We actually don't need the API server, we can work directly with the kubelet server.
(**) We can utilize NRI, thus opt-in for limited swap from inside the CRI lifecycle.

maintaining of the OCI hook becomes expensive - the hook is tightly coupled with the specific OCI runtime that was chosen to be used on the node. Moreover the hook cannot be adjusted on a per-pod basis when the pod is using a custom runtime class. In addition the OCI hook is detached from the wasp-agent lifecycle. It means that extra effort needs to be put in order to clean the hook when the wasp-agent is unresponsive or when its deleted from the cluster. if we consider to remove the hook we should compare two scenarios - with and w/o the hook. The difference is as follows: (1) with the hook the transition is unlimited->limited swap usage. (2) w/o the hook the transition is zero->limited swap usage. setting the limited swap is done by the limited swap controller which runs inside the wasp-agent daemonset. The time it takes to set the limited swap depends on the API latency (this design by itself can be improved). by switching from (1) to (2) we actually don't introduce regression from workload stability perspective, because in both scenarios if the workload exceeds its allowed limited swap, it will be OOMkilled. From node stability perspective switching to (2) is even safer. scenario (2) puts in risk only the container itself that could be OOMkilled in the worst case, while in scenario (1) unlimited swap consumption can put the whole node in risk. Regarding API latency the following steps can be taken: (*) We actually don't need the API server, we can work directly with the kubelet server. (**) We can utilize NRI, thus opt-in for limited swap from inside the CRI lifecycle. Signed-off-by: Igor Bezukh <ibezukh@redhat.com>

openshift-ci · 2025-02-12T18:46:08Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from enp0s3. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

enp0s3 · 2025-02-12T18:46:27Z

/cc @fabiand @Barakmor1

fabiand · 2025-07-03T19:55:07Z

I guess this is worth to push fwd. WDYT?

openshift-merge-robot · 2025-07-03T19:55:16Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

enp0s3 · 2025-07-04T10:10:41Z

@fabiand Hi, I would like to try and replace the hook with NRI plugin. Hopefully to do some PoC in the next month.

openshift-ci bot requested review from Barakmor1 and fabiand February 12, 2025 18:46

openshift-merge-robot added the needs-rebase label Jul 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

discontinue support of OCI hook#92

discontinue support of OCI hook#92
enp0s3 wants to merge 1 commit intoopenshift-virtualization:mainfrom
enp0s3:discontinue-oci-hook-support

enp0s3 commented Feb 12, 2025

Uh oh!

openshift-ci bot commented Feb 12, 2025

Uh oh!

enp0s3 commented Feb 12, 2025

Uh oh!

fabiand commented Jul 3, 2025

Uh oh!

openshift-merge-robot commented Jul 3, 2025

Uh oh!

enp0s3 commented Jul 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

enp0s3 commented Feb 12, 2025

Uh oh!

openshift-ci bot commented Feb 12, 2025

Uh oh!

enp0s3 commented Feb 12, 2025

Uh oh!

fabiand commented Jul 3, 2025

Uh oh!

openshift-merge-robot commented Jul 3, 2025

Uh oh!

enp0s3 commented Jul 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants