Skip to content

Conversation

@shajmakh
Copy link
Contributor

Add basic e2e tests that checks the default behavior of performance-profile with default enabled ExecCPUAffinity: first.

@openshift-ci openshift-ci bot requested review from MarSik and swatisehgal November 13, 2025 13:46
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 13, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: shajmakh
Once this PR has been reviewed and has the lgtm label, please assign ffromani for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@shajmakh
Copy link
Contributor Author

depend on #1426

@shajmakh shajmakh force-pushed the exec-affinity-pp-e2e branch 3 times, most recently from 6ef4f1a to beeea3d Compare November 18, 2025 10:34
@shajmakh shajmakh force-pushed the exec-affinity-pp-e2e branch 4 times, most recently from 565820a to 0af067c Compare January 20, 2026 11:14
@shajmakh
Copy link
Contributor Author

regarding ci/prow/e2e-gcp-pao-updating-profile the newly added test in the PR is failing because the exec process was always (for 20 retries) pinned to the first CPU of the set although the execCPUAffinity feature is disabled.
this was tested locally several times and it passed. looking deeper in the mustgather, we can see that the PP cpu config is as follow:
cpu: isolated: 1-3 reserved: "0"
while the test logs show us that the exclusive CPUs that were assigned to the running (guaranteed) container were:
first exclusive CPU: 1, all exclusive CPUs: 1,4
which means CPU 4 is likely offline which leaves only CPU 1 for the process to be pinned to.
looking at the node's allocatable cpus:
allocatable: cpu: "5"
which means that the PP didn't distribute the rest of the unreserved CPUs. The thing that caused unalignments when scheduling a workload.
Investigation is ongoing to solve this.

/test e2e-gcp-pao-workloadhints

shajmakh added a commit to shajmakh/release that referenced this pull request Jan 22, 2026
GCP cluster profile uses ipi-gcp flow which by default uses 6 vCPUs for compute machines (see `step-registry/ipi/conf/gcp/ipi-conf-ref.yaml`).The performance profile suites configures a profile with `reserved: "0"` and `isolated: "1-3"` (see openshift/cluster-node-tuning-operator#909), unless environment vars are specificed.
In general this is the good practice to include all node's cpus in the
PP cpu section, but reason why we need this now is that we have some new tests that requires most the cpus to be all distributed using PP (see openshift/cluster-node-tuning-operator#1432 (comment)).

In this commit we start updating only the affected job on which the test
would run, later we will need to add this setting to all other jobs that
consume ipi-gcp cluster configuration.
Note: this is subject to change should the CPU specifications on GCP get
modified.

Signed-off-by: Shereen Haj <shajmakh@redhat.com>
shajmakh added a commit to shajmakh/release that referenced this pull request Jan 22, 2026
GCP cluster profile uses ipi-gcp flow which by default uses 6 vCPUs for compute machines (see `step-registry/ipi/conf/gcp/ipi-conf-ref.yaml`).The performance profile suites configures a profile with `reserved: "0"` and `isolated: "1-3"` (see openshift/cluster-node-tuning-operator#909), unless environment vars are specificed.
In general this is the good practice to include all node's cpus in the
PP cpu section, but reason why we need this now is that we have some new tests that requires most the cpus to be all distributed using PP (see openshift/cluster-node-tuning-operator#1432 (comment)).

In this commit we start updating only the affected job on which the test
would run, later we will need to add this setting to all other jobs that
consume ipi-gcp cluster configuration.
Note: this is subject to change should the CPU specifications on GCP get
modified.

Signed-off-by: Shereen Haj <shajmakh@redhat.com>
@shajmakh
Copy link
Contributor Author

/retest

shajmakh added a commit to shajmakh/release that referenced this pull request Jan 22, 2026
GCP cluster profile uses ipi-gcp flow which by default uses 6 vCPUs for compute machines (see `step-registry/ipi/conf/gcp/ipi-conf-ref.yaml`).The performance profile suites configures a profile with `reserved: "0"` and `isolated: "1-3"` (see openshift/cluster-node-tuning-operator#909), unless environment vars are specificed.
In general this is the good practice to include all node's cpus in the
PP cpu section, but reason why we need this now is that we have some new tests that requires most the cpus to be all distributed using PP (see openshift/cluster-node-tuning-operator#1432 (comment)).

Note: this is subject to change should the CPU specifications on GCP get
modified.

Signed-off-by: Shereen Haj <shajmakh@redhat.com>
@shajmakh
Copy link
Contributor Author

when temporarly removed the failing test due to misaligning node topology with PP cpu section,
ci/prow/e2e-gcp-pao-updating-profile lane passed. A fix for the infra issue is proposed here: openshift/release#73835

@shajmakh
Copy link
Contributor Author

/hold
for prow change to be merged

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 23, 2026
openshift-merge-bot bot pushed a commit to openshift/release that referenced this pull request Jan 23, 2026
GCP cluster profile uses ipi-gcp flow which by default uses 6 vCPUs for compute machines (see `step-registry/ipi/conf/gcp/ipi-conf-ref.yaml`).The performance profile suites configures a profile with `reserved: "0"` and `isolated: "1-3"` (see openshift/cluster-node-tuning-operator#909), unless environment vars are specificed.
In general this is the good practice to include all node's cpus in the
PP cpu section, but reason why we need this now is that we have some new tests that requires most the cpus to be all distributed using PP (see openshift/cluster-node-tuning-operator#1432 (comment)).

Note: this is subject to change should the CPU specifications on GCP get
modified.

Signed-off-by: Shereen Haj <shajmakh@redhat.com>
@shajmakh shajmakh force-pushed the exec-affinity-pp-e2e branch 2 times, most recently from 0af067c to 41afeca Compare January 26, 2026 06:56
@shajmakh
Copy link
Contributor Author

/test e2e-aws-ovn
/test e2e-aws-operator

@shajmakh shajmakh force-pushed the exec-affinity-pp-e2e branch 2 times, most recently from acdb51a to a8b7158 Compare January 27, 2026 20:50
Add main e2e tests that checks the behavior of
performance-profile with `ExecCPUAffinity: first` and without it
(legacy).

Signed-off-by: Shereen Haj <shajmakh@redhat.com>
Add unit tests for functions in resources helper package for tests.

Assisted-by: Cursor v1.2.2
AI-Attribution: AIA Entirely AI, Human-initiated, Reviewed, Cursor v1.2.2 v1.0

Signed-off-by: Shereen Haj <shajmakh@redhat.com>
@shajmakh
Copy link
Contributor Author

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 28, 2026

@shajmakh: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-hypershift-pao a8b7158 link true /test e2e-hypershift-pao
ci/prow/e2e-gcp-pao-workloadhints a8b7158 link true /test e2e-gcp-pao-workloadhints
ci/prow/e2e-gcp-pao-updating-profile a8b7158 link true /test e2e-gcp-pao-updating-profile

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@shajmakh
Copy link
Contributor Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 29, 2026
Copy link
Contributor

@SargunNarula SargunNarula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tests, IMO some tests are redundant which can be removed.

}

var err error
testPod := pods.MakePodWithResources(ctx, workerRTNode, qos, containersResources)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the MakePod util function instead to keep the pattern same in this suite -
func - link, can also be configured with resources - link

Expect(err).ToNot(HaveOccurred())
if isExclusiveCPURequest {
testlog.Infof("exec process CPU: %d, first shared CPU: %d", execProcessCPUInt, firstCPU)
Expect(execProcessCPUs).To(Equal(firstCPU), "Exec process CPU is not the first shared CPU; retry %d", i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mismatch comparison - execProcessCPUInt to be used instead of execProcessCPUs

sharedCpusResource: resource.MustParse("1"),
},
}),
Entry("guaranteed pod with multiple containers with shared CPU request",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It tests the same condition as a guaranteed pod with a single container requesting shared CPUs.

Although the CPU manager takes a different allocation path by splitting CPUs across containers, the exec-cpu-affinity feature behaves identically on a per-container basis, which is what we are validating here.

corev1.ResourceMemory: resource.MustParse("100Mi"),
},
}),
Entry("guaranteed pod with fractional CPU requests",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same check as in - guaranteed pod with single container with shared CPU request and fractinal CPU requests

sharedCpusResource: resource.MustParse("1"),
},
}),
Entry("best-effort pod with shared CPU request",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to keep the best-effort scenario under cpu_management only, since it does not depend on shared CPUs.

//cnt1 resources
{},
}),
Entry("burstable pod with shared CPU request",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same case with burstable

// should we expect cpu affinity to the first CPU
isExclusiveCPURequest := false
if qos == corev1.PodQOSGuaranteed {
cpuRequestFloat := container.Resources.Requests.Name(corev1.ResourceCPU, resource.DecimalSI).AsFloat64Slow()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, can we use Millivalue -

milliCPU := container.Resources.Requests.Cpu().MilliValue()
isExclusiveCPURequest = (milliCPU % 1000) == 0

corev1.ResourceMemory: resource.MustParse("200Mi"),
},
}),
Entry("guaranteed pod with two containers with exclusive CPUs,exec process should be binned to first CPU",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same check as in - guaranteed pod single container with exclusive CPUs, exec process should be binned to first CPU

corev1.ResourceMemory: resource.MustParse("200Mi"),
},
}),
Entry("guaranteed pod with two container with fraction CPU request, exec process can be binned to any CPU for containerrewuesting fractional CPUs, and to the first CPU for container requesting exclusive CPUs",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate of this test - guaranteed pod multiple containers with fraction CPU request, exec process can be binned to any CPU for container requesting fractional CPUs, and to the first CPU for container requesting exclusive CPUs

}
}

func MakePodWithResources(ctx context.Context, workerRTNode *corev1.Node, qos corev1.PodQOSClass, containersResources []corev1.ResourceList) *corev1.Pod {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused argument - ctx context.Context

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants