Skip to content

Conversation

@SargunNarula
Copy link
Contributor

@SargunNarula SargunNarula commented Jan 15, 2026

Added e2e tests for IRQ load balancing with housekeeping pods:

  • [86346] Verify housekeeping works correctly with single hyperthread allocation
  • [86348] Verify irqbalance does not overwrite on TuneD restart
    (housekeeping annotation)
  • [86347] Verify housekeeping selects single CPU when SMT is disabled
  • Added baseload calculation functions for determining available pod capacity on nodes

@openshift-ci openshift-ci bot requested review from jmencak and yanirq January 15, 2026 13:25
@SargunNarula
Copy link
Contributor Author

/retest

1 similar comment
@yanirq
Copy link
Contributor

yanirq commented Jan 25, 2026

/retest

Expect(smtActive).To(Equal("0"), "SMT should be disabled (smt/active should be 0)")

cpuRequest := 2
if cpuRequest > newIsolatedCPUs.Size() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will most likely fail to schedule if you have exactly 2 isolated cpus. There will be some burstable pods as well. You need to check the currently available resources or "assume" some burstable and compare with newIsolatedCPUs - 1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shajmakh @MarSik Thanks for pointing this, have a look at the baseload helper implementation introduced with latest commit.

Expect(smtActive).To(Equal("0"), "SMT should be disabled (smt/active should be 0)")

cpuRequest := 2
if cpuRequest > newIsolatedCPUs.Size() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Comment on lines 611 to 614
By("Restoring original CPU configuration")
currentProfile, err = profiles.GetByNodeLabels(testutils.NodeSelectorLabels)
Expect(err).ToNot(HaveOccurred())
currentReserved := string(*currentProfile.Spec.CPU.Reserved)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why updating PP twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While restoring, we update Profile twice to handle the CPU topology transition safely.

We cannot update PP in a single phase because removing nosmt changes the online CPU
topology only after reboot, while kubelet immediately validates reservedSystemCPUs/isolatedCPUs
against the current online CPU set during CPU Manager initialization. A single update would
apply cpusets that reference offline SMT siblings, causing kubelet config validation to fail.

See kubelet CPU manager validation:
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/cpumanager/policy_static.go#L251-L270

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I'd expect kubelet to reconcile gracefully

// CPURequestedCores returns the total CPU requested in whole cores (rounded up)
func (l Load) CPURequestedCores() int {
millis := l.Resources.Cpu().MilliValue()
return int((millis + 999) / 1000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps use a roundup:
retCpu := *resource.NewQuantity(roundUp(cpu.Value(), 2), resource.DecimalSI)

Copy link
Contributor Author

@SargunNarula SargunNarula Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I considered the roundUp approach used in NROP, but after our offline discussion, I believe the ceiling division approach is more appropriate for this use case.

As both have different use cases:

  • In NROP, roundUp to even numbers is used to calculate remaining resources for node padding, which needs to be SMT-aligned.

  • In this case, the baseload calculation is only used to verify that the node has sufficient capacity for the test pod. We need an accurate representation of the actual load, not an SMT-aligned value

@SargunNarula SargunNarula force-pushed the irq_housekeeping_2 branch 3 times, most recently from f036f2c to 88dc045 Compare January 30, 2026 08:59
@SargunNarula
Copy link
Contributor Author

/retest

Added e2e tests for IRQ load balancing with housekeeping pods:
- [86346] Verify housekeeping works correctly with single hyperthread allocation
- [86348] Verify irqbalance does not overwrite on TuneD restart
  (housekeeping annotation)
- [86347] Verify housekeeping selects single CPU when SMT is disabled

Added baseload calculation functions for determining available pod capacity on nodes

Signed-off-by: Sargun Narula <snarula@redhat.com>
@shajmakh
Copy link
Contributor

/approve
@SargunNarula can you confirm please that these tests have run enough and are stable?

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 30, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: SargunNarula, shajmakh
Once this PR has been reviewed and has the lgtm label, please assign jmencak for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 30, 2026

@SargunNarula: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-pao fb10b58 link true /test e2e-gcp-pao
ci/prow/e2e-hypershift-pao fb10b58 link true /test e2e-hypershift-pao

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants