-
Notifications
You must be signed in to change notification settings - Fork 135
Description
How to categorize this issue?
/area control-plane
/area robustness
/kind bug
/priority 3
What happened:
The mcm has a safety mechanism to prevent marking too many Unknown machines as Failed simultaneously, handled by the canMarkMachineFailed function.
However, this rate-limit hardcodes a label selector to look for the key "name" matching the MachineDeployment name:
var (
list = []string{machineDeployName}
selector = labels.NewSelector()
req, _ = labels.NewRequirement("name", selection.Equals, list)
)(Reference: pkg/util/provider/machinecontroller/machine_util.go#L2138-L2142)
We use MCM in a few "non-gardener" clusters and therefore create the MachineDeployment objects on our own.
If a MachineDeployment omits this "name" label in spec.template.metadata.labels, the selector finds zero machines. The inProgress counter remains 0, entirely bypassing the maxReplacements check (0 < maxReplacements is always true). Consequently, if nodes go Unknown for >10min, MCM marks all of them as Failed simultaneously, ignoring the safety limits.
What you expected to happen:
The function should accurately identify machines regardless of the "name" label. Better approaches include:
- Looking up machines via
OwnerReferences(Machine -> MachineSet -> MachineDeployment). - Alternatively: Enforce the
"name"label via validation webhook upon creation if it is strictly required for internal logic
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
MCM: v0.61.1