Skip to content

Rapid scale up and scale down of a MCD result in a cordoned machine #1085

@aaronfern

Description

@aaronfern

How to categorize this issue?

/area auto-scaling
/kind bug
/priority 1

What happened:
We found a bug recently where a rapid scale up and scale down can result in a cordoned machine in the cluster.
This is due to the scale down logic we have with CA, and the recent changes to manageReplicas() in the machineSet controller

The case happens when CA scales down a MCD followed quickly with scaling up the same MCD before the MCD controller notices the replica change and start reconciliation. The result is the MCD ending up with the same number of replicas that it started off with, causing the MCD controller not to have any work to do since from it's point of view, the MCD replicas have not changed.
As part of the scale down operation, CA will cordon an underutilised node. This node just remains part of the cluster as MCM makes no further attempt to remove it

Problem example

T0:
	We have a mcd with 6 replicas and machines M1…M6
	MCD rep=6

T1:
	CA finds M1 unneeded. Reduces mcd replicas to 5, adds a mc (MC1) to trigger-for-deletion annotation on MCD, and cordons this machine
	MCD rep=5; trigger-for-deletion-annotation=M1
	M1 cordoned by CA
	(MCD controller does not run - for whatever reason)

T2: 
	CA increasesMCD rep=6 
	(MCD controller still does not run - for whatever reason)

T3:
	MCD controller start here
	sees that rep=6 (no change). No work done

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • Others:

Metadata

Metadata

Assignees

Labels

area/auto-scalingAuto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) relatedkind/bugBugpriority/1Priority (lower number equals higher priority)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions