Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
5adaf89
[WIP] Add support for machine preservation through annotations
thiyyakat Oct 29, 2025
5fc6abb
Add MachinePreserveTimeout to SafetyOptions.
thiyyakat Nov 5, 2025
d8a0764
Add PreserveExpiryTime to `machine.Status.CurrentStatus`.
thiyyakat Nov 5, 2025
bba1108
Remove `AutoPreserveFailedMachineCount` from machine set
thiyyakat Nov 5, 2025
7e86537
Fix linting error
thiyyakat Nov 5, 2025
f058674
Add generated files
thiyyakat Nov 5, 2025
7f1861a
Add support for preserve=now on node and machine objects
thiyyakat Nov 5, 2025
bd90ed1
Update TODOs
thiyyakat Nov 5, 2025
a2082bb
[WIP] Implement add/remove/update of node and machine annotations
thiyyakat Nov 10, 2025
27d1807
Update preserve logic to honour node annotations over machine
thiyyakat Nov 13, 2025
1e87d0b
Add preservation logic in machineset controller. TODO: remove debug logs
thiyyakat Nov 19, 2025
36f6c4f
Add drain logic post preservation of failed machine
thiyyakat Nov 19, 2025
11010d6
Fix return for reconcileMachineHealth. Unit tests passing
thiyyakat Nov 19, 2025
5dadf85
Update CRDs
thiyyakat Nov 19, 2025
3980331
Fix bug causing repeated requeuing
thiyyakat Nov 24, 2025
98d1b21
Fix drain logic in machine preservation for Unknown->Failed case:
thiyyakat Nov 26, 2025
c0a5647
Fix toggle between now and when-failed when machine has not failed.
thiyyakat Nov 27, 2025
fcbca23
Refactor changes to support auto-preservation of failed machines
thiyyakat Dec 4, 2025
bf520e1
Fix bugs that prevented MCS update, and auto-preservation of machines
thiyyakat Dec 5, 2025
37ef7fa
Add support for uncordoning preserved node that is healthy
thiyyakat Dec 8, 2025
37caeb7
Refactor code:
thiyyakat Dec 10, 2025
e47153e
Fix bug so that recovered preserved nodes are uncordoned
thiyyakat Dec 10, 2025
de3f92f
Minor changes
thiyyakat Dec 10, 2025
2a51c1e
Change verb used in log statements for machine/node name
thiyyakat Dec 10, 2025
95438c1
Fix mistake made during rebasing
thiyyakat Dec 10, 2025
6fc6317
Change return types of preservation util functions such that only cal…
thiyyakat Dec 11, 2025
2c74ef8
Address review comments
thiyyakat Dec 12, 2025
1c73121
Remove incorrect json tag and regenerate CRDs.
thiyyakat Dec 18, 2025
6dc35fe
Apply suggestions from code review - part 1
thiyyakat Dec 19, 2025
a38223c
Delete invalid gitlink
thiyyakat Dec 19, 2025
165af13
Address review comments- part 2:
thiyyakat Dec 22, 2025
6720e4d
Address review comments- part 3:
thiyyakat Dec 23, 2025
14e4af3
Address review comments- part 4:
thiyyakat Dec 23, 2025
c94e391
Add unit tests for preservation logic in machine.go
thiyyakat Dec 24, 2025
4d3482a
Refactor tests to reduce redundancy in code.
thiyyakat Dec 26, 2025
5033fdf
Add tests for preservation logic in machine_util.go
thiyyakat Dec 29, 2025
58ac6cd
Refactor test code to reduce redundant code
thiyyakat Dec 31, 2025
68b9ed1
Fix bugs after merging
thiyyakat Dec 31, 2025
fd1c51e
Remove testing code
thiyyakat Jan 6, 2026
f58d703
Address review comments - part 5: Change api fields to pointers
thiyyakat Jan 8, 2026
1d9e0ba
Fix Makefile
thiyyakat Jan 8, 2026
8790b32
Add crds
thiyyakat Jan 8, 2026
2f4fa29
Address review comments - part 6: Replace function preserveExpiryTime…
thiyyakat Jan 8, 2026
652f094
Address review comments - part 7:
thiyyakat Jan 13, 2026
a6faa51
Address review comments - part 7:
thiyyakat Jan 13, 2026
3242d4a
Fix apis.md
thiyyakat Jan 14, 2026
9cab0bc
Fix apis.md and address review comments
thiyyakat Jan 14, 2026
e3b5dd6
Modify nodeops.AddOrUpdateConditionsOnNode() to return updated node
thiyyakat Jan 16, 2026
2ff8ec9
Address review comments - part 8:
thiyyakat Jan 16, 2026
e796a93
Handle auto-preserved case similar to when-failed case
thiyyakat Jan 16, 2026
ff5b90a
Fix bugs, incorporate design change for when-failed, and add tests
thiyyakat Jan 16, 2026
6c17ccd
Revert Makefile changes
thiyyakat Jan 16, 2026
48d8b7d
Add preservation tests for machineSet controller
thiyyakat Jan 19, 2026
d538707
Update comments and fix minor bugs
thiyyakat Jan 19, 2026
91f2f1f
Address review comments - part 9
thiyyakat Jan 20, 2026
c472897
Address review comments - part 10: Remove unnecessary nil checks whil…
thiyyakat Jan 21, 2026
910cef5
Add machine-preserve-timeout flag
thiyyakat Jan 21, 2026
160dd75
Address review comments - part 11:
thiyyakat Jan 22, 2026
9880185
Address review comments - part 12:
thiyyakat Jan 23, 2026
a56c49c
Handle edge cases:
thiyyakat Jan 23, 2026
62d4348
Ensure reconcileClusterMachineSafetyAPIServer does not overwrite Pres…
thiyyakat Feb 3, 2026
d3e2fa8
Remove PreserveMachineAnnotationValuePreserveStoppedByMCM annotation …
thiyyakat Feb 10, 2026
7b274b1
Add usage doc for preservation feature
thiyyakat Feb 10, 2026
a6743ed
Make changes to simplify design:
thiyyakat Feb 10, 2026
084835b
Add code to reconcile auto preservation, and reduce number of auto-pr…
thiyyakat Feb 12, 2026
6871667
Modify annotation handling to improve determinism: Introduced lastApp…
thiyyakat Feb 12, 2026
cc4404a
Update tests and handle edge cases
thiyyakat Feb 16, 2026
bcfc56b
Fix bugs introduced by latest changes
thiyyakat Feb 18, 2026
87e1482
Sync MCD's value of AutoPreserveFailedMachineMax to MCS on change
thiyyakat Feb 18, 2026
0d385e7
Make changes to machineset controller's preservation logic to sync wi…
thiyyakat Feb 20, 2026
a9dbc21
Change proposal to reflect changes in design
thiyyakat Feb 20, 2026
b7415e7
Update usage doc.
thiyyakat Feb 20, 2026
e7fef73
Clean up comments
thiyyakat Feb 24, 2026
4d5e455
Clean up manageMachinePreservation
thiyyakat Feb 26, 2026
dcf0573
Address review comments
thiyyakat Mar 2, 2026
84a42a4
Address review comments given by
thiyyakat Mar 3, 2026
b00ca03
Address review comments given by
thiyyakat Mar 6, 2026
1538501
Address review comments by
thiyyakat Mar 11, 2026
9abff82
Address review comments by
thiyyakat Mar 11, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 101 additions & 0 deletions docs/documents/apis.md
Original file line number Diff line number Diff line change
Expand Up @@ -513,6 +513,21 @@ not be estimated during the time a MachineDeployment is paused. This is not set
by default, which is treated as infinite deadline.</p>
</td>
</tr>
<tr>
<td>
<code>autoPreserveFailedMachineMax</code>
</td>
<td>
<em>
int32
</em>
</td>
<td>
<em>(Optional)</em>
<p>The maximum number of failed machines in the machine deployment that can be auto-preserved.
In the gardener context, this number is derived from the AutoPreserveFailedMachineMax set at the worker level, distributed amongst the worker&rsquo;s machine deployments</p>
</td>
</tr>
</table>
</td>
</tr>
Expand Down Expand Up @@ -678,6 +693,19 @@ int32
<em>(Optional)</em>
</td>
</tr>
<tr>
<td>
<code>autoPreserveFailedMachineMax</code>
</td>
<td>
<em>
int32
</em>
</td>
<td>
<em>(Optional)</em>
</td>
</tr>
</table>
</td>
</tr>
Expand Down Expand Up @@ -833,6 +861,21 @@ Kubernetes meta/v1.Time
<p>Last update time of current status</p>
</td>
</tr>
<tr>
<td>
<code>preserveExpiryTime</code>
</td>
<td>
<em>
<a href="https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.29/#time-v1-meta">
Kubernetes meta/v1.Time
</a>
</em>
</td>
<td>
<p>PreserveExpiryTime is the time at which MCM will stop preserving the machine</p>
</td>
</tr>
</tbody>
</table>
<br>
Expand Down Expand Up @@ -1071,6 +1114,22 @@ Kubernetes meta/v1.Duration
</tr>
<tr>
<td>
<code>machinePreserveTimeout</code>
</td>
<td>
<em>
<a href="https://godoc.org/k8s.io/apimachinery/pkg/apis/meta/v1#Duration">
Kubernetes meta/v1.Duration
</a>
</em>
</td>
<td>
<em>(Optional)</em>
<p>MachinePreserveTimeout is the timeout after which the machine preservation is stopped</p>
</td>
</tr>
<tr>
<td>
<code>disableHealthTimeout</code>
</td>
<td>
Expand Down Expand Up @@ -1398,6 +1457,21 @@ not be estimated during the time a MachineDeployment is paused. This is not set
by default, which is treated as infinite deadline.</p>
</td>
</tr>
<tr>
<td>
<code>autoPreserveFailedMachineMax</code>
</td>
<td>
<em>
int32
</em>
</td>
<td>
<em>(Optional)</em>
<p>The maximum number of failed machines in the machine deployment that can be auto-preserved.
In the gardener context, this number is derived from the AutoPreserveFailedMachineMax set at the worker level, distributed amongst the worker&rsquo;s machine deployments</p>
</td>
</tr>
</tbody>
</table>
<br>
Expand Down Expand Up @@ -1860,6 +1934,19 @@ int32
<em>(Optional)</em>
</td>
</tr>
<tr>
<td>
<code>autoPreserveFailedMachineMax</code>
</td>
<td>
<em>
int32
</em>
</td>
<td>
<em>(Optional)</em>
</td>
</tr>
</tbody>
</table>
<br>
Expand Down Expand Up @@ -1998,6 +2085,20 @@ LastOperation
<p>FailedMachines has summary of machines on which lastOperation Failed</p>
</td>
</tr>
<tr>
<td>
<code>autoPreserveFailedMachineCount</code>
</td>
<td>
<em>
int32
</em>
</td>
<td>
<em>(Optional)</em>
<p>AutoPreserveFailedMachineCount has a count of the number of failed machines in the machineset that are currently auto-preserved</p>
</td>
</tr>
</tbody>
</table>
<br>
Expand Down
79 changes: 39 additions & 40 deletions docs/proposals/machine-preservation.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,17 +29,18 @@ Related Issue: https://github.com/gardener/machine-controller-manager/issues/100
## Proposal

In order to achieve the objectives mentioned, the following are proposed:
1. Enhance `machineControllerManager` configuration in the `ShootSpec`, to specify the max number of machines to be auto-preserved,
and the time duration for which these machines will be preserved.
```
machineControllerManager:
autoPreserveFailedMax: 0
machinePreserveTimeout: 72h
```
* This configuration will be set per worker pool.
* Since gardener worker pool can correspond to `1..N` MachineDeployments depending on number of zones, `autoPreserveFailedMax` will be distributed across N machine deployments.
* `autoPreserveFailedMax` must be chosen such that it can be appropriately distributed across the MachineDeployments.
* Example: if `autoPreserveFailedMax` is set to 2, and the worker pool has 2 zones, then the maximum number of machines that will be preserved per zone is 1.
1. Enhance `worker` configuration in the `ShootSpec`, to specify the maximum number of failed machines that will be auto-preserved and the time duration for which machines will be preserved.
```
workers:
- name: example-worker
autoPreserveFailedMachineMax: 2
machineControllerManager:
machinePreserveTimeout: 72h
```
* This configuration will be set per worker pool.
* Since gardener worker pool can correspond to `1..N` MachineDeployments depending on number of zones, `autoPreserveFailedMachineMax` will be distributed across N machine deployments.
* `autoPreserveFailedMachineMax` must be chosen such that it can be appropriately distributed across the MachineDeployments.
* Example: if `autoPreserveFailedMachineMax` is set to 2, and the worker pool has 2 zones, then the maximum number of machines that will be preserved per zone is 1.
2. MCM will be modified to include a new sub-phase `Preserved` to indicate that the machine has been preserved by MCM.
3. Allow user/operator to request for preservation of a specific machine/node with the use of annotations : `node.machine.sapcloud.io/preserve=now` and `node.machine.sapcloud.io/preserve=when-failed`.
4. When annotation `node.machine.sapcloud.io/preserve=now` is added to a `Running` machine, the following will take place:
Expand All @@ -49,29 +50,28 @@ and the time duration for which these machines will be preserved.
- After timeout, the `node.machine.sapcloud.io/preserve=now` and `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` are deleted. The `machine.CurrentStatus.PreserveExpiryTime` is set to `nil`. The machine phase is changed to `Running` and the CA may delete the node.
- If a machine in `Running:Preserved` fails, it is moved to `Failed:Preserved`.
5. When annotation `node.machine.sapcloud.io/preserve=when-failed` is added to a `Running` machine and the machine goes to `Failed`, the following will take place:
- The machine is drained of pods except for Daemonset pods.
- Pods (other than DaemonSet pods) are drained.
- The machine phase is changed to `Failed:Preserved`.
- `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` is added to the node to prevent CA from scaling it down.
- `machine.CurrentStatus.PreserveExpiryTime` is updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$.
- After timeout, the annotations `node.machine.sapcloud.io/preserve=when-failed` and `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` are deleted. `machine.CurrentStatus.PreserveExpiryTime` is set to `nil`. The phase is changed to `Terminating`.
6. When an un-annotated machine goes to `Failed` phase and `autoPreserveFailedMax` is not breached:
6. When an un-annotated machine goes to `Failed` phase and `autoPreserveFailedMachineMax` is not breached:
- Pods (other than DaemonSet pods) are drained.
- The machine's phase is changed to `Failed:Preserved`.
- `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` is added to the node to prevent CA from scaling it down.
- `machine.CurrentStatus.PreserveExpiryTime` is updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$.
- After timeout, the annotation `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` is deleted. `machine.CurrentStatus.PreserveExpiryTime` is set to `nil`. The phase is changed to `Terminating`.
- Number of machines in `Failed:Preserved` phase count towards enforcing `autoPreserveFailedMax`.
7. If a failed machine is currently in `Failed:Preserved` and before timeout its VM/node is found to be Healthy, the machine will be moved to `Running:Preserved`. After the timeout, it will be moved to `Running`.
The rationale behind moving the machine to `Running:Preserved` rather than `Running`, is to allow pods to get scheduled on to the healthy node again without the autoscaler scaling it down due to under-utilization.
8. A user/operator can request MCM to stop preserving a machine/node in `Running:Preserved` or `Failed:Preserved` phase using the annotation: `node.machine.sapcloud.io/preserve=false`.
- Number of machines in `Failed:Preserved` phase count towards enforcing `autoPreserveFailedMachineMax`.
.
7. A user/operator can request MCM to stop preserving a machine/node in `Running:Preserved` or `Failed:Preserved` phase by deleting the annotation: `node.machine.sapcloud.io/preserve`.
* MCM will move a machine thus annotated either to `Running` phase or `Terminating` depending on the phase of the machine before it was preserved.
9. Machines of a MachineDeployment in `Preserved` sub-phase will also be counted towards the replica count and in the enforcement of maximum machines allowed for the MachineDeployment.
10. MCM will be modified to perform drain in `Failed` phase rather than `Terminating`.
8. Machines of a MachineDeployment in `Preserved` sub-phase will also be counted towards the replica count and in the enforcement of maximum machines allowed for the MachineDeployment.
9. MCM will be modified to perform drain in `Failed` phase for preserved machines.

## State Diagrams:

1. State Diagram for when a machine or its node is explicitly annotated for preservation:
```mermaid
```mermaid
stateDiagram-v2
state "Running" as R
state "Running + Requested" as RR
Expand All @@ -86,27 +86,26 @@ The rationale behind moving the machine to `Running:Preserved` rather than `Runn
RR --> F: on failure
F --> FP
FP --> T: on timeout or preserve=false
FP --> RP: if node Healthy before timeout
FP --> R: if node Healthy before timeout
T --> [*]
R-->RP: annotated with preserve=now
RP-->F: if node/VM not healthy
```
```

2. State Diagram for when an un-annotated `Running` machine fails (Auto-preservation):
```mermaid
stateDiagram-v2
state "Running" as R
state "Running:Preserved" as RP
state "Failed
(node drained)" as F
state "Failed:Preserved" as FP
state "Terminating" as T
[*] --> R
R-->F: on failure
F --> FP: if autoPreserveFailedMax not breached
F --> T: if autoPreserveFailedMax breached
F --> FP: if autoPreserveFailedMachineMax not breached
F --> T: if autoPreserveFailedMachineMax breached
FP --> T: on timeout or value=false
FP --> RP : if node Healthy before timeout
RP --> R: on timeout or preserve=false
FP --> R : if node Healthy before timeout
T --> [*]
```

Expand All @@ -128,21 +127,22 @@ The rationale behind moving the machine to `Running:Preserved` rather than `Runn
4. Operator analyzes the VM.


### Use Case 3: Auto-Preservation
### Use Case 3: Auto-Preservation of Failed Machine aiding in Failure Analysis and Recovery
**Scenario:** Machine fails unexpectedly, no prior annotation.
#### Steps:
1. Machine transitions to `Failed` phase.
2. Machine is drained.
3. If `autoPreserveFailedMax` is not breached, machine moved to `Failed:Preserved` phase by MCM.
3. If `autoPreserveFailedMachineMax` is not breached, machine moved to `Failed:Preserved` phase by MCM.
4. After `machinePreserveTimeout`, machine is terminated by MCM.
5. If machine is brought back to `Running` phase before timeout, pods can be scheduled on it again.

### Use Case 4: Early Release
**Scenario:** Operator has performed his analysis and no longer requires machine to be preserved.
#### Steps:
1. Machine is in `Running:Preserved` or `Failed:Preserved` phase.
2. Operator adds: `node.machine.sapcloud.io/preserve=false` to node.
2. Operator removes `node.machine.sapcloud.io/preserve` from node.
3. MCM transitions machine to `Running` or `Terminating` for `Running:Preserved` or `Failed:Preserved` respectively, even though `machinePreserveTimeout` has not expired.
4. If machine was in `Failed:Preserved`, capacity becomes available for auto-preservation.
4. If machine was auto-preserved, capacity becomes available for auto-preservation.

## Points to Note

Expand All @@ -151,13 +151,12 @@ The rationale behind moving the machine to `Running:Preserved` rather than `Runn
3. Consumers (with access to shoot cluster) can annotate Nodes they would like to preserve.
4. Operators (with access to control plane) can additionally annotate Machines that they would like to preserve. This feature can be used when a Machine does not have a backing Node and the operator wishes to preserve the backing VM.
5. If the backing Node object exists but does not have the preservation annotation, preservation annotations added on the Machine will be honoured.
6. However, if a backing Node exists for a Machine and has the preservation annotation, the Node's annotation value will override the Machine annotation value, and be synced to the Machine object.
7. If `autoPreserveFailedMax` is reduced in the Shoot Spec, older machines are moved to `Terminating` phase before newer ones.
6. However, if a backing Node exists for a Machine and has the preservation annotation, the Node's annotation value will override the Machine annotation value.
7. If `autoPreserveFailedMachineMax` is reduced in the Shoot Spec, older machines are moved to `Terminating` phase before newer ones.
8. In case of a scale down of an MCD's replica count, `Preserved` machines will be the last to be scaled down. Replica count will always be honoured.
9. If the value for annotation key `cluster-autoscaler.kubernetes.io/scale-down-disabled` for a machine in `Running:Preserved` is changed to `false` by a user, the value will be overwritten to `true` by MCM.
10. On increase/decrease of timeout, the new value will only apply to machines that go into `Preserved` phase after the change. Operators can always edit `machine.CurrentStatus.PreserveExpiryTime` to prolong the expiry time of existing `Preserved` machines.
11. [Modify CA FAQ](https://github.com/gardener/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-prevent-cluster-autoscaler-from-scaling-down-a-particular-node) once feature is developed to use `node.machine.sapcloud.io/preserve=now` instead of the `cluster-autoscaler.kubernetes.io/scale-down-disabled=true` currently suggested. This would:
- harmonise machine flow
- shield from CA's internals
- make it generic and no longer CA specific
- allow a timeout to be specified
9. On increase/decrease of `machinePreserveTimeout`, the new value will only apply to machines that go into `Preserved` phase after the change. Operators can always edit `machine.CurrentStatus.PreserveExpiryTime` to prolong the expiry time of existing `Preserved` machines.
10. [Modify CA FAQ](https://github.com/gardener/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-prevent-cluster-autoscaler-from-scaling-down-a-particular-node) once feature is developed to use `node.machine.sapcloud.io/preserve=now` instead of the `cluster-autoscaler.kubernetes.io/scale-down-disabled=true` currently suggested. This would:
- harmonise machine flow
- shield from CA's internals
- make it generic and no longer CA specific
- allow a timeout to be specified
Loading
Loading