gardener · thiyyakat · Oct 29, 2025 · Nov 5, 2025 · Nov 5, 2025 · Nov 5, 2025
@@ -513,6 +513,21 @@ not be estimated during the time a MachineDeployment is paused. This is not set
 by default, which is treated as infinite deadline.</p>
 </td>
 </tr>
+<tr>
+<td>
+<code>autoPreserveFailedMachineMax</code>
+</td>
+<td>
+<em>
+int32
+</em>
+</td>
+<td>
+<em>(Optional)</em>
+<p>The maximum number of failed machines in the machine deployment that can be auto-preserved.
+In the gardener context, this number is derived from the AutoPreserveFailedMachineMax set at the worker level, distributed amongst the worker&rsquo;s machine deployments</p>
+</td>
+</tr>
 </table>
 </td>
 </tr>
@@ -678,6 +693,19 @@ int32
 <em>(Optional)</em>
 </td>
 </tr>
+<tr>
+<td>
+<code>autoPreserveFailedMachineMax</code>
+</td>
+<td>
+<em>
+int32
+</em>
+</td>
+<td>
+<em>(Optional)</em>
+</td>
+</tr>
 </table>
 </td>
 </tr>
@@ -833,6 +861,21 @@ Kubernetes meta/v1.Time
 <p>Last update time of current status</p>
 </td>
 </tr>
+<tr>
+<td>
+<code>preserveExpiryTime</code>
+</td>
+<td>
+<em>
+<a href="https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.29/#time-v1-meta">
+Kubernetes meta/v1.Time
+</a>
+</em>
+</td>
+<td>
+<p>PreserveExpiryTime is the time at which MCM will stop preserving the machine</p>
+</td>
+</tr>
 </tbody>
 </table>
 <br>
@@ -1071,6 +1114,22 @@ Kubernetes meta/v1.Duration
 </tr>
 <tr>
 <td>
+<code>machinePreserveTimeout</code>
+</td>
+<td>
+<em>
+<a href="https://godoc.org/k8s.io/apimachinery/pkg/apis/meta/v1#Duration">
+Kubernetes meta/v1.Duration
+</a>
+</em>
+</td>
+<td>
+<em>(Optional)</em>
+<p>MachinePreserveTimeout is the timeout after which the machine preservation is stopped</p>
+</td>
+</tr>
+<tr>
+<td>
 <code>disableHealthTimeout</code>
 </td>
 <td>
@@ -1398,6 +1457,21 @@ not be estimated during the time a MachineDeployment is paused. This is not set
 by default, which is treated as infinite deadline.</p>
 </td>
 </tr>
+<tr>
+<td>
+<code>autoPreserveFailedMachineMax</code>
+</td>
+<td>
+<em>
+int32
+</em>
+</td>
+<td>
+<em>(Optional)</em>
+<p>The maximum number of failed machines in the machine deployment that can be auto-preserved.
+In the gardener context, this number is derived from the AutoPreserveFailedMachineMax set at the worker level, distributed amongst the worker&rsquo;s machine deployments</p>
+</td>
+</tr>
 </tbody>
 </table>
 <br>
@@ -1860,6 +1934,19 @@ int32
 <em>(Optional)</em>
 </td>
 </tr>
+<tr>
+<td>
+<code>autoPreserveFailedMachineMax</code>
+</td>
+<td>
+<em>
+int32
+</em>
+</td>
+<td>
+<em>(Optional)</em>
+</td>
+</tr>
 </tbody>
 </table>
 <br>
@@ -1998,6 +2085,20 @@ LastOperation
 <p>FailedMachines has summary of machines on which lastOperation Failed</p>
 </td>
 </tr>
+<tr>
+<td>
+<code>autoPreserveFailedMachineCount</code>
+</td>
+<td>
+<em>
+int32
+</em>
+</td>
+<td>
+<em>(Optional)</em>
+<p>AutoPreserveFailedMachineCount has a count of the number of failed machines in the machineset that are currently auto-preserved</p>
+</td>
+</tr>
 </tbody>
 </table>
 <br>

@@ -29,17 +29,18 @@ Related Issue: https://github.com/gardener/machine-controller-manager/issues/100
 ## Proposal
 
 In order to achieve the objectives mentioned, the following are proposed:
-1. Enhance `machineControllerManager` configuration in the `ShootSpec`, to specify the max number of machines to be auto-preserved,
-and the time duration for which these machines will be preserved.
-    ```
-    machineControllerManager:
-       autoPreserveFailedMax: 0
-       machinePreserveTimeout: 72h
-    ```
-    * This configuration will be set per worker pool.
-    * Since gardener worker pool can correspond to `1..N` MachineDeployments depending on number of zones, `autoPreserveFailedMax` will be distributed across N machine deployments.
-    * `autoPreserveFailedMax` must be chosen such that it can be appropriately distributed across the MachineDeployments.
-    * Example: if `autoPreserveFailedMax` is set to 2, and the worker pool has 2 zones, then the maximum number of machines that will be preserved per zone is 1.
+1. Enhance `worker` configuration in the `ShootSpec`, to specify the maximum number of failed machines that will be auto-preserved and the time duration for which machines will be preserved.
+```
+   workers:
+   - name: example-worker 
+     autoPreserveFailedMachineMax: 2
+     machineControllerManager:
+          machinePreserveTimeout: 72h
+```
+  * This configuration will be set per worker pool.
+  * Since gardener worker pool can correspond to `1..N` MachineDeployments depending on number of zones, `autoPreserveFailedMachineMax` will be distributed across N machine deployments.
+  * `autoPreserveFailedMachineMax` must be chosen such that it can be appropriately distributed across the MachineDeployments.
+  * Example: if `autoPreserveFailedMachineMax` is set to 2, and the worker pool has 2 zones, then the maximum number of machines that will be preserved per zone is 1.
 2. MCM will be modified to include a new sub-phase `Preserved` to indicate that the machine has been preserved by MCM.
 3. Allow user/operator to request for preservation of a specific machine/node with the use of annotations : `node.machine.sapcloud.io/preserve=now` and `node.machine.sapcloud.io/preserve=when-failed`.
 4. When annotation `node.machine.sapcloud.io/preserve=now` is added to a `Running` machine, the following will take place:
@@ -49,29 +50,28 @@ and the time duration for which these machines will be preserved.
    - After timeout, the `node.machine.sapcloud.io/preserve=now` and `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` are deleted. The `machine.CurrentStatus.PreserveExpiryTime` is set to `nil`. The machine phase is changed to `Running` and the CA may delete the node.
      - If a machine in `Running:Preserved` fails, it is moved to `Failed:Preserved`.
 5. When annotation `node.machine.sapcloud.io/preserve=when-failed` is added to a `Running` machine and the machine goes to `Failed`, the following will take place:
-    - The machine is drained of pods except for Daemonset pods.
+    - Pods (other than DaemonSet pods) are drained.
     - The machine phase is changed to `Failed:Preserved`.
     - `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` is added to the node to prevent CA from scaling it down.
     - `machine.CurrentStatus.PreserveExpiryTime` is updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$.
     - After timeout, the annotations `node.machine.sapcloud.io/preserve=when-failed` and `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` are deleted. `machine.CurrentStatus.PreserveExpiryTime` is set to `nil`. The phase is changed to `Terminating`.
-6. When an un-annotated machine goes to `Failed` phase and `autoPreserveFailedMax` is not breached:
+6. When an un-annotated machine goes to `Failed` phase and `autoPreserveFailedMachineMax` is not breached:
    - Pods (other than DaemonSet pods) are drained.
    - The machine's phase is changed to `Failed:Preserved`.
    - `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` is added to the node to prevent CA from scaling it down.
    - `machine.CurrentStatus.PreserveExpiryTime` is updated by MCM as $machine.CurrentStatus.PreserveExpiryTime = currentTime+machinePreserveTimeout$.
    - After timeout, the annotation `cluster-autoscaler.kubernetes.io/scale-down-disabled: "true"` is deleted. `machine.CurrentStatus.PreserveExpiryTime` is set to `nil`. The phase is changed to `Terminating`.
-   - Number of machines in `Failed:Preserved` phase count towards enforcing `autoPreserveFailedMax`.
-7. If a failed machine is currently in `Failed:Preserved` and before timeout its VM/node is found to be Healthy, the machine will be moved to `Running:Preserved`. After the timeout, it will be moved to `Running`. 
-The rationale behind moving the machine to `Running:Preserved` rather than `Running`, is to allow pods to get scheduled on to the healthy node again without the autoscaler scaling it down due to under-utilization. 
-8. A user/operator can request MCM to stop preserving a machine/node in `Running:Preserved` or `Failed:Preserved` phase using the annotation: `node.machine.sapcloud.io/preserve=false`. 
+   - Number of machines in `Failed:Preserved` phase count towards enforcing `autoPreserveFailedMachineMax`.
+. 
+7. A user/operator can request MCM to stop preserving a machine/node in `Running:Preserved` or `Failed:Preserved` phase by deleting the annotation: `node.machine.sapcloud.io/preserve`. 
    * MCM will move a machine thus annotated either to `Running` phase or `Terminating` depending on the phase of the machine before it was preserved.
-9. Machines of a MachineDeployment in `Preserved` sub-phase will also be counted towards the replica count and in the enforcement of maximum machines allowed for the MachineDeployment.
-10. MCM will be modified to perform drain in `Failed` phase rather than `Terminating`.
+8. Machines of a MachineDeployment in `Preserved` sub-phase will also be counted towards the replica count and in the enforcement of maximum machines allowed for the MachineDeployment. 
+9. MCM will be modified to perform drain in `Failed` phase for preserved machines.
 
 ## State Diagrams:
 
 1. State Diagram for when a machine or its node is explicitly annotated for preservation:
-    ```mermaid
+```mermaid
     stateDiagram-v2
         state "Running" as R
         state "Running + Requested" as RR
@@ -86,27 +86,26 @@ The rationale behind moving the machine to `Running:Preserved` rather than `Runn
         RR --> F: on failure
         F --> FP
         FP --> T: on timeout or preserve=false
-        FP --> RP: if node Healthy before timeout
+        FP --> R: if node Healthy before timeout
         T --> [*]
         R-->RP: annotated with preserve=now
         RP-->F: if node/VM not healthy
-    ```
+```
+
 2. State Diagram for when an un-annotated `Running` machine fails (Auto-preservation):
     ```mermaid
     stateDiagram-v2
         state "Running" as R
-        state "Running:Preserved" as RP
         state "Failed
         (node drained)" as F
         state "Failed:Preserved" as FP
         state "Terminating" as T
         [*] --> R
         R-->F: on failure
-        F --> FP: if autoPreserveFailedMax not breached
-        F --> T: if autoPreserveFailedMax breached
+        F --> FP: if autoPreserveFailedMachineMax not breached
+        F --> T: if autoPreserveFailedMachineMax breached
         FP --> T: on timeout or value=false
-        FP --> RP : if node Healthy before timeout
-        RP --> R: on timeout or preserve=false
+        FP --> R : if node Healthy before timeout
         T --> [*]
     ```
 
@@ -128,21 +127,22 @@ The rationale behind moving the machine to `Running:Preserved` rather than `Runn
 4. Operator analyzes the VM.
 
 
-### Use Case 3: Auto-Preservation
+### Use Case 3: Auto-Preservation of Failed Machine aiding in Failure Analysis and Recovery
 **Scenario:** Machine fails unexpectedly, no prior annotation.
 #### Steps:
 1. Machine transitions to `Failed` phase.
 2. Machine is drained.
-3. If `autoPreserveFailedMax` is not breached, machine moved to `Failed:Preserved` phase by MCM.
+3. If `autoPreserveFailedMachineMax` is not breached, machine moved to `Failed:Preserved` phase by MCM.
 4. After `machinePreserveTimeout`, machine is terminated by MCM.
+5. If machine is brought back to `Running` phase before timeout, pods can be scheduled on it again.
 
 ### Use Case 4: Early Release
 **Scenario:** Operator has performed his analysis and no longer requires machine to be preserved.
 #### Steps:
 1. Machine is in `Running:Preserved` or `Failed:Preserved` phase.
-2. Operator adds: `node.machine.sapcloud.io/preserve=false` to node.
+2. Operator removes `node.machine.sapcloud.io/preserve` from node.
 3. MCM transitions machine to `Running` or `Terminating` for `Running:Preserved` or `Failed:Preserved` respectively, even though `machinePreserveTimeout` has not expired.
-4. If machine was in `Failed:Preserved`, capacity becomes available for auto-preservation.
+4. If machine was auto-preserved, capacity becomes available for auto-preservation.
 
 ## Points to Note
 
@@ -151,13 +151,12 @@ The rationale behind moving the machine to `Running:Preserved` rather than `Runn
 3. Consumers (with access to shoot cluster) can annotate Nodes they would like to preserve.
 4. Operators (with access to control plane) can additionally annotate Machines that they would like to preserve. This feature can be used when a Machine does not have a backing Node and the operator wishes to preserve the backing VM.
 5. If the backing Node object exists but does not have the preservation annotation, preservation annotations added on the Machine will be honoured.
-6. However, if a backing Node exists for a Machine and has the preservation annotation, the Node's annotation value will override the Machine annotation value, and be synced to the Machine object.
-7. If `autoPreserveFailedMax` is reduced in the Shoot Spec, older machines are moved to `Terminating` phase before newer ones.
+6. However, if a backing Node exists for a Machine and has the preservation annotation, the Node's annotation value will override the Machine annotation value.
+7. If `autoPreserveFailedMachineMax` is reduced in the Shoot Spec, older machines are moved to `Terminating` phase before newer ones.
 8. In case of a scale down of an MCD's replica count, `Preserved` machines will be the last to be scaled down. Replica count will always be honoured.
-9. If the value for annotation key `cluster-autoscaler.kubernetes.io/scale-down-disabled` for a machine in `Running:Preserved` is changed to `false` by a user, the value will be overwritten to `true` by MCM.
-10. On increase/decrease of timeout, the new value will only apply to machines that go into `Preserved` phase after the change. Operators can always edit `machine.CurrentStatus.PreserveExpiryTime` to prolong the expiry time of existing `Preserved` machines.
-11. [Modify CA FAQ](https://github.com/gardener/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-prevent-cluster-autoscaler-from-scaling-down-a-particular-node) once feature is developed to use `node.machine.sapcloud.io/preserve=now` instead of the `cluster-autoscaler.kubernetes.io/scale-down-disabled=true` currently suggested. This would:
-   - harmonise machine flow
-   - shield from CA's internals
-   - make it generic and no longer CA specific
-   - allow a timeout to be specified
+9. On increase/decrease of `machinePreserveTimeout`, the new value will only apply to machines that go into `Preserved` phase after the change. Operators can always edit `machine.CurrentStatus.PreserveExpiryTime` to prolong the expiry time of existing `Preserved` machines.
+10. [Modify CA FAQ](https://github.com/gardener/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-prevent-cluster-autoscaler-from-scaling-down-a-particular-node) once feature is developed to use `node.machine.sapcloud.io/preserve=now` instead of the `cluster-autoscaler.kubernetes.io/scale-down-disabled=true` currently suggested. This would:
+    - harmonise machine flow
+    - shield from CA's internals
+    - make it generic and no longer CA specific
+    - allow a timeout to be specified