Skip to content

Machine is stuck at Terminating #1079

@yinkun111

Description

@yinkun111

How to categorize this issue?

/area TODO
/kind bug
/priority 3

What happened:
when we are rolling out the worker-pools ,it is found the machines are stuck in the Terminating status, and be deleted forcely after 10 minutes.

 k get machine
NAME                                                              STATUS        AGE     NODE

shoot--hc-cn40--prod-hdl-iq-large-v2-b-cn-shanghai-f-59ffd7jmw6   Terminating   7m9s    izuf67xrc46z92m1rsyyekz
shoot--hc-cn40--prod-hdl-iq-large-v2-b-cn-shanghai-f-59ffdfk5vn   Terminating   7m9s    izuf6hvbrxclomykn2sjsyz
shoot--hc-cn40--prod-hdl-iq-large-v2-b-cn-shanghai-f-59ffdpzb55   Terminating   6m22s   izuf6j6riqwr923w4j8oz0z
shoot--hc-cn40--prod-hdl-iq-stckd-paid-b-cn-shanghai-f-dd996sxp   Terminating   6m22s   izuf6fs8jxfp5fstoxarhuz
shoot--hc-cn40--prod-hdl-iq-stckd-paid-b-cn-shanghai-g-6bc22p68   Terminating   6m22s   izuf6fs8jxfp5fstoxarhtz

shoot--hc-cn40--prod-hdl-iq-stckd-paid-b-cn-shanghai-g-6bcvtbxs   Terminating   6m17s   izuf6cvv8fagx1wb0jtmavz

error message:

Worker extension (shoot--hc-cn40--prod-hdl/prod-hdl) reports failing health check: machine "shoot--hc-cn40--prod-hdl-iq-large-v2-b-cn-shanghai-f-59ffd7jmw6" failed: VM deletion failed due to - machine codes error: code = [Internal] message = [SDKError:
   StatusCode: 403
   Code: IncorrectInstanceStatus.Initializing
   Message: code: 403, The specified instance status does not support this operation. request id: CB1FD9BF-AE7D-566A-BC7B-CD3C91E850DF
   Data: {"Code":"IncorrectInstanceStatus.Initializing","HostId":"ecs.cn-shanghai.aliyuncs.com","Message":"The specified instance status does not support this operation.","Recommend":"https://api.aliyun.com/troubleshoot?q=IncorrectInstanceStatus.Initializing&product=Ecs&requestId=CB1FD9BF-AE7D-566A-BC7B-CD3C91E850DF","RequestId":"CB1FD9BF-AE7D-566A-BC7B-CD3C91E850DF"}
]. Aborting operation. Initiate VM deletion.

We are using Gardener, the problematic cluster is located at Ali-Cloud seed. After consulted with Ali-Cloud engineer, they says:

"The ECS instance is performing operations such as creating snapshots, starting, stopping, restarting, replacing the system disk, or there are other API requests making changes to this ECS instance. At this time, the instance cannot be deleted immediately."

What you expected to happen:
We hope the machine can be deleted smoothly, maybe do not need to wait 10 minutes.

How to reproduce it (as minimally and precisely as possible):
rolling out the worker pools.

Anything else we need to know?:
gardener version is v1.134.2

Environment:

  • Kubernetes version (use kubectl version): 1.33.5
  • Cloud provider or hardware configuration: shoot cluster is located at Ali-Cloud , seed is located at Ali-Cloud
  • Others: N/a

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugBugpriority/3Priority (lower number equals higher priority)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions