-
Notifications
You must be signed in to change notification settings - Fork 135
Description
How to categorize this issue?
/area TODO
/kind bug
/priority 3
What happened:
when we are rolling out the worker-pools ,it is found the machines are stuck in the Terminating status, and be deleted forcely after 10 minutes.
k get machine
NAME STATUS AGE NODE
shoot--hc-cn40--prod-hdl-iq-large-v2-b-cn-shanghai-f-59ffd7jmw6 Terminating 7m9s izuf67xrc46z92m1rsyyekz
shoot--hc-cn40--prod-hdl-iq-large-v2-b-cn-shanghai-f-59ffdfk5vn Terminating 7m9s izuf6hvbrxclomykn2sjsyz
shoot--hc-cn40--prod-hdl-iq-large-v2-b-cn-shanghai-f-59ffdpzb55 Terminating 6m22s izuf6j6riqwr923w4j8oz0z
shoot--hc-cn40--prod-hdl-iq-stckd-paid-b-cn-shanghai-f-dd996sxp Terminating 6m22s izuf6fs8jxfp5fstoxarhuz
shoot--hc-cn40--prod-hdl-iq-stckd-paid-b-cn-shanghai-g-6bc22p68 Terminating 6m22s izuf6fs8jxfp5fstoxarhtz
shoot--hc-cn40--prod-hdl-iq-stckd-paid-b-cn-shanghai-g-6bcvtbxs Terminating 6m17s izuf6cvv8fagx1wb0jtmavz
error message:
Worker extension (shoot--hc-cn40--prod-hdl/prod-hdl) reports failing health check: machine "shoot--hc-cn40--prod-hdl-iq-large-v2-b-cn-shanghai-f-59ffd7jmw6" failed: VM deletion failed due to - machine codes error: code = [Internal] message = [SDKError:
StatusCode: 403
Code: IncorrectInstanceStatus.Initializing
Message: code: 403, The specified instance status does not support this operation. request id: CB1FD9BF-AE7D-566A-BC7B-CD3C91E850DF
Data: {"Code":"IncorrectInstanceStatus.Initializing","HostId":"ecs.cn-shanghai.aliyuncs.com","Message":"The specified instance status does not support this operation.","Recommend":"https://api.aliyun.com/troubleshoot?q=IncorrectInstanceStatus.Initializing&product=Ecs&requestId=CB1FD9BF-AE7D-566A-BC7B-CD3C91E850DF","RequestId":"CB1FD9BF-AE7D-566A-BC7B-CD3C91E850DF"}
]. Aborting operation. Initiate VM deletion.
We are using Gardener, the problematic cluster is located at Ali-Cloud seed. After consulted with Ali-Cloud engineer, they says:
"The ECS instance is performing operations such as creating snapshots, starting, stopping, restarting, replacing the system disk, or there are other API requests making changes to this ECS instance. At this time, the instance cannot be deleted immediately."
What you expected to happen:
We hope the machine can be deleted smoothly, maybe do not need to wait 10 minutes.
How to reproduce it (as minimally and precisely as possible):
rolling out the worker pools.
Anything else we need to know?:
gardener version is v1.134.2
Environment:
- Kubernetes version (use
kubectl version): 1.33.5 - Cloud provider or hardware configuration: shoot cluster is located at Ali-Cloud , seed is located at Ali-Cloud
- Others: N/a