unhealthyMachineTimeout not working when VM is powered off (VM not deleted from disk) #8785

saiteja313 · 2024-09-17T14:49:13Z

What happened:

I have created EKSA Cluster with following configuration,

unhealthyMachineTimeout set to 30 seconds (minimum value) in the Cluster config file Worker node section
Enabled Autoscaling configuration in cluster config file for worker nodes
Installed Cluster Autoscaler curated package on the cluster

I went through two scenarios post cluster creation,

Scenario 1: Navigate to VMWare vSphere console, Click on one of worker node, Right Click and Power Off
Scenario 2: Click on one of worker node, Right Click > Power Off, Right click again > Delete from the disk

Scenario 1 fails all the time. No new node is created. capv pod logs do not show any event that node is unhealthy until 4-5 minutes. And then, node either gets deleted and new node is provisioned or node gets powered on.

Scenario 2 works all the time. Post deletion of node, new node gets provisioned within 30 seconds.

[1] https://anywhere.eks.amazonaws.com/docs/getting-started/optional/healthchecks/#__machinehealthcheckunhealthymachinetimeout__-optional

What you expected to happen:

For scenario 1, capv should respect unhealthyMachineTimeout 30 seconds value. When unhealthyMachineTimeout is set to 5 minutes, capv takes around 20-40 minutes to realize the node is powered off or not ready.

I am not sure if we need something like a node termination handler that Amazon EKS on cloud has.

How to reproduce it (as minimally and precisely as possible):

Configure worker node section of Cluster config file as following.

  workerNodeGroupConfigurations:
  - count: 1
    machineGroupRef:
      kind: VSphereMachineConfig
      name: demo-mgmt
    name: md-0
    autoscalingConfiguration:
      minCount: 1
      maxCount: 5
    machineHealthCheck:        
      unhealthyMachineTimeout: 30s
      maxUnhealthy: 100%

Anything else we need to know?:

Environment: EKSA with vSphere

EKS Anywhere Release: 0.20

Version: v0.20.4
Release Manifest URL: https://anywhere-assets.eks.amazonaws.com/releases/eks-a/manifest.yaml Bundle Manifest URL: https://anywhere-assets.eks.amazonaws.com/releases/bundles/74/manifest.yaml

EKS Distro Release: not sure

The text was updated successfully, but these errors were encountered:

saiteja313 changed the title ~~unhealthyMachineTimeout not working when VM is powered off and VM not deleted from the disk~~ unhealthyMachineTimeout not working when VM is powered off (VM not deleted from disk) Sep 17, 2024

vivek-koppuru added this to the v0.21.0 milestone Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unhealthyMachineTimeout not working when VM is powered off (VM not deleted from disk) #8785

unhealthyMachineTimeout not working when VM is powered off (VM not deleted from disk) #8785

saiteja313 commented Sep 17, 2024 •

edited

Loading

unhealthyMachineTimeout not working when VM is powered off (VM not deleted from disk) #8785

unhealthyMachineTimeout not working when VM is powered off (VM not deleted from disk) #8785

Comments

saiteja313 commented Sep 17, 2024 • edited Loading

saiteja313 commented Sep 17, 2024 •

edited

Loading