Post

Failed tasks in SDDC 5.2

Notes on failed tasks in SDDC 5.2

Failed tasks in SDDC 5.2

Failed tasks in SDDC

Recenlty I faced a weird case with VMware, out of nowhere failed tasks started to show on my SDDC, intially I thought it was something related to password expired, but no, not this time.

Failed tasks in SDDC

The logs don’t lie

People lie, but the logs don’t. After a couple days working with VMware support and we didn’t seem to find any clue of what was happening, by accident I saw something on the SOS healthcheck that caught my attention, it was complaining that the SDDC had problems to connect to my host through API (port 443).

How to run a health check on SDDC

1
/opt/vmware/sddc-support/sos --health-check --domain-name ALL --skip-cert-check
1
2
3
4
5
6
7
|  21 |              ESXi : testhost1.9tec.local               |        Ping status         | GREEN |
|     |                                                                    |  API Connectivity status   |  RED  |

|  21 |              ESXI : testhost1.9tec.local               | svc-vcf-testhost1 |    Jul 20, 2025   |    Never     |      Never      |         GREEN         |
|     |                                                                    |           root          |         -         |      -       |        -        | Failed to get details |
|  21 |              ESXi : testhost1.9tec.local              | NTP Status | YELLOW |
|     |                                                                   |  ESX Time  | YELLOW |

Dang, it reminded me of a situation I found not long ago, where my vSphere replication was causing the reverse proxy on ESXi to exhaust the sessions (the limit is 128). Reference: ESXi hosts go unresponsive on the vCenter Server with maximum connection exceeded for the envoy proxy.

Well well, let’s take a look on the envoy.log on this particular host and see what’s going on

1
2
3
less /var/log/envoy.log
## or
less /var/log/envoy.log

Search by the word “exceed” and you’ll see if there are any evidences of exceeding max connections

1
2
3
4
2025-08-13T22:14:06.853Z In(166) envoy[2100238]: "2025-08-13T22:13:57.261Z warning envoy[2100869] [Originator@6876 sub=filter] [Tags: "ConnectionId":"696461"] remote https connections exceed max allowed: 128"
2025-08-13T22:14:06.853Z In(166) envoy[2100238]: "2025-08-13T22:13:57.261Z warning envoy[2100869] [Originator@6876 sub=filter] [Tags: "ConnectionId":"696461"] closing connection TCP<my_vcenter_IP:59956, my_esxi_IP:443>"
2025-08-13T22:14:06.853Z In(166) envoy[2100238]: "2025-08-13T22:13:57.261Z info envoy[2100869] [Originator@6876 sub=connection] [Tags: "ConnectionId":"696461"] remote address:my_vcenter_IP:59956,TLS_error:|268435588:SSL routines:OPENSSL_internal:CLIENTHELLO_TLSEXT|268435646:SSL routines:OPENSSL_internal:PARSE_TLSEXT"

The workaround

As the KB mentioned above says, there’s no fix for that, but a workaround that consists in disabling the healthcheck on the vSphere replication (that’s what is using the sessions on the ESXi hosts due to enhanced replication fancy checks)

1 - Open a SSH session to the VRMS server on both the sites.

2 - Open file /opt/vmware/hms/conf/hms-configuration.xml with a text editor`

1
vi /opt/vmware/hms/conf/hms-configuration.xml

3 - Set schedule-health-checks to false

4 - Restart HMS service on both sites

1
systemctl restart hms

5 - While configuring enhanced replication, skip the health check. Clicking the “Next” button will allow you to proceed with the replication configuration without performing the health check.

This post is licensed under CC BY 4.0 by the author.