It’s been long time since I wanted to write for myself the different possible states of HA agent messages, when HA agent on the host failed. So If you find it elsewhere, don’t worry, it’s possible as there are many VMware resources out there with many competent folks. For my own study purposes I need to have it written on my website. There are different possible states that can occur with the VMware vSphere HA agent. What means different possible messages when one of the hosts inside of HA cluster fails? Depending on the failure scenario there are different alert messages showing. Alerts concerning the HA agent like: Network Partitioned state, Agent unreachable state, Host failed state, network isolated state, uninitilization/initialization state, host failed state.… While some of them are quite self-explanatory, I will try to focus on those which aren’t so clear.
In my recent article How to configure VMware High Availability (HA) cluster I walks you through the components which are part of the HA cluster, what are the requirements, and you are able to see the necessary steps for configuring of HA. You had the chance to read the architecture around HA, which had changed in vSphere 5.
HA agent on the host failed - Network partitioned state
Each FDM agent should normally be able to reach other FDM agents installed on other hosts and vcenter, and at the same time be able to reach also the default isolation address (usually default gateway). The master host is unable to communicate through network. The network connection fails, and the heartbeats are OK through the network connection, a second communication channel – datastore heartbeats – is used to find out if the host is dead, or just partitioned.
Common case of network partition can be a stretched cluster across two remote sites. If the WAN connection fails, in the site which master host is not present, there is an election organized to elect a master host. The master host in that site will mark the hosts from the other remote site as partitioned. After the connection between those two sites come back, the host being master became slave again and the hosts states changes. You can click to enlarge.
A special case is network isolation. It’s kind of network partition – it’s a partition with only 1 host in it.
One of the resolution paths for the network partitionning state error can be to ensure that in vCenter, the Certificate Checking is enabled. See more details on problem resolutions in this VMware KB: Host shows the vSphere HA status as Network Partitioned. To check that, go to:
- Administration > vCenter Server Settings
But before changing that, make sure that you deselect the checkbox for deactivate HA in your cluster.
Host Failed State
- Host failed, but apparently it can also mean sometimes that host itself is running but the network failed
- Second communication channel failed as well.
There can be local storage, where are VMs still running…. while the host lost the communication with its shared datastores and network…..
Network Isolated State
- The host can be in network isolated state if the HA agent cannot reach any other HA agents running on other hosts in the cluster
- Host is unable to ping isolation address.
Agent Unreachable State
- vCenter is unable to contact the master host and the ha agent.
- All servers in the cluster are in failed state.
- Unlikely, but possibly this can happens: Watchdog process was unable to restart the ha agent on the host when vCenter server wasn’t able to communicate with the agent on the host during the HA disabled and re-enabled process.
Uninitialized Error State
3 possible reasons:
- Closed firewall ports (8182 - Traffic between hosts for vSphere High Availability – TCP and UDP)
- All datastores failed (host can’t access them)
- Host does not have an access to the datastore where the HA state information is stored.
Initialization Error State
This error shows that last attempt failed to configure HA. vSphere HA does not monitor VMs state on host.
This small article is no mean to be fully exhaustive and is provided as is. If you detect any obvious errors, just ping me through twitter: www.esx-virtualization.biz/twitter
Literature and sources:
- vSphere Troubleshooting (VMware PFD)
- VMware Community Forums
- vSphere Availability Guide – section “How vSphere HA works”
- Deep Dive on Duncan’s blog
- Troubleshooting VMware High Availability (HA) (1001596).
Jut a reminder, the FDM log can be found here: /var/log/fdm.log