vCenter Server Heartbeat 6.3 — Experiences and Recommendations
I had the opportunity to work with vCenter Server Heartbeat earlier this year and I wanted to share my experiences with the product – ranging from why it might be needed and what it can provide.
vCenter Server Availability
Is vCenter Server a mission-critical application for which you cannot afford downtime? It depends on the environment, but for many organizations the availability of vCenter Server is becoming more and more critical. For starters you lose a single point of management for all your ESX hosts, DRS and potentially quite a bit more. Let’s take a look at some specific impacts of vCenter Server being unavailable.
- VM and Host Management – Virtual Machines (as well as ESX hosts) would need to be managed directly from each individual ESX host – which can be time consuming if you don’t know which VM is on which host. In addition, you would be unable to provision new VMs from a template.
- Performance and Monitoring – vCenter is constantly collecting performance metrics from VMs and hosts, as well as evaluating alarm criteria. Without vCenter, no metrics are captured for analysis. In addition several third party applications such as Quest vFoglight also rely on vCenter server for data collection.
- vMotion – vMotion – including Storage vMotion – is not possible without an active vCenter server.
- VMware HA – The host agents still provide HA functionality without vCenter, however there is no more safeguard regarding admission control. A cluster could be over-populated as there are no admission control safeguards available when vCenter Server is unavailable.
- VMware DRS – Unavailable – workload imbalances will not be corrected, which could impact performance.
- Backups – Several backup products rely on vCenter Server for their functionality.
- VMware View – Unable to provision new desktops
- vCloud Director – Unable to allocate resources or provision new VMs
When you sum up the above list, it’s pretty clear that a vCenter Server outage will affect operations, and could also potentially effect application performance and availability as well.
Enter vCenter Heartbeat
VMware chose to partner with Neverfail to provide high availability service for vCenter Server. I had experience with Neverfail in the past for supporting BES (Blackberry) servers and I was familiar with the benefits and challenges, so I was anxious to see how this would be addressed in vCenter Server Heartbeat 6.3.
The Neverfail engine is replication-based. Basically you build a second server and install the application you are protecting (vCenter Server in this case) on it. Then you configure a dedicated NIC on each server to be used as the “Heartbeat Channel” which will contain both monitoring and replication traffic as illustrated below.
vCenter Server Heartbeat will monitor both vCenter Server as well as a local SQL database, and will monitor application health while constantly replicating relevant files and registry keys between the hosts. A packet filter driver is then installed on each server which will block all traffic for the application’s IP address, should that server be passive, while the active server is unblocked. This allows both servers to be configured with the same identity – including the same IP – at the same time and vCenter Server Heartbeat manages this during failover.
(NOTE: In a WAN configuration, different IP addresses can be used on the vCenter Servers)
Installing vCenter Server Heartbeat
When choosing your servers, you can choose between 2 physical servers, 2 virtual servers or one of each. If you choose to have both servers virtual you have the option of using vCenter’s clone VM function – simply clone your working vCenter Server VM and then use the new clone as the second server in the pair. While some are more comfortable with having a physical server in the mix, we choose to make both servers virtual and employed some strategies to improve manageability (more on this later).
Now of course while you create the second server and configure it with the same IP address you should disable this IP interface until vCenter Server Heartbeat is installed and functioning. This process is well explained in the documentation. Once vCenter Hearbeat is installed on both nodes it will begin monitoring vCenter Server (and optionally SQL) and begin replicating data from the active (primary) server to the passive (secondary) server.
One thing you’ll want to do to make management easier is to create a 2nd IP address on the primary/public NIC for management. Ideally the heartbeat NICs are on a private and non-routable VLAN, and if the server is currently secondary, the packet filter is blocking traffic for that IP address – how will you remotely manage it? By adding a secondary IP address to the public NIC on each server in the pair, you can provide a permanent IP which can be used for anything from management agents/antivirus updates to remote desktop. Just be careful about DNS registration for this additional new IP address such that you aren’t registering a new IP in DNS for an existing name.
Testing vCenter Server Heartbeat
Once both VMs were running with vCenter Server heartbeat I proceeded to run a battery of tests to evaluate response to an array of failure conditions, ranging from service failure, to host failure to network error. In all of my tests the failover process worked flawlessly – except for one.
First I need to explain what “Split-Brain” is. In earlier versions of Neverfail it would be possible for both servers to believe that they were active (and therefore have the IP’s unblocked) at the same time. In later versions – including vCenter Server Heartbeat 6.3 – a split-brain avoidance feature is enabled (which leverages the secondary IP address I mentioned above). Now this works well in most scenarios, but I ran into a specific scenario that posed some challenges.
One of the scenarios I tested was to disable all vNICs on the active server. Within seconds (consistent with a configured threshold), vCenter Server Heartbeat successfully failed the server over to the secondary and service was restored within 90 seconds. But what happens when the failed network link is restored? After all, the disconnected server still “believes” that it is the primary server. What I observed is that the reconnected server was put into the secondary role (blocked) which is correct, but then the primary was shut down –presumably as a split-brain avoidance precaution after detecting another active server — resulting in BOTH servers being offline. Without manual intervention, vCenter Server would remain offline.
I repeated this test with several different settings and continued to receive the same results. I called VMware and explained the scenario I was testing and asked if there were any setting that I was missing that might prevent this specific behavior in this scenario. Was there a way that I could prevent the active server from being shut down when a disconnected primary node reappears on the network? I was told that this was a known limitation of the current release, and that future releases would have more intelligence and awareness for dealing with such situations.
Granted a network disconnect followed by a successful failover and then network restoration is a fairly rare in most environments, but it can and does happen, and it’s good to be aware that vCenter Server Heartbeat 6.3 may not automatically intervene correctly, and that manual intervention (which is a quick and simple fix) may be needed.
vCenter Server Heartbeat Strategies
There’s many different approaches that can be used to tailor to your unique environment. A strategy that our management was comfortable with was to disable automatic failover. This gave us the benefits of application monitoring (emails on warning conditions for either vCenter Server or SQL) as well as redundancy for vCenter Server, including a console that could be used to quickly initiate a failover manually, as the operations center was staffed 24.7.
But if both servers were virtual how would we know which hosts they were on, so that the console could be accessed if necessary? We addressed this by positioning the 2 servers on two specific hosts and excluding these two VMs from participating in DRS (and also added an anti-affinity rule should things get moved around for any reason).
By the way, the application monitoring is a really nice feature as it can bring to your attention specific issues as it can alert to you conditions and configurations within either vCenter or SQL (if local) that may deserve attention. In other words not only do you gain redundancy and failover protection for your vCenter Server, but you also gain proactive monitoring insights into the health of the application.
The bottom line is that there are very good reasons to consider vCenter Server as a “mission critical” application and that vCenter Server Heartbeat can offer good improvements to vCenter Server availability. Just make sure that you explore the solution sufficiently to understand the options, in order to configure it to your environment’s needs and requirements.