For the past two months my life has been a little crazy which is why I haven’t been actively blogging. Not that the storm is over, I just finally made a commitment to blog at least once a week so here is my attempt to keep my promise to myself.
Over the past few months a lot of good useful and interesting information has been emerging from the various blogs and finally the vRAM fever is over. I was giving a presentation a few days ago that basically discussed the HA differences in vSphere 4 and 5. Besides the architectural difference between the two versions, I was also covering how it could impact our designs and what that really means. I figured I should share some of that information here as it could be useful to most VMware admins.
Some recap: In vSphere 4, as well all know there are 5 primary nodes that are needed for a HA cluster to function. The first 5 hosts in a HA cluster are the primary nodes and a primary is replaced by a secondary only during the following conditions:
- When placing a primary host in maintenance mode
- When removing a primary host from the cluster
- When HA is reconfigured
In vSphere 5, because the primary and secondary concepts don’t exist anymore and because a master is guaranteed to be available in a cluster in almost all possible scenarios, we can perhaps improve on our 32 node cluster shown above. Duncan and a few other bloggers have been blogging heavily on the details of HA in vSphere 5 and I would recommend you take a look at them to get more details on whats under the hood. But in a nutshell, HA has been gutted and rewritten where you have a master host and every other host is a slave in the HA cluster. Apart from the Netwrok heartbeats, datastore heartbeats are introduced that allows the state of hosts to be correctly identified (failed, isolated or network partitioned). Keep in mind these capabilities were not present in previous versions and now with more accurate information about other hosts in the cluster, HA can be more efficient then it ever was.
It takes a total of 25 seconds for a new master to be elected (from the time the original master fails) in a cluster and at 35 seconds, the newly elected master begins to restart VMs and assume all master responsibilities. So now lets look at what could cause a new master to be elected. A new master will be elected if any of the conditions below are satisfied:
- A master fails
- When HA is reconfigured
- Becomes network partitioned or isolated
- Is disconnected from VC / removed from cluster
- Is put into maintenance or standby mode
Well we certainly don’t have a need to limit the number of hosts per chassis to 4 anymore. Keep in mind we only did that because of how HA worked in vSphere 4 and that we couldn’t afford to loose all 5 primaries. Because if we did, HA would not work.
As vSphere 5 relies on a whole different mechanism and really relies on a single master host which is replaced by a slave host (via election of a new master) in all the conditions mentioned above, I think it will be safe to assume we could shrink our 32 node cluster to two chassis instead of 8. Now is that a good design? I don’t know. If you loose a chassis in this setup, your cluster would loose 50% of resources so perhaps spanning the cluster across 3 or 4 chassis would be a better option? It depends really on what is available and sort of out of scope of what I want to discuss here.
I believe large clusters were ignored due to the 4 host limit per failure domain along with other legacy issues that dont exist anymore. A good use case for a large cluster would be the need for creating large vDC for a customer instead of multiple small vDC. Again, that is a little out of scope of what I want to discuss. The point of this post is that HA has been redesigned and redone, so it’s important for us to understand how the new architecture impacts whats we have believed to be good practice in the past. I still hear stuff about scsi reservation from 4-5 years ago which doesn’t apply anymore. So read up on the new HA and understand how it would impact the designs you have done in the past. Don’t get caught with your pants down