Design Implications with Admission Control & HA

To start, there is no issue with the Admission control really, it’s really our lack of understanding that makes it an issue. Last week I posted about Admission control and Duncan has a lot of excellent information on this subject. I just wanted to touch on a few things that I think are extremely important to understand in order to come up with appropriate designs for your environments.

So what is Admission Control (AC)? To state it in simple terms, it’s the policy that will save you from yourselves. Basically it’s a check that enables vCenter to reserve certain computing resources in your cluster so that an HA event can be accommodated. There are three different ways how this is done:

  • Host failures cluster  tolerates (this is where slot sizes are used)
  • Percentage of cluster resources reserved as failover spare capacity (this is where you specify a percentage of resources you want reserved)
  • Specify a failover host (this is self-explanatory so I will not be going over this)

Host failures cluster  tolerates

We already know that slots can become an issue in a heterogeneous setup where you may have a couple of really large VMs and a bunch of small ones. Let’s imagine you have 100 VMs out of which 4 have 8vCPU and 24GB of memory reserved, everything else is 1 vCPU with no memory reservation. Unfortunately your 4 VMs will effect your slot size and your slot size will be huge due to the 4 large VMs you have. This basically means you will have fewer slots to power on more machines in a cluster. Of course you can tweak the advanced settings like das.slotCpuInMHz and/or das.slotMemInMB (credit:Duncan) to limit the size of your slot. While you will have more slots available in the cluster, but keep in mind your large VMs may occupy more than one slot now. So in order for them to power on, all the required number of slots should be available on a single host and not spread across the cluster. Just something to keep in mind.

The advantage of using this method is that its pretty dynamic and your available slots will increase and decrease automatically as hosts are added/removed or placed in maintenance mode. The one big issue with this method is how it handles unbalanced clusters. By that I mean if you have a humongous host in the cluster, HA AC will take the worst case into consideration and be prepared for the time when this huge host goes down. Good right? Well that also means that if all your hosts only contributed 50 slots to the cluster and this huge host owns a 100 by itself, in that case HA AC will only present the number of slots available for power on that accounts for the worst case scenario. So your 100 slot host will really not buy you much. Again, there is a whole lot already discussed regarding this by Duncan on his blog and in his books.

Percentage of cluster resources

This is really the area that I wanted to cover and this also happens to be the one I like.  For a quick intro, please read this. I want to talk about how this method can affect your design. We already know that we specify a percentage of resources that are reserved by for an HA event. By default the pre-populated number is 25%. That sounds safe right?

And here is the answer that you have probably heard a million times before. It depends! Seriously, it depends on your cluster. For the purpose of this post and to keep things simple, we will assume that we are only dealing with balanced clusters. So let’s say you only have a 3 node cluster with each host having 64GB of memory and 10GHz of CPU. Your total is 192GB and 30GHz for the cluster. With taking the default 25%, you are really reserving 48GB and 7.5 GHz. This means when your cluster only has that many resources left, it will not allow you to power on anymore VMs. Is that good or bad? I don’t know only you can determine that.

Now let’s try to make our cluster a little bigger and considering we are still talking about vSphere 4, we will keep our cluster size to 8 (across two enclosures, 4 on each one) to make our primary nodes happy. So what happens in a cluster thats 8 node strong? 25% means, computing resources that equate to 2 of your nodes are just chilling and relaxing until something goes wrong. Of course this does NOT mean that 2 hosts are unused, it means computing resources that equate to what 2 nodes will provide are unused and reserved. Let’s take our example from before and extend that.

We now have 8 nodes each with 64GB ram and 10 GHz of CPU. This means our cluster has 512GB of ram and 80 GHz of cpu. But because our AC setting is set to 25%, we only have 384GB and 60GHz of cpu available for powering on VMs (please note during an HA event, HA will ignore all AC settings as stated here). That means we have reserved 128GB of ram and 20 GHz of CPU for HA. Is that cost-effective? I don’t know, but that should get someone’s attention.

Let’s take this even further and look at an example where we have 4 clusters spread 2 enclosures. Let’s use our same example from above and apply the numbers here. Each blade in the enclosure has 64GB of ram and 10 GHz of cpu. Again lets assume we have the AC set to 25%. As we discovered earlier we should have 128GB and 20 GHz reserved for an HA event per cluster. Across the four clusters in these two enclosures we will have 128*4 = 512GB of ram and 20*4=80 GHz of cpu reserved for an HA event. That is equivalent to the resources of 8 blades in this enclosure. Interestingly there are a total of 32 blades in this setup so 8 blades is 25% of the resources. AC isn’t doing anything wrong, it’s really doing what you asked it to do. You said you wanted 25% reserved, well there you go.

Now if you are not pulling your hair already, imagine if you have 4 of such enclosures. If we use the same setup that means we have 8 clusters and resources that equate to 16 blades are reserved for an HA event. Wait a sec, thats equal to 1 whole enclosure. Yes, and if you have 4 enclosures, 25% of that is 1 enclosure so there should be no surprise :).

I want to clarify that your reserved resources can be fragmented across multiple hosts so all your blades are still probably handling some load. It’s just that the money you spent to provision new VMs may not materialize as you expected unless this was all factored in to begin with. So what if you have 10 of these enclosures setup the same way? That means two things:

  • You have resources reserved that equate to 2.5 enclosure out of the total of 10 enclosures for an HA event
  • And that you have way more money than I can ever imagine so kudos to you for that
So please think for sometime before you go with the default of 25%, I have seen as low as 1% (which really bothers me because why even have AC in that case). I have also seen 50% and that is the highest that you can go. So pick the number that keeps sanity in your world but please don’t take 25% just because it’s there by default. If your argument is that you really don’t need those 100s of GB of ram and GHz anyways, well that’s fair too but keep in mind that somebody actually paid for them. Isn’t utilizing your hardware to the fullest one of the advantages of virtualization. I know we need some headroom for HA events but do we really need an insane amount reserved?

One recommendation will be to set your percentage that equates to what you would have picked in the ‘Host failures cluster  tolerates” method. If you wanted to use 1 there then set your percentage so that you are only reserving computing resources of one host. So in our 8 node cluster example, 12.5% would be equal to 1 host. Sine you can only enter integers here, let’s go with 13%. You started using percentages because you didn’t like the slots, that doesn’t mean you must have more resources reserved, That defeats the purpose. Lastly, unlike “Host failures cluster  tolerates”, the percentage method will not dynamically adjust your percentage if you add or remove hosts. So if you add or remove hosts revisit the percentage again.

Also note, when you have large VMs in your cluster and because the reserved resources can be fragmented, it may not be a bad idea to have a higher restart priority for those large VMs. In simple terms, if your VM needs 24GB to start and your cluster has 24GB available but its spread across more than one host, guess what, it wont start. But since the 4.1 release HA will request DRS to make room but it’s not guaranteed.

vSphere 5

The obvious question is does this get better with vSphere 5? Yes, it does but you will still have to figure out what works best for you. The one enhancements that’s visible in the gui is how you specify resource reservation for HA in vSphere.

You can specify CPU and memory individually . I think this is great if you are trying to make sure your dollars are not wasted and that you are not forced to reserve more than you really have to. This way you can reserve a certain amount of memory and a certain amount of CPU for your cluster. In previous versions, you didn’t have a way to differentiate the two, you could only put one percentage that applied to both prior to vSphere 5. But again, the defaults are 25% and one will have to figure out what will be the sweet spot in their environment.

In vSphere 4, virtual machines that did not have a reservation larger than 256Mhz a default of 256Mhz was used for CPU and if no reservation was used for memory either, a default of 0MB+memory overhead was used that contributed to determining the failover capacity of your cluster. In vSphere 5, the default of only 32MHz is used if no CPU reservation is defined and for no memory reservation, a default of 0MB+memory overhead is used like before to compute the failover capacity.

How is the failover capacity computed? I have written a detailed post on this subject here.

The Current CPU Failover Capacity is computed by subtracting the total CPU resource requirements from the total host CPU resources and dividing the result by the total host CPU resources. The Current Memory Failover Capacity is calculated similarly.

Of course because of the way HA works in vSphere 5, you are not limited to a 4 node per failure domain setup anymore as there are no primaries or secondary nodes in vSphere 5. In other words, if you wanted a 10 node cluster and you have two enclosures you could do 5 per enclosure and not have to worry about HA not functioning during an enclosure failure. I have discussed that in one of my earlier posts. But that’s out of the scope of whats being discussed here.

Going back to the original discussion, because you can now have more than 4 hosts in a failure domain in vSphere 5, lets say  you have a 16 node cluster with 8 nodes on each enclosure (thank you vSphere 5 and FDM), and have your percentage set to 13% (for both CPU and memory) which is just a little over computing resources of two hosts (IMO this is still pretty liberal but it seems more practical), this means you are only reserving computing resources that equate to 2 hosts per cluster. Wait, isn’t that what happened before? Yes, but that was a smaller cluster and this happens to be twice as big. If we have two 16 node clusters we are reserving total computing resources that equate to about 4 hosts across those two 16 node clusters which is better than before. Of course going with larger clusters is another discussion out of the scope of this topic but I will say that DRS will be happier in a larger cluster. Keep in mind, if you take the default of 25% even in this large cluster of 16 nodes, you will still be screwed as that would mean resources that equate to 4 hosts will be reserved per cluster, so you will have the same old issue discussed above. So, be mindful of what percentage you place here. vSphere 5 gives you more flexibility as you can now put different values for CPU and memory.

Conclusion:

Admission control is an awesome thing. You should absolutely turn it on so that it can do what needs to be done. However, it’s important for us to understand how it works. I know a few gigs of reserved capacity in my lab annoys me from time to time, but I know what its there for and how it would benefit me. If I had an enormous amount of computing resources reserved for HA like in the example above, I would be a little alarmed. Of course there might be a good reason for one to run that kind of setup, who knows. But if that person is you please consider this post as a request to donate your hardware to me when it comes time for you to upgrade :).

HA Cluster design considerations in vSphere 5

For the past two months my life has been a little crazy which is why I haven’t been actively blogging. Not that the storm is over, I just finally made a commitment to blog at least once a week so here is my attempt to keep my promise to myself.

Over the past few months a lot of good useful and interesting information has been emerging from the various blogs and finally the vRAM fever is over. I was giving a presentation a few days ago that basically discussed the HA differences in vSphere 4 and 5. Besides the architectural difference between the two versions, I was also covering how it could impact our designs and what that really means. I figured I should share some of that information here as it could be useful to most VMware admins.

Some recap: In vSphere 4, as well all know there are 5 primary nodes that are needed for a HA cluster to function. The first 5 hosts in a HA cluster are the primary nodes and a primary is replaced by a secondary only during the following conditions:

  • When placing a primary host in maintenance mode 
  • When removing a primary host from the cluster
  • When HA is reconfigured
Which means during a primary host failure, there is no secondary host taking its place. So if you loose all 5 primary nodes, then your HA cluster will not work. To get around that issue we limit the number of hosts in a failure domain to 4. Basically no more than 4 hosts of the same cluster in a blade chassis or in a rack. So what if you have create a 32 node cluster? This is what your setup would look like if we want to make sure a primary host is always available.

vSphere 4 32 node HA Cluster

In vSphere 5, because the primary and secondary concepts don’t exist anymore and because a master is guaranteed to be available in a cluster in almost all possible scenarios, we can perhaps improve on our 32 node cluster shown above. Duncan and a few other bloggers have been blogging heavily on the details of HA in vSphere 5 and I would recommend you take a look at them to get more details on whats under the hood. But in a nutshell, HA has been gutted and rewritten where you have a master host and every other host is a slave in the HA cluster. Apart from the Netwrok heartbeats, datastore heartbeats are introduced that allows the state of hosts to be correctly identified (failed, isolated or network partitioned). Keep in mind these capabilities were not present in previous versions and now with more accurate information about other hosts in the cluster, HA can be more efficient then it ever was.

It takes a total of 25 seconds for a new master to be elected (from the time the original master fails) in a cluster and at 35 seconds, the newly elected master begins to restart VMs and assume all master responsibilities. So now lets look at what could cause a new master to be elected. A new master will be elected if any of the conditions below are satisfied:

  • A master fails
  • When HA is reconfigured
  • Becomes network partitioned or isolated
  • Is disconnected from VC / removed from cluster
  • Is put into maintenance or standby mode
As you can tell, a new master is elected in almost all scenarios that could go wrong and more importantly we are aslo covered if the master fails. Keep in mind we were not covered for primary host failures in vSphere 4. So what can we do with our design?

vSphere5 HA Cluster

Well we certainly don’t have a need to limit the number of hosts per chassis to 4 anymore. Keep in mind we only did that because of how HA worked in vSphere 4 and that we couldn’t afford to loose all 5 primaries. Because if we did, HA would not work.

As vSphere 5 relies on a whole different mechanism and really relies on a single master host which is replaced by a slave host  (via election of a new master) in all the conditions mentioned above, I think it will be safe to assume we could shrink our 32 node cluster to two chassis instead of 8. Now is that a good design? I don’t know. If you loose a chassis in this setup, your cluster would loose 50% of resources so perhaps spanning the cluster across 3 or 4 chassis would be a better option? It depends really on what is available and sort of out of scope of what I want to discuss here.

I believe large clusters were ignored due to the 4 host limit per failure domain along with other legacy issues that dont exist anymore. A good use case for a large cluster would be the need for creating large vDC for a customer instead of multiple small vDC. Again, that is a little out of scope of what I want to discuss. The point of this post is that HA has been redesigned and redone, so it’s important for us to understand how the new architecture impacts whats we have believed to be good practice in the past. I still hear stuff about scsi reservation from 4-5 years ago which doesn’t apply anymore. So read up on the new HA and understand how it would impact the designs you have done in the past. Don’t get caught with your pants down 😀

HA and Admission Control

I have seen admission control being used without really understanding how it impacts your cluster and your available resources. While configuring admission control on a cluster the other day, I started thinking how this really works. The concept is pretty simple. According to VMware:

Slot size is comprised of two components, CPU and memory. VMware HA calculates these values.

The CPU component by obtaining the CPU reservation of each powered-on virtual machine and selectingthe largest value. If you have not specified a CPU reservation for a virtual machine, it is assigned a defaultvalue of 256 MHz (this value can be changed using the das.vmCpuMinMHz advanced attribute.)

The memory component by obtaining the memory reservation (plus memory overhead) of each poweredon virtual machine and selecting the largest value

HA relies on slot sizes and in the current version of ESX/i, if no reservations are used, the default slot sizes are 256 MHz and the memory overhead. Now keep in mind, if you happen to have a VM which has a reservation of 4GB, now all of a sudden your slot size has become 256 MHz and 4GB in memory. Basically now you have less slots to place your VMs and admission control will make it to where you can’t power on more VMs than what can be accommodated according to your host failures cluster tolerates setting. Basically HA will look for your worst case CPU and memory reservation to come up with the slot size. All that I just mentioned should be common knowledge.

Let’s assume you have a cluster of 3 hosts and VMs with no reservation, HA is turned on, host failures cluster tolerates is 1, admission control is enabled and your isolation response is set to shutdown. For simplifying things lets assume your cluster is balanced where each hosts has 10GHz CPU and 24GB of memory. Your cluster has a total of 30GHz CPU and 72GB of memory. The total number of VMs running is 60 and none of them have any reservation. Lets also assume your slot size is 256 MHz and 300MB (overhead). So how many slots do you have? You have 30000/256 = 117 in CPU and 72000/300 = 240 in memory. You always pick the lowest number and according to what we calculated above, you have 117 slots available on this cluster.

Let’s assume a host fails and now we only have 20GHz and 48GB left in our cluster. We now have 20000/256 = 78 and 48000/300= 160, which means we have only 78 slots available now. So you have 78 slots and 60 VMs (1 VM/slot), should all your VMs power on? No, because your cluster still has Host Failures Cluster Tolerates set to 1 and admission control is enabled. It’s important to understand how admission control really works. According to VMware:

With the Host Failures Cluster Tolerates policy, VMware HA performs admission control in the following way:

1 Calculates the slot size.A slot is a logical representation of the memory and CPU resources that satisfy the requirements for any powered-on virtual machine in the cluster.

2 Determines how many slots each host in the cluster can hold.

3 Determines the Current Failover Capacity of the cluster.This is the number of hosts that can fail and still leave enough slots to satisfy all of the powered-on virtual machines.

4 Determines whether the Current Failover Capacity is less than the Configured Failover Capacity (provided by the user).If it is, admission control disallows the operation.

So according to that, even though your cluster has enough slots to run all your VMs, but because your host failures cluster tolerates is set to 1, admission control has to make sure it only runs the load it can afford to run in case of another host failure. Basically admission control knows there are 78 slots available but it has to keep in mind that in case of another host failure it will only have 39. Because host failures cluster tolerates is set to 1, admission control will only allow 39 slots to be accommodated. So once HA realizes that 39 slots have been taken, it will not allow anymore power on. It’s saving you from yourself.

I will not throw in other complications like memory reservations or an unbalanced cluster (hosts with different resources) and how to handle that yet just to keep it simple. I do plan to post about why reservation would be a bad idea at the VM level and ways to get around the conservative slot sizes. HA and admission control are awesome tools to have, but if you don’t plan intelligently, you will soon begin to hate them.

HA for MSCS VMs in vSphere

A few days ago, I was complaining about not knowing why HA has to be disabled on a MSCS setup in vSphere. Turns out, only DRS needs to be disabled as HA is still supported according to KB article 1037959. If I read it correctly, even in a cluster across box(CAB) type of setup where you will have to use physical compatibility mode, HA is still supported. DRS is not supported in all vSphere and MSCS setup due to the reasons I discussed in one of the previous blogs. Although the MSCS user guide for 4.1 suggests that you can setup DRS to partially automated for MSCS machines, the pdf also mentions that the migration of these VMs is not recommended. And as the table below suggests, DRS is not supported either.

kb article 1037959

So, what does support for HA really mean? If you only have a two node cluster and have a MSCS CAB setup, the HA support will not effect you because of the anti-affinity rules. However, if your ESX/i cluster is bigger than two nodes, then HA can be leveraged and the dead MSCS VM an be restarted on a different host and still be in compliance with the anti-affinity rule that has been set. For MSCS CIB setup, HA can be leveraged on even a two node ESX/i cluster. When host one dies, host two finds itself spinning up the two partners in crime. One thing to note here is, all of this is only possible if the storage (both the boot vmdk and the RDM/shared disk) is presented to all the hosts in the cluster. I can’t imagine why anyone would not do that to begin with.

Again only a two node MSCS cluster is supported so far. With HA being supported for MSCS VMs, I guess one can certainly benefit from added redundancy. If you think this is being two redundant, just don’t use the feature and disable HA for the MSCS VMs in your environment. I would highly recommend to disable HA for the the two VMs if they are part of a MSCS CAB setup in a two node ESX/i cluster.

vSphere client for iPad (Review)

I was too excited about getting the iPad2 this year and one of the first things I started looking for was the vSphere client that VMware was supposed to make for the iPad. After standing in line and with the help of my friend, I was finally able to get my hands on Apple’s new tablet. For the next two days I religiously searched for the vSphere client for the iPad but was disappointed not to find it. Just this past Sunday, I was talking to a friend who asked me if I tried out the iPad app for vSphere. So I started searching again and it turns out I gave up searching 3-4 days before it was finally released (March 17th, 2011). After feeling left out, I finally downloaded it and took it for a spin.

You will need to download the vCMA, vSphere Client for iPad and off course a vSphere environment and an iPad will be needed. Once you have fired up your vCMA, be sure to change your password for the vCMA appliance. This is not a requirement, but if you plan on allowing remote access to your vCMA appliance, you may not want to leave it with the default password that is known by the masses. You can manage your vCMA appliance at, http://YourIP:5480. I would also assign the vCMA a static IP.

Once you have assigned the IP to vCMA, go to the settings in your iPad and tap on the “vSphere Client” and enter the IP of your vCMA in the “Web Server” field.  Read the rest of this entry »