CloudPhysics and Admission Control tunning

A while back I made a little demo video that showcased one of CloudPhysics cards that is still my personal favorite. I figured it would be a good idea to share it in case there is anyone out there who hasn’t taken CloudPhysics for a spin yet.


Admission Control and VUM

When I started blogging, my goals were pretty simple. I wanted a place to keep notes, and at the same time try and help out someone else who may be looking for the same type of information. In the process I managed to learn so many new things, managed to remember and hopefully helped someone else out as well. Now this post will probably go under the category of I have no clue what on earth is going on and perhaps someone out there does.

I will admit, I haven’t used VUM for sometime or as often as I used to. I dont work on the operations side of house anymore. So the other day, I started to remediate a host in a HA/DRS enabled cluster with admission controlled turned on, the remediation failed.

I figured well that must be because the host was unable to be placed in maintenance mode due to admission control settings. However, when I looked at what the cluster was doing, I was a bit surprised. There is no way anything should come in the way for this host to be placed in maintenance mode. The cluster was so under utilized.

Obviously, I tried to place the host into maintenance mode and had no issues there. So I am not exactly sure why VUM wasn’t able to do the same. Finally, I figured the host is now in maintenance mode and I might as well go ahead and get this guy remediated. No luck there, I was still slapped with the same exact error on the status bar and in the host events I saw the following.

Now I was totally confused. Why does VUM care about this at all? The host is already in maintenance mode meaning it should be out of equation for HA or even admission control for that matter. All that needs to happen here is the patches to be installed but VUM keeps complaining about admission control being enabled in the cluster this host resides in. Even though admission control or HA wont be considering the resources from this host until it comes out of maintenance mode. I would like to point out that I was trying to remediate the host not the cluster itself.

Obviously the next thing was to disable admission control and the remediation went fine. I also tried taking a host out of the cluster and that remediation went fine as well. But I am still not sure why VUM refused to patch a host that was already in maintenance mode. Perhaps someone in the community can throw some light on this. Maybe I am missing the obvious here and simply over analyzing. But this has bugged me for a few days so I decided to post this and ask the question and be sure versus making assumptions that may not be true.

By the way this happened to me on both vCenter 4.1 and vCenter 5. According to 4.1 admin guide:

When you update vSphere objects in a cluster with DRS, VMware High Availability (HA), and VMware Fault
Tolerance (FT) enabled, you should temporarily disable VMware Distributed Power Management (DPM), HA
admission control, and FT for the entire cluster.

Certain features might cause remediation failure. If you have VMware DPM, HA admission control, or Fault Tolerance enabled, you should temporarily disable these features to make sure that the remediation is successful.

Update Manager does not remediate clusters with active HA admission control.

So according to the documentation admission control must be disabled. And below is the reason for that from the same source:

If HA admission control is enabled during remediation, the virtual machines within a cluster might not migrate with vMotion.

Admission control is a policy used by VMware HA to ensure failover capacity within a cluster. If HA admission control is enabled during remediation, the virtual machines within a cluster might not migrate with vMotion.

Moreover it also states the following which I thought was important to capture. Below is why enabling admission control during remediation coud be troublesome.

It’s obviously clear the issue is Admission Control being turned on. And according to much of the documentation, admission control must be disabled. However, the rational behind that requirement is so that the host can be placed in maintenance mode. Obviously in my case the host was already in maintenance mode which leads me to believe that VUM will still check for Admission Control setting on the cluster and fail the remedition if its enabled.

One more thing to add to the mix. If you are running 1000v, upgrading the VEM would fail unless admission control is disabled on the cluster according to this kb article. Also note this issue would only occur for VEM related updates. It would be worth pointing out that my tests included both types of hosts, one running a 1000v and one running a standard switch. In my tests, both hosts were managed by the same vCenter.

Again, the purpose of this post is to share what I experienced and hopefully someone will be able to either explain why this happens or point towards a possible misconfiguration or fix. My explanation for why Admission Control needs to be disabled would be that even though the host may already be in maintenance mode, VUM would still check the Admission Control setting and simply fail the remediation if it is enabled (unless you check the box to disable it when remediating). If this is the case, perhaps future releases will make this check more efficient and not simply fail. If the purpose of this check is to make sure the host can be placed into maintenance mode without violating the admission control setting, then VUM should be looking for that piece of information rather than simply failing if Admission Control is turned on.

UPDATE: Please read this follow up post as well.

Design Implications with Admission Control & HA

To start, there is no issue with the Admission control really, it’s really our lack of understanding that makes it an issue. Last week I posted about Admission control and Duncan has a lot of excellent information on this subject. I just wanted to touch on a few things that I think are extremely important to understand in order to come up with appropriate designs for your environments.

So what is Admission Control (AC)? To state it in simple terms, it’s the policy that will save you from yourselves. Basically it’s a check that enables vCenter to reserve certain computing resources in your cluster so that an HA event can be accommodated. There are three different ways how this is done:

  • Host failures cluster  tolerates (this is where slot sizes are used)
  • Percentage of cluster resources reserved as failover spare capacity (this is where you specify a percentage of resources you want reserved)
  • Specify a failover host (this is self-explanatory so I will not be going over this)

Host failures cluster  tolerates

We already know that slots can become an issue in a heterogeneous setup where you may have a couple of really large VMs and a bunch of small ones. Let’s imagine you have 100 VMs out of which 4 have 8vCPU and 24GB of memory reserved, everything else is 1 vCPU with no memory reservation. Unfortunately your 4 VMs will effect your slot size and your slot size will be huge due to the 4 large VMs you have. This basically means you will have fewer slots to power on more machines in a cluster. Of course you can tweak the advanced settings like das.slotCpuInMHz and/or das.slotMemInMB (credit:Duncan) to limit the size of your slot. While you will have more slots available in the cluster, but keep in mind your large VMs may occupy more than one slot now. So in order for them to power on, all the required number of slots should be available on a single host and not spread across the cluster. Just something to keep in mind.

The advantage of using this method is that its pretty dynamic and your available slots will increase and decrease automatically as hosts are added/removed or placed in maintenance mode. The one big issue with this method is how it handles unbalanced clusters. By that I mean if you have a humongous host in the cluster, HA AC will take the worst case into consideration and be prepared for the time when this huge host goes down. Good right? Well that also means that if all your hosts only contributed 50 slots to the cluster and this huge host owns a 100 by itself, in that case HA AC will only present the number of slots available for power on that accounts for the worst case scenario. So your 100 slot host will really not buy you much. Again, there is a whole lot already discussed regarding this by Duncan on his blog and in his books.

Percentage of cluster resources

This is really the area that I wanted to cover and this also happens to be the one I like.  For a quick intro, please read this. I want to talk about how this method can affect your design. We already know that we specify a percentage of resources that are reserved by for an HA event. By default the pre-populated number is 25%. That sounds safe right?

And here is the answer that you have probably heard a million times before. It depends! Seriously, it depends on your cluster. For the purpose of this post and to keep things simple, we will assume that we are only dealing with balanced clusters. So let’s say you only have a 3 node cluster with each host having 64GB of memory and 10GHz of CPU. Your total is 192GB and 30GHz for the cluster. With taking the default 25%, you are really reserving 48GB and 7.5 GHz. This means when your cluster only has that many resources left, it will not allow you to power on anymore VMs. Is that good or bad? I don’t know only you can determine that.

Now let’s try to make our cluster a little bigger and considering we are still talking about vSphere 4, we will keep our cluster size to 8 (across two enclosures, 4 on each one) to make our primary nodes happy. So what happens in a cluster thats 8 node strong? 25% means, computing resources that equate to 2 of your nodes are just chilling and relaxing until something goes wrong. Of course this does NOT mean that 2 hosts are unused, it means computing resources that equate to what 2 nodes will provide are unused and reserved. Let’s take our example from before and extend that.

We now have 8 nodes each with 64GB ram and 10 GHz of CPU. This means our cluster has 512GB of ram and 80 GHz of cpu. But because our AC setting is set to 25%, we only have 384GB and 60GHz of cpu available for powering on VMs (please note during an HA event, HA will ignore all AC settings as stated here). That means we have reserved 128GB of ram and 20 GHz of CPU for HA. Is that cost-effective? I don’t know, but that should get someone’s attention.

Let’s take this even further and look at an example where we have 4 clusters spread 2 enclosures. Let’s use our same example from above and apply the numbers here. Each blade in the enclosure has 64GB of ram and 10 GHz of cpu. Again lets assume we have the AC set to 25%. As we discovered earlier we should have 128GB and 20 GHz reserved for an HA event per cluster. Across the four clusters in these two enclosures we will have 128*4 = 512GB of ram and 20*4=80 GHz of cpu reserved for an HA event. That is equivalent to the resources of 8 blades in this enclosure. Interestingly there are a total of 32 blades in this setup so 8 blades is 25% of the resources. AC isn’t doing anything wrong, it’s really doing what you asked it to do. You said you wanted 25% reserved, well there you go.

Now if you are not pulling your hair already, imagine if you have 4 of such enclosures. If we use the same setup that means we have 8 clusters and resources that equate to 16 blades are reserved for an HA event. Wait a sec, thats equal to 1 whole enclosure. Yes, and if you have 4 enclosures, 25% of that is 1 enclosure so there should be no surprise :).

I want to clarify that your reserved resources can be fragmented across multiple hosts so all your blades are still probably handling some load. It’s just that the money you spent to provision new VMs may not materialize as you expected unless this was all factored in to begin with. So what if you have 10 of these enclosures setup the same way? That means two things:

  • You have resources reserved that equate to 2.5 enclosure out of the total of 10 enclosures for an HA event
  • And that you have way more money than I can ever imagine so kudos to you for that
So please think for sometime before you go with the default of 25%, I have seen as low as 1% (which really bothers me because why even have AC in that case). I have also seen 50% and that is the highest that you can go. So pick the number that keeps sanity in your world but please don’t take 25% just because it’s there by default. If your argument is that you really don’t need those 100s of GB of ram and GHz anyways, well that’s fair too but keep in mind that somebody actually paid for them. Isn’t utilizing your hardware to the fullest one of the advantages of virtualization. I know we need some headroom for HA events but do we really need an insane amount reserved?

One recommendation will be to set your percentage that equates to what you would have picked in the ‘Host failures cluster  tolerates” method. If you wanted to use 1 there then set your percentage so that you are only reserving computing resources of one host. So in our 8 node cluster example, 12.5% would be equal to 1 host. Sine you can only enter integers here, let’s go with 13%. You started using percentages because you didn’t like the slots, that doesn’t mean you must have more resources reserved, That defeats the purpose. Lastly, unlike “Host failures cluster  tolerates”, the percentage method will not dynamically adjust your percentage if you add or remove hosts. So if you add or remove hosts revisit the percentage again.

Also note, when you have large VMs in your cluster and because the reserved resources can be fragmented, it may not be a bad idea to have a higher restart priority for those large VMs. In simple terms, if your VM needs 24GB to start and your cluster has 24GB available but its spread across more than one host, guess what, it wont start. But since the 4.1 release HA will request DRS to make room but it’s not guaranteed.

vSphere 5

The obvious question is does this get better with vSphere 5? Yes, it does but you will still have to figure out what works best for you. The one enhancements that’s visible in the gui is how you specify resource reservation for HA in vSphere.

You can specify CPU and memory individually . I think this is great if you are trying to make sure your dollars are not wasted and that you are not forced to reserve more than you really have to. This way you can reserve a certain amount of memory and a certain amount of CPU for your cluster. In previous versions, you didn’t have a way to differentiate the two, you could only put one percentage that applied to both prior to vSphere 5. But again, the defaults are 25% and one will have to figure out what will be the sweet spot in their environment.

In vSphere 4, virtual machines that did not have a reservation larger than 256Mhz a default of 256Mhz was used for CPU and if no reservation was used for memory either, a default of 0MB+memory overhead was used that contributed to determining the failover capacity of your cluster. In vSphere 5, the default of only 32MHz is used if no CPU reservation is defined and for no memory reservation, a default of 0MB+memory overhead is used like before to compute the failover capacity.

How is the failover capacity computed? I have written a detailed post on this subject here.

The Current CPU Failover Capacity is computed by subtracting the total CPU resource requirements from the total host CPU resources and dividing the result by the total host CPU resources. The Current Memory Failover Capacity is calculated similarly.

Of course because of the way HA works in vSphere 5, you are not limited to a 4 node per failure domain setup anymore as there are no primaries or secondary nodes in vSphere 5. In other words, if you wanted a 10 node cluster and you have two enclosures you could do 5 per enclosure and not have to worry about HA not functioning during an enclosure failure. I have discussed that in one of my earlier posts. But that’s out of the scope of whats being discussed here.

Going back to the original discussion, because you can now have more than 4 hosts in a failure domain in vSphere 5, lets say  you have a 16 node cluster with 8 nodes on each enclosure (thank you vSphere 5 and FDM), and have your percentage set to 13% (for both CPU and memory) which is just a little over computing resources of two hosts (IMO this is still pretty liberal but it seems more practical), this means you are only reserving computing resources that equate to 2 hosts per cluster. Wait, isn’t that what happened before? Yes, but that was a smaller cluster and this happens to be twice as big. If we have two 16 node clusters we are reserving total computing resources that equate to about 4 hosts across those two 16 node clusters which is better than before. Of course going with larger clusters is another discussion out of the scope of this topic but I will say that DRS will be happier in a larger cluster. Keep in mind, if you take the default of 25% even in this large cluster of 16 nodes, you will still be screwed as that would mean resources that equate to 4 hosts will be reserved per cluster, so you will have the same old issue discussed above. So, be mindful of what percentage you place here. vSphere 5 gives you more flexibility as you can now put different values for CPU and memory.


Admission control is an awesome thing. You should absolutely turn it on so that it can do what needs to be done. However, it’s important for us to understand how it works. I know a few gigs of reserved capacity in my lab annoys me from time to time, but I know what its there for and how it would benefit me. If I had an enormous amount of computing resources reserved for HA like in the example above, I would be a little alarmed. Of course there might be a good reason for one to run that kind of setup, who knows. But if that person is you please consider this post as a request to donate your hardware to me when it comes time for you to upgrade :).

HA Admission control – Percentage of cluster resources

I am sure we are all aware of why HA is all important and awesome to have. It helps you to finish your coffee, smoke your cigarette before rushing towards a server that just went down. Ok maybe not that but you get the idea right. Another thing to keep in mind regarding HA is the admission control policy. I like to call this the policy that saves you from yourself. Basically it keeps check of how many resources are available and how many will be needed for a failover to happen. It keeps you honest and ensures that the HA’s promise is not broken.

As we already know there are three types of Admission Control Policies to choose from:

  • Host failures cluster  tolerates
  • Percentage of cluster resources reserved as failover spare capacity
  • Specify a failover host
“Host failure cluster tolerates” creates slots which at times could create issues specially if you only have a a few VMs with High CPU counts and memory reservation. Of course you can look at advanced settings that could address this and Duncan can tell you all there is to know about this. The second option which is selecting a percentage of resources is my personal favorite specially due to the flexibility that it provides. We will go over that in a little bit. The last option which lets you specify a failover host is the one thats rarely used and rightly so. After all why would you want a host to just sit there and wait until something goes wrong?
As you may have already noticed, vSphere 5 gives you the option to specify a percentage of failover resources for both CPU and memory. Prior to vSphere 5, this was not the case. I think this is an excellent addition and our clusters will now be more flexible then ever.

25% is whats placed in there by default and what this really means is the 25% of your total CPU and total memory resource across the entire cluster is reserved for your cluster. So in other words, if you have an 8 node cluster, 25% of your resources or resources equal to two host (assuming its a balanced cluster) are reserved for an HA incident. If this happens to be a 32 node cluster and if this is a balanced cluster, resources that equate to 8 nodes will be reserved as 8 is 25% of 32. So keep that in mind before deciding what number to put there. You can’t reserve more than 50% of your resources.

Below is how the resources are calculated for the hosts:

The total host resources available for virtual machines is calculated by adding the hosts’ CPU and memory resources. These amounts are those contained in the host’s root resource pool, not the total physical resources of the host. Resources being used for virtualization purposes are not included. Only hosts that are connected, not in maintenance mode, and have no vSphere HA errors are considered.

So how do you know how much head room do yo have left in the cluster? On your cluster summary tab, you will notice there is no longer a place for you to look at slot size as this method does not use slot sizes. It basically gives you a simple view of how much room you have left.

The Current CPU Failover Capacity is computed by subtracting the total CPU resource requirements from the total host CPU resources and dividing the result by the total host CPU resources. The Current Memory Failover Capacity is calculated similarly.

In vSphere 5, vSphere HA uses the actual reservations of the virtual machines. If a virtual machine does not have reservations, meaning that the reservation is 0, a default of 0MB memory and 32MHz CPU is applied.

So assuming you went with the default of 25% for each resource, 0% as current failover capacity is something you should hope never to see. You are seeing that in my screenshot (above) because my cluster happens to be empty and has no hosts. Lets, say you went ahead and turned on a few VMs and your cluster shows something like below, (98% CPU and 95% memory), this is something to be happy about. This basically means you have 98% of CPU available and 95% of memory available in your cluster.

There is one thing to keep in mind, though 98% of my CPU and 95% of my memory appear under my current failover capacity, this does not account for the 25% of whats reserved for an HA incident. At least thats what I was able to see by the few tests that I ran. What this means is that I can only power on VMs that account for no more than 98-25 = 73% of CPU and 95-25=70% of memory thats free in the cluster. For everything else HA should try to save me from myself.

Let’s look at a quick example to see how these numbers are calculated:

  • The Configured Failover Capacity is set to 25% for both CPU and memory.
  • Cluster is comprised of three hosts, each with 9GHz and 24GB of memory.
  • There are 4 powered-on virtual machines in the cluster with the following configs (assume overhead is 100mb for all VMs in this case):
    • VM1 needs 2GHz and 1GB (no reservation)
    • VM2 needs 2GHz and 2GB (2GB reserved)
    • VM3 needs 1GHz and 2GB (2GB reserved)
    • VM4 needs 3GHz and 6GB (1GHz and 2GB reserved)

So what does our cluster have? Our cluster has 9GHz+9GHz+9GHz = 27GHz of CPU and 24GB+24GB+24Gb=72GB of memory. (These amounts are those contained in the host’s root resource pool, not the total physical resources of the host).

How much resources are we using with our four VMs that are powered on?

Memory = VM reservation + overhead = 0+100+2048+100+2048+100+2048+100= 6544MB = 6.4GB

Note we only used 2048 for VM4 even though it had 6GB configured. Thats because it only had 2GB reserved. Also, VM1 had no reservation so only overhead was used.

CPU = If no reservation use 32MHz for vSphere 5 = 32MHz+32MHz+32MHz+1GHz= 1.096GHz

So what is our current failover capacity?

Memory = (72GB – 6.4Gb)/72= 91%

CPU = (27GHz-1.096GHz)/27= 95.94%=96%

Wow, that is a lot of cluster resources left. Now lets take 25% off from our numbers to come up with exactly how many VMs can we power on before HA starts screaming back with an error.

Memory = 91- 25 = 66%

CPU = 96-25 = 71%

Now keep in mind, selecting the percentage for admission control policy isn’t going to solve all your problems. But I do think that this setting is far better than complex slot sizes and what not. This gives one a simple view of how much room you have in your cluster without messing around with slot sizes. However, unlike cluster host tolerates setting where you can simply add hosts like crazy, using the percentage method may require you to revisit your percentages as you add or remove hosts. At the same time it also gives you more flexibility. So next time you are setting a cluster, think about whats important to you.


When to disable HA?

I was talking to a friend a few days ago who is the owner of a pretty big sized vSphere 4 environment. His network team is in the middle of upgrading switches which requires some outage (30 sec+) on his management network for the ESXi hosts.  He wanted to discuss the best possible way to approach this without disrupting the environment and causing an outage for the VMs. Keep in mind the VMs network will continue to run and is not expected to go down.

My immediate question was if HA is enabled and the isolation response that has been set on the clusters. In his case the isolation response is to shutdown. Obviously this means, as soon as the management network has the outage,

  • the hosts will be unreachable
  • the hosts will try to reach each other with no luck and
  • ultimately ones they realize they can’t even ping the gateway (isolation addr)
  • all hosts will declare themselves isolated

The result will be an environment that will automatically shutdown all the VMs., exactly opposite of what we want. So whats a simple workaround? Disable HA, and re-enable it ones the network maintenance is completed. Of course keep your fingers crossed that your hosts don’t experience a hardware failure during this time.

It’s interesting to explore if changing the isolation response to leave VMs powered on will be helpful in this case or not. If the isolation response is set to powered on and the management network experiences the outage:

  • the hosts will be unreachable
  • the hosts will try to reach each other with no luck and
  • ultimately ones they realize they can’t even ping the gateway (isolation addr)
  • all hosts will declare themselves isolated

At this time, though the VMs will continue to run, your primary hosts in the cluster will now try to turn on the VMs that are already running on the other hosts (assuming isolated primaries are hardworking hosts that continue to do their jobs even in an isolated state). Of course they wouldn’t be able to but this will be unnecessary stress on the hosts that you probably dont want. Would it impact your VMs? Maybe, remember we want the VMs to have no outage or impact. Seems like leaving VMs powered on and HA enabled for the network maintenance period is also not the best option as it could cause performance issues for some VMs. I still think disabling HA, during this outage might be the most seamless option at this time.

What about vSphere 5?

Naturally the next thing that came to mind after a couple of days was does this change with vSphere 5? It does. As we already know that in vSphere 5, we also have datastore heartbeats that correctly identify the state of the hosts in the cluster. For those who are not aware, vSphere 5 uses 2 datastores by default that are shared across all or most hosts in a cluster to use as heartbeat in case the management network experiences an outage. So what will happen when the mgmt network goes down in a vSphere 5 cluster and the isolation response is to leave VMs powered on?

  • Each host in the cluster will enter an election (except the master, the master will declare itself isolated in 5 sec)
  • As each host will be isolated, each host will elect itself as a master
  • Each host will ping the isolation address and declare itself isolated
  • Trigger the isolation response which is to leave VMs powered on

In this case within 30 sec of the management network outage, each host would have declared itself isolated and wont attempt to restart any VMs like the primaries would in vSphere 4. The result will be that no host will be under unnecessary stress to start VMs that are already running somewhere else unlike the previous ver. This also means the VMs will not experience any performance issues as the hosts will not have any additional stress. A file called poweron (on the shared datastore) that each host will own will also reflect their updated status as being isolated.

Now what happens when the network outage is over and the hosts are in a position to talk to each other? I have not been able to find documentation on weather an isolated host (host that has already declared itself isolated) will enter an election (vSphere 4 or 5) ones the communication channel is open and bring the cluster back to life. From what I have read so far, the host will remain in an isolated state unless some manual intervention is introduced like reseting HA etc. Perhaps Duncan or one of the HA experts can sanitize this and provide a concrete answer to this.

If re-setting HA is the quickest way to bring the hosts back to elect a master (vSphere 5) and primary hosts (vSphere 4), so that HA can function after the network outage, then perhaps disabling HA for the network outage period is probably a better option. Unless, they could autofix and enter an election ones the outage ends, ie. hosts that have declared themselves isolated return for an election. I have not been able to find any documentation on that. Based on that, disabling HA for the maintenance window would probably be the best option in this case (assuming the isolated hosts will not autofix themselves ones network comes back).

One thing to note, vSphere 5 offers a brand new HA under the hood. However in this scenario we noticed very little difference between the two ver. Keep in mind the architectural difference in HA from vSphere 4 to 5 enables the isolated hosts (in vSphere 5) to be correctly identified which eliminates the unnecessary stress some hosts would go through due to the leave powered on isolation response. There is more to it than what meets the eye.


Duncan wrote a post in reply to this one where he recommended unchecking “host monitoring” instead of disabling HA. This will be quicker and make it to where your cluster wont have to start over. Moreover, the host monitoring option is really meant for outages like this. So don’t disable HA, just uncheck “host monitoring” during the expected outage window.