HA for MSCS VMs in vSphere

A few days ago, I was complaining about not knowing why HA has to be disabled on a MSCS setup in vSphere. Turns out, only DRS needs to be disabled as HA is still supported according to KB article 1037959. If I read it correctly, even in a cluster across box(CAB) type of setup where you will have to use physical compatibility mode, HA is still supported. DRS is not supported in all vSphere and MSCS setup due to the reasons I discussed in one of the previous blogs. Although the MSCS user guide for 4.1 suggests that you can setup DRS to partially automated for MSCS machines, the pdf also mentions that the migration of these VMs is not recommended. And as the table below suggests, DRS is not supported either.

kb article 1037959

So, what does support for HA really mean? If you only have a two node cluster and have a MSCS CAB setup, the HA support will not effect you because of the anti-affinity rules. However, if your ESX/i cluster is bigger than two nodes, then HA can be leveraged and the dead MSCS VM an be restarted on a different host and still be in compliance with the anti-affinity rule that has been set. For MSCS CIB setup, HA can be leveraged on even a two node ESX/i cluster. When host one dies, host two finds itself spinning up the two partners in crime. One thing to note here is, all of this is only possible if the storage (both the boot vmdk and the RDM/shared disk) is presented to all the hosts in the cluster. I can’t imagine why anyone would not do that to begin with.

Again only a two node MSCS cluster is supported so far. With HA being supported for MSCS VMs, I guess one can certainly benefit from added redundancy. If you think this is being two redundant, just don’t use the feature and disable HA for the MSCS VMs in your environment. I would highly recommend to disable HA for the the two VMs if they are part of a MSCS CAB setup in a two node ESX/i cluster.

MSCS VMs and Snapshots

When using VMs in a MSCS cluster across box (CAB), you will need to setup the RDMs in a physical compatibility mode and enable bus sharing. Please note that VMs with RDMs in physical mode will not allow you to snapshot either. Basically, you will find that your MSCS VMs will have their snapshot option greyed out. What does this mean?

You can’t use VCB to backup your VMs as that relies on snapshots.
You can’t use vDR for backups as that relies on snapshots.
And lastly you can’t leverage the snapshotting ability for tasks like patching of your VM if you have been practicing that in the past.

When running the disk in independent/persistent mode, you would think that snapshots would still work and only snapshot the vmdk running the OS partition and not your RDM. However, with the bus sharing in place for MSCS to work, the snapshotting criterion is not met. Hence the option remains greyed out. Also, lets assume you are able to snapshot the VM somehow, and you take the snapshot of VM1, after keeping it for a day you suddenly decide to revert it back, I am not sure on how VM2 or even the cluster itself will behave when all of a sudden one of its nodes has forgotten what happened 24 hours ago. So, just don’t snapshot it even if you come up with a way of being able to do it.

One thing I haven’t tried yet is to see what happens when I turn a node off, snapshot it, do what I need to do and when I need to revert, turn it off and then revert. Not sure if this is even possible or supported. Not to mention, the bus sharing will be have to be disabled. But in all seriousness, I would never do this in a production environment. If I ever decide to test this, this will only be for my own curiosity.

Not being able to snapshot should not really be the reason for you to scrap MSCS in VMware. This is simply a limitation to understand prior to designing a solution. If snapshotting is of paramount importance, then this may not be your cup of tea, however will the alternate solution give you that ability? If not being able to back up the VM using VCB or vDR is your issue, then please be informed that you can still backup the VM by installing the agent inside the guest and use a traditional backup mechanism. Leverage the same infrastructure that has been backing up your physical world. MSCS in VMware is not perfect, but it will get there.

MSCS and vSphere Conflicts

As already addressed in the vSphere 4 u1 release notes, MSCS VMs are supported in a HA/DRS cluster, its amazing how many few have noticed the change. With all the functionalities that have been introduced over the years by VMware, its easy to miss a few things every now an then. Some consider MSCS a primitive form of clustering as opposed to HA/DRS clusters within ESX/i. However it must be noted that a HA/DRS cluster does not protect you from application failure or OS corruption. Neither does FT in vSphere. With a FT enabled VMs, it must be noted that when the primary VM blue screens, so does the secondary VM and you are left with two identical server both not functioning.

To sum it up, HA/DRS and even FT protects you from a hardware failure only. According to VMware, MSCS must be leveraged to maintain a 100% uptime for Windows guests. So what you can and cannot do with MSCS and VMware?

You can cluster two VMs on the same host, two VMs on seperates hosts and you can also cluster a physical and virtual machine. There are detailed guides published by VMware on how this can be achieved. (Click Here)

A 50K foot view of what you can and cannot do and this will also differ based on the version of ESX/I you are running:
Only two nodes in a MSCS cluster
MSCS cannot be an FT enabled VM
Though MSCS VMs can be in a HA/DRS cluster, both HA and DRS should be disabled for all the VMs that are a part of MSCS
Quorum and shared disk should not have the VMFS signature and should be presented to all the hosts in the cluster where the MSCS VMs reside (Think about it, it makes sense)
Don’t overcommit and try to create a reservation for your VM equal to the size of the memory assigned.
The VMware doc will have more details

Now the last part, DRS is disabled because under the hood, HA uses vMotion. Though vMotion is rapid and causes no outage for the users, MSCS heartbeat is very sensitive and may detect the few seconds of the stunning period as a node failure and consider that node to be down. This is certainly not what you want. Hence its best not to vMotion, which is why DRS is disabled as well.
Why is HA disabled? No one has been able to give a straight answer on that and it basically comes down to that its not supported.

As of now I really don’t know why you can’t have HA enabled for a VM that is part of a MSCS cluster.
The good news is, with 4 u1 and onwards, you can utilize the same hosts that are in a HA/DRS cluster to run your MSCS VMs, just don’t forget to disable these features for the VMs that are part of the MSCS cluster or else the VMware and MS support may stiff you in time of need.


RDM Tutorial

At times as VMware engineers, you may run into situations that sort of take away the flexibilities that you have with your environment. For example, a request to attach a RAW LUN to a VM, will prevent you from taking snapshots depending on how the disk is configured in VMware. Though RAW LUNS may seem like a hurdle for some, it has its place in the virtual world. One of the advantages of RDM is that it enables your storage team to run their fancy management tools on the presented LUN. At the same time, you will see better performance in a high I/O VM with the intensive I/O being executed on the RDM versus a virtual disk. A database server with huge transactional read and writes may seem like a good candidate for RDMs. MSCS requirements will also lead one to look at RDMs.

I found a good tutorial on RDM at showmedo.com that I am sharing with you guys.