PSOD (Esxi 4.1) Dell R710

The PSOD that has been talked about for years now finally made an entry to our datacenter this morning. The R710 that was spinning for a couple of days and still not part of anything special except for the plain ESXi installed on it stopped responding to my ping requests. Oblivious to what had happened, I called our network gurus to see if anything on their end was changed. Once it was confirmed that the network wasn’t tweaked, I decided to log on to the slow KVM to get to the console of this server. It got really silent and right before me was the PSOD on ESXi 4.1. I have been working with VMware for sometime now and I knew how big this really was. So I took a snapshot as a souvenir and hoped this would get me to the elite class of VMware engineers that I have so far only envied for.

From the PSOD it only made sense to look at the hardware side first. Upon running diagnostics, errors with PCIE came up. I figured it would be best to call dell and not reinvent the wheel. Sure enough once I was on the phone with dell, and upon sending the system logs from the DRAC, it was revealed that PCIE card on slot 2 (quad port NIC) was the culprit. We placed the card on slot 3 and now slot 3 started reporting issues. Luckily for us we had a few spare NIC cards that we ended up placing in this box. It has been up for hours now. No error logs reported yet, no orange LCD on the server and no purple screen of death.

4.0-4.1 and Cisco 1000v

Due to the 1000v, I have witnessed some new challenges recently. The latest one has been the upgrade from 4.0 to 4.1. When running the upgrade from VUM, I kept getting failure notifications due to incompatible software installed on the host. It turns out the CISCO VEM was not seen as a friend even with the VSM version 3a running. I ended doing the following to make it work. A little dirty but worked!

Remove the CISCO VEM from the host –server hostname -r -B VEM400-XXXXXXXX

Install the upgrade on the host –server ip address -i –bundle “location of upgrade”

I would remove the host from the 1000v as it is being rebooted. Once the host comes back up, add the host to the 1000v again and the VEM would be deployed. Now enjoy the fruits of 4.1.