The PSOD as we all know is probably not the best way to start your day. In most cases one goes through this experience mostly due to hardware issues. I have come across a few over the years and it has been issues with MB, bad NIC, bad HBA and what not. At times this could also be due to a combination of things, that are annoying but also rewarding in hindsight.
Whatever the case might be, it’s generally not the best thing in the world and not a lot of fun. Unless it’s your first one, then I can understand the excitement and the anxiety. I came across this awesome post that I must share. It has a list of some very useful kb articles related to PSOD. Yes, this is also my way of bookmarking this awesome post. I would recommend you bookmark that link as well for a rainy day when your screen goes purple. I am hoping they will keep this post updated.
Btw, I really like how they prefer calling it the “Purple screen of diagnostics” vs the “Purple screen of death”. Pretty creative aye.
The PSOD that has been talked about for years now finally made an entry to our datacenter this morning. The R710 that was spinning for a couple of days and still not part of anything special except for the plain ESXi installed on it stopped responding to my ping requests. Oblivious to what had happened, I called our network gurus to see if anything on their end was changed. Once it was confirmed that the network wasn’t tweaked, I decided to log on to the slow KVM to get to the console of this server. It got really silent and right before me was the PSOD on ESXi 4.1. I have been working with VMware for sometime now and I knew how big this really was. So I took a snapshot as a souvenir and hoped this would get me to the elite class of VMware engineers that I have so far only envied for.
From the PSOD it only made sense to look at the hardware side first. Upon running diagnostics, errors with PCIE came up. I figured it would be best to call dell and not reinvent the wheel. Sure enough once I was on the phone with dell, and upon sending the system logs from the DRAC, it was revealed that PCIE card on slot 2 (quad port NIC) was the culprit. We placed the card on slot 3 and now slot 3 started reporting issues. Luckily for us we had a few spare NIC cards that we ended up placing in this box. It has been up for hours now. No error logs reported yet, no orange LCD on the server and no purple screen of death.