1000v Blues

As we all know new things are happening in the virtualization world everyday, it doesnt necessarily mean that we should jump on a new technology that got introduced yesterday into our production envirnoment. Every technology has its place and should only be implemneted to benefit the business. Labs are a perfect place to fullfill your apetite for knowing new cool tools not the production world.

In spite of the the advantages a std vSwitch has, a 1000v switch certainly gives you more capabilities. But are the new capabilities really going to be utilized? We have implemented 1000v into the environment and not one person really knows exactly whats going on. It has become more a learn as you go sort of game. Its exciting but only until something really goes wrong. Over time the team has developed a certain level of comfort with the 1000v however we are still not a 100% sure. Besides the idea behind having a distributed switch is also to make changes from a central location. It doesn’t help if you have multiple vCenters when there should have only been multiple clusters in the same vCenter instance. Now you have multiple 1000Vs to manage and more things that could go wrong. But the number of switches would have been far more if these isolated vCenters didnt have distributed switches. Then again, if these were std swicthes, they wouldn’t have had to patched like the VSMs in the 1000v do. And new patches could put a lot at stake. Yeah yeah, you could test it out i labs but we all know its never the same.

My point is, 1000v are great to have, but it goes back to the why do we need it? If you have compelling reasons as to how it could better your environment, one must go ahead with the headache. However, if these reasons don’t hold a lot of weight, the 1000v switch could easily become a hassle. And as always, please do proper testing in labs and don’t simply rely on white papers and he said she said. I have seen these being implemented in environments to give back the switching to the network teams as well. If that is the case, then I am sure your network team will appreciate in being part of the planning. Let’s get one thing straight, just like we don’t want multiple vCenters when multiple clusters within the same vCenter would do the job for obvious reasons, I am sure the networking team would also have issues with multiple vSwithes that they have no control over. Let them take over the switching aspect to better the end result. But lets not introduce this monster until you have a proper understanding of how it actually works. Guessing your way through may help you to set it up, however that may not be an efficient method for troubleshooting when an issues comes up.

 

e1000 packet loss

While going through the network stats in esxtop, I noticed huge percentage of recieved packets being dropped. For a second I thought I am not looking at the right screen but I was. The numbers were all over the place and didnt make any sense at all. They jumped from 20%, to 90% to 3%, 48% all with 5-6 sec. Sadly this was having to almost all the VMs. I didnt know if this was really happening or the host didnt really know what was going on as we are using the 1000v.

 

I asked our network team to use their tools to see if they are noticing the same thing on their end. They are still researching the issue. In order to restore some sanity, I decided to increase the receive buffer of a VM to see if that would make any difference. It didn’t! Finally I decided to start looking at a few VMs that weren’t experiencing this issue. To my surprise they were all XP machines. I started to think it is guest related in some way. Upon further investigation, I noticed the XP VMs were running the flexible NICs vs the e1000 that was running on the other 2003/2008 VMs. My next step was to replace the NIC on one of the VMs with the vmxnet NIC to see what happenes. WOLLLLAA! the drop packets went down the 0 and stayed there.

I am waiting on the network team to confirm that the e1000 was in fact loosing actual packets and the esxtop wasn’t simply nutty. Once they do confirm this, I plan on replacing the NICs on the other VMs as well. The vmxnet will give you better throughput, they are less CPU intensive and lastly according to ESXTOP they aren’t loosing packets like the e1000 were. The e1000 were great until the vmxnet came around. I think its about time we start looking into implementing this. One thing I don’t like is that the vmxnet NICs appear as removable objects in your system try like a USB drive would . I am sure there must be a way to fix it, I just haven’t figured it out yet.

EDIT: 11/24

VMware released a patch in Sept to address this issue. It turns out that the packet loss being reported on ESXTOP for e1000 nics was not correct.

 

vCPU worlds

While going through series of esxtop or should I say resxtop stats, I noticed that certain VMs had a higher count of worlds they were involved with. Considering I was researching an issue with performance of some of the VMs in the envirnoment, it only made sense to look further into this. Typically a vSphere VM with a single CPU would see 3 worlds (VMX, MKS and the vCPU), however a couple of my single vCPU VMs saw four, the additional world was “Worker” in one instance and “psharescan” in the other. Google didnt help much neither did the endless pdf I went through to figure out what was their purpose. However, upon vMotioning these trouble makers to a different host, the number of worlds got cut down to 3. The number never went back up but not knowing what those unknown worlds were is driving me crazy. One day I will figure it out.

Moral of the story; vCenter may have improved over the years and with 4.1 now we have more stats than ever with storage and networking I/O and what not. With all that in mind, esxtop/resxtop still remains a very nifty tool in every VMware Admin’s toolkit. If you don’t use it, get used to using it because at some point you will find yourself playing with it.