Performance Issues with Networking on ESXi and UCS B200 / 6100
I ran into an environment which had a series of issues with the virtual infrastructure and I thought I’d share the result in the event someone else found it helpful.
This was an environment running UCS B200 servers and UCS 6100 Fabric Interconnects. NFS storage was being used for some volumes and at times vCenter (5.0) was recording very high levels of storage latency. This was confirmed by the VKERNEL log which showed the intermittent loss of NFS mount points and path failures.
One thing we came across is this Cisco document which explains that some UCS servers have issues with Interrupt Remapping. This can be disabled in the BIOS and vSphere, but in this case the UCS BIOS was upgraded to a current release which did noticeably improve the environment.
The other item we found is the following VMware KB which explains that the network load balancing policy “Route By IP Hash” is NOT supported with UCS B200 servers with UCS 6100 fabric interconnects. As NFS based storage uses IP as a transport, this could explain some of the latency which was observed. From the KB article:
When enabled, the NIC teaming policy Route based on IP hash involves a team of at least two NICs that selects an uplink based on a hash of the source and destination IP addresses of each packet. Host network performance might degrade if Route based on IP hash is enabled on ESX or ESXi because cross-stack link aggregation, or grouping of multiple physical ports, on UCS 6100 Series Fabric Interconnects deployed as a redundant pair is not supported. As a result of the network performance degradation, you may see intermittent packet loss and the vSphere Client or vCenter Server might lose connection to the ESX or ESXi host.
The sum of both changes is that storage performance issues were eliminated and began functioning as intended.
On a quick side note, this also makes a case for the value of converged infrastructure and reference architectures. Systems are increasingly complex and when you build something on your own it is very easy to encounter issues like this and more. Converged infrastructure and reference architectures can help here by providing a blueprint as to what components can work together and under what conditions. When you use one of these solutions you can have the confidence that you’re not the only one operating from your blueprint and that significant level of engineering and testing was invested in your architecture. Additionally if an issue is encountered, you can be proactively notified of the risks and what changes are recommended to mitigate them.