HBA Best Practices with vSphere 4.1 (updated)
KB article 1030265 got my attention — it describes a potential issue in vSphere 4.1 where HBAs and PCI devices can stop responding. The article currently doesn’t detail the exact circumstances that are known to cause this problem but the workaround does reveal what I think are some good practices for HBAs in general.
To workaround this issue, ensure that you have a minimum of 2 HBAs in each host and that those HBAs are on different IRQs. This can be determined by reviewing /proc/vmware/interrupts.
In my ESX host designs I always preferred to use at least 2 HBAs when possible to eliminate an HBA card as a potential single point of failure.
The other issue that is sometimes overlooked is hardware interrupts. To ensure the best performance and availability, always make sure different IRQs are used for each HBA. For example:
cat interrupts |grep qla2xxx|cut -c270-330 0 <COS irq 19 (PCI level)>, VMK qla2xxx 0 <COS irq 20 (PCI level)>, VMK qla2xxx
UPDATE: If you have ESXi you will not be able to run the above command. Setom reported in a comment to this post the following:
It seems that the following command can be used for finding the IRQ in ESXi:
vmkvsitools hwinfo -p
Loot at the 4th column of the output, it shows ISA/irq/Vec. The middle number should be the irq of the device.
Also I found this post at Malaysia Hypervisor which gives more background on the vmkvsitools command.
Another best practice mentioned in the article is to “ensure that you have alarms set to alert you if path redundancy is lost”. I know of a case where a VMware customer had issues on their SAN. Long story short, after the initial failure on controller A, no one was aware that the path to this controller was still down. The next weekend they went to do maintenance on controller B (the only active path) and brought the house down. I’ll repeat again using the words from the KB article — ensure that you have alarms set to alert you if path redundancy is lost. The “cannot connect to storage” rule in vCenter 4 should by default include triggers for “degraded storage path redundancy” and more. Make sure that you are proactively monitoring for these events!