HBA Best Practices with vSphere 4.1 (updated)

KB article 1030265 got my attention — it describes a potential issue in vSphere 4.1 where HBAs and PCI devices can stop responding.  The article currently doesn’t detail the exact circumstances that are known to cause this problem but the workaround does reveal what I think are some good practices for HBAs in general.

To workaround this issue, ensure that you have a minimum of 2 HBAs in each host and that those HBAs are on different IRQs. This can be determined by reviewing /proc/vmware/interrupts.

In my ESX host designs I always preferred to use at least 2 HBAs when possible to eliminate an HBA card as a potential single point of failure.

The other issue that is sometimes overlooked is hardware interrupts.  To ensure the best performance and availability, always make sure different IRQs are used for each HBA.  For example:

cat interrupts |grep  qla2xxx|cut -c270-330
0 <COS irq 19 (PCI level)>, VMK qla2xxx
0  <COS irq 20 (PCI level)>, VMK qla2xxx

UPDATE:  If you have ESXi you will not be able to run the above command.  Setom reported in a comment to this post the following:

It seems that the following command can be used for finding the IRQ in ESXi:

vmkvsitools hwinfo -p

Loot at the 4th column of the output, it shows ISA/irq/Vec. The middle number should be the irq of the device.

Also I found this post at Malaysia Hypervisor which gives more background on the vmkvsitools command.

Another best practice mentioned in the article is to “ensure that you have alarms set to alert you if path redundancy is lost”.  I know of a case where a VMware customer had issues on their SAN.  Long story short, after the initial failure on controller A, no one was aware that the path to this controller was still down.  The next weekend they went to do maintenance on controller B (the only active path) and brought the house down.  I’ll repeat again using the words from the KB article — ensure that you have alarms set to alert you if path redundancy is lost.  The “cannot connect to storage” rule in vCenter 4 should by default include triggers for “degraded storage path redundancy” and more.  Make sure that you are proactively monitoring for these events!

8 Responses to HBA Best Practices with vSphere 4.1 (updated)

  1. Methone says:

    silly question: how do you establish in Virtual Ceneter this alarm ?

    • Kevin says:

      Hi Methone and thanks for posting. It’s not a silly question at all actually.

      First of all the default status for this alert is “unset” rather than “alert”. It needs to be “alert” if you want the host to change color in vCenter.

      The other issue is external monitoring systems. The default action is to send an SNMP trap. This needs to be changed so that the proper notification action is taking place. More detail here: http://communities.vmware.com/docs/DOC-12145

      If I can find the time I’ll try to work through some of this in a more detailed post. Thanks!

  2. We have a few ESXi 4.1 hosts and unfortunately, there is no way to get to the /proc/vmware/interrupts file even in support mode.

    The KB article on VMware http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1030265 does indicate that it pertains to ESX and ESXi, but does not discuss how to view which IRQs numbers are assigned to which HBA.

    Using resxtop on a vMA and selecting “i” allows you to view interrupts (as does vmkvsitools hwinfo -i within tech support mode), but it does not tell you what IRQs they are assigned to. Am I missing something?

    I have created an SR with VMware to ask their opinion.

    • Kevin says:

      Hi Matt — Thanks for posting!

      That’s a great question and you’re are correct that resxtop doesn’t reveal this either. Neither does the CIM data (Health Status).

      Of course HW vendors should have something here (i.e. DRAC for Dell) but a universal VMware or PowerCLI command would be nice. I’ll look around and let us know what you hear from VMware! Thanks!

  3. douglas carson says:

    Nice article.
    Do you know of a good way through ESXi to find the IRQ settings?

  4. Kevin says:

    Hi Douglas…

    No, not yet. Obviously hardware vendors may be able to help here (i.e. DRAC from Dell, etc) but at this time I’m not aware of a universal vSphere method that works in ESXi. A PowerCLI script may even work but it will be some time before I can dig into that.

    If I come across anything I will definitely update this post. Thanks!

  5. setom says:

    Unfortunately one of my customers hit the problem in this KB, he is also running ESXi so we cannot find the IRQ value if we follow the instruction in that article. It seems that the following command can be used for finding the IRQ in ESXi:

    vmkvsitools hwinfo -p

    Loot at the 4th column of the output, it shows ISA/irq/Vec. The middle number should be the irq of the device.

Leave a Reply

Your email address will not be published. Required fields are marked *