When CPU metrics with Hyperthreading, Monster VMs and VMware Make No Sense
CPU seems like such a simple thing, but in the age of virtualization, hyper-threading and vNUMA, it can get quite complicated. In fact looking at some metrics can get you to lose your mind until you realize what’s really going on. Let’s jump right in to the original problem.
I encountered a VM with 16 vCPUs on a server with 16 physical cores (2 x 8). The VM was frequently getting alarms in vROPS (vRealize Operations Manager) as at times it would be at 90% or more for sustained periods, but the host server would show just under 50% utilization. This is illustrated in the graph below.
Why would a VM with 16 vCPU be at 80% when the 16 core host is only at 41%. What’s going on here?
The first things we will need to explore are hyper-threading, NUMA and how different CPU metrics in VMware are calculated.
Hyper-threading was intended to solve an issue with a waste of potential resources in the CPU. It does not change the physical resources (the CPU cores) but more resources can be potentially tapped into by allowing two threads to be processed by the same execution resource simultaneously — with each physical core being an execution resource.
When hyper-threading is enabled it doubles the number of logical processors presented by the BIOS. In this example, our 2 socket, 8 core system with 16 cores total now presents 32 cores to VMware. Chris Wahl has an excellent post on this topic which I strongly encourage you to read, but for now I’m just going to “borrow” one of his graphics.
VMware’s CPU scheduler can make efficient use of hyper-threading and generally it should be enabled. The number of logical processors now doubles, providing a performance benefit in the range of 10-15% in most vSphere environments (depending on workload/applications, etc.).
But what about our scenario of a host which has only 1 “monster VM”?
Sizing a Monster VM with Hyper-Threading
The general rule here is that you should not provision more vCPUs than the number of PHYISCAL cores the server has. In our scenario there are 32 logical processors presented due to hyper-threading, but only 16 physical cores. If we provision more than 16 vCPUs to the VM it means that execution resources will now be shared for the VM. Now there are some exceptions here (test your workloads!), but is it generally recommended not to exceed the number of physical cores for this reason.
VMware has a blog post on this topic. What is their guidance?
VMware’s conservative guidance about over-committing your pCPU:vCPU ratio for Monster virtual machines is simple – don’t do it.
NUMA and vNUMA
In the interest of time I’m going to go too deep here, but let’s just say NUMA is a technology designed to assign affinity between CPUs and memory banks in order to optimize memory access times.
vNUMA was introduced with vSphere 5.0 which allows this technology to be extended down to guest virtual machines.
The bottom line here is that the mix of virtual sockets and virtual cores assigned matters. As this article shows, processing latency can be increased if these settings are not optimal.
First you’ll want to make sure that hot-CPU add is disabled as this disables vNUMA in any virtual machine and then you’ll want to make sure that your allocation of virtual sockets and virtual cores matches the underlying physical architecture or you could be adding some processing latency to your VM as noted in this VMware blog post.
One more point here. There’s a setting in VMWare called PerferHT. You can read about it here, but it basically changes the preferences in vNUMA. There’s no universal answer here as it will vary from application to application, but this setting is a trade off between additional compute cycles and more efficient access to processor cache and memory via vNUMA. If your application needs faster memory access more than it needs compute cycles, you may want to experiment with this setting.
BACK TO OUR PROBLEM…
As it turned out all of our settings here were optimal. We had one vCPU socket with 16 cores – matching the 16 physical cores on the server – and vNUMA enabled. If you are using a Windows guest you can download Coreinfo.exe from Sysinternals and get more detail on how vNUMA is configured within your VM.
But we still don’t have an answer to our question – why is VM CPU at 80% when the host is at 41% given 16 physical cores (host) and 16 virtual cores (VM)?
Is it possible that not all the cores are being used? Let’s check — here is a graph from vCenter showing that all 32 logical cores are being used (and tracking within 10% of each other) but average CPU is 19% peaking at 27% over the past hour:
But then we look at the VM for the same time period and we see the same pattern except that CPU peaks at over 90% and averages 45%:
How can this be? The VM is triggering high utilization alarms when the host is at less than 50% utilization.
Let’s go to ESXTOP to get some additional metrics, but first we need to understand the difference between PCPU and “CORE” in ESXTOP:
So PCPU refers to all 32 logical processors while “CORE” refers to only the 16 physical cores.
Now lets look at ESXTOP for this host.
Notice how CORE UTIL % is reported at 78% while PCPU util is only 40%. That’s a big difference! Which one is right?
If we look at the Windows OS, we see that CPU at that same instant was aligned with the CORE UTIL% metric:
It seems that there’s a couple things going on here. First, the CORE UTIL metric more accurately reflects utilization for THIS Monster VM scenario as it averages across 16 physical cores and not 32 logical cores. Second it seems that the CPU utilization metrics which we tend to rely on in vCenter and other tools tend to follow the PCPU (hyper-threaded) statistics and not the “core” utilization.
A few graphs to quickly illustrate this. First once more here’s CPU for both the host and the 1 VM that is on it as reported by vROPS 6:
Same pattern in both but the host is averaged across 32 logical processors while the VM is averaged across 16 vCPUs, which results in the numbers being almost double for the VM.
We can also see this by looking at MHz rather than percent utilization:
Without breaking down the math, the number of Mhz consumed by the VM divided by the capacity of the host, does align with the CORE UTIL% metric.
One thing I could not figure out about this chart is why the host shows LESS Mhz utilized. There should be no averaging here – just raw Mhz consumed – so it’s escaping me why the host would show less consumed than the VM (not possible in raw Mhz). If anyone has an answer for this I’ll gladly update this post with attribution.
So if I’m using vROPS 6, what metric do I use to see actual core utilization without factoring for hyper threading? The documentation I must confess lost me a bit. Allegedly this metric exists but I couldn’t find it anywhere:
After some trial and error I did find a CPU Workload % metric which does appear to focus on the cores (no hyper-threading):
Again the pattern is identical except “Usage” (top) is averaged over 32 cores – not accurate for our scenario – and “Workload” (bottom) is averaged over the 16 physical cores. Here the Workload metric (bottom) gives a far more accurate picture which aligns with the VM level metrics. If we look at just the default Usage % metric we are left with the impression that the host has far more resources to give and that our vCPU allocation (or something else) may not be efficient, but that does not seem to be the case here.
So what would these metrics look like on a host with many workloads and no Monster VMs (more common)?
Different scales here which makes the bottom chart appear more volatile, but the gap between the two is not a doubling like we saw before. The numbers are much closer.
Now here’s a question that troubles me. The default CPU metrics in vSphere count all the logical cores but look at the peak above. If I looked at the default CPU graph, I’d think I was at 74% when the physical cores were actually at 88%. I can see how averaging across all logical cores can provide a better view of utilization, but it seems to me that the Workload metric (physical cores only) provides a better watermark for detecting bottlenecks.
We’ve jumped about a bit here but if you’re still with me let’s try to nail down some conclusions from all this:
- Hyper-threading does not increase execution resources, but it many cases it allows them to be used more efficiently depending on the workload (this benefit is often 10-15% in VMware environments).
- The default CPU metrics in vSphere are averaged across all logical cores which includes those added by hyper-threading. This can result in confusing results when a single “Monster VM’ is running from a host
- VMware’s guidance is to NOT exceed the number of physical cores on a host, when provisioning vCPUs to a Monster VM.
- In vROPS 6 the Workload % metric appears to only look at physical cores and thus may be a better indicator for CPU bottlenecks in some cases.
- vNUMA considerations including virtual to physical core allocation can impact performance.
As for our VM which was triggering CPU alarms, it appears that it is using an appropriate amount of resources on the host after all. Now there’s a possibility that we could experiment with more cores and possibly get better results, but they key is we can throw out the 50% disparity between CPU utilization and VM utilization as bad data in this scenario.
And last but not least –
- Measuring CPU utilization is not nearly as simple as we had thought.
One final note — this is my interpretation of what I am seeing. If anyone can offer better guidance (especially corrections) to anything I’ve posted here, please do so and I will be glad to update the post with attribution.