The Risks of Thin Provisioning — and Solutions
Recently I was interviewing a candidate who mentioned his vSphere experience. I asked him what his favorite feature of vSphere 4.1 was and he said thin provisioning (technically this was possible in ESX 3.5 but was really introduced in 4.0). I then asked what some of the concerns and risks with thin provisioning might be.
I’ve found that not everyone is aware of these risks and concerns, as well as potential solutions. Virtualization Review also recently made a post on this topic so I wanted to briefly cover two concerns and approaches for dealing with them.
The first concern is metadata updates. Metadata refers to a portion of a VMFS volume that contains – well metadata. The problem is that when metadata changes are made a lock is placed on the entire LUN using a SCSI-2 reservation which can significantly delay disk access for other ESX hosts trying to support other VM’s on that volume. This is one reason it has long been a best practice to limit the number of active VM’s on a VMFS volume.
Metadata updates occur during activities such as a VM power on, vMotion, snapshots, and – growing a thin provisioned disk (see KB1005009 for more details). When a server needs more space and needs to use some previously unallocated blocks, it will require a metadata update, which will place a lock on the LUN. This can be especially bad if you have a number of servers that need to expand their thin disks at around the same time.
The good news is that this problem has been largely solved with vSphere 4.1 and a VAAI-enabled SAN. VAAI (vStorage API for Array Integration) enables hardware acceleration for several functions – including hardware accelerated locking. When VAAI is used, the hardware accelerated locking allows the SAN to atomically update the metadata using a single SCSI command at the block level. Since this lock is at the block level, this no longer locks out other hosts to the LUN during the metadata updates. The bottom line is that the negative aspects of metadata updates – specifically LUN locking – are eliminated when VAAI is used.
How do you enable VAAI? Three simple requirements:
1) Run vSphere 4.1
2) Use a SAN whose firmware supports VAAI
3) Make sure that the VMFS3.HardwareAssistedLocking parameter is enabled (1). This can be enabled without a reboot.
It’s really that simple. Use VAAI and all those concerns surrounding metadata updates and LUN locking are pretty much gone. Of course you’ll also want VAAI for hardware accelerated zeroing and copying (i.e. template deployment/storage vMotion).
STORAGE CAPACITY MANAGEMENT
Generally it has been a best practice to only fill VMFS volumes to around 70 to 80% of capacity. You need to maintain some free space for delta logs (snaps), vmotion and several other reasons. Does using thin provisioning change the considerations here? Yes!
Thin disks will expand whenever the server needs to access previously unallocated blocks. You could even encounter a “perfect storm” where a number of servers will need to grow their thin disks at the same time (i.e. a Service Pack or an application is being pushed out). You need to prepare for the possibility that these disks may grow quickly with little notice and you may need to react to avoid capacity issues.
In other words a 20% buffer may no longer be adequate. How much more should you buffer? That really depends on the expected growth rate. Most web servers are fairly static from a storage perspective while databases and other servers can be quite dynamic.
You can create alarms in vCenter for various capacity levels on VMFS volumes but what is really helpful is to trend your storage growth over time. Products like vCenter Capacity IQ or Quest vFoglight can do just this and give you some trending for VM and VMFS consumption over time (and even alarms based on such trending).
The bottom line is that you need to take several things into consideration before deploying thin provisioning at I would recommend the following be considered at a minimum:
- Increase the VMFS buffer (free space cushion)to at least 30% and perhaps more based on understanding of the data growth rates of your environment (of course this chips away at the storage savings slightly).
- Make sure you have capacity alarms on your volumes (whether vCenter or something else) that give you ample time to react to changes.
- Use capacity trending to gain more insight into your growth rates, and perhaps build alarms around them.
Thin provisioning can still be done effectively and with significant benefits, but it does require an organization to change their monitoring processes and to put operational teams in a better position to react to the more dynamic expansion of thin disks.
HOW MUCH CAN I SAVE FROM THIN PROVISIONING?
If you haven’t already you need to download and experiment with vEcoshell, which is somewhat of a Powershell GUI for vSphere (similar to Quest PowerGUI, but focused on virtualization). It’s a great tool and one of the built in queries is called “Wastefinder” which will quickly give you an estimate of how much white space could be reclaimed with a 20% buffer built in. The math here is designed to work with Quest’s vOptimizerPro product but it can give you a quick idea of how much space you can potentially gain with thin provisioning as well.