I’ll say it at the beginning of this post and again at the end, stop getting caught up in the vendor hype that their storage solves boot storms and worry about the write IO that is created by virtual desktops, login storms, application streaming. Disk is the main reason VDI projects fail or don’t get off the ground. If you have more than a couple hundred hosted virtual desktops you’re going to need to understand disk IO, the IO profile of your virtual desktops, the applications and events that cause it, the peaks, and how your storage system handles it.
I’m quite surprised by the number of vendors talking about how they solve boot storms or have really high read IOPS capability in a hosted virtual desktop/VDI environment. In the real world most people don’t worry about boot storms because we boot up VM’s well prior to users arriving. Not to mention storage manufacturers have cache on their controllers which will cache the blocks of frequently accessed data on the controller. Does this IO still show up when you look at the hypervisor and VM’s? Yes, but that doesn’t mean it is resulting in a read IO from physical disk. The other point which conveniently goes un-talked about, presumably because it’s not something all storage vendors can address, is the percentage of IO in a VDI environment that is write IO. In fact in many cases the write IO is as much or more than the read IO and as you’ll read below, write IO is more expensive in nature to handle with RAID. On top of all of that the write cache which works really well when reading the same physical block of data in a disk array is left barely hanging on as the write IO coming into cache is unique and must be written to disk quickly or we risk filling up cache, causing a cache flush, and new IO coming in is forced to go straight to disk (BAD).
Great reads on VDI, IOPs, and the importance of addressing the write IO in VDI. Read the articles below then lets move on.
In order for me to explain this fully I think it’s first necessary to revisit IOPS and disk drives.
Let’s start with the math of IOPS. Disk drives, solid state or of the spinning platter variety are all have a limited number of I/O’s per second that they are capable of. The general numbers that are accepted are for random read or write performance. I’m referencing numbers listed on wikipedia but you’ll see numbers that vary from this depending on the vendor. http://en.wikipedia.org/wiki/IOPS
|7,200 rpm SATA drives||HDD||~75-100 IOPS|
|10,000 rpm SATA drives||HDD||~125-150 IOPS|
|10,000 rpm SAS drives||HDD||~140 IOPS |
|15,000 rpm SAS drives||HDD||~175-210 IOPS |
Now that we know the estimated available IOPS per drive drive we need to calculate in the RAID penalty. For many this may be something you were not aware of. RAID penalty is the reason behind the statement you’ve heard, put your log files on RAID 1 or RAID 10. As you can see in the table below the RAID penalty is lower on RAID 1 or 10 for write IO and log files are typically heavy write IO. So to get the most IOPS out of your drives you want optimize the RAID type for level of redundancy and the type of IO involved. The total IOPs your storage system can handle is a function of the speed of the drive, the RAID being used, and the split of read vs write IO. Let’s say you have 10 drives averaging 150 IOPs each for a total of 1500 IOPs. Using RAID 5 those drives can handle 1500 read IOPs or about 375 write IOPs. Notice how much less write IO those drives can handle? Are you beginning to understand the scope of the problem with VDI and disk IO?
|RAID 1 and 10||1||2|
So what are we looking for in a disk solution for VDI? At scale we need a solution that can handle read and write IO in a tier of SSD. There are many solutions on the market that use SSD as a cache, some for only read, some for both read and write. A cache of SSD which handles both read and write IO is good so long as the underlying disk can keep up during the busy time (typically during periods of heavy login or application streaming). If the cache can’t keep up then we end up going directly to physical disk, which are probably under heavy IO, hence the reason cache couldn’t offload fast enough to disk in the first place.
In my opinion the easiest thing for IT is to have a disk solution which can keep all of the heavy read and write IO in a tier of SSD in the array. These solutions don’t require a warming of cache as the frequently used blocks for read IO are probably already there if they are indeed accessed frequently. These solutions are more easily understood and implemented. These solutions should have enough SSD for write IO to sustain the day’s write IO in SSD and then tier that data down to a slower tier if desired. The downside to this approach? It’s usually expensive…hence the reason this VDI storage marketplace is so fragmented in their approach to solving the VDI IOPs issue.
There are other solutions on the market which use SSD for read and lay down their write IO sequentially on slower spinning disk (which results in higher sustained IOPs). I haven’t seen any performance testing for these types of solutions that establish the write IO they can sustain.
There are also some new solutions using RAM as a tier of disk and then optimizing the stream of IO to existing storage.
I’m not going to make a recommendation on the storage you should choose, just make sure you’re handling write IO in your storage solution and make sure you fully understand and test prior to production deployment.
Regardless of the solutions you choose the biggest takeaway for you should be, stop getting caught up in the vendor hype that their storage solves boot storms and worry about the write IO that is created by virtual desktops, login storms, application streaming.