 If you haven't met me, my name is Philippe. I work for Citrix on the Zen server engineering performance team with focus on storage performance. So I've been doing a lot of stuff with storage performance and there's only so much you can talk about in a 20-25 minutes talk. So what I wanted to talk about today is what I've been recently encountering and working with very low latency storage devices, so newer SSDs and things like that. So what happens in terms of the virtualization overhead when you're using this kind of storage? I'll then review very quickly how Block From Block Back, Block Tap QM, Block Tap 3, Qt Desk, etc. works in case you get confused with the terminology and share some of my recent measurements explaining how I've been measuring things today and talking about what we're doing to improve on that. So this is both from a Zen and a Zen server perspective. So one of the things that come across a lot when probably everyone would want to do that is when they first install any kind of Zen environment and get a VM running they would go into DOM 0, so if you're familiar with Zen Center, this is Zen Center on a DOM 0 console and they're going to do some sort of DD and observe what kind of throughput they get. So it's a very simplistic test and here you can see I'm reading at about 118 megabytes a second and then you're going to do the same thing in your VM and in this case again I've got 117 megabytes a second so it's all good, right? I can quit. So it turns out it's not always like that. So what happens when some other people have different types of disks and then in this case you go into DOM 0 and you measure about almost 700 megabytes a second and then you get to do a VM and you measure 300 megabytes a second and of course people are going to argue you have an aggregate, it doesn't work like that, it depends, of course, but as I said, this is one use case. So let me go over very quickly how storage requests transverse the virtualization stack and then discuss some of my measurements. So as you know there are many different ways a user application can do IO. A very simple one is through read and write. So if you have a user process and you have a buffer and you have a file descriptor in this example opened with all direct so it bypasses any caching and you do some read that's going to go into a syscall handler, it's going to bounce around in the kernel for a bit, end up in the block layer and then you're going to have a BIO placed in your request key for the previous device. You kick the driver, the driver passes your requested device and eventually some hard interrupt will complete your request and in the end your buffer has the data. So the most simple case with Zen is work out some of my animations magic, that works. We have a, when my manager saw that he said oh that's why I'm paying you for the PowerPoint animations. What we do here in the, what we call the upstream Zen case or what you get if you just install it and it's, we replace that device driver in the VM with something called block front and we run something called block back in them zero and block back is going to pick up this request from block front and pass them to the block layer and up in your device driver. So it's basically just that. In Zen server we do things a little bit different and if you use QMU or QDisk you might see something similar. We have certain features like thin provisioning and cloning and snapshotting which is supported in the software stack. So what we do is basically store the data in VHD but block back doesn't talk VHD. So we introduced software component called TAPTISC and it got a little bit tricky because we cannot get block layer to talk to a space component like that so what we did at the time and this is just historically done like that was to introduce a new component here called block TAPTU. So block TAPTU provides a block device that block back and talk through the block layer and block TAPTU would talk to TAPTISC and then TAPTISC would complete the request through lib.io. So the data path is a little bit longer, of course there's a little bit of added latency here but we can then work with VHD and do storage migration and things like that using this model. So this has been historically done like that way before people started thinking about SSDs and other concerns with that. So there is a slightly different use case which is the one with QMU QDisk and TAPTISC 3. And what we do there is we get rid of this and we replace TAPTISC with QMU, the only difference here is actually that QMU doesn't really use lib.io, uses POSIX.io which is a little bit different, it's more portable and it has other differences and we get this user process to talk straight to block front through the Grand Dev and the event channel device in order to do the grand mapping and get interrupted and interrupt block from back. So that's about it. So to measure these things what I did was to grab three different types of disks. So the first one you see there is a Dell, it's a Seagate SAS disk, a 15K RPM enterprise class hard drive. The second one is a Dell SSD and the third one is an Intel SSD which is pretty fast. And I grabbed two of each, I created three sets of RAID 0, put them into the same box so I had three different volumes connected through a Perk H700 and I started to measure reading with what I was varying my block sizes so here from one sector all the way to four megabytes in logarithm scale and observing what kind of throughput would I get in megabytes a second so here bigger is better. And what I could see is that while I was growing the block size per request in 232K the throughput was about the same up after which point I kind of reached the maximum throughput that this SAS disk could offer and then you see the throughput doesn't change anymore no matter how much I increase my requests. That's the blue line. So on the green line those were the Dell SSDs, the one in the middle in the picture before. And you see they get faster up to 128K and then they drop. And what happens here is just that the controller, the device driver for that controller actually advertised the maximum request size has been 120K, 128K. So at that point the block layer starts sending one request of 120K and another one which is depending on what I was using here for another 4K and then just splitting requests up. So that configuration was sensitive to this and that is the green line. And on the red line which is the Intel SSDs you would just carry on getting faster. There is a little bit of a bump that are 120K again but less sensitive and then you just carry it on. So this is measured on top of Zen 4.3 on a CentOS 6.4 and the 3.11 kernel. So I'm going to pick two of these which is the blue line with the SAS disks and the red line which is the fast Intel disks and compare them on a virtualized perspective. So if you look at the blue line and then I run a VM. So this was a Debian Weezy VM with all the VCQs pinned so it's a very big box but I kept everything into the same Neumannode and I run my measurements from the guest. What I see is that that's the purple line here. All of them are slower up to just over or just over 60 something maybe just over 100K. At which point the throughput I get in the guest is the same. So this kind of matches what you saw on those first two slides when I was doing DD from Zencent and we measure something with one megabyte block sizes and Domziri measure again in the VM we don't see any overhead. So this is the plot that tells you exactly why. If I add measurements from a disk that has been plowed using block front 3.2.0 so the difference here didn't use the 3.11 because the 3.11 has persistent grants implemented I'll talk about that a little later. So I just wanted to simplify what we would normally see. And it was block back plugged directly to the block layer. So it was what I was describing as the upstream Zen case. If I plug more disks to this VM through Q to tap this to and block tap and through Q disk what we see is that the throughput doesn't really get that far that doesn't really get that high. So we never really match the throughput we were observing either with block back or with the or with bare metal or even from Domziri. There are many reasons for this. There's some CPU overhead here. There's the longer data path. There were some added times. I'll talk about that later as well why this happens. So this is what happens in this configuration using those SAS disks. Now I will show you what happens if this was being measured on top of the Intel SSDs which is the red line now. So you see that block back was sensitive as well to that bump at 120K. But even block back is unable to actually be fast enough to match the SSDs. So if someone actually sets up this configuration or a similar class of a very fast enterprise storage and try block back directly or some user space alternative, they're gonna see a massive performance hit. And I can even flick between these lights. And you can see how the block back in the user space doesn't really get any faster at that point, but the SSDs can just go faster. So the problem now is to understand what actually is going on. Why is it not faster, right? Where are those bottlenecks and what can you do about them? So what I decided to do is stop looking at this in terms of throughput. And throughput, as you know, is just the amount of data it can pass in a fixed amount of time. Now if we invert that, what we get is the amount of time it takes to pass a certain amount of data. And if we're passing just a small amount or a very size of data to the guest and back and not send anything parallel, not break up requests, we can actually measure the time it is taking for the data to travel around and try to work out where things are getting slow. So if I plot a portion of that same graph as before, and here the colors have changed and the scales have changed. So just to show you, I'm just varying in the nearest k up to 128k, measuring at every 4k. And on the y-axis now, I have the latency it takes to serve those requests. So now lower is better. I can see that from dom zero, I can still request that pretty, pretty fast, maybe this was maybe 50 microseconds in the beginning with very small requests and then up to just over 100 microseconds. And then what happens with blockback, you can even see that glitch at 128k in the end, and what's happened with my use of space alternatives. So now we can see, oh, this is how much slower it actually is to serve requests of that particular size. So what we need to try to work out is how much time I'm losing on each one of these stages. What am I doing during those 50 microseconds? And what am I doing on that 3, 400 microseconds that has taken so much longer? So what I did is to start inserting trace points during the path, the code path that serves those requests using our DTSC. So this basically measures a fixed constant with a timestamp counter that is consistent across course. So I can get that from different domains and I have a solid reading on how much time has passed between that particular piece of code being executed. And I observed a few things. So I observed that if you're just passing lots of requests, the first ones you pass are usually slow. So you need to use some sort of technique to trace this in which you can pass lots of requests and see what happens after the hot path is actually warmed up. To do that, I used a trace print K. So I could just put everything into a trace buffer in the kernel and then look in the bugger fast for the actual timestamps that I've read. And then the user space within the application that I was reading from, and if I was using something like TapDisk or BlockDisk or QDisk in the user space, I just made a big static buffer and put all my TSC readings in there. When I was done, I just sent a signal for you to print them out. So I didn't want to be printing them while in the hot path, otherwise I would add overheads to that. And I actually confirmed that with my timestamps in there, I confirmed that to the other measurements that didn't have any timestamps inside, and they were pretty much equivalent. So I knew I was not adding any sort of overhead by measuring this way. And I would run 100 requests through, sequentially reading all of them. And using IFF1, so I was not having requests piping themselves up. I looked at all the times, sorted them out, eliminated the fast and the slowest ones, and just average out the actual time for those middle 80 requests that got served. And I repeated this ten times. And by repeating it ten times, I actually saw some interesting things like sometimes you start this burst of 100 requests. And you notice that blockback has been scheduled in a particular CPU, or a particular vCPU of that range that was pinned. But you observe some difference into the affinity of requests we're going through and from. Of course, when there was anything a little bit weird, I could go back to the raw data for cutting anything out and trying to investigate what was going on. And let me show you where I was putting those traces. So if you remember this path for blockback, I started putting traces everywhere on the path where the request is actually going all the way through. So I can measure things like the latency of the context switching or the latency of the event channel, the latency to schedule, the K thread for blockback, etc. To the point where I send the request to blocklayer. And then I could measure everything else back when the request was actually completing all the way back to my application again. So you can look at the slides later and look at these 12 data points there. And I could actually plot that. So all of these lines correspond to those points you saw before. And what we see here is that this big orange line is the actual time that we spent in the device. And the other lines were spent doing everything else together, requesting there and then back. Now if you want to have a look at the raw data later, I can share that with you. You can go into the, and examine if there's any minimal optimization here that can be done. I think the most important points to raise from there are the actual times that we spend in the grant mapping, and the grant and mapping. Which are the two lines, the brown triangle and the other one with the cross. Second and third to the top one. This now has been measured on 311, but we disabled persistent grants both in blockback and blockfront. So in the conclusion of these slides, I'll talk a little bit more about why we did that. So this shows actually the overhead for that. We know or we believe, as Boris was saying, you don't know anything until you've properly measured it, that there is some sort of contention on the grant table. And if we were actually doing concurrent AO and having many domains hammering them zero with AO requests, this would probably not grow linearly. It would probably go worse a lot more than just that because of locking issues. But we haven't measured that just yet. So what happens when you actually have persistent grants in this case? So that's what I noticed when I actually had persistent grants. There was a lot of two lines before that were the lines for the grant map and the grant map actually dropped down. So they started to be executed really, really fast, even for bigger block sizes. But the time that I spent copying the data from the persistently granted pages was actually pretty high. Now I know it would be better if I actually have a total line to show this, but the sum of these bottom lines is actually not higher than that, it's lower, which means the persistent grant for this use case was actually worse. So for aggregate throughput the persistent grant is good because we eliminate the contention on the grant table, or we believe we're doing that, it is faster Roger when he implemented that, he measured this. Also for the user space alternative, so this is for blockback, the persistent grants also help because the current way we're doing grant mapping and mapping on the grant device is actually slow. Again, something we believe I haven't effectively measured that just yet. So for user space alternatives the persistent grant is very good. Now there might be other things we can do there, I'm going to go to that later. So just to summarize, I just did this measure because this was just something very recently got up streamed to 311 and I wanted to examine that in this fashion. So what I observed is that with the persistent grants we were going to do less flushing because we're not doing as many unmapping. And we believe to have less contention on the grant tables, so concurrently it was probably better. But the problem I see is that currently blockfront will always use the persistently granted pages even if it hasn't negotiated persistent grants for blockback. So we would always copy. So ideally you should support both their paths. I already have a chance to talk to Roger and he agrees with me and this is probably something we're going to be changing soon. So we believe it would be better if administrator could decide which one of these is in my deployment I'm going to use persistent grants or I'm not. It depends on the particular case. So there are a number of things we can do for improving here. So this is the time where we have to start thinking and experimenting with things and people who have ideas should step forward and get involved. Now all of these SSDs and VMEs and whatever, it's pre-new technology. And as I showed you the block-tap stuff we're using, that's probably still from before SSDs. And there are many new aspects changing in storage technology very, very fast. And for the next years we anticipate to be doing maybe a million IOPS with a single stream so devices responding in one microsecond. And we cannot afford to be spending several dozens of microseconds in ground mapping or mapping. We're not really getting them, the cores are not really getting that fast and we need to find new ways of doing that. So some of the ideas that I've been discussing, there's persistent grants. I'm not going to go into detail with all of them, I'm just going to mention some and we can discuss that later. There's persistent grants that we're just talking about. So we know that ground mapping and mapping is expensive, we know there's some condition tables, this is the way of helping that. Another thing that got implemented in 3.11, which I believe is very, very helpful, is indirect AO. So up to 3.10 something, we could only... Each request that would be put in the ring between block from block back could be at most 4K. So the request structure only had 11 segments for pages and we could not address more than that. So if you issued a larger request it would get split up. That also limited the total amount of data it could have in flight because the ring only has 32 slots. So unless you're using multi-page ring or doing some other thing like that you'd be very limited on the amount of requests you can put through. So indirect AO fixed that by having the... the one segment pointing to a page full of segments for other requests you can have very large requests per ring entry. So Malcolm in our sense of engineering team also proposed something where we can avoid doing TLB flushes if we hadn't touched the pages and we realized that in the storage data path we don't really look at the data. So we still do the mapping but if you don't access data we don't need to force a TLB flush. David here, he's been talking about some other ideas for the user space alternatives in which we introduce a page full handler or we don't really need to grab a map in the user space and the kernel, which is what's happening now. There are other ideas such as using separate rings for requests and responses. So imagine if the guest has one ring for putting the requests in and the back end could put the responses in a separate ring. So you would facilitate for example to control how many outstanding requests you have so right now we can only have at most 30 requests because you consume the whole ring and then you need the slots to put the responses in. And if you had different requests you could have different threads in your control domain, in your driver domain coordinating saying we're not going to pick up more than a thousand requests altogether so they could just consume the ring slowly. Or if there is only one VM and you have enough memory for you you could just consume the request ring many times and just place the request. You have separate space to put the responses. This could also possibly help with some caching issues. If you are following what's happening at the block layer there are some people working on a multi-Q approach which is basically they realize that in big new machines you still have one single request queue per device driver. So you put the request in there and if the request is coming from a CPU in a different node there is a very high cost of doing remote access and this is again a very big penalty for very fast storage drivers. So the concepts of having software request queues and need for algorithms like CFQ are being re-think. How are we going to adapt to these new changes? This is probably coming at 3.12 I think. That's it, I think it's just about on time. So thank you and I'll take some questions. Questions? Have you answered everything up front? I think people just want to go to the pub now. Yeah. Well they are waiting for the whisky raffle. So no questions? Okay. Thank you.