 Diolch i amdu o ddechrau. Rydw i'n nhw, Jonathan Davies. Mewn beth sy'n cyfor o blaenau sétrix. Oedden nhw'n cyflwytu'r sétrix a'r drwsethau. Nid y blaenau, hynny'n mynd i fill. Mae'r blaenau a'r oedden nhw'n mynd i'ch meddwl. A oedden nhw'n mynd i'ch meddwl i'ch meddwl i'n meddwl y blaenau am ddwefyd. Rydym ni'n ddych chi yn mynd i ffordd y mae'n mewn prayl o rebwad, sy'n ymhydd am y Ddadau Aelodol. two release. I'm going to start by talking about expectations from our users. We tend to call our users customers, so their expectations are important because they're giving us money if we can meet their expectations. I'm going to explain how we met those expectations by removing some of the hard limits and some of the soft limits to achieve higher VM densities. Finally, if we've got time, we'll look at some benchmarks that quantify some of these improvements. Some people just want to run one or two handful of VMs, but we've got a lot of customers who really care about getting a lot of VMs on their servers. If you can run N VMs on, say, a one-socket machine, then if you've got a two-socket machine, people expect that you can run two N VMs. You've got to have a lot of sympathy for that view. People tend to get a bit upset if you can only run N or N+, a little bit, if you've just gone and spent double the amount of money on a bit of hardware that's double the size. The rather unfortunate and perhaps embarrassing situation that ZenServer found itself in is that we really couldn't do that. We weren't scaling with hardware, so people were buying bigger boxes and saying, well, why can't you run twice the number of VMs? I've got the same picture here, just expressed a lot differently here. Basically, with previous versions of ZenServer, we had hard limits to our VM density that meant that we were not able to exhaust the theoretical hardware capacity for modern enterprise-grade server hardware happily. This is what I'm going to talk to you about today. In ZenServer 6.2, we've solved that problem. You can see that there are two things that we needed to do. The first thing was we needed to move those hard density limits to the right, so completely pushed the hard limits way beyond what people are actually going to experience on reasonable hardware. The second thing we needed to do is actually then ensure that there's no other soft limits you're going to run into so that you can actually, again, on reasonable hardware, you can actually fill it up with VMs and you get reasonable performance out of those VMs. Hard limits. I've got a list of the hard limits here that we've ever come in ZenServer 6.2. Many of them have already been alluded to in some way today. You've already had a presentation this morning on event channels, so I don't want to duplicate what David has already said, but one thing that he didn't mention is that in a 32-bit DOM zero, you actually have a limit of only 1,024 event channels. With a 64-bit DOM zero, there's four times as many, and that's because of this. You can see on that hash to find that there's a squaring going on there. We know that there's various VM functions that require event channels, obviously things like paraverchialised IO requires event channels, and so it's quite common for typical VMs to require, say, between five and ten event channels just for normal, everyday VMs. If you do the maths there and you divide your limit of 1,024 event channels by that number, then, well, I've got a couple of examples here, if you've got a paraverchialised VM with a single VCPU, one network interface and one VBD, you can only run 225 of those VMs on your host, and then if you try and start a 226, you're going to get a pretty horrible error message saying you run out of event channels if you can read the error message. Even worse, if we consider another scenario with an HVM guest, fairly virtualised but with PV drivers, and with one VCPU, one VIF, which is a network interface, and three VBDs, virtual block devices, which are the disks, and that's a pretty common scenario for desktop virtualisation, so with the Zendesstop products that Citrix produce. This is a typical scenario where each VM will have three virtual disks, and in that scenario you can only run 150 VMs per host, so no matter what the size of your hardware is, you can't go beyond 150 VMs in that configuration. So what do we do about this? In Zendeshop 6.2, we did a bit of a hack, it's a bit embarrassing, so I won't go into the details of what the hack is, but basically it allowed us to enjoy four times the number of event channels in DOM0. There was a special case that we put in that would treat DOM0 differently to other domains. Obviously that then gave us a much more comfortable margin for how many VMs we could run on a host, and that was enough to tie us over. In the future, I will refer back to David's presentation from this morning, there are some ideas floating around about changing the ABI to provide basically an unlimited number of event channels, and so this is very promising to completely remove that hard limit. Okay, if you didn't run into that limit, you might have run into this limit. So this is a limit with blocked app 2. Blocked app 2 is a kernel module that Zendeserver uses in its storage data path for reading VHD files and things like that. And we found that blocked app 2 only supported 1,024 devices. So there was this hash define in the code there, but each virtual block device you want to have requires one of these devices in DOM0, and so using the same scenarios I had before, if you've got three disks per VM, then you can only have 341 VMs per host before you run out of these devices. Well, so what can we do? It was a hash define 1,024. What's the harm in doubling that? So we doubled it and nothing seemed to break, so we've now got 682 before we run out of them. In the future, we might be running away from blocked app 2, so the limit would then be avoided in a different way, but in the meantime, it doesn't look like there's going to be any problems. Just bumping that hash define up if we need to. So if you didn't run out of event channels, if you didn't run out of blocked app 2, you might have run out of AIO requests, because again with blocked app 2, it creates when it's setting up a new virtual block device, it creates an AIO context that can receive 402 events. I haven't got the rationale behind that here, but there's a set of hash defines that end up with a number 402, and there's a system-wide sys-cuttle that defines the total number of AIO requests you can have at any one time, and that's just less than half a million, and again, if you do the division, that means you're going to run out if you have 368 of these VMs that have three disks each. Again, though, this is something that can be changed, so we worked around this pretty simply. It's just a sys-cuttle thing that you can change, and we bumped it up to a million, and we've therefore pushed that limit back a bit, and it looks like you can keep on pushing that back. There are, of course, implications to that, but it looks like running at a million is a fairly safe number. In the future, we could think about disaggregation and perhaps splitting DOM 0 up into different storage driver domains, and therefore multiplying out this number into separate domains if we really find that this is a problem in the future. Okay, hard limit number four. I should have warned you that there's seven, so you can, if you're taking notes, this is number four. This is a limitation with Windows VMs because the default before 0.07.62 came along was to use receive-side copy, which is protocol one in the net front and net back protocol. What that does, if you look at the net back code, it allocates 20, or it starts off with at least 22 grant table entries for each virtual interface that you have because it's setting up these, it's granting this memory from DOM 0 to the guest. Because you've got a total of 8,000 grant table entries in total, you're going to pretty quickly run out when there's all this grants going on from DOM 0. It's not too bad when it's happening the other way around and when you're just doing 22 grant table entries per domain, but if all your 22 per domain are all coming out of DOM 0, then that's a problem. Here, the limit comes out as 472 VMs if each of them just have one interface. In ZenStory 6.2, while we were talking with the idea of bumping up the number of grant table entries so we could support more VMs, but actually for other reasons that I won't go into here, we're not using receive-side copy as the default anymore, so actually that problem's gone away. Number five, ZenStory. ZenStory uses select and select is pretty limited because it can only listen on 1,024 file descriptors and that's not something that can be easily changed. Don't ask me why there's that limit, but that limit exists. The problem there is that QMU opens three file descriptors to ZenStory per VM. Each HVM guest that you're running requires three file descriptors. After you've removed a bit of the overhead, 333 VMs per host is what you can then run. What we identified in ZenStory 6.2 was that two of those file descriptors were being used for watches. It was a pretty heavyweight thing to have a whole connection open just to do just a watch particular thing. There was a patch that I think Freddy Arno wrote, I don't know if he's in the room, to combine two of those watches just to share a connection. Sorry. I don't know. I can't remember what the other one was for, but I think in principle they could have all shared a connection. I don't know what the reason for that was, but anyway in the future I'm led to believe and maybe some of you know better than I do, I'm led to believe that upstream QMU doesn't connect to ZenStory at all and it gets this information in other ways and if that's the case then we don't have to worry about this limit anymore because ZenStory won't have anywhere near as many connections to it then. There's a very similar problem also with Consoldy that we came across because Consoldy uses Select as well. Because each PV domain opens three FDs, I guess those are standard input, output and error, you again get this limit of about 341 VMs. This one was worked around, there was a patch. I can't remember who wrote that patch. The converted Consoldy to use Poll rather than Select, which doesn't have this limit of 1024. On to number seven. The final one in my list is DomZero Low Memory. We found empirically that each running VM will consume about one megabyte of DomZero Low Memory and so when you subtract the amount of Low Memory that other things are using you're left with about 650 VMs that you can run on your host and obviously if you run out of Low Memory then bad things are going to happen, you're going to get the OomKiller kicking in and almost certainly killing completely the wrong thing and that's not going to be pleasant. The reason obviously for this, I mean talking about Low Mem is perhaps quite an antiquated thing these days because using a 64-bit DomZero would avoid this problem. You've got a nice homogeneous DomZero memory address space so that problem will go away in the future. Right, let's take a step back. I've listed seven things there. This table here is an attempt to summarise these limits because I've thrown out all these numbers but what I cared about was to make sure that we were able to push back all of these numbers to a sufficient level that running VMs or various reasonable configurations of VMs would allow you to run a large number of them. This slide shows an example where you've got HVM guests that have one VCPU, one VPD which is a disk and one virtual network interface. I've got PV drivers for those IO devices. The table here shows the number of VMs you can run in these different Centerva versions before you run out of these seven different resources. You can see here the bottom line there in the Centerva 6.1 column says that the bottom line is 225 VMs before you run out. In this case, the thing you're going to run out of first is Dom's Area Vent Channels but you can see there that the story is a lot happier in Centerva 6.2. The limit is up to 500 and that's the point at which you start running out of Centestor d connections because of that issue that was mentioned earlier. In the future, the picture is looking pretty rosy. None of those seven things will apply anymore so the limit is going to be much higher. I'm certain there will be an eighth thing to add to this list so when you come back to the Zen Summit next year, maybe I can tell you about number eight, but it's clearly going to be hopefully quite a lot in excess of 500. A second scenario, if we had three disks on our VM, the number was even worse on ZenServa 6.1. You only had 150 of those VMs that you could run on your host before you run out. Now we're back up to 500 so therefore we've been able to stay at support a limit of 500 VMs. For paravirtualised VMs, the numbers look a little bit different. Again, the problem was event channels. That was the first thing you were going to run into and now that problem has gone away. The other problems here have been mitigated. The first thing you're going to run into there is the low memory problem. Again, that problem and the other remaining ones there are also being mitigated in various ways in the future. I'm expecting a subsequent ZenServa release to have an even higher number of supported paravirtualised VMs. I mentioned that we've got 500 as our supported limit and that's actually practically possible. Here's a screenshot from the ZenServa's graphical interface showing 500 in this case XP VMs, but any VMs would have done. They're all running there and they were pretty responsive. This, talking about the responsiveness of the VMs there now, brings me on to my next topic, which is about the soft limits. I've shown that it is possible now in theory to run this number of VMs, but how well do they perform? Can we actually practically use them? If you're doing real workloads in them, those VMs in that screenshot were reliable. If it was in a practical scenario, would they be usable? The answer was no in ZenServa 6.1, but we've been able to overcome a few of the soft limits in 6.2. Let me tell you about them now. I guess certainly if you're a ZenServa user, this will be a very familiar picture to you if you're the kind of person who runs top. ZenServa is using a huge amount of CPU there. What's it doing? Firstly, is it actually a problem? Let me show you this graph here, which is a screenshot from our GUI that shows the CPU utilization on the vertical axis against time on the horizontal axis. What we're doing here is boosting a large number of VMs sequentially. You can see that the lines I'm pointing to there around the middle of that graph are the DOM0 VCPUs. There are eight of them there in this test, and they were taking up around 50% give or take, each of them. What I did then was actually separated out ZenStory to have its own DOM0 VCPU. Using task sets, make ZenStory have its own CPU that's dedicated to itself. What we see there is that the graph changes slightly because that one DOM0 VCPU, the one that's running ZenStory, actually goes to 100%. Because it's running at 100%, that's a very clear indicator that this is a bottleneck here. That's going to be slowing things down. There's going to be things that are trying to poke ZenStory for various reasons that are going to be delayed because it's running at a maximum capacity here. What were those things? Could we remove them? The answer is yes, we could. Basically, there was a variety of things. We had a good look at all the things that were interacting with ZenStory in ZenServer. There were quite a few spurious rights going on, things that perhaps didn't need to be written. We had a load of things being written by the VMs, reporting back various monitoring information. That was causing quite a lot of ZenStory activity. It was arguable that that was perhaps of limited value relative to the cost here that it was imposing on the system. The second class of things we were able to do here was to replace polling with watching. In the past, we had some processes that were naively asking ZenStory, is this key here? Is this key here? Is this key here? That was just imposing a lot of load on the daemon. There was a perfectly good solution sitting there waiting to be implemented, which is to use ZenStory watches. We've been able to replace quite a lot of the tool stack code that was just doing this naive polling to do some watches, so it gets notified when the thing is interested in changes. Those things combined really reduced the CPU utilization of ZenStory, and that made a big difference. The second thing, QMU. When you're running HVM guests, again, this is a pretty common site. At first glance, this might not look too bad, but you may, the eagle eye amongst you will probably spot the load average there, which is completely through the roof, which is perhaps a sign that something not so good is going on. But each of the QMU processes there is only using 3%. It's not too bad, is it? But actually, the VMs that were running here, the Windows VMs, were actually basically idle. They weren't running anything. What was QMU doing? If you think about it, if you're trying to run say 200 VMs, and if they're all idle, if each of those QMU processes is using 3%, then 3% times 200 VMs is 600%. You're completely wasting 6 whole dom zero of ECPUs just to basically do nothing, because these VMs are idle. They don't need anything being done by QMU. This is clearly a complete waste of hardware. Well, we want to just know what it was that QMU was doing, and could we fix this? So, we did some analysis and looked at the different events. A simple ashtray of QMU would have shown you that it was busy processing events. It wasn't every few microseconds it was waking up to process another event. This table here shows a summary of what those events were. There's a variety of different emulated devices there, and quite clearly from that table the big hitter was USB. This is just one set of measurements, but it's typical that you'd be processing around 220 QMU events to do with the USB bus per second, and just the sheer weight of processing those events coupled with all these other ones here. That was what was causing that 3% ambient temperature of QMU. So, there's one of those things in that list. There is the Buffal IO. That was something that was already a patch, I think, from Stefano, but I'm not sure if I'm right to say that, to convert that to use an event channel rather than using this via QMU. That seemed like a much cleaner solution anyway. That got rid of 13 QMU events per VM per second. The other things that are all emulated devices that could be disabled, and actually a lot of them are completely pointless, so who needs a parallel port and a serial port on their VM, especially when it's not connected to anything. We saved one event there per second. The QMU monitor in ZenServe, we're not using that for anything. Who needs that? A CD-ROM. Just the presence of having a CD-ROM there, even if there's a drive there, even if it's not a CD-ROM, an ISO inserted into that drive, obviously these Windows VMs were kind of polling that saying, is there a CD in the drive? And so actually not having a drive there at all saved us 38 events per second, but the big one was USB. So we've provided an option in ZenServe 6.2 to not emulate a USB controller. And actually, for a lot of our customers, that's completely fine. The only thing really that that would cause a problem for a lot of our customers was the USB tablet device for doing absolute positioning in our graphical user interface when you're viewing the VM's console, but actually a large number of people don't use that console if you're providing, say, a virtual desktop environment, then your user's going to be connecting over some protocol like remote desktop or perhaps an ICA protocol or something like that. So actually that's been a dramatically improved upstream in QMU, so you're just going to be on a newer QMU version because now we detect when the device is idle and we throttle down the head polling intervals. So if you go to a newer version of QMU and the device is inactive with being used, those will go away completely. Okay, so that was a claim that in future, in the modern versions of QMU, this problem's been solved, which is great to hear. Thank you very much. There's also people that we don't have a Linux front-end fork, but we have a Windows front-end fork. Yup, so that's clearly a way. ...desired to upstream that, I think we just haven't gotten... TV, USB, and all that help is actually the device. So Windows is pulling on the device based on the hit of a time-out setting device. So there's not a... USB sucks, like there's no... Yeah, it's incredible what the guest must be doing. It's the hardware, right? So a real, actual, honest-to-body USB controller doesn't work properly, and they should constantly go to it and say, hello, hello, hello, hello, hello, hello, hello. So that's what Windows does. So you've got a lot of VMs saying hello to their USB controllers, indeed. Something modern, I don't know, 151.6, something relatively recent. It's not that. So that was the main way in which we reduced the Dom0 CPU load in Dom0. Let me move on then to talking about benchmarks so I can hopefully quantify some of these improvements. So you'll recall we've pushed the hard limits right now, and now it's just a case of eking out as many VMs as we can before the performance starts to degrade. So I've got a couple benchmarks. The first one here is BootStorm. So BootStorm is the word we used to refer to, starting a lot of VMs at the same time, booting them all in as short a time as possible and measuring how long it took to boot them. This is actually quite a good benchmark when you're talking about measuring VM density, how well do lots of VMs perform. There's a lot of stuff going on in Dom0. During a boot, you've got the tool stack churning away. You've got a lot of Zen Store activity. You've got a lot of... Your QMU process will be doing a lot of emulation before your PV drives have kicked in, perhaps. So it's actually quite a good workout for the system. And you can see here from this graph, so we're starting 25 VMs at a time and doing 90 in total. These are Windows 7 VMs. And the vertical axis on this graph here shows the time at which the VM had finished fully booting, so it's completely ready to use. And the red line there was ZN061, blue line ZN062. So primarily due to these improvements that I've been describing, it actually made quite a considerable difference there, and it made a 60% improvement on the total time to boot 90 VMs. How about 120 VMs? Well, the story's even better. The gaps now being closed by 75%. You can see there that it was starting to get pretty slow in the red line there after you got 100 VMs. Things were really starting to choke up in Dom0. The load average would have been like on that slide earlier, which is going completely through the roof, and there was no side of recovery there. But it's a lot more linear now in 62. 120 VMs. What about starting 200 VMs? Well, there's two things to say here. We couldn't do the measurement on ZN061 because we were running into those hard limits that I referred to earlier. So all we get is that red line kind of tailing off horribly, and then suddenly stopping. The blue line however does carry on because we can now run 200 VMs quite comfortably. And on this hardware, it took any 30 minutes to boot them all, which you can see. Maybe it's going up slightly, but it's pretty linear, certainly a lot more linear than the red line. What's the step function here? George is asking why these lines are stepped. It's because I'm doing 25 VMs at a time so that means I'm using the ZN061 as a tool stack to have 25 outstanding VM starts at any one time. From the word go, I say here's 25 VMs to start, please, and then gradually they'll all start, and then eventually the tool stack will say, okay, the first one's done, the second one's done, and actually all those 25 will finish about the same time, so then my next 25 will then start. So that's why you get this kind of step effect because they tend to go in batches when you're using this approach. And this is actually the approach that ZenCenter, the GUI takes when you're starting VMs. So if you select all your 500 VMs and you click start, it'll do them in batches at 25. What's it to storage? That's a good question. Yeah, I think it was local SSDs. Yeah. I mean, it probably wouldn't have been too different if we'd had it on Shared Storage. Okay, I think we've got time for one final benchmark. So this is a benchmark called LoginVSI. So LoginVSI, let me talk about what the benchmark is before I talk about the graph. LoginVSI is a benchmark that's very commonly used in desktop virtualisation. It simulates a pretty, it's a pretty basic simulation of a kind of office worker. So in each of the Windows VMs that you have, it'll run essentially an identical workload, although there's some elements of randomness. It'll run things like Microsoft Word and Excel and PowerPoints and Outlook and kind of common office user applications, trying to simulate what a user of a virtual desktop environment would do. The way the benchmark works is that you kind of have an idea in mind of how many VMs you want to run and then it'll tell you how close you came to that or how many VMs were able to run at an acceptable level of performance up to that number you wanted. So what this graph here does is summarises a load of measurements using this benchmark and you can see that on ZenSever 6.1, if I wanted to run 80 VMs, actually all 80 ran fine, it goes pretty linearly there, but the points at which I try and run more than 105 VMs, suddenly I find that if I'm trying to run 106, none of them live up to the benchmark's expectations about what an acceptable level of performance is, so that's why that completely dives down there to zero. In 6.2, it's a much more healthy story. We're able to get way over 300, in fact, before you even start finding that even one or two VMs aren't performing to an acceptable level. The second thing to notice is that you actually don't have this big dive down anymore because DOM zero load is not just completely going through the roof and causing the whole system to collapse and actually now it does sort of scale a bit more gracefully and so this graph goes as far as showing 400 VMs and 300 of them were still running well and the gap there from 300 to 400 could even just be due to exhausting the CPU capacity of this machine yet maybe it wasn't even really possible to get much more than that in theory. You say number of VMs performing exceptionally. Are we to interpret this graph to mean that you gave 400 VMs, 300 of them performed, still within acceptable limits, and 100 of them were useless? I'm not saying useless, but you're right to interpret it in that way that the benchmark has a notion of what acceptable performance is so it'll do things like measuring how quickly my window opened or how quickly. So this is expected, so you're going to run out even if there's zero virtualisation overhead, you're still going to eventually find that all your CPUs are going to be saturated if you try and run too many. So this is what you'd expect. Any more questions? Any more questions? I'm sort of wondering why does the ZenStory is not switched to using Poul? I mean it's still using select, why? Yeah, I don't know. I think it could be. The question was why can't we use Poul like we did with the console daemon for ZenStory. I think we could, but it turns out that problem is going away anyway. If we're no longer connecting with QMU to ZenStory, then we don't have to worry about that being a limit, so 1024 file descriptors on ZenStory would be fine. Okay, I asked because I did both patch. ZenConStory and ZenConStory, you seem to pick up only one, I don't know. Of course, yeah, we're using a different version of ZenStory in ZenStory. This last graph, do you know how it compares to a KVM, a similar scenario? No. The question was, do I know how this compares to a KVM? No, I don't have good quality analysis of other hypervisors. I'd be surprised if they were much better than this, but I don't want to speculate because I don't know. It is something that we'd like to know, and I think it is actually something that could be measured using our measurement infrastructure, so it might be interesting to try and find that out. Any more questions? Yeah, run forest run. In your previous slide about boot time, can you talk about what you have done mainly to decrease your boot time? Yeah, so it's really just the things that I've mentioned in the presentation here. Primarily, it's about reducing dom zero load, so the reason that the red line is shooting up in the air is because dom zero was too busy, so there's too many VMs trying to cause too many things to happen in dom zero. Dom zero was maxing out at CPUs, the load average was going sky high, and so all we needed to do was quieting things down in dom zero. So, actually, we've got to the point now where certainly the QME optimisations that I mentioned, not emulating all those devices, we've got to the point where the load average is basically zero even when you're running hundreds of VMs, so that's clearly where you want to be. There's no reason for dom zero to be doing anything when your VMs are just sitting there idle, and that's not the position that we're at. For example, you told that the main reason is that one-cumule processor user of 3% of the CPU utilization of dom zero is because of USB, so in then 6.2 or less, you just disabled USB? No, we haven't disabled USB, but we've given the option for people to disabled it. But you can also disable USB in 1 or 6.1, right? So, in then 6.1, there was no option for the tool stack to enable you to disable USB, so if you wanted to do that, you would have needed to know which files to go and edit in dom zero to cause USB to not be emulated, and now we've actually made that a much more public interface, and we're writing documentation that explains how to do this, so this is now a normal thing. So you could have actually disabled USB controller from within your VMs by going to device manager, or if you're in Windows. We're running out of time. I have one more question if there is one. So in this testing here, I think we had eight, I didn't mention it there, I think we had eight vCPUs in dom zero there, but actually it wouldn't have been too bad with fewer than that, I think, and certainly you don't need eight when all the VMs are running, you only need a lot of vCPUs if you're trying to do a lot of booting in a short period of time. Okay, okay, thank you. Thank you.