 Hi, I'm David Vrable, I'm part of the Zen server team at Citrix and I'm going to talk about some of the work that I've done on removing one of the scalability limits inside Zen. So the limit to the number of event channels that a guest can support, basically limits the number of VMs you can support because it limits the number of front end devices that can connect to the back end. So what are event channels first? Event channels are Zen's method of doing power virtualized interrupts, they're edge triggered, they're bi-directional and that if you set up an event channel between say a front end and a back end, the front end can signal the back end and vice versa and they're also always directed at a single vCPU. So with these properties, event channels can be used to virtualize all types of interrupts, so they're used for the inter-domain interrupts for front and back ends, they're used for inter-processor interrupts IPIs, they're used for virtual lurks for things like the PV timers and the PV console, they're used for physical lurks, MSI and MSIX and they're used despite being edge triggered, they're also used for level triggered physical lurks with a bit of extra stuff. So how do they work? Well, in this example here we have domain A wants to send, if they'd have been interrupt to domain B, so first it notifies, it doesn't notify a hypercall into Zen, Zen then writes into the piece of memory shared by the target domain with Zen, so it sets something in there that says that there's an event pending, it then triggers an up call into the guest, and then the guest then looks at its shared memory and sees that there's event pending and then calls the appropriate hand for that. So the thing that limits the number of event channels you can have is the layout and structure of the shared memory. So this is the current layout, so if you look at the bottom there, that's the shared memory and we have an array of bits to say that there's an event pending, so we have 4,096 bits on that 64-bit guest, so we can have at most 4,096 events. There's also an equivalent array of masked bits that a guest uses to prevent delivery of an event, specific events. So obviously it would be inefficient to have to continually scan all 4,096 bits whenever an event turns up. So on a per-VCPU basis there's this thing called the selector word, so the 64-bit word and each bit inside the selector word corresponds to a word within the shared memory. So when you handle an event, you go through each bit in the selector word, find the first bit set, look in that corresponding word and find all the set bits in there and handle those interrupts. So what are the problems with this design? Well, the first thing is there's just too few. 4,096 is good enough for now. We support 4,000, we have the 4,096 limit within, then server and then server can only handle up to something like 300 VMs, if you want to go any more, you run out event channels. There's also no way of having any sort of event channel priority between the event channels. They're all handled with the same priority. And also because it's just a flat array of bits and you don't know when an event is going to turn up, it's also unfair and I'll have a graph later on that illustrates that a bit better. But basically the algorithm used is only to service the scoundrel bits, it's only fair if the uniform distribution of events, which turns out to be the case. We also have some additional requirements that we want for a new ABI. We would like there to be the ABI to be the same for 32-bit and 64-bit guests because it creates extra complexity within the hypervisual guests if they have different ABIs. We also want to make sure that we don't have excessive amount of memory usage because each piece of shared memory between Zen uses up a limited set of global shared mapping space within Zen. So we don't want to completely blow that away. And also we would like an ABI that could be extended in the future to perhaps even more or potentially other use cases. So this is the new Pfeiffer-based design on the right-hand side. We have the shared memory between the guest and Zen and instead of just a bit array of pending bits, we actually have an array of event words. Each word has a pending bit as before and a mass bit as before. Then it has two additional fields, the link bit, L, and the link field. And this is used to, as you can see, to construct a link list, a single link list. And it's used to create a FIFO of pending events. So you can see here that event channel one on port one was the first event that turned up and then six and then three. And then in the per-VCPU shared area, we have the head pointers, so where the head is. And because events can only be on one, effectively in one FIFO, we can insulate many lists onto the same event array. So we can have multiple heads. So in the actual implementation, we have up to 16 heads, which gives us 16 different event priorities. And then to allow the guest to see which cues have pending events, we also have a ready field, which is similar to the selected bit and the two level design. And that upset bit in that says there's some pending events on a particular cue. So one thing about using a shared memory interface is that you, most typically with a list, you synchronize the list with spin lots, obviously with a shared memory implementation between Zen and a guest, you can't do that. So this is the core algorithm we used to actually link an event. So part of raising events. Firstly, just set the event pending and we check if it's masked. If it's masked, we don't do anything because the guest is not interested at this time. And then we check if it's already in the list. And if it's already in the list, we also don't need to do anything. Then we add it to the list. And then we atomically set the link field of the tail of the list to point to our new event. Or if there isn't a tail, or rather if the tail is not currently in the list, i.e. the list is empty, then we set the head field so the guest can find a new head. And then finally we advance the tail to point to our new event. Then conversely, when we handle event, we read a local copy of head initially. So the guest keeps his own copy of the head pointer. So it never has to write back to the Zen, the shared head. So Zen only writes the head and never removes some central rates conditions. And if the last link field we read was zero, then that current local head would be zero. And we know that we previously emptied the list. So then we read the new value of the head out of the per-VCPU control block. And then we clear the link field and the linked bit. And that removed that event from the list. We advance our local copy of head to the new head, which if it is the tail of the list, this will set the head to zero. When we go back around next time around, we know to pick up the next head. And then if the event is still pending, if something else hadn't cleared it, like if the event channel is disconnected or something. And if it's not been masked, then you just call the interrupt handler. So this graph here illustrates one of the key benefits other than just having lots more. It's fairness. So by fairness, we mean the average latency of an event should be independent of its port number and independent of any other events that are occurring elsewhere. So this graph is from a very contrived test deliberately constructed to show worst case behaviour. So what we have here, and this is actually the use case that it's, or the setup that it's using, is not dissimilar to what you'd actually see in a real DOM zero, for example. So in a real DOM zero, what you typically have is you'd have a whole bunch of physical works at low event channel numbers, a whole bunch here, and then spread out to the rest of the event channel space. You'd have all the front, the event channels going to front ends. And typically what you get is you see all the physical works here would be a lot higher rate on average than the spread out of all the other events. So you'd end up with a, you wouldn't end up with a uniform distribution of events rate. So in this example here, we basically have a port 73, I think. We have an event that occurs about five or so times more often than the events in the rest of the space, the other nine events. So effectively what that means is that the algorithm that the two level based design uses always ends up scanning for bits starting from the same place, which is always after the highest rate. So the one immediately after the highest rate, so port 73, is always service first and has low latency. And then it gradually climbs up as you move away. And then finally the highest rate one has the worst latency because that's the one that's always service last. As you can see, the FIFO based design, because events are serviced in the order that they're raised, we have completely uniform average latency. So the current status of this work is the Zen support has merged as of last week, I think. Linux support, so yeah, the Zen support should be in 4.4. Linux support is done. It may make it into 3.13, but it's, I mean, it definitely will be in 3.14. And then there's still some further work in this area, not specifically to do the extended limit. Although we've added support for 16 different priorities, we don't actually make use of this feature yet heavily. We use effectively one priority for the timer interrupts in Linux, so the timer interrupt always gets serviced first and all the other events are all serviced at the default priority. So this is done just because I didn't have time to actually spend time investigating how you could effectively use priorities. For example, it might be useful to put the event channels used to signal Q and U to do MMIO as high priority because that's likely to require low latency. And you might want to put the bulk data events for things like blockback or netback at a lower priority. But I mean, that would have to be someone have to actually try that out and see if that actually helps. One of the things that got added, because as you increase the number of event channels that a guest uses, it consumes more resources inside Xen. We want to make sure that guests can't consume too many of those resources, so there's a new hypercall, a new DOM Cato to limit the number of event channels that a guest may bind. So there's support for that in Excel, so there's new command, there's new configuration file option, max event channels that you can set to limit that, and the default is set such that any typical DOM U wouldn't hit the limit. So the default is 1023, which would be fine for almost all DOM U's except maybe driver domains. You might need to change that. But other tool stacks might want to implement that, so you might want to plumb that through to the ZAPI tool stack or libber. There's only support for guest support, we've only done Linux. So if you run a DOM 0 using some other, I think NetBSD does DOM 0 or FreeBSD, then you might want to add that to one of the BSD's. Related work, as you start increasing our event channels, we're increasingly going to hit a spinlock contention inside Zen because there's a single event channel lock per domain, and every time you send a notification to a domain, you have to take that domain's event lock, which means that for a driver domain on DOM 0, all the other domains will be contending on acquiring that event channel lock. And also there's something else I'm working on, which is just a bit of refactoring of how Linux sets up its physical errors, which is not that interesting. So there's a design document, if you want to know more, you can get to it from that link. It goes into more details and explains how it covers the complete algorithm used. So if you're implementing either in guest in another OS, you'll want to read that, which is just interesting. Any questions? So on your graph of latency fairness, I noticed that it looks like the average latency is fairer but higher in the new scheme. Do you know where that latency comes from and have you got a plan to try and get that? I believe most of it comes from, it's effectively just extra cost of having the link and effectively unlink. It's more expensive to do the complex change than a simple bunch of tests and clear bits, I think, I'd not profile there anything to tell. But we also have this thing where to support different priorities, after we handle a new event, we go back and do another exchange on the ready field to see if there's a higher priority queue pending. That might add extra cost. So you did say that the one on the left happens five times as often as the other ones, right? I forget exactly that. The test setup is that when we set it up so that we service three events at once, one of which is always at port 73 and the other two are randomly distributed throughout all the other nine ports. So it comes out as average four or five times. Is it possible that the reason it's higher is because now you're frequently surfacing number 73 before the other ones? Maybe not, never mind. I'm not quite sure I'd followed that. Never mind. So in the earlier charts, it mentioned that the DOM U kernel is what puts the event back on the free list. I'm curious how does the hypervisor allocate from the free list without having a race with the guest freeing something on the free list? Sorry. In this pseudocode, the link, you basically, you do a comp exchange and as part of that, you test to see whether, when you're putting it on, to test to see whether the tail is still on the list atomically by checking the link bit and if it clears during that comp exchange, the comp exchange will fail. Go back around again, you'll see that the link is clear and then you'll go back and say the list is empty, I need to set head and then tell the guest that there's new events pending on this queue. And similarly, when you remove, the unlink is another comp exchange that's atomically clear, both link and linked simultaneously, and then return the value that it just cleared. The design document has more pseudocode explanation of this, if you're interested. I have a question. Can you turn to the future work or to-do list? Play which? I mean the future work or to-do list, the last thing. Future work, yeah. As you said, you mentioned about the then event log scalability, so what actually does this mean? Do you mean, is there any serial contention in the existing then event log by listening to your intention? Yeah, I've done synthetic tests, I was stress testing the evention device actually to effectively remove a similar bottleneck in the evention device where there was effectively a single global lock regardless of how many users have been eventioned by how many times you'd opened it. So all QMUs, despite the fact they'd opened dev evention independently would all contend on the same event channel lock. As part of testing that, I'll need to pull up the graph. I don't know if I have Wi-Fi, probably not. If you just spam an event channel lock as part of testing, you see this really easily. Whether it's actually an issue in practice, I don't know, but I can certainly see it as we tackle some of the other contention points. So once you've tackled say the grand table locking contention, then the event channel lock contention is probably important. But I don't have any data to show that. Any other questions? So have you tried running large numbers of VMs with this? And another question, are there other impediments to doing so other than event channels? Yes, to the second part. Yes, there are other limits. I think Jonathan Davies has a talk later on. We'll go more into the general scalability of Zen. I have not actually run that many VMs, apart because my test box doesn't have enough memory, but you can basically create as many effectively loopback events or whatever you like. And so I've run 100,000 plus VMs up to the limit of numbers. But I've not run lots of VMs. I think Waste has some work as well on that. I remember he's given a... But that's a different design. I don't think my experiment counts for this. So maybe we can repeat that at some point. Yeah, sure. Yeah, this actually varies from box to next, not only the event channel, I think. More questions? No? All right. In that case, thank you.