 Yeah, so welcome, welcome to my talk. Yeah, so my topic is improving scalability of Zen, the 3000 domain experiment. It's going to be a mixture of things. So the first thing is I'd like to introduce to you the Zen architecture. But thanks to Lars, he did a very good job just now. So I don't need to explain it anymore. So I'm just going to skip that part. So I guess I can save everybody a few minutes or even 10 minutes or so. And then almost identical, but his pictures are much more beautiful than mine of course. And then the second part is the current scalability status. And then the third part is some work undergoing in our community to address the scalability problems at the moment. And the fourth part is about this experiment and the demo. And finally I will also give you some interesting findings I see in this experiment. Yeah, and let's get started. So yeah, I guess there's because Lars has done this before, so I'm going to skip this as well. Because I'm not really a very good marketing person. Not really good at speaking this kind of things. So yeah. So Zen has a very large user base estimated more than 10 million individual users and has the largest cloud in the world. And also there are some foundations for several client-side projects like QBITOS and same kind. Yeah, cool, great. And then the most important bit to the open source community is that it is completely open source. It really is under TBLOV2 with DCO. It's just like Linux. So yeah, if anybody wants to hack Zen, just need to clone the free and send it well and it can get to that very soon. That's also cool. And we also have a very diverse community just as Lars showed us. So yeah, I'm going to skip this part. And also the Zen architecture, as you can see, my picture is a lot more ugly than Lars. Yeah. Yeah, sure. Thank you. Thank you for coming, man. Sorry. Yeah. And this TV protocol. Actually, Lars didn't have this. I'm going to explain this. Yeah. So the TV protocol is a generic protocol used between the front-end and front-end device. So now we have a consensual ring in the memory which is divided into several slots. Each slot is a union of response and request. So in general, the front-end needs to send something to the back-end. It puts the request in the slot and then notify the back-end via the event channel. Then the back-end gets the request from the slot and process it and then put the response in the ring as well. And then notify the front-end via event channel as well. So this event channel thing is primitive provided by Zen, which is used to do notifications. And actually it's a very important primitive in Zen. There are many things that are mapped into this event channel. So the things that are mapped into event channels are the physical IRQ, that's the interrupt line from the real hardware, real device. And the virtual IRQ, which is the interrupt for the virtual device. And also IPI, which is the inter-process or interrupt. And the last but not least thing is the inter-domain notification event channel. So the event channel shown here is the inter-domain event channel, which is used to notify both ends. And so it's a very important component in the Zen system. And it's very critical to the scalability problem. So this is driver domain, which is also covered by last just now. And also this is the HEM guest. And this is PDHEM guest. So I really don't have much to talk about this guest, but I just need to remind you that this arrow lines here and here, they are basically all event channels. So the scalability of Zen is somewhat limited by the number of event channels supported by the whole system. So the current scalability status. So the latest release of Zen is Zen 4.2. In that release, we support up to five bytes of host memory. Yeah, you have done it perfect? Yeah, for it. Thank you, Rafa. You mentioned about event channels. Yeah. The limitation of event channels. So is it limited by hardware? No, no, no. Actually it's a software thing. So I'm going to go into detail just a bit later. So yeah. So the current status of the Zen scalability. In the Zen 4.2.2 release, you can support up to five terabytes of host memory. That's for 64-bit bits. And also up to 4095 host CPUs. It's also for 64-bit bits. And if you are running a PVVM, you can have up to 512 vCPUs. And if you are running an HVMVM, you can have 256 vCPUs for that VM. However, what does PV stand for? Yeah, I saw you in the last book. Sorry. So the PV is short for para-virtualized. So in this so you see it coming now. Yeah. HVM is hardware virtual machine. That's a VM with hardware virtualization extension support. Yeah. Thanks, Lars. Yeah. But the eventual is relatively small. Now in the current design we only have 1K for 32-bit domains and then 4K for 64-bit domains. So let's do a simple calculation. In a typical PV or PVH VMW, we have 256 megabytes to 245 gigabytes of RAM for that VM. So I didn't made this up. So I got this data from the internet. So Amazon does have some crazy bit instance with 240 gigs of RAM. Yeah. It's getting. But that is not really a bit there because like in 64-bit architecture you always can easily handle that amount of RAM. Then for the CPU, typically you have one to 16 virtual CPUs for a VM. Of course you can have more. But yeah, that's probably enough for your daily usage. So it's also not really a bottleneck for the stand system. Then you need at least 4 inter-domain event channels for a VM. The first one is the Salesforce. So the Salesforce is the central database to store domain configurations and exchange configuration informations. So it consumes one event channel. And then the second one is the console. The console, you can think of it as a counterpart of the serial port of the physical machine. It consumes one event channel as well. And then if you want to do something practical, like communicating with the external world, then you need a virtual network interface. That needs an event channel to do notification as well. And the last one is if you want to store data in your image, then you need a disk. That's a virtual block device. Then for a typical DOMU, you need at least 4 event channels. You would need more if you want more devices. Or if we have like a multi-cube width in the future, one width will consume more event channels as well. So the calculation is as follows. From a backend domain's point of view, is that like DOM0 or driver domain. Or they are called backend domains in general. So for a backend domain's point of view, it has its own num, like IRQ or PRQ or VIIRQ. So the number of event channels consumed by these three things are related to the number of CPUs and devices. So a typical DOM0 it has like 20 to 200. And then the remaining is like then divided by 4. Which yields less than 1,024 guests, supported for 64-bit backend domains. And then even less for 32-bit domains. Then people might ask so one backend domain that still sounds bit right? Yes, that's actually a very big number if you are using it in a normal use case. If you only run dozens of domains or even hundreds of domains to run your typical workloads like a fully-fledged Linux running Apache, Nail server, or whatever service you see fit. So 1,000 domains is actually quite enough for this time. But we certainly have some use cases that can easily hit this limit. For example, the Open Mirage project which is also mentioned by Lars is an incubator project of the same project. Yeah. It tries to spawn as many VMs as possible to serve the external world. So it can easily hit this limit. So now we are trying to look into the future and see if we are preparing for the next step of this Zen world. But just rest assured you are just running your normal workload and Zen can handle this well at this stage. I joined Citrix last December and I was asked to look at the scalability issues of Zen at that time. And then I saw an email on Zen user list. It's a user who tried to run 1,000-dome use on a single host. So that W is actually a modified mini OS. But it had problems. Well, I couldn't access the W. They seemed to be running, but I couldn't access the W. Then I took a look at that. Actually, 1,000-dome is not really a big deal if you remember the calculation we just did. So it's actually running 1,000-dome is actually more or less the two-stack limit. Yeah, so the main problem the user had is the Zen console. He couldn't access the console of the domain. So he had problems accessing the 338th domain. And onwards, I explained that. I know that. I know the genuine cause of that. And then I fixed it. Would this possibly be a restriction of the hardware platform they were running on? Maybe they ran out of some resources on the platform? No, no, no. You mean the 1,000-dome thing. They couldn't access the console. That's what I'm curious about. Yeah, that's actually just a software problem. You already fixed it. Well, I sold this email and then I sold it. Well, 1,000-dome, that's not there's no fun in doing that because saying get a different handle of that. Then I thought, well, how about 3,000? So let's get to that. Because with that number of domains we can definitely hit the event channel limits. Even if we only equipped each domain with only two event channels, that's one for Zen Store and one for Zen Console, that will come through up to 6,000 event channels which has already exceeded 4K limits of the 64-bit build. And also we could possibly discover the two-stack limit as well as the backend limit. And a more open-ended question is that is it really practical to run this huge number of domains at this stage? Or what should we care and what should we look into in the near future or what needs to be improved? So that's an open-ended question. So yeah, let's get to that. So the first thing is the two-stack limit. Started with the 1,000-dome thing. So why couldn't the user access the number 338th domain onward? That's because Zen Console uses select and the select system port. So the program. Okay, that's fine. So that system port has a limit in Linux which can only handle 1,024 file descriptors at the same time. So the Zen Console case is that it has to open like 9 file descriptors when it starts up and then 3 file descriptors for each domain. So it has 338 multiplied by 3 plus 9, that's 1,024. Yeah? So this one is easy actually. I just wrote a patch to switch to the full system port which in theory can support like anyway, thousands tens of thousands, but yeah, that should be enough. And this patch Zen Console D and Zen Store D has been up-screen. So this two-step limit is in that form now. Yeah. Then there's the bigger problem, yeah, the event channel limit. Actually it was identified as the key feature for release. So two designs came up by far. The first one is a three-level event channel ABI. And the second one is the people event channel ABI. Yeah, I'm going to go into detail. It could be a bit boring, but yeah. Yeah. So the three-level ABI it was designed and aimed for the 4.3 timeframe which so it needs to be straightforward and simple. So in fact it's just an extension to the default two-level ABI so hence it gets this name. And it was started in December 2012 and then nowadays the B version 5 drop has been posted and it's almost ready. So let's first look at the default ABI. The default ABI has been around for a long time a decade or so, yeah. So in this ABI each event is represented by two bits. One for pending and one for mask. So there is a global share bitmap of events. If an event is set pending then the corresponding bit in that mask bitmap is set. Furthermore if that event is not globally mapped, it will try to so that there is an upper-level selector which is used to speed up the pickup path. So each bit in the selector is mapped to a single word in the serial bitmap. So yeah, if the event is set pending globally mapped, then the corresponding bit in the selector is set. And finally if the event is not globally disabled there is an upper-panning flag in each VCPU structure set and that is set as well. So the kernel knows that so on every return from client context to the kernel context the kernel takes a look at the upper-panning flag if that upper-panning flag is set it knows that there is an event pending then it looks at the selector and picks up the bit, the pending bit in the selector and then it picks up the corresponding word in the share bitmap and then it picks up the actual bit in the pending bitmap that represents the event and finally calculates the port the event port and then handle that. So this is the set pending and pick up path. So the three-level ABI is designed as follows. So we simply extend another level of the bitmap. So now we have two selectors, the first-level selector and the second-level selector. So the set-panning path is like well we first need to set the bit in the bitmap then if it's not masked we set the bit in the second-level selector and then set the bit in the first-level selector and then set the up-call-panning flag and the kernel side pick up path is just the other way around. It sees the up-call-panning flag it picks up the first-level selector and then the second-level selector and finally the actual event itself. So yeah, are you mean the memory footprint or the time it's very hard to measure and it has not been measured yet? Yeah, yeah, yeah. So this approach is actually simply several bit operations which should be passed. So the number of vent channels supported by this design is that now we can have 32k for 32-bit guest and 256k for 64-bit guest. So the memory footprint is that now we have two bits per event it's pending and masked. So we need two or 16 pages for 32 or 64-bit guest respectively and we also need to map number of vCPUs pages for the controlling structures into the same. Furthermore, this can be envisioned for DOM0 and driver domain other DOMUs use the default APIs because normal DOMU can never use so many event channels. So the pros for this API is that the general concepts and risk conditions are very well understood and tested and it's only envisioned for DOM0 and driver domain so the memory footprint is not very large. The downside is that this traffic map has no priority. This is the downside inherited from the two-level design and then there's the people API so the motivation is that as we are designing new APIs after all why don't we just start ground up and get more great features and the design draft was posted in February and the first prototype posted in March it's still under development both at hand should be ready by 4.4 release In this API each event is represented by a 32-bit word and it's placed in a queue so each event word is divided into different sections the highest three bits are used as pending bit, mass-bit and link-bit the pending bit and mass-bit is easy to understand but the link-bit means that this event is in chain, it's in a queue and then the lower 17 bits are used as the link-bit which is used to point to the next event in queue so we also have several perceived control structure this perceived control structure has different holds several queues so this one represents a queue and this one as well so with this design now we have event priority we can assign events to different queues with different priorities and then here we have a picture showing an empty queue and a non-empty queue so on the left-hand side is the empty queue it's only the link queue are shown so there is nothing in here and on the right-hand side there is a non-empty queue so for example we have this like queue zero the tail is one and hat is five so that means the first event in queue is number five and then we pick this up and then we look at the link-bit number seven event that's the next event in queue and then picks up the number one event goes to one and we see zero there's nothing more in the queue so but as this design makes every event authority to be worth the state machine is a bit more complex compared to the three-level design so number of events eventually supported is 128k by design because now the link-fill has 17 bits but it is extendable yeah, extendable then the memory footprint 132-bit work for events we need up to 128 pages per guest to map all this event queue into Zen and also we need number of virtual CPUs pages for control structures mapping Zen as well so with this design we should definitely use 2-stack so they make the maximum number of events channel W can have otherwise a normal W may consume 8 pages in Zen that's really not desirable so the pros are that now with this design we have event priority because we have several queues we can handle them with priority and then the second thing is the people ordering you can guarantee that the first in first out priority because it's a people queue but the downside is that it has relatively large memory footprint because it is by design needs to be enabled for all domains otherwise it has no clear advances compared to the three level design so here's the community decision actually this scalability issue is not as urgent as we thought only the open-marriage project expressed interest in an actual event channel so the decision is that we need to delay this until 4-4 release because the maintenance thing is better to just maintain one more ABI than two and also by that time we should be able to measure both solutions and take one, pay a better one and because event handling is more complex by nature so it would be the better need time to test both designs enough pros and cons and theories and calculations so let's bet today are there any attacks that the two approaches or the three maybe including the two ABI's might be more susceptible to whether the denial of service attacks or other types of attacks that make a difference from the security side or not really you happen to discuss this so can you repeat that? just thinking are there any security issues around event handling if I can create a denial of service based on my knowledge of the event handling it might be that one is more susceptible to being attacked than another if I can flood the event queue in such a way that it can't be handled in a timely manner you mean the people design? well either design is either design better in terms of that than the other? yeah so I mean flooding event queue or whatever just by producing lots of lots of events so yeah our team used to have a discussion about the fairness of these two designs so from the domain the discussion of it being secure to being robust is like if you're flooding it with excessive use of a particular resource and it's running out of that resource how well does it have a failsafe mechanism how does it fail? is one failing more gracefully compared to the other? that was your question right I understand so how did you find the break? I chaired the WebCL Working Group in that regard we've been looking at denial of service one definition for a open CL or WebCL kernel in denial of service is that a kernel takes such a long period of time that it literally overuses the resources its share of resources so it can bring a GPU to the halt so depending on your scenario your denial of service may have different definitions you know in the example of Amazon server you could put so many requests to it that it brings it down to its needs so denial of service meaning somebody is using up so many resources that they're none left for anyone else so it doesn't necessarily slows down or gets down to a practical halt so in that scenario in the event queue if too many events were getting generated how will it handle that? A, can it detect that there's a potential attack and you know first the detection comes and then the second step would be if and when it detects that does it have a failsafe mechanism it's fail in a graceful manner to where the whole system doesn't get brought down so your whole VM is not getting brought down as a result denial of service is a catch-all phrase that people tend to use that anytime somebody is using more resources relative to everyone else it is... If we don't know we can think about it in the future just wondered if we had thought about it already I'm not quite sure if I get to write because you mean by flooding the queue there's actually no way of flooding the queue but only affecting the processing process I mean yeah if in the 2 or 3 level API design like selling event is just selling a bit if it is already set then it's already set like overflowing the big map of queue and then in the people API design if you would like to retrain the already pending event into the queue you can't do that because it's already in the queue the state machine prevents you from doing that so in both designs it is guaranteed that your event even if the event is being raised multiple times you can only see one event from the back end domain's point of view or whatever domain's point of view so there's not really a way to flood in this thing Let's carry on so we can think of some ways to be evil Especially in a public talk yeah so where was iNet yeah the experiment itself even though none of this design neither of this design was taken for the free I still did this experiment just to see any interesting findings in this thing so yeah I took advantage of the free level design because it's almost ready and it's usable so yeah the first one is the 3000 mini OS thing the hardware spread is 2 sockets 4 cores 16 threads and with wonderful bits of rain there's a normal server and the software configuration is the w0 have 16 vcgs which is more than enough and then 4 gigs of ram and each mini OS has 1 vcg and 4 megabytes of rain as well as 2 event channels there's one for Zen Store and one for Zen console yeah it's gonna be a quick demo so now we have ready I'm just too long on this they are actually running and you can connect to 2996 ok as you wish 2996 right last now it's just a bit laggy because while the Zen even if I switch to the core fd handling so many fds it's still rather slow it's probably try something else like the event like the event driven library crazy thing but it's just too much work at this stage if you want that we can do that in the future yeah it's really not doing anything useful but at least we can see that it's receiving events and it's delivering events so yeah actually it's working just to prove that and there's any other things like domain creation time but I don't know where I can show you yeah let's forget about this let's go to another one just 3000 linux or so 3000 mini OS is just easy as pie what about 3000 linux so we need a much more powerful machine so there's one machine called Hydramonster in our lab which has a socket 80 cores and 160 threads and half a terabyte of RAM then the software contribution is that the DOM0 has 4 vCPUs and each CPU vCPU is pinned to a physical CPU and also it has 32 gigs of RAM which is also too much for it and each DOMU has 1 vCPU and 64 megabytes of RAM and 3 eventually that's too to the normal 2 same console and 140 bit but I don't have a virtual disk configured for this DOMU that's because it's just too painful to access like you can imagine just like several thousand process accessing the hard disk at the same time that would be dog's log yeah so then we change to this Hydramonster I'm going to do this excel list once again and faster spring so actually they have been running for like 10 days and it seems to be running well because the CPU times keeps on growing that means their events are not lost they can get timers, events, they get schedule they get CPU times to run and then like let's connect to one of these 3,000 just pick one 3,000? oh no not this one I might have broken it's not broken so this is just a normal busy box shell now I also have a network interface but with no IP address design that's because it would be so rude to ask for several thousand IP address in our network got a guy gonna play me for that and then so we also have some fun but also it's just too slow that's the log it's just nothing special just normal but I install Tetris in this one I'm gonna play Tetris for the rest of my talk just joking and then let's look at the bridge it's also going to flush the screen actually there's a limitation of the Linux bridge which only allows you to attach like 1024 interface on the same bridge so if you really want to connect it's like several thousands domain you may need to fill with the IP table or whatever things it's just a bit painful, I didn't do that and also we can have a look at how many event channels we can up them to that's 9,000 or more so yeah at the end of the demo it's quite simple but it's working then there are some observations I got from this experiment the first thing is the domain creation time my domain is creation time I mean from the point you type in Excel create and then you get back to the prompt if you create less than 500 domains that time is actually acceptable like less than 5 seconds you get back to the prompt but if you create more than 800 it's rather slow taking like 10 seconds or 20 seconds or so and then finally it took hours to create 3,000 domains I didn't do that by hand I wrote a script but I did run that once or twice just to see how much time it takes to create 3,000 at first domain actually that time was like 40 seconds that's completely not acceptable at all the two-step gets the counter-fired extract the kernel and then write the entries to Zansfor and then bring up the device model and then goes back that still takes 40 seconds which is not very acceptable and then the second observation is the backend bottleneck the thing is the network bridge limit in Linux which I just talked about you can only attach 1,024 interface to a single bridge and then there's the PV backend driver buffer established at this time because the PV backend drivers are not designed to run support this huge number of domains so if you run IPerf to your DOM0 simultaneously on several domains basically everyone gets that and also the IO speed is not acceptable especially the disk IO speed that's why I didn't configure a disk for every DOMU also there's a rough estimation Linux with 4 bits of RAM can only allocate about 45,000 of the internal students its memory limitation that's probably enough for daily usage and then another observation which is quite obvious is the CPU starvation because now with like 1,000 domains running on a single host we have a density like one physical CPU versus 20 virtual CPUs so if we don't dedicate several physical CPUs to pre-host service domains basically you have a very high chance to destroy the whole system I did make a mistake that's like several domains spinning up and they are consuming too much CPU than DOM0 stop and it stops stop so the whole system isn't usable so we actually need to dedicate several physical CPUs to pre-host service domains so the conclusion is that thousands of domains is doable but it's achievable but it's not very critical because we still have many more bottlenecks to address the first thing is the hypervisor and two-step we need to speed up the creation we might have bottlenecks in the two-step or in the hypervisor or in both but that still needs to investigate more and also the hardware bottleneck the only thing we can do is just to wait for it there's really nothing more we can do and then the TV backend drivers the buffer size this one could be addressed by make it configurable by post-admin but we also need to investigate a bit into the processing model of the backend drivers so and a possible credible way to run thousands of domains that's in this application just try to upload the service to dedicated domains and trust the same schedule through the right thing we're going to be there because I'm not expert in this field so thank you that's all, thank you