 OK. Hello. Good morning everyone. I hope you're all caffeinated this morning. I've been very impressed with the quality of some of the local coffee shops actually. I've become a bit of a coffee aficiandlo recently. It's been really nice to try some different filter coffees. So let me begin. So my name is Robert Bradford. I work at Intel. I've been working in open source professionally for around 15 years now. Most recently I've been working in the space of virtualisation. A few years ago I presented here at KVM Forum back in Edinburgh about rust hypervisor firmware, which was one of the projects that I started to start on my journey learning rust and looking at different ways we can boot virtual machines. But most recently I've been working on cloud hypervisor as one of the maintainers there, which is a VMM written in rust using the rust VMM crates. And today I'm going to talk about my most recent work, which was looking into seeing how quickly we can boot a VM into user space and trying to achieve that in 100 milliseconds. And the answer was, yeah, I could do it in 64 milliseconds. So thank you everybody. Yeah, no. No, oh, seriousness. You got to take every number you see with the pinch of salt. So this was the number that I achieved on my nice new order Lake desktop, a big upgrade from my Haaswell system before. But your numbers will vary, right? I found that I actually had pretty stable numbers but I had a lot of variance over what other workloads I was running at the same time. Just, you know, so I found that I would repeat the test several times. But what I want to say is I think the number that I set out 100 seconds is definitely achievable and I'm going to present some of the steps along the way of reaching that and some of the challenges that I've had. But don't think, you know, don't obsess about the explicit numbers too much. So why do we want to boot quickly? Well, increasingly there's a use of functions as a service, containers as a service if you look at functionality like AWS Lambda and the equivalence that you get on GCP and Azure. They want to be able to encapsulate those workloads in a VM but the protection, but they still need to be able to serve those quickly. That can result in an improved user experience for the users if those sites respond very quickly. But also, you know, in the current climate we want to make sure that we are focused on minimising our resource consumption as well. Like in general, anything that we can do for optimisation will help with power consumption as well. One thing that I learnt with some of the optimisations that I was looking at have also improved the steady state performance. So that's also useful for things beyond the actual initial booting. And that's not undervalu the just intellectual curiosity of looking into a problem and coming up with a solution. I think that's a really strong motivator for me in this kind of situation. So what do I define by the boot time here? Well, I'm looking at the situation where I'm directly booting into the kernel. So I'm not going through a firmware or a boot loader. So I'm going to be using the, in Cloud Hypervisor, we support the Linux PVH. So we directly boot into the VM Linux. And I'm ignoring the management layer, which might have come in through, so something like using Lidvert or, you know, Cata containers or anything like that that's on top of that. And I'm also disregarding what's going on in terms of the user space side of things. I'm really just focused on from starting the VMM. To running the first instruction user space. So that's the kind of time window that I'm most interested in. And that's what I'm looking at right now. So how do I measure that time? Well, the easiest way to do that was to instrument the kernel, because we're running our own custom kernel here, so because we're directly booting into the kernel, to do something to, so that the VMM can recognise that. And the easiest way is to trigger VM exit and one of the easiest ways to do that is to write to the IO call. We use the debug, the XLA debug. And inside Cloud Hypervisor, we catch that and we trigger a log event at that point in time. So it's very easy to measure that time, because we've got our logging starts from the very start of the VMM. There are alternative approaches. You can actually use the KVM probing to track on the actual PIO exit as well and you can get a timestamp from there as well. So that's a link here, which these slides are updated on the website, so you can use this material as for that. So there are several different ways to analyse that number that we've got and how we can go about improving that. And you can split this into two different ideas. So we can look at things from the frequency domain. So how often does this thing occur? So a great tool for using that would be just using Perth, just either as a standard, measuring the stochastic counters and just looking at the call graph. Or we can actually have some explicit counters for the VM exits. So in Cloud Hypervisor we have a counter for the VM exits. But also you can go with Perth, you could actually say, I want to look at particularly the types of KVM situations and some KVM exits as well. Or you can look at things from the time domain. So you can look at the tracing and I built some custom tracing infrastructure to generate this output. The vlogs from the VMM, so we have some key part point of time in our VMM where we output some logs and you can reference those back. And actually the guest D message if you have it compiled with the timestamps in, gives a really close approximation to that original boot time if you know how long your time before executing the kernel is. So you can then experiment just looking at that. That was a really useful approximation to see if things were making any difference without the overhead. So this is my initial test. This is pretty much the sort of a standard recommended kernel, Cloud Hypervisor configuration. And where we're outputting to the serial port and we're booting with a virtual block device. Now we have, we use IOU ring in our block device. So it's pretty good performance from in terms of like throughput and latency. But this with this configuration it resulted in boot time of 166 milliseconds which was not the 64 milliseconds I quoted earlier. And it's not below the 100 milliseconds which was the goal that I set out when I started this work. So okay, we'll have to do work better. So I loaded up Perth. So here we are looking at things in terms of the frequency domain. I'm sorry, you'll need to make sure you go and say if you can't see this you don't need to go and see your optometrist. It is very small and quite blurry as well. But basically this shows that we were getting a lot of PIO exits from the serial port. And every time we were getting a bit of that we were having to handle that in Cloud Hypervisor. We were having to do some bus walk-ups which walk-throughs which is like a standard thing that all the Rust VMMs use this common bus infrastructure that originally came from cross VM that has a small cost for walking those to find the underlying device to handle the PIO exits. So we want to minimize those PIO exits. So the easiest way to do that was to switch to Vertio Console. So you do miss the early logging prior to the console being initialized, the Vertio Console being initialized. But since we're directly booting into the kernel we're less interested in things like the firmware control and firmware logging. So not having the serial port is not a big disadvantage. And when I do that I went below the 100 milliseconds on my test machine. So that was a bit really big. Removing the PIOs was a really for the serial port was a really big improvement. Not just from the VM exits but also from all the handling that we were doing behind the scenes to actually output that serial. So, but we can go a little bit further. If we actually add quiet to the kernel command line remember we have the kernel command line in our control because we're directly booting the kernel and if you're in your framework for CAS containers or whatever framework you might be using for a functions as service or as a container or as a service workload you will have control of that command line as well. And you don't necessarily need that output. It might not be particularly helpful. So that also shaves some time off the metric there. So we're getting closer to what I was hoping when I reread my talk proposal was the 50 millisecond goal that I maybe can outline and I'll explain a little bit more about that later. So we're getting a big improvement there. But then when I started looking at the analysis using Perth again we had a big increase in the number of amount of activity happening on our vertio block device. So this was now we were being bottlenecked by the handling for the block device. Now in Cloud Hype Advisor we also implement a PMEM device which has far fewer exits triggered because the reads and writes don't need to go on to a vertio queue. Instead the memory is memory mapped directly into the guest. So there is still some use of the queue for fit verse queue for signaling flushing but it's much reduced and during boot we get far fewer much less activity on the equivalent vertio PMEM thread. So this increased, but this increased the boot time. So I was like, oh, that's frustrating. But then when I look at the profile so it's always good to reevaluate the profile after every single time. Now I'm seeing the cost coming from the mapping those pages into the guest. So that, so I was like, okay, how can we evolve on there? Well, what happens if we use huge pages, right? We know that the VM is going to be of a reasonable size. We know that it's going to work things in a particular manner. Huge pages would be a perfect choice for this. And so we, if you enable huge pages the boot time improves to 64 milliseconds which was that time that I quoted there. As an interest I tried what happens when you go to one gig huge pages it gets worse again and that's just from zeroing all that memory that you don't necessarily need at boot time. So to make huge pages was the sweet spot for booting the VM. So I did a little bit more analysis and I found, and so I was looking at this, so it confirmed the idea that I had which was that the VM exits were the cause of the slowness during the boot time. And as we can see the number of VM exits drops as we move through from the worst case with the serial report and doing the PIO exits related to that and then the related exits relating to the memory and then all the way through to the huge pages where the amount of VM exits is significantly reduced. But what about THP? Well, I don't know if anyone has the experience of working with THP. I could not get it to work for our use case at all. In theory, it reminds me of this quote for about SCSI that transparent huge pages, it should work, I think it might be some interaction between the fact that we use MMFD, create MMFD for creating our file descriptor for our memory but at no point can I ever co-ass anything to use THP even after asking some experts and exploring. So I would actually love to get some feedback if there's a way of trying to make THP work. So one thing I did was I went through our history and I looked at our boot time using one of our automated tools for tracking our performance. And I spotted some changes in our boot time going all the way back from 2021 in March, shortly to the current version and we had some improvements. A very astute eagle eye amongst you will recognise that some of those times are actually below 50 milliseconds. And that's because this test suite that we use is part of our automated testing. A void wants to void the discrepancy from any underlying performance of the block device. So it uses a tempFS for booting for the file system for the guest which if that tempFS is then going into PMEM you're going to get very, very good performance. So what optimisations did I make in leading to that change? Well, there were a few key optimisations. So the first thing was that when I did an analysis I discovered a lot of probing of all the PCI for the PCI buses beyond the first PCI bus. So that was like an interesting problem since we only had one PCI bus at the time. Again, I mentioned earlier that like biocraffing and cross-vm we have a common way of looking up our devices on this bus object and then finding the underlying device which is very relevant for serial devices. And for the PCI config IO port. And I wanted to work out a way of optimising that. And actually loading in the kernel was quite a piece of time. So, since we only supported one bus one bus per segment we support up to 16 segments the MCFG static table actually has a, you can say, how many buses are associated with an individual segment an individual host bridge controller that that is for. And by changing that from 255 or whatever it is down to one the boot time significantly improved as the kernel was no longer probing for all those devices that don't exist. So I had a little look at some of the other Rust VMMs that use this. Firecracker doesn't yet have PCI support so that doesn't apply in cross-vm. The bus size is calculated from the PCI MMO config size. So that might be better to calculate it the other way around and so they could actually since I don't think they support multiple buses themselves. So that might be an improvement that they can apply there. But doing that improves performance significantly in terms of skipping the extra probing that we just don't need to do because we don't have any devices on those extra buses. The PCIO fast path was also really important. So simply by at the point at which we get a VM exit if we just quickly check to see if the IO match the PCI config IO port we can just bypass that and go straight into the PCI config IO device and look things up. And the combination of those two led to a big improvement in the boot time. So that with the same config that I was talking about earlier that gave a 37% improvement. So just optimizing that PCI config probing and discovery which was so significant during the boot gave this really significant improvement. So I wanted to look at some things outside of frequency domains. The majority of what I've done in terms of profiling and performance analysis using tools like Perth have been looking at that frequency domain. So I wanted to get like a nice trace graph a little bit like boot chart but for the cloud hypervisor start. Now I could get quite a long way as I just because by looking at our logs because our logs print out the time relative to the start of the VMM. But I wanted to go a bit further. So I added some trace points and built a little tracing point infrastructure into cloud hypervisor and then wrote Python script to generate an SVG which was surprisingly difficult actually getting the kind of colors to look halfway readable. And this is what led to the inspiration of like what happens if we load the kernel asynchronously. So in this particular situation we load the kernel in a separate thread. That's just reading it from the disk and at the same time we can do some of the device creation. Now on a very simple VM design without with only a few devices it doesn't take too long but if you start to add a lot of network devices maybe you've got some VFIO devices or a VFIO user device the PDPA device or lots of different segments the time for taking that swells and actually you get much more of an advantage out of the asynchronous kernel loading. Right now the function entry point that's here that is the predominantly waiting for the kernel process to the kernel loader thread to finish. So we got some benefit of improvisation the benefit increases the more complex config that you have. So that was kind of quite interesting. I'm really hopeful that I can add tracing to other places to really look at other specific situations that we have as well beyond just boot. But it's all good that we've done this as a one-off experiment to find what the boot time is right now and aim for this goal. Well, so we wanted to have a way of monitoring what was going on with our performance not just for boot time actually going beyond that in terms of starting time for like vertiolive throughput vertiolive for block and net latency for network. So we've got this metric so you can go and visit this website it has this fancy graph showing the metrics. So we run the metric suite on a bare metal machine and we use that to give us some good information about whether our patches are good quality whether there's any big regressions and here is a great example of where it showed something really valuable. Actually there's this big peak here where the boot time went from about on this particular, this is an ice lake server so this is not necessarily got the highest clock speed but it's obviously got a lot of calls so that's why the number is higher in terms of absolute time than I would see it in terms of what I was getting on my desktop system. But this peak occurred a few days after or a few builds after I was looking at adding support for a updated kernel so then I had to go ahead and find out what was going on and I was able to bisect the kernel to this particular change. It doesn't look particularly innocuous I'm still looking into exactly why this change broke it so I went back and I stopped using the latest kernel. So it's really interesting that we've got the ability to run this automated tooling both for regular building but also for actually being able to bisect a particular situation. So how can we go below 50 milliseconds? Well I did actually achieve that but I had to take some extreme measures. I had to disable networking in the kernel. So that's kind of I thought well you could have like a function as a service there was a function as a service system I was looking at which worked entirely on starting a thing spending in standard in and standard out and then sending that over the HTTP but then I realised if you had a function that can't access any network resources it's not especially useful. So I think you kind of do need to have networking. You can possibly go a bit further with really really fast storage so if you have something that was like a PMM persistent memory style storage like you can actually improve that and get closer to that boot time as we saw with the boot time when it's running directly off the RAM in Tempifest you do get below the 50 milliseconds but we already have quite a slimmed kernel inside Cloud Hypervisor for the reference kernel. So I think it's doable and as clock speeds improve as IPC improves you're going to get better and better and the same mechanisms that we've talked about can be continued to use for optimisation. So in summary, the custom tooling that was a really interesting exercise and showed some useful information. It demonstrates that for instance correctly ACPI tables are a huge course of the cost but it was instead that maybe working out how to do it more working synchronously is valuable. But then I was telling some of my colleagues in the cat containers project like yeah we're working to optimise the boot of the VM from 150 milliseconds down to 100 milliseconds in below and they were like, yeah, doesn't really help. They were talking of the order of multiple seconds for the get starting up leading to the start of the VMs through Kubernetes. So I feel like that was okay. It was still a good experiment to do. I also had a preview of the latest, I tried the latest 6.0 RC, I know it's been released and the boot time shot up to 240 milliseconds through some new speculative execution prevention mechanisms which was a bit unfortunate. So maybe we're not going to actually be able to keep achieving that goal if we want to continue with those kernel updates. But there's still the value of the automated monitoring that I have and that's really, really useful for helping us understand what's going on and there's also the value of the learning exercise from that. The booting fast I want to say can sometimes be used as an alternative to templating. There's a lot of excitement for templating and starting VMs quickly but I think that some by taking a copy of an existing running VM and starting a new one but I think that possibly you'd be able to just boot directly to solve that problem. Thank you very much. Any questions? Great, the question was, did I compare different sizes of VM based on the core count and different sizes of memory? I did as an interesting value, I did look at those things and they do scale as you would expect. It's not linear but they do scale as you would expect but for the purpose that I was looking at within the range that I was looking at for those particular situations with functions of service style workloads, within the range of about 512 to say four gig of RAM there was very little difference and when you were looking at two to four CPUs it didn't make a difference. If you scale up to a very large number of CPUs and a very large amount of RAM then yes you do get some differences but within the range that makes sense for a function of service or a container as a service workload I didn't see enough variation to want to put it in the material here. Okay, great question, so what? Okay, so the question was can I give details about the kernel config? So we use the PTH added infrastructure so that takes the VM Linux.bin directly so that's an uncompressed binary. So I think it's about 48 megs in the testing I was using. This kernel is the sort of standard kernel that we have published the config for and it's our recommended config. It's currently based on 5.15. It is just the right set of devices for running with Cloud HyperOS and Verti devices. It doesn't have a lot of excessive functionality in it and you had a second question. No, great question, okay. So we're booting directly into the question was about whether I was using init RAM FS because I'm not using a booting interval into the OS and into the bootloader. I'm not using an init RAM FS. The command line boots has a root equal slash dev BDA and there's no init RAM FS in use. That would be an interesting exercise if you had a particular workload where you could keep all your interesting logic inside the init RAM FS and maybe have a large-ish init RAM FS and perhaps a temporary persistent storage or a network-based persistent storage or not have any persistent storage. That might lead to an interesting situation where you could get almost the same performance as you would be getting from the kind of booting directly from RAM situation. Of course, you've had to load that init RAM FS into memory, but if it's already in your cache, that would be useful. The question was, did I look at the reason for the causes of those PI exits and how much time I was spending in them? Yeah, yeah, but particularly for the... So in terms of, although I'm focused in those exits, if I can bring up the slide, please go here. Of these exits, actually only very few of those will actually result into going into the BMM itself. A lot of those will be handled by KVM. And I was looking at areas that I can, as cloud hypervisor, can work to optimise. So I did have some monitoring of how long I was spending inside the PCI and serial exits, but it wasn't of particular interest. It sort of demonstrated as much as you would expect. That sounds really interesting. Maybe we can have a chat about that later. That would be really cool. Okay, so that was a good recommendation for like mapping the... So there was a recommendation to map the address base that we used for PCIe, the config thing, directly into the guest, rather than having it using the PIO probing. We kind of need the PIO because we do support firmwares and so they do want to do very simple stuff like access there via PIO. So it's a difficult compromise, but that's actually quite a fun idea. Are there any other questions? So cloud hypervisor does support those situations, those use cases already. So the question was about a little bit more detail about VM forking or using copyright snapshots for fast snapshotting. So the cloud hypervisor, we do support that functionality when you snapshot out the VM and you boot it again from the same snapshot. Those have copy and write mappings of the underlying RAM. And we also support very fast live migration into a new VM by sharing the memory file descriptors for the guest again. So you can also, as well as actually having to snapshot it out, you can actually just live migrate into a new copy. Those are, I'm not convinced about the value of templating, although VM forking, where you can actually do better address by showing, might be really valuable. I think sometimes templating is used to work around problems from the slow boot. And I wanted to see if we could look at say, hey, if we boot quickly, maybe we can avoid the complexity of the templating. Because once you get templating, with templating you do have other potential issues in terms of like, well, what about the guest MAC address or the block storage and things like that. So maybe booting can mitigate some of that if your situation would allow you to do that. But I agree, you can get really fast startup times with, especially if you want to actually, if you're templating with a living program in user space that maybe your user space program takes a long time to run, their VM templating might be really valuable because you can just template with the user space. And in this case, you wouldn't care about how long it boots because you can do that. I do know people who are using Cloud Hypervisor with a long running template user space and then basically have hot-plugging the network device in and hot-plugging a separate persistent storage device in for their like data, but otherwise they use a read-only root FS. That's how they achieve, avoiding all the problems by hot-plugging in their later devices. Wonderful. For no more questions, thank you very much indeed, everyone. Cheers.