 Start the presentation, good luck. OK, thank you. Well, hello, everyone. I'm Matias Vara. I am here to talk about some of the work I did in the last couple of months regarding some work I did to improve the booting time of the Toro kernel, which is a kernel dedicated to running built-in machines. Can you hear me well? I mean. So in general, there are no speakers here. OK. OK, cool. So what is exactly Toro is application-oriented kernel, which is made of, more or less, a five-mainly unit, which are a process memories, file system, networking, and these components, these units provide a simple API for the user application. For example, if you see the process unit, we'll see machines thread, three-thread, all the manipulation of the thread. If you see a file system unit, we'll see some API to manipulate files and so on. In the case of microservice, which is a user application, we propose to compile the microservice together with the kernel, and this results in a single binary. The microservice can decide which are the components going to include in this binary. And that is done directly from the syntax of the microservice. So for example, in this case, I say, OK, I will include memory, file system, the extensive driver, and the Intel Gigabit driver. After the rest of this, we generate a binary, which is a ELF. From this binary, we have a set of tools in which we can first generate a Dmash by relying on another tool called a Clody. We can then launch a built-in machine. To demonstrate all this work, I will just show a simple demonstration. Here, for example, we have a total running in a simple microservice, which is serving index HTML. For example, here, you see the results. What you see here is more or less the initialization of all the drivers. Sorry, we're just here, the initialization of all the drivers, the network key, and so on. And then from here, you see the user application start to execute. If you come back to the slide. So actually, we spend some time to first compile the kernel and the user and the microservice. And this takes more or less one second, because it's only compilation, and then generate the Dmash. We take another one second and a half, more or less, to boot the built-in machine. Here, I'm taking into account the time the door is open. And here, this time, it shows that the kernel starts to execute, but not after that. It's just until the kernel mine starts to execute, for example. The Dmash in the current state is around four megabytes. So what we figure out, but I figure out that this time limits a lot, if you wanted to use kernel, for example, continuous deployment of microservice, or if you wanted to instantiation on the mine of microservice, because we need a fast instantiation. So if we see more in detail how is the booting process of Toro, you see, first, the built-in machine monitoring instantiation, which includes, for example, instantiation of the device model, bias, and other stuff. I'm not pretty sure about this. Then you have the bootloader that is already part of the kernel, which is instantiation of most of the hardware, the processor, and two log-mode, an APG in, well, read the kernel from the disk, and then put it on memory. So this talk, I'm only going to talk about the word I did, which is what focuses on these two first steps. In the next slide, I'm going to talk about how we try to improve this time. So yes, I will start talking about how I improve the bootloader, then the visualization of the built-in machine monitor. I will present some tests that I did, and then we conclude. What the problem that I had at the beginning was that, as you can see, the generated image that includes the user application and the kernel was very big. And this was like this, because in order to make the bootloader simpler, what we say is the image will be the copy of the kernel in the memory. So the whole, I mean, an image of the kernel in memory. And we did it like this, because it makes it seem a little bit bootloader, because you just need to read the image, put it in memory, you don't need any compression or something like this. And the main problem of this, first, you end up with a mesh that is huge, as you can see, what's four megabytes. The bootloader is still complex, because you still need to initialize some hardware, any paging, and so on. So the proposal, what I'm working on, was to use the minus kernel option, which is proposed in QAMEM, to load the kernel. And I will explain a bit what is exactly this. Actually, what you can tell QEMU is to pass the binary, which must be a multiboot kernel. And what is exactly going to do? Well, this is not exactly like how the binary is, but I'm using these pictures just to illustrate a bit. So actually, the binary, which must be ILF32, you have for sure all the sections, and you need some sort of special header that QEMU is going to read. From this header, it's going to read all the sections of the binary, and then put it on memory. I mean, this is the memory of the virtual machine on the guest. When QEMU starts to execute the multiboot mind, which is a point in the x-text section, the processor is already in protect mode. So most of the hardware is already initialized. So all that feature that you need to implement the model to install a hardware you don't need anymore, because you already have it here. And then, at some point, you want to start to execute the kernel. So the benefits of using this option that have QEMU were many. For example, I could reduce the size of the mesh from 4 megabytes to 130 kilobytes, because only include the binary, right? That reduces a lot of the complexity of the bootloader, because you don't need to jump to protect more the names on the stuff, because QEMU is already doing this for you. And also, since you don't have to read the whole image from this and put it on memory, because QEMU is doing this, you release a lot the boot in time. I started with 1 and 1 half, and I end up with half second to boot the kernel. The main drawback of this is that, well, not all virtual machine monitors support the use of the feature. So at the end, you're porting your kernel to that virtual machine monitor. Second, QEMU only support ELF32, so you need some sort of magic to put ELF64 inside ELF32, which is extra work that you have to do. And still, when the kernel starts to execute, it still is in protected mode. If you want to use 64-bit, you have to go to log mode, and you still have to do such initialization in the boot loader, so it is not that you can just start to execute the kernel. Some people were working on this more precisely in a project called QEMU Lite. We tried to add these features to QEMU, so you don't need to enable log mode, because when you jump to the kernel, it's already in log mode, so they use a lot of also the composite of the boot loader. But they stopped it to work on this, I think, one year ago. If you see the port that I did for Toro, you can find it in that issue in GitHub. I stopped in that one year ago, I think. So the next topic that I want to talk is how I work around the virtual machine monitor to try to improve the initialization. So actually what I did was, I saw three approaches to try to improve the virtual machine monitor. These approaches are based on QEMU and QEMU, and these are Keywood, Nimu, and Firecracker, maybe you know already then. The big picture of these approaches is that they try to improve some aspect of the virtual machine monitor, for example, simplify the load of the kernel or some reduce the device model. So I think they tackle different aspect of the whole virtual machine monitor. And to talk about these approaches, I wanted to first introduce some concept here. When you have the kernel, the Linux kernel, the Kavi driver, which provides all the virtualization features. Then you have the virtual machine monitor, which provides, for example, device simulation bias to the guest, and then you have, for sure, the guest in guest model. So the first approach I want to talk is Keywood, which is also based on QEMU. And the idea that proposes a firmware, which is much more optimized. The only drawback that they present, they are not a lot of information about how exactly is the optimization, but one of the drawbacks that they present is, for example, they limit the size of the kernel that you can load. It is 8 megabytes, and you can find more information there, which is cool with this is that you can just compile it and then use it by passing the minus bias with the firmware. The next approach that I tried was NEMU, which is based on QEMU, but they only focus on two platforms. And the idea of this project, at this idea that they gave me, was that they tried to reduce all the device model in order to reduce the texture phase and also the, for example, memory footprint of the virtual machine monitor. Also, they propose a new type of machine, which is named Beard, and to use this virtual machine, you can only boot from USB. I did not work too much on this, but you can find more there in that link that I used. And the last approach that I tried was Firecracker, which is, I think they started to be public on December, I think. The nice thing here is that they don't have any device emulation, all the hardware is based on Viewtio. What is also nice is that you don't have this multi-boot feature that has NEMU, but you can, for example, use Firecracker and ELF64 binary, and it's going just to start the kernel, and the processor is already with patient-enabled and 64-long model, which is quite nice. So that reduced a lot of the hardware initiation I had to do before I secured the kernel. So how I compare these approaches, well, the test was quite simple. Well, I'm sure it was the time from I launched the VN until the VN is shut down, and to shut down the VN, I just added some code in the kernel mine, and then I did this for three approaches, and then compared the time. If you want to see more about the script that I used, you can see here. For sure, you can give me some comment about the script, because it's just sort of a sketch. So that is the test, the rest of that I have. This is the hardware that I use it, and one interesting thing, but there are many interesting things that you can see here, for example. Here you have Kimu, where I use the four megabyte image, and then you can see the difference when you start to just use the binary. I mean, that is rational, because you're using a lot of the binary. I had to load on memory, and then you can see that you have 300 milliseconds that you get if you use Kimu. Then you have Nimu, where you see a difference of 152 milliseconds for Kimu. I don't know why, but then you have a lot of difference when you use Kimu, and this is really interesting what you get by using Firecracker, because it's only take 27 milliseconds to start to execute the kernel of mine. And it's just a note that I leave here just to compare as far in mind the times. For example, if you follow this blog and you check how much time spent in Echo is more or less two and a half milliseconds. And here you have a whole built-in machine, a sort of micro-built-in machine, how they call it. So, yeah, I'm going to conclude. So what I get is when we use a solution that is a combination between a multi-boot kernel and key-boot, which is a framework which is optimized, you see an improvement of my current boot loader of 11, yeah, a factor of 11. I agree that my old boot loader was a bit crap, but still. Then we have a factor of 85% when we try with Firecracker. My impression is that the choice of one solution over and over is just a trade-off between the need that we have to do to adapt the kernel to that solution and the resulting boot-in time that we want to expect. So, yeah, that's me. I don't know if someone had any questions. Maybe it was too fast. Yeah, that's it. Yeah, sorry? No, I didn't try. Ah, sorry. If I had tried a KBM tool, no, I haven't. They told me to do it a couple of days ago, but I had no time, but I will do it next time. Implementation of what? Sorry, yeah, the question was if the implementation was in Assembler and Pascal, the implementation of what? Well, they have, yeah, a part of Assembler, a part of Pascal, yeah. I mean, all the work that I present here is most on the boot loader, so it's most Assembler. So, if you have no questions, that's all.