 So our next talk is actually from a colleague of mine and he will talk on his broker as he made on his virtualization interface for multiple VMMs. So here's Alexander Petruf. Thank you. Welcome to my presentation about microconvitalization on the one roof. The outline of my talk looks like this. I will give short introduction and the motivation. Then I will go through several kernel interfaces looking at the common features and the differences. And based on that I will try to discuss or to show how such virtual machine interface of all the kernels can be harmonized in a way so that we can on top run virtual machine monitors which are portable across kernels. If you today look at off the shelf virtualization solutions, then you will see that they are written of complexity. For example, if you look at Linux KVM, there you have a monolithic operating system running with virtualization extensions. There's no isolation between your drivers or the sub-sub-sub-sub-sub system in the kernel. Or if you think on DOM0 on Xen, it's a huge Linux kernel running there and all your virtual machine monitors. So your QMU instances have to trust that it's working perfectly fine. That's million lines of code you have to rely on. But actually application for virtualization call for trust-worthy solutions but this somehow complexity defeats trust. There are of course alternative approaches around. One way to go, not the only one, but one way to go is to use a microkernel with virtualization extension for hardware-assisted virtualization and build from the architecture point of view with something trust-worthy as a system. How does that architecture look like in principle? I took Nova here for example so you can replace it in theory with other kernels. What you need first is of course a microkernel which is quite tiny which also has features to run as a hypervisor. Then on top you need some user-level environments. Your drivers are running well protected from each other in own protection domains. And of course you have native applications on top and some resource management stuff. For this talk relevant is that you have some virtual machine monitor. Together with a hypervisor you can drive your guest operating system, your unmodified guest operating system. The nice thing in such architecture is that as a virtual machine monitor you can have multiples of some. So not the same implementation. They can be specific to your guest operating system if you want to optimize for low code or for security reasons. So you have some, you can play a bit. Taking from this picture you see that we need a microkernel, we need a user-level environment and we need virtual machine monitors. And the Genant Operating System framework is such a user-level environment. It consists of more than 100 ready-to-use components. So you have native drivers. It's a button on the bottom left. You have low complex resource multiplexers for network disk, graphical user interface. So I'm talking about something like 1000 lines of code, not about million lines of code. Then we have support for several kinds of libraries. You can even use your GNU software. And the main interesting thing is that this user-level environment which is specifically designed for microkernels supports already more than, we support eight kernels. In the middle you see some of them. The nice thing about it, I think like two years ago we even got to the point that we have a unified API for all the kernels. For some specific hardware platform you are talking on, you develop your GNU component once. You compile it once and link it once and then it will run on all the kernels immediately. You don't have any recompilation step or adjustment you have to do. It just works. Why is this interesting? If you want to build some general purpose OS, when you mentioned that you have to package all the components, all the software for each specific kernel, again and again it takes some effort. And with this step we actually have, for example, for x86 we have just to build it once, to package once, and it will run on all the kernels. Regarding virtualization, we support already several virtual machine monitors and virtual box and so on, but I will come to this point later. When looking at the kernel, so that's a current one, so eight we support in principle. Actually, five of them are capable to run unmodified, so have hardware as its virtualization features. On the one hand, it's a Nova micro-advisor where my machine currently running is on. Then we have a kernel called base hardware, so HW, typically called, developed by Genotlabs. Then we have SL4 as you have heard beforehand. We have Fiasco OC support, and even we support Linux to some degree. When we look at the inventory of the virtual machine monitors, we have often on Genotlabs currently. You will see that for our own kernel, so for HW, we have some custom implementation, mainly used on ARM machines. Thanks to the Muen separation kernel project, we have even x86 support for our kernel. Originally, it was just designed for ARM. Together with that, we can run Bindo 7 on Muen with our own kernel. Here we use a port of virtual box 4. On the other hand, we have the Nova micro-advisor, which is running on everything, which is x86. There's a nice thing about it, that it supports already multiple vzpus, 64-bit guests. The actual point here is that we have already support for three different virtual machine monitors. You can run them all in parallel, so you don't have to decide on one. Depending on what you want to run, you can choose. The question now was, the unified API, what I was talking about, is working for Genotlabs components, which are not virtual machine monitors. The virtual machine monitors currently are directly built against the kernel API, so that means although you cannot move it easily to SL4 or Fiasco OC, it's not possible at the moment. The question was, can we do it? What does it take to get to this point? That's what I want to talk about now. In the beginning, we said, okay, let's focus on micro kernels first, and then on x86, because the grass cut is the biggest. Let out Linux for the moment, so Linux KVM would be also a nice thing to look into, but for the moment we want to have the micro kernels running. The virtual machine monitors are all of the micro kernels for a specific platform. The approach is simply that we take the interface we have from our own kernel, where we have a good experience on ARM and also on this together with a more in-separation kernel, and extend it to this degree that we can run over SL4 and Fiasco OC with this interface. Now I would like to go a bit through the kernel interfaces to see the differences. Beforehand, I just want to make a short describe how such a virtualization event looks like in principle. So when you have once you have your guest operating system running, you will at one point get some exit, because the guest is doing something which is maybe privileged or it touches a virtual device. So the hardware will cause a VM exit. The hardware will then transfer some of this data to some specific region called VMCS on Intel. The task of the kernel is to look up whom to deliver this event or who takes care about it. The microhyper will not do it. So with each guest operating system and virtual machine monitors associated, when the kernel finds a specific thread which the message will be delivered, it will also find some shared memory region between the handling thread and the kernel called user thread control block. Via this user thread control block, you would transfer some of the state from the VMCS, give it to the virtual machine monitor. The virtual machine monitor looks into it, decides what to do and to emulate things or whatever. So this is the principle flow in general. Now I would like to go a bit through the differences. For the NOVA hypervisor, it looks like this, that the state is transferred via the so-called UTCB, which is attached to each thread. The layout of the UTCB is agnostic regarding whether you're running on Intel or AMD, so you don't see the differences in the layout. And the other thing is if your virtual machine monitor is clever enough, it even can specify which state should be transferred to the virtual machine monitor, so you don't have to get ever the whole state. When looking on Fiasco OC, your handling thread has also a UTCB, but actually there you set up another shared memory page between kernel and virtual machine monitor where the state from the VMCS or VMCB if you're running on AMD is copied. On Fiasco OC, the full state is copied ever, and the layout of the VCPU state actually is not agnostic regarding Intel and AMD. So this is the difference. On SLL4, they decided to go in the third direction. They have also some shared memory per thread. It's called IPC buffer, not UTCB, but it's semantically the same. When you create a VCPU, you set up some VCPU state also in the kernel, but they decided that it's not shared with the virtual machine monitor directly, but instead when you get NVM exit, you get a fixed set of registers via the IPC buffer transferred, so 17 registers. And if you want to have more, you have to make synchronize this code per item in the VMCS to ask for the rest. And then you want to enter, want to resume, you transfer three registers, which is fixed, and when you want to transfer more, you have to make synchronize this code beforehand and fill each item one by one. I have looked up something like 60 registers in the VMCS are relevant at most, so it means it's quite an overhead you can produce. So now we have seen how the state is copied for us back, and now the question is how the control flow works. So when you get an exit on the Nova Hypervisor, the kernel sets up for your IPC, an IPC call, that means your handling thread gets a synchronous call, and as soon as it, when it applies, the VCPU gets running again. On Fiasco OC, it's the way around. You have a thread in the virtual machine monitor, you assign some VCPU object to it, and if you have done this, then you can make a blocking source call called VM resume, and then the blue thread in the virtual machine monitor is blocking until the point when the VCPU exits again. And for SL4, it's the same as for Fiasco OC, they have a thread, you attach some VCPU object to it, and afterwards you are allowed to call it as blocking source call. So it's semantically the same as for Fiasco OC. When we're looking on the kernel developed by Genotes, there it's done more non-blocking and asynchronously, so when your VCPU exits, the VCPU is stopped, the kernel sets up some asynchronous notification, the next time the virtual machine monitor thread will be scheduled, the signal will be delivered, and when you want to resume the VCPU, you make a call to the kernel, which is however non-blocking, and that means the virtual machine thread is ever ready to do something else in the meantime. It's not blocking at all. So now we have seen the common features and differences. Now the question is how harmonized interface can look like, how we can support it on Genote in general, or is it possible? And the main idea here is that the virtual machine monitor should just be a normal Genote component. Then the question is what is a normal Genote component? A Genote component is designed in a way that it should be driven by events, so it's in order to implement easily state machines. So that means when you start a Genote component, you have a thread which is called entry point, and so the only job of this thread is to register for any kinds of events he is interested in, and afterwards you leave the scope, so you're not blocking or not spinning or something like this, you leave just the scope of your initialization, and then you wait to have that something happens. And then when an event comes in, you get ready, only during this event or this state transition, you do actually something useful, and as soon as you leave the handler, you are ready again to handle something else. And such events can either be synchronous incoming remote procedures calls or asynchronous notification. From the handler point of view, you don't see the difference actually. For a virtual machine monitor now, such VM access, such a VM event should just be another event source, so if you don't want to do any special about it. Also, if you want, if some IO is ready and must be, for example, a network arrived for your guest, this IO event should also just be another event source. You don't want to do any special. Of course, we want to have a kernel-agnostic API, and the vCPU state on this interface should also be agnostic regarding whether you're running on Intel or AMD. It should not matter. So we don't want to expose it to the normal developer of a G-Node component, so of the virtual machine component. What I have told you now in a picture maybe, so when you start a virtual machine monitor, you have your entry points, which registers, for example, for you want to have timeouts or you want to get networks packets delivered to your VM, and when you create vCPUs, you get just a notification that something happens. This picture was mainly so that you can build single-readed virtual machine monitors, which is nice in order to have to don't to care about locks and stuff, but of course you can also go multi-readed if you have multiple vCPUs, if you have multiple physical CPUs, you want to put the vCPUs on the physical CPUs, then you can have, of course, spawn several entry points. That's also possible. Now regarding the kernel-agnostic API. So what we do on G-Node is that G-Node-based library is part of the dynamic linker. The API for this shared library is, as I said, kernel-agnostic, so it's fixed. The actual implementation of the library, of course, can be kernel-specific. So that means for, depending on which kernel you're running, you get another LD, but the interface is the same for this. So that means all our components are typically dynamically linked. So, and yes, we put this interface just in the base library and then we have some freedom to implement it kernel-specific. How this interface looks in general from an abstract point of view. So on G-Node, when you want to have a service, you set up a connection. If you get this connection, you have a get-back capability. So you have a session. This implicitly sets up your address space for the guest, which is empty in the beginning. Then you can populate it with creating new CPUs, getting access to the state of the guest, attaching memory, so, packing the virtual machine monitor with some memory. And then you have a specific handler class where you actually register for this event for an exit. And then you can let run or pause vCPU. And the calls are actually unblocking. So in a picture it looks like this. So when you start up your kernel, the first task is core. Then you have some management component in it which spawns a new instance, virtual machine monitor in that case. When the endpoint starts, it sets up a connection. So it asks the parent, I want to have the service. Please give it me. So then you have policy attached, which can be enforced. Then you get this connection to the service, get-back a capability, and then you can directly talk to the service. So that means on genotes when you have a VM session, it means you have a connection to some service. And via the service you can create CPUs, attach memory, also operation you have seen beforehand. And whenever an exit happens to some of the vCPUs, the exit will be delivered directly to the virtual machine monitor. What you see from this picture here is we have some freedom to decide where to implement stuff, which depends heavily on the kernel and the kernel interface and how nicely we can hide the special things of the kernel. So in this picture, this task is kernel-specific, it must be. And in the library here is, so the implementation is specific. And for Nova, Fiasco, the delivery of the exits are, for example, goes directly to the virtual machine monitor. In theory, we also can divert here and then get a notification over there, but couldn't it look like this? Just to give you an idea, so on the service side, just to get your virtual machine running, we need something like 200 to 500 lines of code. On the client side, for Nova and SFO, we need something like 500, which is mainly copying the state, forth and back of the VCPU. For Fiasco, see it's a double amount, mainly because of the fact that you have to do for Intel and AMD twice, and you cannot nicely abstract away. For our own kernel, you see that's actually just a Corsus wrapper, so there we don't implement anything currently. Now I would like a bit go through the control flow stuff. It's the pictures you have seen beforehand for our own kernel and Nova. So the direction, how is it called, when on an exit, how it looks similar. And a more detail is, so when you get an exit from your VCPU, your kernel gets running, and on our own kernel, we set up a signal, an asynchronous notification on Nova as a call, then the handler becomes running. The fact that it's a synchronous or an asynchronous notification is not relevant for the handler. It just becomes running, it got its event. When you then decide to resume the execution on our own kernel, it's just a non-blocking asynchronous notification or a source call to the kernel, which is non-blocking. On Nova, when you actually say, I want to run, the implementation just will remember its effect, and as soon as you leave the handler, this handler class, the IPC reply will be done. And if you said run, then the kernel will take care to resume the VCPU, and if you said not run, we have some kernel extension where the VCPU will not be resumed. So the next topic is, for example, you have a virtual device model which programs some timeout and the timeout fired. So you get some event, your entry point is non-blocking is running. So the first thing what it does, it calls pause to get the VCPU hold it. On Nova, it's a recall source call, and the nice thing is it's both non-blocking. Then at one point, the VM gets stopped, and as soon as this happens, again, you get your IPC call on Nova or a signal on our own kernel. And then you look at the states, write to the right registers that you want to inject and interrupt so that the timeout happens, and yes, you can just resume it. So both kernels are quite similar, and the Nova-specific parts we could nicely hide. When looking at SF4 and fiasco C, so they have the same control flow direction, so you have a blocking thread, a thread which makes a blocking source call to resume the VCPU. This makes for us life complicated because on GNOT we don't want to have blocking source calls. The kernels actually provide mechanisms that you can cancel this ongoing blocking source call, but we decided in the first take not to look in detail into it. So because it caused some headaches, how to get it nicely hidden on API level. And so the current workaround is that we spawn per VCPU you set up in the kernel also add extra handler thread in the virtual machine monitor. What does it mean? Whenever you create a thread, actually you create your VCPU and you also create an handler. While we are doing this, so that the entry point just can run the non-blocking call, the handler does actually the blocking call and that means immediately after this run this entry point is still ready to do something else, so it's non-blocking. For example, it can't just run another VCPU while it's the same thread. Regarding an IO event you want to inject, for example you have again a timeout, instead of the entry point on the API level you will make a port. Actually what now that's happening is that we go to the kernel and we have set up this asynchronous notification object on SL4, on Fiasco, it's an interrupt object and what actually happens is that this blocking call will be canceled. So that means here we pause, we immediately return the entry point is ready to do something else, it's non-blocking, which is good. Afterwards this ongoing source call from here will be canceled, you return, you get the fact delivered that you got canceled because of something and then we set up just with G-Node primitives, normal asynchronous signal, look at the states, want to inject some interrupts and then you go back to the handler and the handler does again this blocking call. So with this look around we can nicely abstract this blocking call away. So now we have seen how we in principle get this harmonized, this interface. Now the question is how is the virtual machine monitors better we can actually run with a virtual machine on top. Of course you start first with just simple unit tests so you start with some simple instructions where you know that they will exit when you're instructed then you just check the control flow when this was working you add several vCPUs then you put several entry points on several CPUs and have multiple vCPUs running and getting this unit test running. After some while the Nova and Fiasco OC it was running quite nicely it was no big deal but as before we hit some issues. The first thing was more or less by accident we called this VM end of source call on a thread which has no vCPU attached and what happened the kernel folded just away there was a null pointer dereference in the kernel and the kernel died. So we patched this array notified the SO4 developers and they acknowledged it's a bug so it's not just that we did something wrong it's just a bug. The next issue was the test began running and we got our inwardly guest states of SO4 on Fiasco OC and Nova it was running fine but on SO4 it was not running fine. After some time we recognized we are running in 16-bit mode and we have no unrestricted guest support so the hardware has the support but the kernel does not enable it. So we looked through the input specification and enabled it in the SO4 kernel because we don't want to change the test just because SO4 can't do it. The next thing was when we had some specific case where a vCPU was just spinning and was not exiting and as soon as the thread becomes running the whole system starved blocked nothing happens anymore. So we added we have some top utilities where you see the utilization and afterwards that the top utility put to a high priority level and the vCPU below we saw that it's running on 100% the vCPU and all the vCPUs next to it on the same priority level was not scheduled anymore. So we looked through the kernel it must be a bug, we found the place so we reported this to the SO4 developer and they acknowledged it, yes, it's a bug. They missed some state checking in the scheduler. Finally, we were done. The tests were running, we were happy. Now we just wanted to wrap up and run our mobile tests on SLF4 and the kernel will just not boot on some machines and then we noted as soon as you enable this feature this kernel cannot be used on CPUs which has not VTX available. This is if you think about a general purpose OS where you want to use this kernel is it's quite not going to work because you don't know it's a target machine actually. So we patch this out because we don't want to provide several kernel compilations. So at the end of the day the Toyo virtual machine monitor was running on all three kernels. It also works on AMD on my private home machine but SLF4 has no support for this. So the next thing was actually we would like to run real virtual machine monitors, not the toy stuff. We have Zool and VirtualBox 5 as port. I decided for Zool first because it's much, much, much smaller than VirtualBox. If you run in trouble you can easily out the bucket of course. Mainly I ripped off all the Nova specific parts replace it with a new interface, got it running after some days on Nova and then the idea was now it must just run on SLF4 and Fiasco OC. It turned out no. It took me several days, which some ups to weeks to get it running on the one hand because I did things wrong. On the other hand there were difficult differences where I have some back ups life but I cannot go into detail now. The final conclusion was that we need a further patch for SLF4 and one tiny patch for Fiasco OC to get it running. At the end of the day all three kernels were running with the Zool virtual machine monitor. That means we have a Linux VM running with network even with SMP, but not for SLF4. As soon as we enabled SMP for the Linux guest the kernel just dies again with some fault in the kernel. This time I did not investigate it. The next step was actually we want to have the VirtualBox 5 virtual machine monitor running because with that you get support for Windows 7, Windows 10 Ubuntu, whatever you like to and this is currently working progress still. So VirtualBox binary is already kernel agnostic so it puts up the same compilation. It starts running on Lova quite fine for some simple genote VMs. I used genotes as VM because it is easier to debug for me if something goes wrong. For SLF4 it comes up to some point but something is still not correct but we will find out. Besides unknown remaining challenges are still that VirtualBox 5 port requires access to its FPU state. Lova has this access in principle which is currently missing in the virtual machine interface so we have still to edit. But for SLF4 and Fiasco OC I am quite not sure whether it is supported. For a quick look it does not look like this but maybe I am wrong. And on SLF4 actually there is no 64 bit guest support that means you will not get your Windows 10 running at all. So the final conclusion so in the beginning of my tour when I prepared the start of this work I was quite skeptical that's why my subtitle now it looks more like it could be possible of course there are some restrictions depending on the kernel you are using you should be aware of it and the roadmap now is that we will finish the VirtualBox 5 adaptation so getting all our VMs we are currently running on Lova back again so the same functionality and if we done then depending on the time you want to invest in SLF4 and in Fiasco OC we will also do this. When this then gets upstream the next idea is that our own kernel then it is the time to add to our own kernel also visualization extension for x86. And the same approach you in principle can apply to ARM or RISC 5 or this is quite optional. So the benefits of this work are that you get portable virtual machine monitors across the kernels and at the end of the day actually the genome user has the ultimate choice regarding the kernel so depending on the stability of the features of the kernel you can choose within some minutes you can choose another kernel and run the same compilation the same virtual machine monitor and the same VM on Gnode. That's all from my side. Thank you. Any questions? Yeah. Yeah. Yeah. For the Nova case it's not possible so your question is in principle, yeah I don't know the overhead but for example for the Nova kernel you cannot do this cross-core so it means that if you have a vCPU running on some physical scores you have to get a point on the same physical core. Yeah. Yeah. No questions? So the question was how did we find the box in SF4 and what was the post? And why it was not formalized by SF4. The actual point is that typically they don't verify everything or every configuration especially this virtualization extension for x86 they did not verify it but probably they will do and should do. Okay. Sounds good. Nice to hear. The question regarding the address spaces of the various components in particular in light of things like secure and complete virtualization from AMD and things that are coming like this where one is