 So, hello everyone. My name is Achilopeth. I'm a software engineer working in the virtualization team at Red Hat. And in this presentation, I'm going to introduce you to Libcaron, a tool, a library that might help you using virtualization based isolation for your workloads. So, let's get a little bit to the big question, which is what is Libcaron? And if I had to define Libcaron in a single quote, that would be that Libcaron is a dynamic library that enables other programs to easily gain virtualization based isolation capabilities with a minimum possible footprint. And it has a fun fact, originally this code said KBM based instead of virtualization based, hence the title of this, of this talk, but Libcaron has recently over the last couple months also acquired the ability to run VMs using the hypervisor framework or macOS. So, it's no longer strictly tied to KBM. So, let's talk a bit about the goals and goals of Libcaron to set the expectations properly. So, Libcaron intends to be easy to use, and it tends to integrate all the features needed for its purpose with minimal standard dependencies, be as small as possible in code side, have the minimum possible minimum footprint, and provide a friendly environment for microservice and container workloads. And this test point doesn't imply that Libcaron cannot be used for other kind of workloads, but we are currently focusing on those two. Also, and this is a very important thing that should set the expectations properly, Libcaron does not intend to support conventional virtualization workloads. In other words, Libcaron is not a replacement for KQMU, is not a replacement for virtual box. And not in fact run a full guest operating system and there is no intention whatsoever to implement such feature in the future. Okay, so what does Libcaron provide actually? So, we have two libraries, one is Libcaron itself and the other one is Libcaron firmware. And the first one provides C-binding to interact with the library, a virtual machine monitor based on Raspberry MN, Grace and Firecracker, arch-dependent devices and integrated via Bitio FS server and a minimal set of Bitio devices. Bitio console, Bitio FS, Bitio balloon, but only the free page reporting feature and Bitio BISOC. And provided by Libcaron firmware, we have an interface to access the guest payload, which is what the data that is going to be injected inside the guest memory. And a bundle minimalist, the Linux kernel has a payload. It's very likely that in the near future, Libcaron firmware will also include some kind of minimal firmware possibly written in Rust that is needed to be able to run memory encrypted VMs. Okay, so some of you may be wondering now that why do this has a dynamic library? I have all these features has a dynamic library. And just playing this, let me first introduce you to what will be needed if you had a runtime and you wanted to create a VM using an external virtual machine monitor. First, your runtime will need to locate the executable of that virtual machine monitor through the file system. And the virtual machine monitor itself will need to locate its own dependencies, possibly other libraries, a kernel image, possibly some firmware, everything through the file system. This is not a problem unless the runtime in test to switch between contexts in that which is something that OCI runties tend to do. If that's the case, and the runtime switches to a different context with a different one point in space, it's very likely that the runtime won't be able to find the VMM, the virtual machine monitor executable through the file system anymore, nor any of its own depended of the dependencies of this last one. This means that the runtime will need to either avoid switching contexts, which is bad for security, or it will need to somehow carry on the payload between different contexts, which can be complicated because the runtime may not know all the dependencies of the virtual machine monitor, and in any case, it won't be really efficient anyway. So what happens if the runtime is using libk run? Well, if the runtime is using libk run, it has a link, dynamically links a libk run and libk run firmware, and the moment the runtime is executed, the dynamic loader gets all the components from libk run and libk run firmware inside the memory map of the runtime itself. This way, the runtime can safely switch between contexts, can change to a different one point in space, and all the data and code that is needed to run the VM will be carried on with itself. Okay, switching topics, let's talk about doing storage without block devices. Sort of you have probably noticed that when I enumerated the bit.io devices that libk run supports, there was no presence for... Bit.io block wasn't present there, and bit.io SCSI wasn't also on the list, neither, and in fact there is not any device that will allow the guest to access to any kind of block device. Instead, libk run uses bit.io FS, which is able to use any directory on the host as the guest root file system. So basically what's happening here is that when the guest operating systems needs to access some service on the file system, it will use fuse and bit.io MMIO to communicate to the bit.io FS server integrated in libk run, and this bit.io FS server will act on behalf of the guest operating system to access the files which are located on the host file system, starting at some directory in the host that is acting as the root file system for the guest. So let's take a look at how this works in practice. So what we have here is chrubbm, which is an example that comes with libk run itself, is linked against libk run and libk run freeware, and it expects to receive a root directory to be used as root file system has a fixed argument, and a command to be executed inside the file system has a second argument. So what are we doing to do here is we are going to create a directory that is viable to be used as the host file system in the VM, in a completely manual way. So I'm going to create first directory root FS, then I'm going to create some support directories inside it, vmp, vim, and now I'm going to copy some binary so we have something to actually execute this in the VMs to have an entry point at least. So what I'm going to use is the vcbox binary that is shipped with the Fedora package, which is statically linked, which is very convenient for these kind of use cases, and I'm going to copy it in root FS, vim. And I'm also going to copy it again as ch, so I have a shell that I can use as an entry point inside the VM. Now I'm going to use chrubbm using root file FS has the first argument root FS is going to be used, has the root file system for the guest, for the virtual machine, and vinesh has the command to execute it inside the VM. And here we have a shell that is running inside a completely freshly created VM. We don't have any commands here, but we can use vcbox integrated functionality to provide an installation command for us. That's whether now we can in fact confirm that we are running on a fresh VM that is running a different kernel than a whole system. My whole system is using a kernel ship with Fedora and the guest is using a different new worker now which is ship which is bundled with him look around for work. We can also run any arbitrary executable within the root file system and even pass arguments to it. Of course, creating the root file system by hand is not really convenient for a daily use base, but I wanted to show you the way to illustrate how it actually is to create such a structure. Of course, I can easily imagine some kind of microservice orchestrator or system to deploy those microservices in the independent VM simply by creating this small and simple root file system hierarchy and dropping the static and static binary on it. Of course, there are better ways to create root file systems. For example, you can use all CI images. For clear managing images, you can use any container tool. I'm going to use Podman here. And what I'm going to do is I'm going to create a new container which is named Vivian Tehran, based on an image, on the Debian image. Now I'm going to create a directory to hold the contents of the container which I'm going to call Debian root effects. I'm going to export these container contents and extract them on this directory. Now my contents under and I can remove this newly created container. And I should be able to execute CI through the game using Debian root effects as my root file system. And this time I'm going to use Vimbash. And here we are. We have previously started VM, but this time using a Debian root file system. Okay. So going back to the presentation, let's talk about the advantages and disadvantages of these mechanisms. So for instance, using VTIFS in place that we have a requirement for zero storage management, we don't need to deal with these images or these formats. We don't need to partition in or we don't need to layer a file system on it. We don't need to stream the image, we don't need to grow the image or worry about running out of space. It also means that it's very easy to share files between the host and the guest. We don't need to share, we don't need to configure some kind of share folder or anything, because that's how it works by default. And it's very friendly to microservice and container workloads, which as we've seen before is one of the main goals of liquor run at this point. As of disadvantages, we can say that the performance is not as good as when using block-based devices, mainly because the guest is not able to rely as much in the in its own file system cache as it would be able to do with block-based device. And it needs to communicate with the host more frequently for data. But on the other hand, this is actually good to keep out memory footprint though, because if the guest is using a significant amount of memory, has file system cache, that memory cannot be returned to the host and to be freed. Another disadvantage is that the attack surface is larger than using vdio block because we have more code and we have more source calls. That said, it's very likely that liquor run with integrated support for vdio block would be shunish. The main reason is that it's better suited for running encrypted workloads, as we've seen before, the attack surface is lower. We don't need that many scouts, so we are also able to build a stricter second policy. And what things we'll do, we'll work in that way in this case is that instead of having some directory on the host, probably there will be some kind of trust component on the platform that will take the guest payload, whether it's an OCA image or something else, it will create a disk image, it will lay some kind of encryption on the disk image and will store the guest payload inside that disk image. All things said, if vdio block finally gets integrated in liquor run, it probably has an independent favor of the library instead of a configuration option. The main reason is that I would like to keep liquor run as opinionated as possible and not cheap features that are not going to be used by the runtime. So switching to another topic, similar to what will happen with vdio block, among the least of vdio devices that liquor run supports, there is no vdio net, not any other kind of device that will allow the guest to use a virtual and good interface. So how are we doing networking without networking interface in the guest? Well, basically we are using a novel approach which is called transparent socket impersonation or TSI, which is implemented inside the guest custom kernel provided by the liquor run framework, and doesn't require any kind of changes on the user space application so you can use the binary sheet with an OCA image or whatever other binaries. And the trick here is that when the user space application requests an AF init socket to the kernel, what it's actually getting is an AF TSI socket which is with compatible semantics. This AF TSI socket wraps an AFTB socket and AFT init socket inside it. So what happens when the user space client attempts to establish a connection to a local endpoint to something to some service that is running within the context of the guest operating system itself? Well, in that case, the user space client will request the connection to the TSI socket, not being aware that this is a TSI socket, the user space client thinks it's an AFT init socket. And the TSI socket will attempt to fulfill that request using the init personality in the first place. As there is a user space server in the local context, the request is fulfilled and the connection is established, and the user space client and the user space server can start communicating between them the user way. This is very similar to what will happen if the user space client was using a conventional AF init socket. But let's see what happens if the user space client attempts to connect to an endpoint that is not local to the guest, that is outside the guest on the host or in some other network device outside the reach of the guest operating system. So at first what will happen will be the same, the user space client will request the connection through the TSI socket, and the TSI socket will attempt to fulfill the request using the init personality. But this time there is nothing in the guest that will fulfill that request. So TSI will attempt to fulfill that request using this BISOC personality. The BISOC socket will communicate with the BITIO BISOC device, which is integrated in Lycoran and running in the context of the runtime outside the context of the guest. And if the BITIO BISOC device is able to connect to the external endpoint, which again may be located in the same host or in whatever other external network resource. If the BITIO device is able to connect to it, it will establish a connection and the user space client will start communicating with the user space server across boundaries without any of them being actually aware of the situation. This is completely transparent for both of them and there is no need for any kind of explicit support. And what happens if instead of having a user space client using the TSI socket, we have a user space server. So what happens is that once the user space server starts listening on the TSI socket, which again is not a worries at the TSI socket, the TSI socket will actually implicitly start listening on both the init and the BISOC personalities at the same time. If a connection is received through the init personality, the new socket that will be created to attend that connection, it will be a new TSI socket where the init personality will be the primary one. And if a connection is received through the BISOC personality, the new TSI socket that will be created will be one with the BISOC personality as the primary one. Okay, so let's see this in action. What I'm going to do, I'm going to use again CHroot VM with the same root file system we created before, we contains the BISOC box. And BISOC box has a implementation zone W get we can use for these groupers. And what I'm going to do is I'm going to start an HTTP server on the host outside the VM, listening on port 8000 and I'm going to connect to it from inside the VM. I'm using the IP address of my host and the connection complete successfully. Again, this service, the external endpoint doesn't need to be on the host, it can be anywhere else. And if I intend to connect to an external server, the connection also works. Of course, I can do it the other way around and open a port inside the VM. Then I can also, I can connect to it from outside the VM from the same host on the port 1234. Hello from the host. Hello from the VM. So what's actually happening here if we pay more attention to the socket is that what's actually listening on or from the host perspective was listening on port 1234 is something called VM. Which has the process ID 2328. And if we search for this process ID, we'll see that it is actually the CH route VM binary we are executing that this link at against the loop around is an exacting has a BMM for this for this instance for the guests. Okay. So, going back to the presentation. So the advantage and disadvantage of this strategy. One of the advantages is that the network configuration you need to do on the guest is very minimal just you just need to configure the DNS, but you don't need to configure an IP address. You don't need to configure a gateway or any kind of routes. You just need to click around to add on behalf of the user space applications running in the guest without the need of implementing a TCP network stuck in the library. So everything that that happens between the user space client, the kernel, liquid run and the user space server is happening at a circle level. And from the host perspective, all connections appear to come from and go to the live carry on enable runtime and are visible in the network in space of the runtime context. We can also use network reaches more IP table rules. And as a result of all the above the environment is really friendly to container workloads to the point that things such is the outside cards world of the box with any kind of specific support for liquor. As of disadvantages, we can say that it requires explicit support for each address family. The common only AF net streams are supported, though is very likely that AF in support for AF that I am just going to arrive very soon, and there is no support for row sockets. So, before jumping to the next section, I would like to talk a bit about memory footprint, which is, I actually would like to talk more about this, because it's a very important topic for liquor run, but the problem is that it's very complex topic. It requires a lot of time to establish the proper context to be able to explain what is actually actually measuring and to actually show what's happening. And I couldn't find a way to feed this into this presentation, I will probably write a blog post or something or there has some documentation in the current repository. But anyway, I would like to have likes on strategies implemented in the current. One of them is that the kernel payload is directly injected from the library mapping into the VNs space are low in the read only section to be shared between multiple instances. This is true only for the KVN implementation has the hypervisor framework on macOS for some reason does not allow it to build the VM space with more than one region. The run is also using a minimalist kernel, both in futures and incapabilities. This kernel is has a limited amount of CPUs you cannot use. It has a limited amount of RAM it can address and all of the that contributes to lowering the footprint of the of the library. There is minimalize in the con we already see that we are on the shipping the devices we need and even on those cases we are only implementing the features of the devices that we are actually going to use. And we are also be driving balloon with free page reporting, which is a cool feature that allows beta your balloon to pay the guest to periodically report to the host, which basic which pages are not long in the news. So the host can return them to the free pool. And that is working progress that but I hope that we have reached to arrive soon is the ability to use with the access with that. This allows us to bypass the guest make the guest by system cache, which is something we talk a bit about before. And this is actually good because it allows to ensure the guest is using is keeping using as few memory has possible in on the site, so we can keep the memory footprint as small as possible. If the guest was using a significant amount of memory has persistent cash, the cash that memory won't be able to be returned could be returned to the host. So let's stop talking about the current itself itself and let's talk about how you can use the current. First of all, how can you obtain the current being a relatively new project, the current is not just ship by any distribution officially, but there are unofficial depositories for Fedora for open to say and for macOS. You can also, of course, believe from sources, I think it's relatively simple to do. You need to do in the in the in this order you need to build the current for work phase and then the current itself. And once you have all the software in place, you only have to worry about one single header, which includes includes all the documentation for each function. And for the liquor on library, which will bring liquor on firmware by itself because it's a dynamic link it against it. This is a minimal example of using liquor on in under 10 days of code. This will create a VM allow a VM with just one the BCPU 512 megabytes of RAM, using root FS has the root file system and being a she has the has the command to execute to decide the VM has entry point with no arguments and an empty environment. Of course, I don't expect anyone to be using the care on this way because we are ignoring any kind of error codes. But I still think it's a nice way to illustrate how easy can be to create a like we be using the current. So, in fact, let's pick up this code. Based it to a file, compile it. And here we have again, our fresh VM. Same thing has we had with CH rule, but with all minimal example. Okay. Back to the presentation. And lastly, I would like to give some examples and use cases so for projects that are some idea and that may give you some ideas of what you can do with liquor on. For instance, there are already some projects that are using liquor on such a key RAM VM, which is a tool for creating like with VMs from a CA images using liquor and build that. In fact, I think we are good on time. Let's try this take a minute and let's so click a run VM a little. And I'm going to take this opportunity also to illustrate how liquor and also works on non-linear platforms. I'm connecting to the Mac M1 device that I have right here. This is our system. And with the RAM VM list, we can see the configuration that we have for the virtual machines that we have already find with Fedora test image. I'm going to create a new one based on Ubuntu. What I am specifying here is the image name in the same way you will do with any container. So instead of a new Ubuntu, this could be Nginx or whatever already made this publicly available. In fact, you can also specify the tag you want to use. And I'm going to be given the name Ubuntu Defcon. This will take a few seconds. And now I have my newly created VM. And I can go right away. And I'm going to increase the RAM. I can start the video. Here we have what this is. This is Vinesh. Let's use impassisted. Here we have our first list target. Like with VM on macOS on a foreign operating system. And has things to PSI despite what I don't have. Okay, anyway, trust me, there is no other interface than the low back interface. But despite that, I can connect. I can start right away installing that in my package repositories and install wherever I want to install to compare this into a build machine. In fact, I could also, if I want to, I could also change this VM to share an additional directory. For example, my home directory inside the guest. Now I have access to my own home file system from within the VM. Okay, back to the presentation. Another cool project that is already supporting. This is the OCR runtime. This is cool because if you enable virtualization based isolation, it will do that inside the container context. So you will get container isolation and virtualization based isolation at the same time. Other ideas that we are exploring with liquor run is the ability to run fully encrypted workloads using hardware support with such as AMD, SAV, SMP and Intel TDX. Other ideas is having the ability to have conventional services to self isolate. There are already some services such as HTTB servers that are able to use CHroot to self isolate. And it will be cool that in addition to use CHroot, they will be able to self isolate in their own VM without any kind of maintenance cost for the system administrator simply by enable some kind of configuration on the service itself. And lastly, the other idea will be enable micro service platform to deploy functions in virtualization isolated environments, which is something we already hinted before when we were talking about how easy it is to create a system and drop some kind of static binary in it. And that's all I have. Thank you for listening. And if you have any questions, I will be more than happy to try to answer them. Okay, thank you Sergio. We sure have some questions from the attendees here in our Q&A. The first question is from Jen Yan. Should KCM memory did work with these things in the end? It's probably good. Yeah, it's probably good work too. I haven't tested it. But I'm not sure if the cost of their running KCM will compensate the benefit because in the end, most of your footprint will be something or unique footprint. You know, better say you're not unique footprint of each VM will be something like 100 megabytes for each instance or something like that. So yeah, I don't know if you will be but it's something that is worth exploring. Okay. Our next questions from Stefan. How was prop populated inside of the end? Is there some kind of process inside of the end that it also sets up standard file system mounts? Sure. Inside Libcar round inside the VTiOS server that is an integrated in binary that is bundled within the VTiOS server and it has a special in node number. And it is the in charge of mounting on the son of the directories setting up the environment and do some other of minimal utilization stuff. Okay. The next question is from Andrea. Does this project match Cata container need to spawn a very light VM to run a container in it? Not exactly. So Cata container works by running multiple VMs, multiple containers in a single VM when you are using a pod. While on Libcar run, each VM gets each container story gets its own VM. So if you happen to have a pod, what will happen is that you have those containers sharing the same mountain in space, the same networking space probably, but each of those containers will be running in different VMs that will be communicating among them using VTiOS first through the one point in space and TSI through the networking space. Our last questions from Alexander. Is anything similar to Kimus's next shot in your apartment's cry of power checkpoint restore in scale for Libcar run? Not so far. The idea is the Libcar run should be as simple as possible and the life cycle should be relatively simple. So the idea is that the VM will start and will die and there will not support for life migration or even choose safe the state in any case. But again, this is something that maybe worth exploring at some point.