 Good morning and welcome to the KVM status report at KVM Forum 2021. My name is Paolo Bonzini, I'm a distinguished engineer at Red Hat, and I'm going to present what happened from October 2020 to September 2021 in KVM. October 2020 corresponds to both last year's KVM Forum and Linux 5.9, while September 2021 is presumably the timeframe of Linux 5.14. So we had five releases during this timeframe, and one more is underway with the development period already over. During this last year, we had over 1400 commits, with about 10% of those destined to stable releases as well. So around 10% of the commits were fixing a bug. Some of them were all bugs, some of them something that simply wasn't caught in recent stabilization periods for each Linux release. 1400 is actually a lot of commits for KVM, and you can see that we had about 400 commits in three out of five releases. And this has happened actually pretty rarely in KVM in the past. This other graph plots the commits in each group of five releases, and you can see that we had a substantial uptick in the number of commits starting around 5.4, so this bodes well for the health of the KVM hypervisor. About half of the commits have been contributed by Google, mostly due to developers moving to Google for other employers. But still, there is a very long tail of employers that are contributing to KVM, including multiple cloud providers, hardware vendors, and operating system vendors. About two-thirds of these commits were for the X86 architecture, followed by ARM, which had some interesting projects that I will talk about later. And then there's also Power, S390, and MIPS. Regarding the generic commits, the most interesting one is probably the optimization to the immune notifiers, avoiding that the notifiers take spin locks unnecessarily. And there's also a binary statistics API. The topic of statistics and binary statistics has been raised several times over the years, and there were also talks about it at KVM Forum, for example, in 2019. So finally, we have an efficient binary statistics API directly exposed through KVM, and not just the bugFS statistics, as it was in the past. Also, it's very interesting that we had over 10,000 lines of self-tests added, and this means that about half of the self-tests were added over the last year. This is important in order to avoid regressions, and to simplify testing of new patches that are sent to the hypervisor. On X86, we have to report the completion of two projects that were presented at 2019 and 2020 KVM forums. One is the new PageTable Management Code, also known as MMU, in KVM's code. The new code enables multiple optimizations, such as parallel handling of page faults and lazy allocation of reverse maps from guest physical addresses to host physical addresses. The new MMU will be enabled by default in the next release, 5.15, and another important new feature is the exposure of dirty page information through a ring buffer. This has interesting performance characteristics that were presented by Peter Shu at KVM Forum 2020. On the topic of confidential VMs, several new features were merged over the past year. SCVES gets support for 5.10, and then host support for 5.11. And also for SCV, we have support for sharing the encryption context across multiple VMs. This is useful for live migration, and also a hypervisor API for live migration itself was merged in 5.13. All this work was presented in the past, of course, KVM Forum 2019, and in 2020 instead, TDX, the Trusted Domain Extension, was announced and presented by Intel. The work on observing TDX support for KVM is underway. As has been usual for the past few years, we had several changes for nested virtualization. Migration of nested hypervisors has gotten remarkably stable, and also there's support for multiple new features in the nested hypervisor on Intel, the wait for CP activity state, and nested test scaling, for example. There's also optimizations that allow KVM to run faster when it is itself a nested hypervisor on Hyper-V hosts, and those optimizations have been extended to AMD processors. Unfortunately, we also found a few security vulnerabilities for AMD processors. Finally, on the topic of nested virtualization, I would like to mention the Google Summer of Code project for improving SVM emulation on the TCG binary translator. This project has produced new test cases, which will be useful in further development of nested virtualization on KVM. We had a few miscellaneous new features, such as exiting to user space on NSR access, the interesting Zen interface implementation, and static calls. This is the continuation of a project presented by Andrea Cangeli at KVM Forum 2019, but this work was actually contributed by Akamai. It's good to see developers pick up projects that were not completed for one reason or the other, and these developers then ensure that everybody can benefit from those performance improvements in this case. Also interesting is the selective enabling of Hyper-V hypercodes, which is important for reducing attack surface when features are disabled. On the topic of ARM, the most interesting feature is probably protected KVM. This was also presented last year at KVM Forum by Will Deacon, and Will's talk has also been covered on Linux Weekly News, LWN.net. Around ARMv8.1, the virtualization host extensions were added in order to optimize KVM and other hypervisors that are part of an operating system kernel. However, protected KVM is not using the virtualization host extensions, and instead separates the hypervisor running at exception level 2. From the host kernel running at exception level 1. The host Linux kernel then is isolated from the hypervisor, and guests are not visible to the host kernel at all. This can be used on Android, for example, to isolate various binary blobs that want to run in a trusted environment. Despite the name, the actual relationship between these binary blobs and Linux is more akin to mutual distrust. And PKVM provides something that satisfies both parties. The custom binary blobs are happy because they are not visible to Linux, but also the Linux developers are happy because those proprietary binary blobs are effectively deprivileged. This is a complex work that comprises a lot of steps, and several of those were merged over the past year. For example, the hypervisor now protects itself from the host kernel using stage 2 page tables. The hypervisor component can now manage page tables without any reliance on the host kernel, and uses these to create a stage 2 mapping for the host kernel as well. This host kernel mapping is mostly an identity map, but it is populated lazily. This way, the hypervisor itself is never being mapped into the host kernel. Also, pages in a protected KVM system now have a notion of their owner, which can be the hypervisor, the host kernel or even a guest. What was merged in 5.15 was the code that lets the host donate memory to the hypervisor to be used for guests or stage tables. Those stage tables are then made inaccessible to the host kernel. Likewise, when a protected guest is initialized, all of its memory is owned by the guest and is made inaccessible to the host. Unfortunately, this does not work well for Virtio, which necessitates sharing memory back to the host. And for this reason, work is underway for guest-host page sharing and support for device-DMA in protected guests. There will be a new hypercode interface to facilitate the sharing and also bounce buffering to prevent sharing-sensitive information that may reside in the same page as the DMA buffers. Protected KVM will also introduce a new restricted DMA mechanism that will instantiates the software bounce buffers on a per-device basis. All this work is expected to ship in Android 13 and the integration with the Android user space plus the kernel development is a crazy amount of work, so it's really cool that KVM got this new user and a lot of us will probably be users of it. So, what's next? In the realm of confidential VMs, as I said before, TDX support is on the way. And also the next iteration of secure encrypted virtualization for EMD, a CVS&P, which stands for secure nested paging. There will certainly be more work on nested virtualization, especially for EMD. For example, nested TSC scaling and possibly optimizations for the nested nested virtualization scenario, where you have more level of nested hypervisors. This includes features such as the virtual global interoute flag and virtual VM load VM save and possibly other optimization to the code. On the MMU topic, SMP and TDX both require some care in managing the page tables and this work will probably be done only on the new MMU code. So we might even decide to deprecate or remove the legacy MMU support for two-dimensional paging. Finally, the binary statistics API has not yet been plumbed into the user space stack and this is certainly something that should be very useful. It will be great if next year we were to have support for this in QMU, in LibVert and possibly even monitoring tools and in LibVert clients such as CubeVert or OpenStack. Thank you and enjoy the rest of KVM Forum. Hey there, my name is Eric Ernst. I am a software engineer at Apple where I work on cloud infrastructure. I'm super psyched to be here to learn at KVM Forum and also have the opportunity to be able to give you an update on cota containers. So my interests for the last several years have been kind of the mix of containers and virtualization bringing that together in a Kubernetes environment and because of that there's no greater place than to be in the cota community where we work on these things and have the opportunity to be part of the architecture committee. So today I kind of want to be able to give an update on what's new, what we're up to next as well as be able to provide a refresher of what exactly is cota containers and how does it work. So starting off, what is cota? Well, we're an open source project and we're part of the open infrastructure foundation and because of that early on as the project was starting we had a big focus on making sure it wasn't just open source but also we did open design and open development and open community instead of just throwing things over the wall. With that we do have an architecture committee. We have open governance so twice a year we vote for different seats and anybody who has committed anything into the project is able to vote and we have a pretty good diversity of great engineers from different companies and then myself as well. Archana from Intel, Fabiano from Red Hat, Peng Tao from Ant Financial as well, Samuel Ortiz from Apple. But on top of that it's pretty interesting, we have a lot of other people who are important stakeholders both from users, contributors, everything else with the project. So if you look at the contributor list, it's a pretty good mix and can give you a good indication of who are the stakeholders in the community and what companies make up the community for cota containers. But as far as the actual project itself, what are we providing? It's a secure container runtime. Basically, if you look at it, defense in depth if you are security conscious is always a good design pattern and we're taking that and doing that with the cota containers project. So we're taking container isolation and then adding on top of that a second layer which is going to be using hardware virtualization. So if we think about it from a threat model perspective, what we're looking at is that we don't trust the workload. So if you imagine I'm an infrastructure provider, I'm doing remote code execution as a service, I don't trust you and I'm going to protect my hosted infrastructure including the other workloads running on my infrastructure from your workload. So that is the primary threat model that we have today with the project. And as such, on the left hand side, we have traditional container and on the right hand side, just kind of a drawing of what cota is really representing. So on the traditional container side, they are doing a great job of kind of virtualizing the host from a software perspective, leveraging just standard Linux primitives like cgroups, namespaces, dropping capabilities, everything else in order to be able to give the user a sense that they have their own machine while also limiting the impact the different processes on a single host can have on each other. The problem again that we have is that we're looking at defense and depth. All of those primitives are from that same host Linux kernel. So what we look at is using those same primitives, but doing it from a guest VM instead, using a unique kernel per pod, in the case of Kubernetes, which is a group of containers, the minimal, schedulable unit in a Kubernetes environment. So what we do again is leverage hardware virtualization to create those virtual machines and use that as a second layer. So how do we do it? Just kind of a quick overview of how do we take these container primitives and what it takes to be able to run a workload and map that to inside of a virtual machine. So in the Kubernetes environment, on every single node, you'll have something, a kubelet or something like a kubelet, which is going to reconcile this state and say like, okay, here's a workload, it's assigned to me, I need to go ahead and create that workload and run it on my node. The kubelet itself want to actually do this. It is going to go through a container runtime interface, an interface with a CRI runtime like cryo or container D, who is then responsible for pulling the image and actually running the workload itself. In actuality, cryo or container D don't do all of the work for the actual management of the lifecycle of your workload that's passed down to an OCI, an open container initiative runtime, like Cota containers, another one would be a run C in a traditional case. So in the Cota case, of course, we're going to start up a virtual machine. Generally, we're using a VMM with KVM underneath and we're going to run a minimal guest kernel, minimize footprint as well as attack surface, and then a simple user space program that is a Cota agent who is actually going to administer the lifecycle and create the workload itself. But in order to be able to communicate with that agent, to be able to articulate what is the actual lifecycle and signal handling and everything else, we need a medium to be able to communicate between the host and the guest. And what we use for that is Verdeo Vsoc, which depending on the VMM implementation, it could be like a hybrid approach or where it's a Unix domain socket that's mapping to Vsoc or it could be like a Vhost Vsoc solution. And that's what we're going to use to be able to communicate between the host and the guest. Of course, we're going to need networking. So typically in a Kubernetes or a container kind of environment, what you end up having is the workload. In our case, even the VMM is running inside a network namespace. And in a simple case, Kata will be provided inside that namespace with a VEath. And maybe the other end of the VEath could be attached to like a Linux bridge or something simple on the host. What we need to do now is map that VEath to something that the actual VM can consume, like a tap device. There are multiple ways to do it. The way that we do it in Kata containers is to use a TC mirror, where essentially all we're doing is the tap device's Rx is mapped to the Tx of the VEath and vice versa. And that way we can guarantee connectivity. In the past, we used Mac VTAP, as well as even like a Naive Linux bridge implementation. But we found TC mirror to be the best solution for us. On top of that, you need the actual workload to get inside, as well as maybe certificates and different tokens and config maps and everything else that typically in a cube environment would just be a directory on the host that is bind mounted in. In our case, we can't just bind mount a directory in. We need to have some kind of mechanism for file system sharing. In our case, that's the VerdeoFS itself. And what we'll do is we will be able to bind mount into a shared path on the host and then we have a V host user application, VerdeoFSD, which is going to then interact as the back end of VerdeoFS and present it inside the guest, which works great as well as for passing in the root of us. But sometimes we don't necessarily have just a mounted file system. We'd actually have a raw block device that we would want to pass through. So let's say you have a persistent volume. In our case, then what we have to do is be able to map that in. And of course, that's just going to be a simple Verdeo block solution. So we have our persistent storage. We have our networking. We have the actual volume secrets as well as the root of us. We can now go ahead and communicate with the agent and start up the actual container workload or containers workload since it is a pod. And then at this point, you know, any kind of standard IO or anything else from the workload is going to be able to map over Verdeo Vsoc back onto the host for consumption by the upper layers of this architecture. So I mentioned that we use a couple of different virtual machine monitors. And with that, you know, we have cloud hypervisor support as well as firecracker. Though in the firecracker case, there is no shared file system, but the rest of this is accurate as well as QMU support. So what's new with the project? What have we been up to in the last year or so? There have been a bunch of different improvements and kind of updates. First of all, we had a 2.0 release and the major change, breaking change on this is that we reimplemented the guest agent that runs inside of the virtual machine. It was written and go before we rewrote it in Rust and then added asynchronous support for more efficient execution. The motivation for that is you can say, okay, we wrote in Rust, it must be more secure, which there is something to be said about that, but primarily one of the big benefits is that the footprint of the application is much smaller in moving to a Rust instead of using Go. So instead of tens of megabytes, we're now less than a megabyte, which again, if you're running many pods and I know that adds up. We switched the default for our shared file system to VertioFS. So it used to be 9PFS and working with some of the great developers on VertioFS, we were able to kind of test and use this and make it kind of an experimental option, but now it's the default and is what we're using out of the box for Codd containers. We also had actually a limitation with VertioFS. So because it's based on type of fuse, we don't have a way for the guest application to be able to watch a file that could be changing on the host because that watch, you're not able to map it to that particular FD on the host. So what that means is if you have an application that's using that shared volume for maybe a configuration or a secret, typically the application is going to be able to, would want to use iNotify to do a watch on the file and it would fail in our case because it's mapped by VertioFS. So what we did is implement a workaround where essentially we're just going to be able to inside the agent, pull the actual shared file system mount point, and then be able to manually copy the files if things change, which is definitely a super hack, but an area that I think that we can contribute upstream and working on being able to add actual iNotify support. Also, we added open telemetry support. So tracing is very helpful, especially when you're running at scale. We had open tracing support, but now migrated open telemetry as well as on the QMU side. We changed our default to use Q35, which probably is a super long time coming. So what's next? What are some of the things that we're looking to, that we are actively doing right now and it's coming in the future? One area is around performance improvements and performance isolation. One area... Well, these are less about making changes in the VMM or anything like that. This is more like working with the greater cloud-native community in order to be able to provide more VM-native interfaces. So instead of giving me a VEath, if you know that you're going to be running inside of a VM solution, provide me with a tap device instead and then we can avoid a little bit of latency and a little bit more complexity in the infrastructure. Similar with storage. If you're providing me a file system volume, it's going to be a block device, which is formatted and then mounted on the host and then over a VerdiFS shared inside the guest and then mounted inside the guest, which is quite a bit of complexity and requires mounting on the host. Why not just provide that block device itself instead of let us mount it only inside of the guest? So maybe slightly better security posture and also more efficient being able to use a VerdiO block rather than having to have all these file system mounts. And then similar, we're looking at just better management of how we do CPU sets and mapping the physical topology on the host to the topology that we present inside the guest. As an example, we can do CPU sets on the host on different vCPU threads and then make sure we have a mapping of a vCPU thread to a guest CPU as well. So different work in this area. Another area that we're looking at, which is definitely super exciting is around confidential computing and this gets a lot more into virtualization technology. And as I discussed before, the threat model that we have today is that I am a remote code execution as a service provider and I want you to run on my hardware, but I sure don't trust you and neither do the other tenants on this hardware trust you. So what we have to do is provide extra security. You as a user running on my service, you inherently have to trust me. There's nothing stopping me from reading your memory and seeing exactly what you're doing. So if you have very, very sensitive data that you have to inherently, your level of trust is with the service that I'm providing. With confidential computing, you're kind of flipping that. You're saying, okay, I'm the user now and I don't trust you, even though I'm going to use your infrastructure. I love to use your infrastructure, but I need some kind of guarantees that I can trust you and that's kind of what confidential computing is providing. And what we're doing is preventing the host from being able to attack the workload. And by attack, you know, it could just be view and get the actual contents of what's executing inside the guest. So this is pretty neat. In order to do it really, we're going to do no longer just protect the data when it's in transit, like encryption or when it's at rest, but we also need to do it while it's being processed. So the actual active memory of the process of the actual workload as well as the CPU state that needs to be encrypted and only the guest can access it. Similarly, I need to provide a mechanism so that way you can measure the hardware and software stack and know that you do and trust me that you're able to run on it. Now, all of this is very much architecture specific and there's been a lot of changes in the last few years in here. You know, there's different implementations from AMD, IBM, Intel as well as ARM upcoming. And these are at different levels of maturity. You know, some of these are specs. Some of these are hardware that you can get today. And as you would expect then, you know, there's also hypervisor dependencies where some of this is still in progress. An example like the TDX work from Intel is still being upstream today. And then you need to actually find, you know, a VMM that's able to leverage this and provide a mechanism to do this. So I think this is a very exciting space and there's a lot of work in this. And, you know, if any of this is interesting to you, I would love to have more people get involved with this. We already have a very robust group working in this space. And, you know, I think that would happy to have more input. So that's all I have today. I really appreciate your time. And thanks. Bye.