 Hello everyone. Today we will talk about Kuvert at the cost of containerizing VMs. I am a master's student from Japan. I have been researching on container and lightweight virtualization technologies. I joined the Google Summer of Code program in 2020 and I worked under Libward Organization and contributed to Source Stack open source project. I implemented features and supported advanced performance tuning. I feel truly grateful to have Dario and Vassily in the team. Dario is an experienced virtualization specialist and Susie. He has been working on both Zen and KVM projects. Vassily is a senior software engineer and Susie. He has been working on container and the VM convergence technologies. He is also an active contributor of Kuvert. Both of them are very kind and supportive. Today's specific topic is about the performance evaluation and tuning of KVM and Kuvert. Throughout the talk, we will show you the effect of vCPU pinning at virtual topology on VMs performance, as well as available tuning capabilities on both KVM and Kuvert. We will also explain some cases where improper tuning can actually lead to quite a significant performance loss. Just a quick recap. KVM is an open source virtualization solution that is built into the NINIX kernel, which runs on x86 machines. Kuvert is a Kubernetes add-on that allows you to manage VMs and containers in a unified manner. Here is an interesting comparison of KVM and Kuvert. With KVM, you have the full control of tuning capabilities, which also can be a disadvantage because tuning can be complex. In addition, you can decide where the VM runs with the traditional approach, but managing them can quickly become troublesome if you have thousands of them. This is why Kuvert came into play. Kuvert inherits a Kubernetes capability to orchestrate VMs and scale. It hides the VM configuration complexities behind the high-level YAML definition, which also turns out to be a disadvantage since you don't have full control of all the VM parameters. It's worth mentioning that the container runtime might introduce some overhead due to resource usage or counting on imitations. From now on, Dario will talk about the environment setup and tuning. Hi, everyone. Since this talk is about the results of some experiments that we have run, let's have a look at the experimental setup. We used a 32 physical CPU server, both as the KVM host and as the Kuvert worker node, which basically means this is the hardware where the VMs were actually running in both configurations. It had two NUMA nodes with eight cords each and hybrid threading enabled, as you can see in this slide, together with some more information about it. This is how our server looks like if you ask it to LSTOPO. As host, we used Ubuntu LTS. We used the latest available Kuvert release, which was 0.44, which includes Kimu 5.2 and Libvert 7. For the KVM experiments, we built from source and used the various inversions of those software components. We run our experiments inside a one-vCPU VM and then we repeated them inside a four-vCPU VM. The VM had eight gigabytes of RAM in both cases and was running open-source elite 15.2. We used MM tests as our benchmark ensuit, as it can orchestrate running benchmark inside one or even multiple VMs. We have run several benchmarks and we have considered multiple configurations for many of those benchmarks. We run cycletest with wake-up threads, either pinned to the vCPUs or not. We run NAS with OpenMP and with two threads running in parallel in the four-vCPUs VM, of course, and executing the various computational kernels. We also run stream, also with two threads, still in the four-vCPUs VM. We run hackbench with either two or four groups of thread, resulting in quite a few communicating tasks. We run cartbench with either one, two, or four build jobs in parallel, and we run ISO for synchronous IO with different file sizes. We run all those benchmarks inside a one-vCPU VM and then inside a four-vCPUs VM, as I said, and we consider different load conditions for the host, too. We run them on idle host. We run with 50 percent load on the host without considering our VM, and with 100 percent load on the host, again without considering our VM. For generating the load on the host, we use the various stress-and-g threads in a launched in a way that they can simulate having other VMs running at the same time of our one in the host. In this presentation, we will show a subset of the results that we have achieved. So how can we tune the performance of a VM? Well, one way of doing that is doing some kind of static or semi-static resource partition. This means doing vCPU and memory pinning and maybe defining virtual topology for the VM as well. We can also hope to improve performance using huge pages, isolating the vCPUs from the interference of any IO activity and a couple of other things. Using huge pages or large pages means using memory pages bigger than the default four kilobytes for backing the VM memory. This improves performance because working the various level of the page tables for translating guest virtual addresses into host physical addresses add to happen less often, lowering the pressure on the TLB, and when it happens it is also faster. If our VM has more than one vCPU, we can define a virtual topology for it, which means that its virtual CPUs can be arranged in virtual cores with virtual threads, virtual sockets, and even virtual numeral nodes. The software running inside of the VM then, both the kernel and your space programs, will see such a topology and will make assumptions and try to make optimizations based on this information. vCPUs can be pinned to the host physical CPUs, that means that the host scheduled will only be allowed to move the vCPUs around the various pCPUs within certain limits, if at all. Using pinning can be very effective for cutting down the overhead of vCPUs migration, for achieving more consistent performance results, and even more important for partitioning the host resources, physical CPUs in this case, among the various VMs or in general among the various entities and activities that we have running on the host. And of course we can try to combine vCPU pinning and virtual topology and achieve at least potentially very good results. If we pin vCPUs in such a way that the virtual topology that has been defined for the VM matches the physical topology of the group of pCPUs where the vCPUs of the VM run, we can however also completely mess up our own performance if we badly mismatch the guest and host topologies, for example, for instance by pinning vCPUs of a virtual core on pCPUs from different physical cores, and even worse if we also get the resource partitioning part wrong. If we are pinning the vCPUs on the pCPUs of a specific human node, then it also makes sense to try to make sure that the memory of the VM is entirely allocated on that human node, to avoid the latency of remote memory access. And there are also a couple of other things that we can do, but I am not going into other details about them in this presentation for time reasons. It is also possible to cut out of the threads that deal with IO events in the attempt to improve the scalability of IO and isolate the vCPU from interference of IO itself. And we actually did some experiments with these settings, but again I will not go into details and we will not show the results for time reasons. Finally, still about IO, we saw that for our configuration, which uses a preallocated row image as the disk of the VM, using the native IO model was really important. For instance, IO zone, the benchmark wasn't even able to finish unless we used IO equal to native. We are therefore using it for all the configurations that we show results off. And yeah, apart from that, IO tuning at the virtual device level happens by specifying caching mode and whether or not we have and we want to use multi-cuing. Now, for KVM, in our experiments, we considered four different configurations, the default one, so no pinning at all for the vCPUs and no virtual topology defined for the VM. Then a pinning default and a pinning one. Let's call them like that, where we do one-to-one vCPU pinning for both and either leave the virtual topology as the default one or define a custom one with all the vCPUs defined as cores of the same socket. Finally, we checked the fully tuned setup or at least what we call the fully tuned setup, where we do one-to-one vCPU pinning, coupled together with the perfect guest to host virtual topology mapping. And yeah, for reference, this is how we specify each of these configurations in the libvert.vm.xml file. Of course, we expect that it is the best tuned configuration that will give us the best results. Let's see if it's true. For KVM, we try to do the same. However, in this case, it is Kubernetes and KVM that are in control of the VM configuration details, such as where the vCPUs are pinned and how the virtual topology is defined. So this is again for reference how we tried to achieve the same configuration that I showed before for KVM, and this is all we can do. So basically specifying the topology and asking for the vCPU to be pinned in the vMI YAML file. But check the results, especially for the vTune case, which was supposed to be the best one. The topology has been defined and the pinning has been established, but the mapping between virtual threads and physical threads is actually wrong. And there is really nothing that we can do, at least not for now, about it because that's entirely under KVM control. So well, I think this means that we should probably expect weird results, especially from these configuration on the KVM side, which was indeed supposed to be the best one. But again, let's see. Here is our benchmarking result of memory bandwidth. In KVM, the default case with our vCPU pinning performs pretty well because the host scheduler can manage to move the vCPU around in a way that leads to good performance. For example, the host scheduler can put vCPUs in different cores, since a host is doing nothing and the sibling threads will not interfere with each other. It will still give us pretty good performance, as you can see from the graph here. On the other hand, if we pin the vCPU without concerning about the host topology, we might end up in a situation where two vCPUs happen to be siblings at the host level and compete with each other for resources, which explains why we have some significant performance drop here. If we pin the vCPU in a way that matches with the host topology between in our case, we're guaranteed to get the best performance, as you can see from the graph here. We tune the vCPU in Kuvert. The virtual topology does not match with the host topology. Unfortunately, there's not much we can do since vCPU pinning is out of Kuvert's control. That's because CPU allocation is managed by a CPU manager in Kubernetes, and the Kuvert has no control on what vCPUs will be allocated. To further confirm our understanding, we put some medium load on a host. As you can see from the graph, the performance gap between default and tune cases are becoming smaller. This is even more obvious when we put high loads on the host. There is almost no difference between default and tune cases in Kuvert. It perhaps because when all vCPUs are busy, the topology is probably not too important. The trend is pretty much the same in NASA parallel benchmark. However, we have a more quantitative major of CPU performance. It basically run different kinds of metrics computation. vTune still guaranteed the best performance in KVM. Default configurations can manage well with the help of a host scheduler, but the difference is small when the host is idle. When we put load on the host, the improvement of vTune is more obvious. In the Kuvert case, since topology is mismatched, the vCPUs are pinned. Host scheduler does not even have the freedom to move the vCPU around, which leads to quite a bit of performance loss and inconsistent result. There is a similar trend in cyclic test, but we can clearly see the benefits of pinning. The topology is probably not that important here because cyclic test is just waking threads up and doing some computations and going to sleep. The unbounded case, the situation is similar, except that latency will be larger compared to pin cases. However, one thing we still don't understand is that in KVM, the latency for the idle case is smaller than the pin case. We're actually still investigating the reasons. Current bench is different from previous benchmarks. We're essentially running tags using one, two, and four vCPUs, whereas previous benchmarks use only half of those four vCPUs. As you can see from the graph, when there is only one thread running, pin configurations outperforms default ones. Since there is no overhead of CPU migration, all pin configuration beats a default case. When we start running two threads, vTunes outperform default configuration by very little, which is what we expect. Of course, mismatched topology suffers. When we saturate all CPUs, the default configuration outperforms in all cases, again because the host scheduler can put those threads in four different cores. This is perhaps our favorite benchmark, not only because it matches with all previous results, but also should ask the whole picture of how different configurations behave in the case of one, two, and four vCPUs. This IO performance is a complicated one. It could be affected by either CPU memory, cache, or the combination of them. For now, we just want to show you a general tendency. There is not much difference between KVM and Kupferd in the case of sequential read. However, we see larger standard deviation when we put some loads on host, and of course, we see small performance drop when the host is highly loaded. Random reads show similar tendency, and there is some significant performance drop compared to sequential read, which is what we expect. After all, it's random read. Same goes for sequential writes. Random writes experience a large performance drop. Overall, we don't see much difference between KVM and Kupferd. It's not surprising because we're using the same kind of disks backend and configurations. To conclude, we have found out matching your host CPU topology in the VMs can guarantee good performance, but mismatching ones can lead to great performance loss. In general, the host scheduler can manage really well in the default case, only if there is not much load on the host. There are some inherent limitations of Kupferd with CPU pinning, since CPU allocation is managed by CPU manager in Kubernetes. We cannot do much about it. I guess this is perhaps the trade-off they have to make because it contradicts with the idea of automatic orchestration, but still default configuration works quite well in general. Last, I think Kupferd can actually be improved to avoid a mismatching CPU topology. And this is the end of the presentation. Thank you for listening and please don't hesitate to contact us if you have any questions. We're happy to discuss anything. Again, thanks Dario and Vasili for all the support along the way. Last of all, we really appreciate the KVM Committees for organizing this event. Hope to see you guys again.