 Hello everyone, welcome to my presentation. My name is Sebastian Nene, and I'm working for Google as part of the Protected KVM project. In today's presentation, we are going to talk a VCPU stall detection mechanism. This mechanism is not PKVM specific, but it can also be used for regular KVM guests. So this mechanism was added as part of the upstream Linux kernel. Apart from that, it will be available for Android U. So going on to the agenda. First of all, I want to talk about the reasons for adding a new watchdog-like device. Then we are going to explore what the current Linux infrastructure has to offer in this area. After that, we are going to take a look at one proposed solution. And this involves a cross VM backend driver, as well as a Linux kernel driver that is part of the frontend. And last but not least, we are going to take a look at the lessons learned and what are the next steps. So let me briefly discuss about the problem that we were facing off. So first of all, why adding a new stall detector for KVM guests? It turns out that previously there was no mechanism to detect stall guests from the outside world. This means that in our case, VCPU threads appeared as runnable, but no code path that we were running to, we didn't know in which state our guest was. If you take a look at the existing solutions, like for example, the Chrome OS SMC watchdog, this doesn't account for stolen time. And in our case, it can result in spirit resets. What is stolen time? So this represents the time taken from the guest while the host is busy doing something else. How it works currently on ARM. The host shares a page with the hypervisor and the hypervisor is then responsible to update the new time and then the guest reads that page. We also need to account for VCPU hotplugging events, like for example, when a CPU can go online or offline. Let's take a look at the current existing infrastructure for watchdogs that Linux has to offer. So the Linux kernel uses the slash dev slash watchdog interface and this interface is used for receiving user space notifications from a demon. This is also not a cave-in specific mechanism. It's a bare metal behavior. It has been existed for a while. In normal operations, the notifications informed that the system runs in the expected state and everything is in order. If the notification is not received by the watchdog, the system dumps its state before rebooting and it also prints a message in the message and after that it can end up resetting the system so that it can end up in a new state, in a normal state. If you take a look at this diagram, this will tell us that VCPU threads are scheduled independently. So they rely on the host scheduler. And one thing is that we need to account for stolen time here because VCPU are baked by POSIX host threads and if the watchdog expires while the VCPU is not running, we can get spurious resets. Since stolen time is accounted separately for each VCPU, we require strong CPU affinity. When we send the heartbeat notifications, which cannot be guaranteed by the user space. And for example, we can look at the CPU host plugin. One scenario in which this, if we follow the current infrastructure and using the existing Linux kernel watchdog drivers, we can end up in this scenario, like the guest runs and sends an MMI notifications to the emulated watchdog device. While he's writing the device registered, the guest exits on the data abort path. And then the VMM on the host receives the notification and rearms the timer for the next expiration period. The timer starts decrement its internal counter but the guest is not scheduled to run. So this can end up pretty badly because the timer expires because the guest wasn't scheduled to run. So this can trigger a spurious reset. One solution to this problem that I came with is adding two different blocks. One of them is the VCPU emulated device. This is part of the cross VM. So it's a cross VM emulated device. It's using the MMI interface to talk to the guest. We define the set of registers for this emulated device. And we also added the front end driver in the Linux kernel that knows how to interact with it. This explains a little bit better what the cross VM backend driver is doing. On the right hand side, you can see the list of registers. And these are the registers which are defined per CPU. So every MMI device is abstracted in cross VM by a bus device interface. And a device registers from the memory bus by providing the size and the memory region to KVM. The MMI events are dispatched to the register device which performs the necessary logic. And we detect the stall if the internal clock which is accounted per VCPU decrements an internal counter which reaches value zero. And then is when the VCPU stall detector fires. In order to take account for the stall in time, we take a look at Proxstat entry that are defined per VCPU. The Linux kernel front end driver is a standard miscellaneous device. And it ended up in this way because we received some upstream objection to our proposed inclusion in the watchdog framework. On the right hand side, you will see a small DT node that describes our device. Initially, we tried to create this device as part of KMU. And after that, we also created some cross VM MMI device. The stall detector is probed using the device tree and this guest driver is responsible for delivering heartbeat notifications. So this informs the system that it's still up and running. The Linux kernel driver is also responsible for registering for hotplugging events. So whenever the CPU goes offline, we need to disarm the HR timer so that we don't end up in a state where it can reset the guest. These are the states of the front end watchdog driver. It starts in the initialization stage where it configures the internal clocks. It also computes the number of ticks that the counter will start decrementing. And then it ends up jumping into normal operation state by writing watchdog register status. While in normal operation, it expects to deliver periodic heartbeat notifications by writing into a per-VCPU register, which is the load count register. And it ends up in a locked state when no notifications are sent to the guest, from the guest. What are the things that we ended up adding? So we added a patch in Linux kernel upstream after 12 revisions. We also have some cross VM changes that are no merged and we are looking forward to support this mechanism in true Android view. This is also nothing Android specific, so it can also be used by some other OSes. One thing that we want to do as a next step is to report diagnosis message from the virtual machine monitor, in our case from cross VM. So we need to inform another service that our guest ended up in a stole state. Thank you. If you have any questions, yes please? Yes. So the question was, I need to explain a little bit better why we need to account for stolen time and why it matters in our case. To answer your question, because of the fact that the host scheduler is responsible for scheduling the VCPU threads, if the system is busy scheduling in between guests, it can end up not servicing your guests frequently enough. And one problem with this is that it can trigger some spurious events from the watchdog side. For example, the guest, yeah. Yes. Yeah, that's a great, yeah. Yeah, so that's a great question. So the watchdog itself is responsible for detecting stole guests in a way that, so for example, when the guest is not scheduled to run, we don't want to reset the guest in that case. Does it answer your question? I mean, my question really is, what's the situation where that's a use, I mean, what is a particular situation where you accept that the host has so overloaded it can't run guests in say 30 seconds? And that's not a situation where, the normal situation for watchdogs is they're used on spaceships where you need to reset the thing if it's not responsive in some time. In this case, the guest is not responsive in that time and therefore some action has to be taken. So I don't think these reboots are spurious. I mean, maybe the reboot is the wrong action to take, but it's not a spurious event. It's a thing, it's a real thing. Yes, I see your point. So for example, if we are stuck for some MMIO emulated device that tries to do something like you have a device which is MMIO emulated in cross VM or whatever VMM, you don't want to track that time as being part of the periodic event of the watchdog. Yes? Yes, so what I'm looking at is I'm looking at the vCPU threads if they are running or not. I'm not looking at the entire system like for example, a guest application if it was scheduled or on the, so on the current infrastructure what we have is like a demon which is petting the watchdog. And what I try to do is adding some HR timers to deliver that heartbeat. Yes, so it is stolen inside the guest kernel. It can run in weeds, for example. Maybe another, like it's not another way to do it, but another consideration could be that the watchdog might not reset immediately and just send an event somewhere and this can look at the stolen time and decide based on that whether or not to reset the VM or not. Like if you see that there is like super over committing, perhaps the solution is not to kill that VM but perhaps to kill some other VM that is less important. Like if you have this kind of extreme over committing at some point you have to do something like an OOM killer for vCPU users essentially. Yes, that's a strong point, yes. The ESB device which I have to write in KVM does actually allow two different actions. So you can have like a pre-action that happens before you do the reboot which and that pre-action might be something like parallel of suggesting. Yeah, I think that that's a good extension of what I've currently added. So I think that's doable. So thank you for the suggestion. Okay, thank you so much for your attention.