 Hello, and good slightly almost closely passed afternoon. I'm between you and your lunch break, but I think because you're here, you're excited about Hypervisor Framework just as much as me. So really quick, who am I, what are we talking about? I'm Alex Graf, I work at Amazon, mostly on KVM-related things, but this is not KVM. I'm going to have a look quickly at what this, how this compares to KVM, but here we're looking at macOS technology. All opinions on my own, this is not official company work. So on macOS, if you want to virtualize a system, if you want to explore a virtual machine, traditionally as a user, you just see, hey, I'm running my Hypervisor and things just work, right? But underneath the hood, what that meant is that your virtual machine monitor, like the thing that actually shows you the display and such, that one needed to go and leverage CPU instructions that are privileged, which means it had to have code running inside of the macOS kernel. Now, I don't know if you're familiar with code quality of Hypervisors, but let's just say that KVM is probably leading the bunch there, which means Apple was very enthusiastic about getting these people out of the kernel because they just kept getting my buck reports about things that really were not related to their code at all. And they were thinking hard on how to actually get there, which eventually led them to the development of something called Hypervisor.framework. So Hypervisor.framework is a combination of two things. It's a piece of code in the kernel, in the macOS kernel, that is able to, on Intel, drive VMX, on ARM, basically drive all of the EL2 extensions to spawn a virtual machine, and a user space component, a library that gives you a simple interface into the actual kernel API. And what that allows them to do is, because they now move the actual Hypervisor code into user space, what that allows them to do is, they can lock down their kernel. And that's exactly what you're seeing on these Apple Silicon Macs, where you are no longer allowed to run, to inject even your own code into the Apple kernel. So just some background on why exactly that that whole framework exists. I'm not saying we should follow the same path with Linux, if you don't want to have something that is completely locked down, but this is why they did it. What ended up happening is, they built something that actually allows us, or as QMU, for example, to go and reuse the exact same framework to then do virtualization ourselves. So now we, as an open source community, can actually use this framework that they built to solve a purely commercial problem for them as a lever to use virtualization. And if you look at this picture, this kind of looks almost like KVM, doesn't it? Like, if you would draw a picture of how KVM communicates to things, it's like, yeah, you have like a layer in between your hardware and your user space that happens to live in the kernel. It's the same, right? Well, yes and no. Let's take a quick look at that, actually, the API. The Hypervisor Framework API for ARM, this is the AFCC4 version, is super minimal. I don't know if you've ever looked at the KVM API and it's hundreds of API calls that you can do, and it's plus thorough of features that it has, but Hypervisor Framework is simplistic like nothing. It basically just says, hey, you can create a VM context, you can create CPUs, you can destroy them, and you can set a couple of registers. That's it. So if you want to put this in pictures, you would basically say, hey, yeah, what's your Hypervisor Framework? And using this, I can create a VM container, same as KVM. In that VM container, I can go and map memory between my process that my VMM process and the virtual machine, same as KVM. I can create VCPU containers that then contain VCPU state, same as KVM, but now comes the big difference. So when you go and start to execute that virtual machine and you go and say, hey, I have a virtual CPU, go and run, go into this context there and run this, same as KVM, by the way, the return values are vastly different. What happens is KVM always interprets everything on behalf of you in the kernel and does a lot of work for you. Hypervisor Framework is trying insanely hard to do as little as it can in the kernel. Anything that's policy, anything that's any logic that would not break the security model of the Mac OS kernel is moved to user space. And that means what you get as exit, variant as exit return value from this CPU run execution is literally the same thing that your hardware delivers into the real EL2 call. So you no longer have two different documents. You don't have an architecture definition like we do on KVM where you have an architecture definition and the architecture does things and then you have another completely different ABI that describes how KVM actually operates and does things. You just have one. It's just the architecture definition tells you exactly the encoding for every single event return that there is. The same thing like the only thing missing now to create a functional machine is interrupts and that one is trivial again as well because in a super core simplistic deep down world what you have on ARM is just two interrupt lines. You have an IRQ line and an FIQ line. That's it. Nothing else. So what you can do as in Hypervisor Framework is literally just basically set each of them as active which again is very different from KVM in case you know that ABI farm. Let's take a look at what happens if you want to go and actually run Linux. Let's say you want to go and do QMU dash kernel and directly straight boot a Linux kernel into the system. So you take your Linux image. You load it into the virtual machine. You create this virtual machine context with RAM and with CPU state. You load that image all the way straight in. And now your register state contains your Linux registers. Your RAM contains the image. And all you need to do now is basically say go and run the CPU please. And then it does that. And the first thing that you see is hey, I got an HVC exit. Why did you get an HVC exit? Well, very simple. You ask Hypervisor Framework again. Hey, can you just give me the reason why this HVC exit happened? Which HVC exits typically are SMCCC calls which is a specific calling convention in ARM which in X0 contain the reason why you called it. So you just ask it, can you give me the value for X0? It returns the value for X0. You know, whoa, this is a PSCI call. Again, ARM thing. And so you just return success. Well, ideally you also do some work. But from the Hypervisor Framework point of view all you would then need to do is set registers again and keep going. You just go and run your VCPU again. And you do that a couple of times and also handle a few things like page faults. And suddenly you have Linux running. Easy, isn't it? Anyone could write a Hypervisor. So as you can see down there in the statistic to get to that screen which means you have a full Linux kernel booted up to the point where I just couldn't find it with root device. We had a whole hopping total of six HVC calls which are PSCI calls just to probe the system and look if there's another CPU. We had two system registers we had to tackle. I don't remember which ones. And all of the rest, like the other almost 30K, that's MMO calls. So calls to the in-drop controller, calls to the serial port, calls to enumerate your PCI Express and so on. But all of these data boards would be, not all of them, but almost all of these data boards would be also available or we would also have to handle in KVM. Okay, so if you look at what Hypervisor Framework does is it really just handles the bare minimum that is security relevant for exiting, for handling virtual machines. It handles the AMU because you don't want to leave your memory management to user space, right? So it handles the actual AMU generation where it creates page tables so that your virtual machine only has access to its own memory. It handles the world switch where you switch registers between the two worlds. It handles a few system registers that are relevant. Not every system register actually contains extensive information. Things like ID registers, what number of CPU am I? What CPU features do I have available? All of these are important to create a virtual machine that is consistent and works, but it's not important from a security point of view. So they move all of that to user space. They don't handle that in Kernel. And V-Timer support. V-Timer is important because the V-Timer is an integral part of the CPU core so that one needs to be configured by the Kernel. And the whole rest of everything that virtualization entitles is in user space. So you have your typical device simulation VM layout same as KVM in user space, but you also implement your intro controller in user space. You implement the PMU in user space. You implement most of the remaining system register traps in user space, like functionality probing. You implement PSCI in user space. And you basically implement anything you didn't know in user space. So is there an access to some system register? Is there an hypercore that I didn't know? Is there anything else I didn't know? It just immediately goes to user space. Now let's compare that to KVM, right? If you look at KVM, KVM in Kernel space handles everything basically. The KVM in Kernel space goes and handles the world's rich, the AMU, the system register, and the VTIM. I like hypervisor framework, but on top of that, it also handles all the other pieces that are almost all the other pieces that user space does in hypervisor framework. It handles PMU code. It handles PSCI. It handles device assignment if available. It handles NASA virtualization, which I don't even have an answer to how hypervisor framework would do it. And it most importantly does the catch all. So is there anything that, if there's any functionality, anything that the guest executes that the hypervisor doesn't know? Then in KVM, it almost always gets terminated inside of KVM, which means you need to patch the kernel to enlighten you with anything at all, which can be frustrating at times. User space is pretty much only responsible for anything that's device-related, AMIO exits, and a bit of device layout and memory layout customization. So let's compare the two. Hypervisor framework is very little code. It does as little as it can in the kernel, only what it has to from a security point of view, but not from a functionality point of view. It moves all of that flexibility into user space. If you wanted to, you could basically fake ACPU that doesn't exist. You could create something that basically reports itself as an A57 inside user space. It wouldn't actually execute the instructions like an A57, but you could expose it as that because you give user space all the flexibility that it actually can do using the hardware capabilities. But it doesn't handle a lot of the most sophisticated visualization features, such as VP Muse. Like you cannot use real counters, real hardware counters using that framework. You cannot use the hardware-accelerated interrupt controller, which means your interrupt injection path is always going to be slower since it goes to user space. There's no nested word right now. I don't know if they have plans on implementing it there, but it's not there. I also haven't seen anything at all related to device assignment. Like how would you even do that? How would you take an interrupt? How would you take like share pages there? Maybe they will find a better solution or a good solution there to make device assignment work as well, but today it's just not there. Whereas KVM has all that support. KVM has support for almost everything you can think of in the KVM land, in the visualization land on these systems. Actually six as well as R. It has fast pathes for interrupts for exits. It can, for example, have IO event of these where you go and have a fast path in the kernel that realizes, hey, I have a poke here to a register that really just should be a file descriptor right into the host. So you have a lot of very advanced features that allow you to fast path things. But at the end of the day, because you keep all of that logic in KVM, you remove a lot of flexibility from user space. So if you wanted to abuse visualization for completely different purposes, then what KVM envisions, you just don't get to do it. So KVM is different from hypervisor framework in its philosophy, but these things look super much similar, don't they? It's available on Xcd6 as well as AR64. It's supported in QMU by the way. It has support for interrupt visualization like the VGX thing that I mentioned on Xcd6, but not on ARM yet. So I don't know if they are working on something. I don't work for Apple, but they did eventually after about eight or something years of hypervisor framework life, they eventually also came around to implement interoperatization for hypervisor framework on Xcd6 that is hardware accelerated. Maybe they will do the same thing for ARM. In QMU, everything is upstream and just works as of last year already. So if you just take a recent version of QMU and use dash axlerhvf and dash cpuhost, that one's important, it literally just works. Most of your QMU command lines will just run. So what is interesting or what am I, what I'm curious about to brainstorm and maybe inject some sparks into some brains here is could we use that concept somewhere else? Like the idea is super powerful, like have a super minimal interface that doesn't try to boil the ocean. KVM tries to boil the ocean every time, right? But do you actually need to boil the ocean for every use case? So Apple was able to build a fully viable virtualization environment with literally like a shared page and forces calls. Could we do the exact same thing with nesting for example? Could we build a nesting interface that looks like hypervisor framework? Should we start maybe on Mac OS build a genuine hypervisor framework nesting interface and then just say, hey, if you want to nest, you don't get super fancy fast end up virtualization but do you actually care, right? And the second thing is KVM in its evolution over time has started off from a super big monolithic environment where we tried to handle everything in the kernel away slowly to a model where we're moving more and more into user space as well. So if you look at the progression on between architectures only, if you look from actually six via ARM to risk five for example, we are already naturally moving into the hypervisor framework direction. So risk five is almost the culmination of that where all the register setting interface and getting interfaces are literally hardware encodings for those registers. We have I think about five or so octals only there. It's much, much, much more simplistic than our traditional old school actually six interfaces which was doing different structs that need to move things around and then if the architecture changes, the whole thing just falls apart. Should we move more into the direction? Should we rethink maybe even the KVM interfaces on actually six in ARM as well? So on actually six, for example, what happened recently is that we had MSR support. We added support that MSRs can be deflected into user space. User space can just say, hey, this MSR over here, I want to handle it, please, just leave it alone. Should we do more of these things? Should we just basically at least give user space the ability to be as flexible as it can be? With that, who wants to see it? Oh, awesome. I have a super tiny QMU command line over here. Simplest thing you can write basically, which is really just a small VM that runs Windows because showing Linux is something I usually do. This is more fun. It just works. It's sensitively performant. I was actually surprised by how well it does perform in comparison to, especially if you think about the whole fact that the interrupt controller is all in user space. Like even all your IPIs go by a user space. Everything is in user space, right? And it still is a well-working solution. It's an old image of Windows, so don't worry. It even has outdated signatures now, but it just works. It literally is a VM that boots and is functional. Linux obviously works as well. It's not, that's the easy case. Mac OS works as well, but you can virtualize Mac OS using this. Cool. With that, more questions. Yes. How does Hypervisor Framework handle debugability for the guest? There's a special set of callbacks. I think I had them somewhere there. That one. You see these debug functions there? Set, drop, debug, exception, right access and so on. A mixture of these and actual justice back settings. We know that one's not implemented yet. I had a private branch where I tried to at least make single-stepping work and that kind of worked, but upstream doesn't do debug support for the GDB stuff yet. So in GDB stuff, if you set a breakpoint on Hypervisor Framework, on Hypervisor Framework VM, it won't wrap. I'll access that it doesn't have to be GDB stuff that does it. There is a callback into the Excel framework to set a hardware breakpoint because that is part of the Excel framework and we just don't implement it. But pre-free to, you probably have a Mac somewhere and you're too bad, too bad before you change that. Cool. More questions. Volunteers to create a nesting interface using, okay. Wait a second for a second. I think we do need a microphone for this. It's going to get longer than I can remember. You're not audible on the stream? No, it's just the stream. It's not this room, it's that the stream wouldn't hear you. And so just speaking the mic, it's easier. Thank you. For the combination of work with QMU, if you want to have one Cortex-A running theoretically at 3.5 gigahertz, another one at 2 gigahertz, another one with SVE, another one without SVE things, that should be allowable for simulation in SOC with multiple chips. And KVM should be able to second well QMU in its intent to simulate that platform. So in that case, don't do everything but do what you're told by the user framework is much better. So I was trying to give you a real context for why KVM may have good reasons to evolve to... Yeah, I mean, there's also always different ways to solve that problem, right? You could also just tell KVM to then expose something very different and cash those values on the kernel. But in general, I feel like we are over indexing in KVM on removing that tiny last mile between kernel and user space, which is not a super slow switch in comparison to removing the actual exit that really is the thing that kills our performance instead. Yeah, do you want the mic or is it short? So first to comment, because Sergio is here and another thing that runs on hypervisor framework is KVM and for those of us who use Macs and need to do Linux builds or stuff like that, it's a really cool tool because it boots in 100 milliseconds or something like that, let you run your Linux commands in a context that it persists from one run to the next. So that's really, in my opinion, life-changing compared to running the VM. And then a question, so what about advanced devices? I have a VM that runs three graphics and audio. But one thing I noticed is that I crashed a host and I forgot to leave when I used audio. So are there other glitches like this that you're aware of? So anything that's to do with 3D acceleration, audio, yada, yada. All of this is orthogonal type of a framework because at the end of the day, you're just creating a device model that allows you to do some either PV or non-PV acceleration of these operations. I'm not aware of audio crashes. I mean, if there's a bug report, I'd be happy to take a look at it. But in general, QMU is able to do a lot of these things if you have a Linux guest and a host that is supported for 3D accelerations, for example. But you probably just want to look these two seats right off you and ask Garret if we actually do have support for 3D acceleration in QMU as the backend, because I don't know. Okay, it does work. Okay, it does work, cool. Okay, so there's some extra patches needed to make 3D work. Okay, cool, yeah, absolutely. And it's also different ways and variants on how this whole thing works on 3D acceleration, but yeah, with Linux guests, we definitely have everything we need to make it happen. Mac OS guests are slightly more difficult because they have a proprietary interface. More questions? Yeah. What about overhead? Overhead. What about overhead? There's tons of additional switches to user space. These chips are so insanely fast on switching from kernel to user space. It just doesn't matter. Like as I said earlier, we traditionally have index quite heavily on optimizing for the kernel user space switch. I have slides from like 10 years ago where I was already showing that that is only about 10 to 20% of the performance loss you would have on a VM exit. So what you want to optimize for is to not exit. You don't want to optimize for the super tiny last path of going from the kernel to user space. Negligible. If you have to move to the kernel already, you already lost basically on, if you have some hot path that keeps poking the kernel space. But then again, you're not using the hardware accelerated interrupt controller. Correct for interrupts that is a slow path. I was surprised by the performance you get out of this already. Can you do a comparison between the two? Yes, you could. If you want to, we can sit down later on, install Asahi and do an apples to apples comparison. There is a native Linux part of the systems now. And so we could literally just go and run a couple of benchmarks and just do an apples to apples comparison. My guess is you would probably see like a 30% or something slow down, at least in some that's interrupt or IPI heavy. But for the majority of simplistic use cases, like, hey, I just want to have a, maybe even single CPU Linux VM to just compile a couple of things here. That's not your case. There it's probably gonna be in a single digit. Yeah? Okay, question is, is there an equivalent to VFIO on MacOS? Yes and no. There's this out of the scope, but I'm happy to answer it really quick. There is a, there's nothing that actually is VFIO-ish. So you cannot just go and tell the kernel, hey, give me access to this device, please. What you can do now, since two years ago or something is write device drivers for PCI, including Thunderbolt devices in user space, which are in a super tiny constrained environment. So it's not just a user space application. It's really just a tiny environment that doesn't even have access to normal source calls, but that can answer IPC calls, Mac IPC calls and share memory with another entity. So you could write a user space application that talks to a tiny driver that then basically shares the bar regions again and then you can build your own VFIO driver using that. Do we have the mic still? Thank you. More than five words is not something I can easily repeat. I've taken a little bit of a look at the universal memory architecture and I was curious, what's your take on? I've heard that it would be better if it was SBBR compliant for virtualization purposes, although I'm not entirely sure that that's important. And then I guess the only other thing I was curious about was do you think it would all be possible to do something like a mediated pass-through for like a GPU, like the M1 GPU? I'm actually, I'm able to extract the actual question out of the question. Well, I mean, the question is that VFIO pass-through obviously involves, I think it's, at least in the context of the VFIO mediated device, it requires, if I've understood correctly, something to do with SBBR, although I'm not an expert on ARM architectures. SBBR is just a spec that defines a couple of hardware primitives that you have to have. Thanks. Hold this more reading. No, not that, a firmware primitives that you need to have but that's SBBR is really just about booting. That's orthogonal. So what use case are you trying to fulfill? Are you trying to? I was thinking about trying to share the GPU between the host and the guest with like an unmodified guest driver. Okay, unmodified guest driver is out of the question in general, what you can do is PV and PV has two different mechanisms basically. You have two different paths that are easily viable. One is using our own PV mechanism that we have which is what we mentioned earlier. The other one is the proprietary Apple one which you could reverse engineer and then just use prioritized graphics interface to access a PV metal interface in the guest. I have some more questions but maybe we can take it offline afterwards. Yeah, absolutely. Okay, cheers. Thanks a lot and happy lunch.