 Virtual machines, this is a hardware talk and it's in English. Unfortunately, there is no German translation. We just assume that you can speak English well enough. Virtual machines are all the rage for many reasons, for security reasons, for practicability reasons. But the fact is that just because you're running a VM and you're just a provider, you can still, your hypervisor can still read and write into it. And that may not be such a good thing. So here are Janos, raise your hand. And Claudio, Claudio is from Italy, but he's been in Germany since I think 10 years, something like that. And they come from Chaos West in a further sense. I think they're based in Stuttgart. And they're going to tell you the challenges of protected virtualization. Not only VM, but protected VM, so that you can sort of claims-free, just say this is a black box and have nothing to do with it. Let's have a big hand, please, for Claudio and Janos. All right, let's start up with a big greeting from our corporate lawyers. So what are protected VMs? I mean, he already told us that. We'll have a look into what's the basic definition of it, and then what we did, how we implemented protected VMs on our platform S390, or you may know it better from the name, Mainframe. Claudio here is KVM and QMU developer for S390, and he actually made another talk about the topic, which you can see on the YouTube of KVM Forum 2019. And this is Janos. He is the KVM S390 commentator and also QMU developer. He also made other talks at the KVM Forum, but not about this topic. OK, let's step in. So protected virtualization, or sometimes also secure virtualization, allows for virtual machines whose state is not observable and alterable by the hypervisor. So the hypervisor can't reach into memory, it can't get the registers, and that brings us a lot of benefits. One is we can protect against malicious operators. So if you are hosting a cloud environment, you might fear that one of your operators extracts customer data from VMs and sells it to somebody else. We can protect against malicious hypervisors if the operator himself was not able to get access to the VM, but maybe he was able to infect the hypervisor and then get access to the VM. Protected VMs kinder also gives some protection against bugs because we provide protection against memory overrides into the VM. And then there's the big topic mostly for banks, which is compliance. When we speak about secure entities, secure VMs, we mostly currently speak about secure enclaves. And enclaves are pretty small. You need to adapt your code for them so they can run. But they provide one of the most important things, memory, integrity, and confidentiality. They don't provide denial of service protection, but basically nobody provides that because you can just unplug the machine. Protected VMs, on the other hand, can take basically any workload. It's just a VM. It requires moderate kernel support, but user space can basically stay the same. We currently do not have any changes to our user space. And they also provide integrity and confidentiality for the memory. As with everything, they are still vulnerable to DOS attacks. As an example, Intel and ARM have SGX and trust zone as secure enclaves. And the hot new thing, Protected VMs, is provided by AMD, who stepped forward at first with SEV. And they're also working currently on add-ons for that. IBM C, the mainframe. We currently work on Protected Virtualization, which is a working title, so don't put a pin in that. And IBM Power has Protected Execution. So the typical parts of a server are a CPU, memory, some form of boot image, persistent storage, other devices like network I.O. And for VMs, also, a big part is host memory management, swapping migration, and all of that. How do we protect it? Well, first of all, for the state, for the memory, and for the CPU, we can give encryption protection. That's actually what AMD SEV does. They have different keys for memory encryption for the hypervisor and the VM. So if the hypervisor reads from VM memory, it will read encrypted data with the hypervisor key, but not with the VM key. So it will basically read garbage. Same thing applies if it writes to the VM. Then we have hardware-assisted access protection. Well, that's basically one of the key things to have in our opinion, because that allows us to also protect against write accesses. Then the hypervisor cannot only not read from the VM or reads just garbage, but it can also not corrupt the VM. We also need mapping protection so that if we have a page, the hypervisor cannot just remap it at a different address if the hypervisor knows the contents of that page and maybe alter execution within the VM. And then we also can give integrity protection of the data, which is not completely the same as access protection, because if we think about swapping, then you also need to make the data readable. But if you give it back to the VM and swap it in again, it should be integrity protected so that what the VM gets back at the end is still the same that was swapped out beforehand. All of these can and basically should be combined to give maximum protection. Let's go ahead to the boot image. Well, for the boot image and for booting in general, we used to have the TPM. And the current implementation of boot image protections are basically a successor of that. We can have encryption, which allows us to precede a VM with secrets, SSH keys, looks passwords, and so on. And we can have remote attestation, where a VM comes up, boots. And at some point in time, it will be measured, so a hash will be generated. That hash will be sent to the customer server. It will be validated there if it is a trusted boot image. And then the VM continues running. Attestation is really nice to have, but having the ability to just encrypt the boot image and also have it integrity protected just by a hash is great because of the ability to precede it and just reducing complexity of the whole attestation stack. I.O. is a bit of a problem because for I.O. you need to make VM memory visible to the hypervisor. If you have a look at Word.io, that's all done through VM memory. So everything that the VM sends and receives needs to be encrypted. You need to have full disk encryption. You need to use HTTPS, SSH, encrypted protocols, except for, well, if you want to make something really readable, which is maybe boot messages or just an error output via standard console. The VM cannot and should not make assumptions about the behavior of a device. But, well, in today's times, how often is that the case anyway? So that's just normal behavior currently. In the future, we might have dedicated hardware, which actually provides a protected channel between a VM and the hardware. And then you could communicate with that device without encryption. But I'm currently not aware of any hardware devices who support such things. Swap and live migration is a complete other beast for protected VMs. Because you normally can't read the memory. You can't write the memory. So normally, you get it in an encrypted form, transfer it to the other host for migration, and then you need to bring it back. But when you bring it back, it needs to be integrity protected. The integrity needs to be checked. And you need to check if everything has been transferred to the new host. So that's a whole other challenge of complexity, which a lot of people are currently working on. So Claudia here will give a brief overview about the implementation on IBM C. And as we can't trust the hypervisor anymore, the question is, which entity can we currently trust? Well, we did a poll, and the new name of that entity is Ultravisor. Problem will be, what's the next entity below that? Let's see. And of course, on the mainframe, we have an instruction for that. So let's have a look at what is this ultravisor and what it does. The ultravisor is a trusted entity and it's implemented in hardware and firmware. And it basically takes over some of the tasks traditionally performed by the hypervisor, like the boot process. So the Cripsen verifies the boot image. It protects the guests from the hypervisor and from the other guests. And proxies all the interactions between the guests and the hypervisor. So it does a lot of work. This is a little table that shows what we trust and what we don't trust. So we don't trust the hypervisor, as we said. We don't trust the normal guests because it's basically the same as the hypervisor from our point of view. So we don't trust them. We trust the protected guests for themselves as in we don't care if they shoot themselves in the foot. That's their problem. But we don't assume that they will not try to do fishy things. So we don't trust them that we're not trying to access each other's memory, for example. So we will prevent that. The only thing we really trust is the ultravisor, which is trusted system-wide. And it's the only entity that is really trusted. And here is a scheme that shows how the interactions between all the moving parts are. So normal guests are not trusted. The hypervisor can read into normal guests. The normal guests, in theory, should not be able to read into the hypervisor. But they could if there is a bug, for example, or a malicious hypervisor. Nothing stops that. On the other hand, you can see that protected guests can only access their own memory. And they cannot access anything else. The ultravisor can access everything. As we said, it's the one trusted entity that can do everything. And nobody can access the memory of the ultravisor. The hypervisor, in general, cannot access the memory of the protected guests. But you see there is an asterisk because, of course, you need to share some memory with the hypervisor in order to perform any kind of IO. Otherwise, yeah, you have a very secure guest, but not very useful. And of course, there's no interaction allowed between the guests themselves. This is the grand scheme of how everything works. Basically, the rest of the presentation is to explain how we make this work. So protection, as we said, the guest memory is protected by hardware unless it's shared. It's only accessible by the owner, so the VM itself and the ultravisor. And when it's shared, also the hypervisor, but never other guests, never other secure guests. The guests and CPU state, so some blocks of memory that contain the state of the CPU or the state of the guest in general, those are only accessible by the ultravisor. So nobody can access it, read it or write it. The host memory is never accessible by secure guests. So this means that, for example, a secure guest will have a very hard time to do any kind of breaking out-of-the-box kind of exploit because a secure guest cannot access, in any way, the memory of the host. The guest to host mappings are protected by reverse mapping table. So in case the hypervisor, for any reason, malicious bug tries to map a page, a guest page to a different guest address that will not work. And I.O., as I said, the guest shares the buffers needed for performing I.O. and it's the guests that decides which pages to share, of course. I put bounce buffers in parentheses because it's not necessary to have bounce buffers in theory, but in practice, this is done with bounce buffers. Finally, some emulation data, which it's bounced and checked by the ultravisor. There are some bits, some pieces of information that are needed sometimes to perform emulation of some instructions. And those are copied back and forth by the ultravisor and they're checked for correctness so that the hypervisor, for example, cannot provide values or bits in registers that are not allowed by the architecture. Of course, not everything is checked perfectly, but this will at least prevent the worst attacks. So what's left in the hypervisor then? The hypervisor still has to do any kind of I.O., so if there is a disk access, the hypervisor will need to read the data from the disk and provide the device model. Probably it's virtio, but it doesn't have to be. Of course, scheduling, from that point of view, it's just a VM, so it's just, and KVM in particular, it's just a process that needs to be scheduled. This, of course, means that the host could decide to never schedule a specific secure VM, that's the I.O. service, but that's unavoidable because, as Janusz said before, you can always put the plug anyway, so there's nothing we can do about that. So the I.O. service is never an issue because it's always possible, there's nothing we can do about it. We don't care. Housekeeping for some instructions, some housekeeping from the hypervisor, they're executed by the firmware or hardware, but still the hypervisor needs to know about it. And finally, some instructions actually need to be executed by the hypervisor. For example, I.O. instructions, not just those, but in particular, I.O. instructions are handled by the hypervisor. So, let's have a look at the life cycle of a secure guest and a secure host and see how this looks like now in this complex scenario. So, first the guest boots, and it boots in normal standard, non-protected mode. And then it loads an encrypted blob in memory, which is the actual protected boot image, although the second step can be skipped in some cases, like if you load the kernel image with a QM or common line parameter, then it's already memory, so you need to load it, but details. And then the guest performs a reboot into secure mode. Basically, the guest asks the hypervisor to reboot, and as a boot device for the reboot, it specifies this blob. At this point, the hypervisor will call the ultravisor and say, hey, I need to create a secure VM, and I need to create these many secure CPUs, and then it will basically pass this blob from the guest directly to the ultravisor. The blob basically contains all this, possibly contains all the secrets of the protected guest, so of course the hypervisor will not be able to do anything with it. So this is just passed directly to the ultravisor, which is the only one entity that is able to actually make sense of the blob. The blob also contains some other configuration parameters, some other keys that are used for some other purposes. At this point, each page of the image is then, we say unpacked, basically, the hypervisor asks the ultravisor to decrypt the page, so the page is first made secure, so it's made inaccessible, and then it's decrypted. So the hypervisor will never be able to see anything inside. Once that's done, we have a boot image in memory that is now ready, it's been decrypted, and the hypervisor just continues execution of the guest. It simply needs to use a different format for the CPU block because now it's not normal CPU, now it's a secure CPU, so some things are different, necessarily. So what do we normally have in the CPU block as CPU flags, a program counter, some registers, including all the special registers, some timer information, all the interception data, basically the reason why there was a VM exit and the instruction that caused it, some hypervisor control flags that altered the behavior of the virtual machine. This is, of course, this is not something that you want to have in a secure, in a protected virtual machine, so we do it differently when the vCPU block is for a protected CPU. So, first of all, when a protected CPU starts, the state is not taken from the block of memory that is in the hypervisor storage, but is taken from one of these ultravisor-reserved areas because it needs to be protected. Except for a couple of bits, we have a couple of bits that are always copied back into the hypervisor because, for example, interrupt management, the hypervisor needs to know when specific CPUs are enabled for interrupts because otherwise it cannot inject the interrupts. Of course, in some cases, some more information is needed, so let's say an IO instruction, there's some small memory blocks that contain some information, there's some registers that contain some important information that needs to be shared with the hypervisor, so the ultravisor will actually, on a case-by-case basis, copy this information back to the hypervisor, but only exactly those bits that are strictly needed to the hypervisor, and if needed, some information is copied back from the hypervisor into the guest, and these bits are also checked. Again, only for those instructions where the hypervisor is expected to provide information to the guest, and those bits are also checked for architectural compliance to prevent the hypervisor from playing any tricks on the guest. That's not all. Instruction interception is normalized, so the instruction text is not the actual text of the instruction, it's a normalized version, so every time the hypervisor will see the instruction with the same registers, of course, the values in the register will be in the right place, the ultravisor will take care of putting things back afterwards. There's a new interrupt injection API because normally we just read from memory and write back in memory to perform the interrupt injection, but that's not possible, so this has to be done through the ultravisor, and of course, there are new interception codes, new reasons to get out of the VM, for example, if some instructions seem to be interpreted in a secure way or if some cases and instructions have been already executed, but still the hypervisor needs to be notified about it because it needs to take care of some housekeeping so apart from these new interception codes, there is interestingly less to check because the ultravisor will take care of most checks and there's a secure instruction data area which is also something that normally is not there, some instructions have some small buffers that are usually just accessed by the hypervisor, in this case, they cannot be accessed, but since they are small, it's also hard to use bounce buffers for them, so what happens is the ultravisor will just copy those in a specific page, this secure instruction data area, and in case it needed, it would be copied back into the guest. It's easy for example, use for console data, serial console and stuff like that, or boot messages. So as I said, interrupt injection need to be done differently, this is done through the state description, so before running the VM, the hypervisor sets some bits to tell the ultravisor, please when you start inject these interrupts with these parameters. Of course, the interrupts are not always allowed, only a few program interrupts are allowed to be injected, these are like exceptions, and these are only allowed when they are expected, so you cannot just randomly inject a invalid instruction or a page fold, that's only allowed when the instruction allows for that to be injected, and of course you can never inject interrupts into the VM if the interrupts are not enabled for those classes of interrupts you want enabled, and this is why the ultravisor needs to give those few bits to the hypervisor all the time about which interrupts are enabled. Swapping is interesting because this breaks everything, now the hypervisor needs to read the memory and save it to the disk and then put it back when needed, so how we do this, we export the page, which means the page gets encrypted by the ultravisor, and then unprotected, so made available. This is initiated by the hypervisor. The hypervisor asks the ultravisor to export the page, at this point it's encrypted and readable, at this point it can be written to the disk, but also hash is saved somewhere in a protected memory area so that we know once we swap the page back, we can check if the content has changed because we have to guarantee integrity. At this point the pages can be swapped, okay so the hypervisor can swap the page to disk and use that memory for something else. Once the page is needed again, the hypervisor needs to swap it back, so okay, reading from disk, you have the encrypted page in memory, at this point the hypervisor asks the ultravisor to import the page, so the page is smart secure, it's decrypted, and the integrity of the page is checked. If the check wasn't successful, the page is not imported, otherwise it is and the guest can continue. If the check was not successful, the page is not imported, means that basically the guest cannot continue execution, so it's as good as that, but that counts as a denial of service attack and we said before there's nothing we can do about it. Anyway, now let's see what the guest has to do. Basically it needs to start boot, check if protected virtualization is available, if yes, loading the blob and do this reboot, this is done basically at the boot loader stage. What the kernel has to do actually is quite simple, check if it's running inside a protected guest, if so, set up the bounce buffers to be able to perform IO and then use the bounce buffers. So the changes in the guest are really, really minimal. By the way, the changes to the guest have been upstream already, well, the rest is still a work in progress. So let's have a comparison of how this behaves and what the characteristics are in comparison to, for example, with ACV from AMD. ACV is already there, so you can already use it while protected virtualization is not there yet. But apart from that, you can read the state of the CPU, so you can read the registers unless you have the ES extension, which I think the newest CPUs have it, whereas on protected virtualization, it's never possible. You can always read the memory with ACV, but it's encrypted, so it's not an issue. Whereas on protected virtualization, it's only readable when it's shared and same for writing, it's only writable when it's shared in protected virtualization, whereas on ACV, it's always writable unless the SNP extension is enabled, which I think it has been presented last month for the first time. I don't think there are any CPUs with that available yet. On the other hand, swapping is not supported yet with ACV. It's supported with protected virtualization. Migration is supported with ACV and it's not supported yet with protected virtualization. Sure, what can we get out of this? Protected virtual machines means the hypervisor cannot access the state, cannot read or write the state. Ideally, it's technically protected as a state and the boot image as well is protected. This is the basic idea behind this. What needs to be done yet is, and what challenges are in general, yes, making memory accessible requires bounce buffers. If you don't use bounce buffers, then you need to make memory accessible and not accessible on demand, which is terrible for performance. Swapping pages can be hard. For example, AMD doesn't even bother to do any swapping because it's hard. Migration also can be very hard. If you have any questions, you can find us around. Don't hesitate. These are our deck numbers. Yeah, the current hypervisor patches are on the mailing list, on the KVM list and the Linux S390 list. They are in active discussions. And this year at the KVM forum, we've seen that there has been a lot of discussion between the architectures. AMD came out with SEV as the first ones. Some others will follow, I guess. ARM came out, no, no, power came out, but they actually implemented their solution independently from the S390 version. So, get in touch with us. We always want to know your thoughts and the newest ideas. We want to create a kind of community around there because most platforms are working on protected VMs and they most often have the same problems. So, thank you. Okay, now, as usual, we can field question. There are two lit mics, please. Okay, you have a question? Go ahead, one sentence, just a question, please. So, on slide 31, discussing the provisioning with the protected law, what is the encryption key used for that blob and how does it prove to the customer who is launching this guest that it is actually being launched in the secure mode? Good question. So, the blob is encrypted with a public key? No, I mean, the blob is encrypted with an image key. But that image key is then encrypted with the private key of the machine. And for which you get the public key when you ask your cloud provider that you want to deploy on that platform. So, on a specific machine. So, basically, it's a classic public key system. So, the private key is somewhere in the hardware and the public key is public. You just encrypt your image and if your VM is running, you know that it has been decrypted by a system that has the private key. So, one of these things, yeah. Okay, okay, next question is, is there a question from the internet? Give us a sign. No. Okay, go ahead. Hi, you make the distinction between migration and swapping from a technical point of view. Where's the difference? Well, I mean, for swapping, you only bring out and bring in pages from disk or network. But for migration, you need to transfer the memory and then also the CPU state. And the CPU state is actually a bigger problem because you need to have some kind of protocol that transfers it in a specific way, specific structures and all that. It needs to be integrity protected and at the end, swapping only does integrity protection on one page, but if you transfer the whole state, you need to have an integrity of the whole state for all of the CPU states and of the whole memory. Also, you need to guarantee that the destination is also a proper system. Yeah, an authorized host that is able to run the VM. Okay, I see you. Ask your question, please. Are there any plans for nested virtualization? No. No, none whatsoever. It's just not possible. I mean, we already have two levels of virtualization on the mainframe. Going deeper is, well, we implemented it, but we would need hardware assistance for that and more than two levels is not in our plans as far as I know. Are any other architectures considering it? I don't think so. I haven't heard. Thank you. I mean, of course you have unlimited levels of virtualization, but without protection. Which we don't want. Okay, any further questions? Okay, well, let's call this a day. Let's have a final big hand for Janusz and Claudio.