 Hello, everyone. My name is Vitaly. I am from virtualization engineering team at Red Hat. And normally, I work on various public clouds and third party hypervisors, making sure Linux is the first class citizen there. I'm also a sub-maintenor in KVM, working on things like Hyper-V Enlightenment and various x86 things. Today, I'm going to talk about supporting the new thing in the cloud, confidential VMs and what we can or should or must do in Linux to make this all work. So you've been following this cloud landscape over the last couple of years while sitting at home. Then you could have noticed that they all major hyperscalers released something which is called like confidential VM. I think Google was the first one with AMD SCV option in 2020. AMD SCV gives you memory encryption. And basically, that's it. Then we have an offering from Microsoft Azure. Since last June, it's even in GA. And they do AMD SCV S&P, which is like an advanced version of AMD SCV, which not only gives you memory encryption, but it also gives you like register encryption or register protection and integrity protection. We now have Intel TDX, which is a technology very similar to AMD S&P from a customer's perspective. The implementation is very different, but for customer characteristics, it's similar. And Amazon has just released a S&P option for their existing C6M6A, R6A instance types. So you must have been wondering what these confidential VMs are about and what do they give you as a user. Why are they better? Why would you want to use them? And basically, what they promise is that these VMs will keep your data confidential. It's like a bold claim, but then what does it really mean? Well, that data confidentiality means that nobody, but the owner of the data can get access to the data. And in case of VMs, there is always the host which runs your VM. So the confidential VM technology is something which allows you to remove the host from this trusted base. You don't need to trust your host anymore. So even ML issues or compromised host cannot get access to your data. However, all these confidential VM technologies give you no additional protections from within the VM. So if you have some application running in your VM and this application was compromised hacked into, it will still allow for data leakages. Confidential VM technology protects your VM from the outside, from the running host. But it doesn't protect processes inside this VM. Another thing that you must also remember that none of the confidential VM technologies give you any guarantees that your VM will actually going to run. Because if the host which runs your VM is compromised, somebody can easily disrupt execution of your VM, basically terminate it, or just don't allocate any time slots for it to run. This is always possible. So it only protects your data. It doesn't allow data to leak to the host, but nothing else. So when we're talking about data protection, normally we talk about protecting data in runtime, at rest, and in transit. I'll start with the transit because it's not really confidential VM specific. We've been protecting data in transit for years because we never trusted our networks in the first place. We are working over public Wi-Fi or who knows who administers all these routers on the way of your data. So there are cryptography-based solutions how to protect your data while the data is in transit. What confidential compute technologies give you on a CPU level is something which helps to protect the data in runtime. Because when you have your app with your confidential data being executed somewhere in the cloud, where is your data? Your data is likely in memory because you need to process the data. You have some of this data in CPU registers because you cannot really operate on memory all the time. To make an operation on your data, you need to put it in the CPU. And these confidential compute technologies which are basically newer CPU features, they allow to hide all this from the hypervisor. As I've already said, MD was pioneering this with their self. And yes, I have to make a caveat that I'm only talking about the technology which allowed to basically protect the whole VM. There were technologies in the past which were designed to create confidential enclaves. And these are technologies like Intel SGX. But they don't protect the full VM. They protect certain applications which are specifically built to work with these enclaves. So there was MD-SAV. And MD-SAV protects your memory. So your memory is always encrypted. And only the guest can decrypt it. However, it gave you no protections for CPU registers. So whenever your data is in CPU registers, it can be seen by the hypervisor, which significantly limits the protection. Because basically hypervisor can stop your VM every cycle and take a look at what is in the registers. So it's really not hard to steal your data even when the whole memory is protected, even by observing CPU registers. Because once in a while, you will have all your data in CPU registers. Then AMD came with an upgrade which was called MD-SAV ES, which is encrypted state, where they also hid all the registers. Now the hypervisor cannot observe the registers. What was missing there is integrity protection. So the hypervisor can do things like, imagine you have two encrypted pages. And the hypervisor can just swap them for your VM. So it cannot see the data. But the behavior of your VM is likely going to change. And you may, for example, do IO from the memory. You didn't want to do the IO, right? So there are latest like SAV-SNP. It gives you this integrity protection. So the hypervisor will not able to do any of this. And the last but not least is protecting data at rest. Because once you've done processing some piece of data, you will likely be putting it to storage. And the storage comes from the host. You must be sure that it's also protected. So how are you going to do that? Yes, I've already said about these technologies. There is support for almost like all of them for the guest side in Linux kernel nowadays. There is still something missing on Microsoft Hyper-V side to support these technologies with their hypervisor. It's coming. The thing is that there are Microsofts that always do things differently from the rest of the world. But yes, we are getting there. Now I'd like to talk a little bit about protecting data at rest. First, let's think what do we really want to protect? It's kind of obvious that you want to protect your sensitive data. The data your application is processing, these must be confidential. This is great, but then you have your operating system. And if we are talking about generic Linux operating system, it has a number of things which you would like to protect from the host. Some of this data, think about binaries. They are built from open source. They are not any secret. Everybody can get the same binary of Bash as you have. You don't want to hide it from the host, but you want to at least right protect this data. But some of the data, even operating system data, must actually be even grid protected from the host. Think about your SSH host key. If the host is able to steal this, it can try to impersonate you and present some other VM which will look exactly like your VM. So for general purpose operating systems normally to resolve the problem, we want to do full disk encryption. Because encryption also gives us integrity protections. Even though we have plenty of stuff which we don't want to hide, it's easier to think about it. Let's just encrypt everything and be it. We've done this with Linux for years. We have things like Lux, which is great and which works and which is able to create an encrypted volume for you. You can be using this on your laptop, an installer. You provide a password. And then everything is encrypted. So what needs to be done there? The problem with confidential VMs in the cloud is that you need to think how the guest is going to get the password or the key. Or anything. Because this must also be protected from the host. So you may want to do something like, OK, when my VM boots, I'm going to go to the console and enter some password and decrypt my root volume. But first, it's really inconvenient. You don't really want to have to go to the console every time you are starting the VM on the cloud. But that's not the main problem. The main problem is that this is inherently insecure. Because this virtual console is emulated by the host. So the host can easily see what's going on there. So once you've entered this password on the console, you're done. The host can easily get to your volume after that. So we must come up with a way for the guest to receive the key in an automated fashion. But we need to make sure that we are giving the key to a true confidential VM. So for example, that its memory is fully protected from the host. That the host won't be able to steal it. And we must be sure that everything which was executed in this virtual machine before that is good to certain thing. Because if the host managed to inject something in your boot chain, which is untrusted, then you cannot give this any sensitive information to such VM because it will be stolen from you. So how are we going to do this? And namely, in Linux. So let's take a look at how Linux normally boots. That's a very high level generic picture of what's going on. You have some platform firmware, which in more than cases is like UFI firmware for x86. Then you will have a bootloader or actually a chain of bootloaders there. After that, you boot your Linux kernel. Linux kernel takes usually like any trimfs. And then it has all the drivers to mount your root volume, switch there, and start executing from there. So we want to do full volume encryption. Obviously, we cannot encrypt everything because then we get into this chicken and egg problem. If the whole disk is encrypted, who's going to decrypt it? You still need to have some code which is going to decrypt it. And you cannot delegate this to the host. You don't trust the host anymore. Of course, there are solutions where it was like host-based volume encryption. But you cannot use them in confidential VMs. So which options are on the table? The first obvious option is let's basically keep our bootloader unencrypted and encrypt the rest of the system. Maybe good from some perspective because we are reducing this unencrypted surface and the code which we must verify in some different fashion. We cannot rely on the full disk protection for the parts which are not encrypted. But then we need to do quite complex tasks in bootloader. And normally we want bootloader to do as little as possible. So another option would be to make Linux do the encryption. So let's keep Linux. And it's init.rmfs unencrypted. And the main advantage is that we can use standard Linux tools. Linux already knows how to encrypt volumes. We don't need to do anything much. But then we get into a situation where all these three artifacts need to be somehow verified. And this includes Linux init.rmfs. So summarizing these two options. We can do encryption with bootloader. But we then need to have a bootloader with trust. And in open source work, we unfortunately don't have many options for that. We have to teach bootloader new tricks, like work with complex devices like TPMs in a way we want. And the problem is that when we are writing a bootloader, we cannot use any existing library from Linux. Because, well, it's like a very different environment. We will have to either borrow the code in the bootloader and support it ourselves. Or write it from scratch. Usually, we end up with writing everything from scratch into bootloader by saying, oh, we are going to get away with a very simplistic something. Then it grows. And then it has its problems. And remember that this part needs to be completely trusted by you. Because this is what's going to protect your data. Unlocking by Linux and using standard Linux tools seems to be a bit better. At least some major Linux vendors seem to be converging on this. Because we don't need to implement anything. We already have the tools. But then, as I showed you, we have bootloader. We have Linux kernel. And we have any traumas, which all remain unencrypted. We need to make sure that they are trusted somehow. So how are we going to do that? Well, how do we usually check the integrity of the boot chain in Linux? We have two main technologies. One is called Secure Boot without a space. The other is called Measure Boot with a space. Don't ask me why. So crash course for those who don't know what these things are. Secure Boot is basically establishing a trust chain from the hardware, where we check the signature of every binary we load before executing it. We start with Microsoft Certificate, which is embedded in hardware. Then we load some bootloader. Then bootloader loads kernel and whatever. And every artifact which loads the next one is supposed to check the signature on it, that it's like a good thing. Then we have measured boot. And measured boot means that everything we load and every significant fact about system boot gets measured in TPM, which is a device, which is a chip in your system, either physical or virtual, and which can basically record a sequence of events. And this is great, but none of these currently cover any drama pass or kernel command line to this matter. So to make things better, in system decommunity, people came up with a concept called Unified Kernel Image. And Unified Kernel Image is a very simple thing. It is basically let's take our kernel in a drama pass, kernel command line, bundle them all together in one UFI binary, and sign it. That's the most important part. Once we do that, if we load this and the signature is correct, we can trust this whole thing. So it sounds like great. So how do we build it? We take something which is called system desktop, which is a really, really simplistic loader, which basically will take out from itself the kernel, the any trauma pass, command line, put this all in memory, check the signature, OK, it all matches, and going to launch it, nothing else. We can use secure boot, and now also check the any trauma pass, because that's fully trusted. We are going to grant the key to our data to it. In Fedora rel systems, if you use them, you can build a UKI with a standard DRACAT tool, same way you are building your any trauma passes now, but you can say like minus minus UFI, and it will build a unified kernel for you. And as it's a UFI like PE binary, you can load it directly from your firmware or from something like Shimbot loader. You don't need complex things like GRUB. Although we are actually working on adding support to GRUB to load UKIs even for BIOS booted systems, just to be able to reuse the same image with unified kernel everywhere, including on BIOS booted virtual machines, because otherwise, we will have to provide two separate images, one for our UFI and for Confidential, another one for BIOS booted. We don't like that, but just are on the mailing list. Eventually, we hope that are going to be merged into GRUB. So how does this all work? Just to give a high-level picture of how we boot and how we get the key. So we start with UFI and firmware, and then we rely on secure boot. So we check the signature of the first-stage boot loader which in Linux is Shimb. Shimb is signed by Microsoft, if you don't know. It doesn't change very often. It's very simplistic, and its purpose is to basically carry a vendor certificate in it. So we don't have to go to Microsoft with every kernel built we do. And every UKI built we do to this matter, right? So as I told you, yes, you can boot your UKIs from firmware, but if you want to do it with secure boot enabled, you will have to go to Microsoft and ask Microsoft to sign your UKI, which may not be what you want. So we do it through Shimb, same way we do GRUB, but then from Shimb, we check the signature of the UKI, which in our case, for example, in cases like Fedora and REL, is going to be signed by Red Hat already. Every binary reload gets measured into PCRs. So these PCR registers, yes, and I'm sorry for not explaining much about what PCRs are, but think about them as like extendable hash functions. So the previous value get extended with the next one. So the only way to arrive to the last value is to go through the chain of extension with the same hashes. From when UKI boots, right, at this point, we think that we are safe. We know all the measurements, so we can create a policy which basically says, okay, if your system in this state, this is a good state we know. And then we can give the root volume key to this VM. Great concept comes with limitations. Your init ramfs is now static, right? Previously, you were building it on the target system. Now it's built by your vendor, so you cannot put more stuff there anymore, or I mean, you can, but then you will probably have to sign your own kernel, right? Which defeats the purpose if you do it. So for example, now in REL Fedora, we ship a package called kernel UKI vert, and vert is there for a reason. We put all the drivers which we think you might need on popular cloud and virtualized environments like vert IO, VM bus, NVMe, stuff like that. And we say, well, this should be enough for major use case. If you need more, unfortunately, you have to talk to us. You have to open a bugzilla. Please put more drivers in the UKI. Yes, you can rebuild, like literally, right? You'll have to deal with signatures. Shim offers you this like mock mechanism, so basically you can enroll your own keys which are gonna be used in secure mode. That's one thing. Another, now static artifact we have is the kernel command line. Previously, we used to put things like root equals something on the kernel command line because we were creating another target system. We cannot do this anymore. We need to have a command line which works for everyone. So honestly, we can really put much there. And basically Fedora and REL ship with something like console equals TTI as zero, TTI zero. And that's it, right? So we need other mechanisms for where standard tasks, like how to find your root volume, right? What's gonna be the root volume if you cannot pass this parameter anymore? So we rely on some features from system D which allow for auto discovery of the root volume. You may still need to modify your kernel command line in some cases. For example, we realize that we have things like crash kernel equals something on the kernel line. And if you don't know what it is, you just reserve some memory for a crash kernel in case crash happens. You put into the special like KDump kernel which is just gonna save your memory to a file. So you can file a bug with your vendor and say, oh, my kernel crashed here is the logs. Here is the memory dump. And we cannot come up with one size fits all solution there because different systems may need different sizes. So there is a recent development in system D project for system distub. There is a signed extension mechanism. Again, this, so basically you can produce a signed file, put it on your if I system partition unencrypted and then it's gonna be sourced if the signature is correct, of course. They will either come from your vendor or you will again have to deal with issuing your own secure boot keys. Yes, I've talked a little bit about the TPM policy when we release the secret, right? We release the root volume key, right? What would this policy be? What do we need to check to say that your system is a good state? First, as we heavily rely on secure boot, you must be sure that secure boot was actually enabled on your system, right? So at least these must be included. Then you must be certain that the artifacts which you trust were booted and then you have options there. You either trust the certificate which signed these artifacts, meaning basically saying like everything built by red hat is trusted, right? Or everything built by canonical is trusted by me and then your policy will only contain the certificate. Or you can be more strict and say, I want to make sure that these exact binaries with these hashes were used in the boot. So I only trust these binaries, which I know. Currently, for example, for Asia confidential VMs, we use the second approach. And the reason for that is that we at Red Hat use the same certificate to sign our kernels and our UKIs, which makes them indistinguishable from the certificate perspective. And this means that somebody can take our normal kernel with any initramFS he wants and boot that. And that would pass the policy if we would just check the certificate. From a random initramFS, you will be able to steal the password for your data. We don't want to allow that. So we actually bind the secret to this hash of the particular UKI and to bootloader before it. So, all right, you trust your vendor or you did everything this yourself and put this on the cloud and your VM has booted. Can you start using it, right? It must be confidential, right? On this web UI tells you green, I boot it. Well, ask yourself a question. How do I know that this is actually the VM, the confidential VM and not something completely different? Think about this attack that a host creates non-confidential VM somewhere else on non-confidential hardware, but then changes everything inside it. So when you log in and you ask, am I confidential? It will tell you, yes, yes, I am. I am confidential. You need to find a way to test the system, basically to prove its properties, that it's like a true confidential VM, that it was using the image you expected. All these technologies we just mentioned that we were actually used, right? And it's not some completely different image. And this is, process is called attestation. I encourage you to come to the next talk by Kristoff. Kristoff is over there, who is gonna talk about establishing a chain of trust, like to make sure that what you're using is what you expect you're using. There are some other solutions, like for attestation service, like cloud-based services, there are some from your cloud vendors, like Microsoft Azure. There are some coming from hardware vendors, like Intel is working on something called project timber, which is like a cloud-based attestation server. And yes, I also wanted to talk a little bit about the Romanian potential attack vectors, but I think we are already over time a little bit. So I'd rather go to Q&A right away. Hope we have still have some questions. Thank you. Yeah, so to rephrase the question, is what do we do about the firmware, like UFI firmware, right? It's also in the picture. It's what gets started first in the VM. We need to have some trust in this artifact. So two ways. First, for example, what Microsoft does with Azure today. They have some sort of an attestation mechanism in their infrastructure. And the result of the attestation is that the firmware gets the private key, basically, the state of the VTPM loaded in it. So the only way for it to obtain the key is to pass the attestation, which means that whenever we see that there is a VTPM with a certain private key, we trust that it passed the attestation. In this case, we trust Microsoft as an organization, that they have this established. This is not happening on the host. And we trust them on that. Also, I mean, not Microsoft, but I heard that various cloud providers are working on something which is going to be called, like, bring your own firmware, where you would be able to come with your own VMF binaries there. But then again, you will need to build something like an attestation for your firmware, right? The firmware will load, talk to the server, say, hey, I'm good. I can prove that I'm good, right? Please give me some private state so I can continue executing. So this is like a tough question, right, with firmware, but getting there. And I mean, cloud providers are actually interested in making this confidential VMs confidential, right? They're on your side. They don't want to have access to your data, because it helps them a lot to say, we can never get to your data. You know, you can use our infrastructure, pay us, but we have no responsibilities. Go ahead. Yes, yes, there are. Yes, the question is, what about other non-X86 architectures, namely IBM System Z? Yes, they do have second generation of their something. I personally haven't looked much at how they do things, although we just have KBM Forum, and I chat with them. And they do something very similar to UKI. They just don't call it UKI. And they are asking me, how can we get in there? I just start calling your thing UKI, because what you're doing is already like a unified kernel. And you're done. Right, so there is also this confidential, I forgot the name, in the ARM world, which is coming in like ARM V9 specification, right? So yeah, it's not exclusive to X86, and namely like System Desktop, we can now use to build UKI for both X86 and ARM, at least that's what gets booted through UFI, right? Yes, it's just that without all these confidential extensions, it's like a pointless exercise, right? So you're not going to do much.