 Hello. So welcome to this session. I'm Christophe de Linsin. And today I'm going to talk about five big problems with confidential containers and why we need KVM developers to help. So the agenda for today is, first, a quick primer on confidential containers for those of you who already forgot what it is. We are going to talk about attestation, which is answering that you know what you're running. We are going to discuss performance of a head and confidential containers is a hardware vendor's dream. We are going to talk about image download, downloading images for containers. And it's a matter of leather, rinse, repeats. And finally, access control, which means we need to rethink Kubernetes credentials. I'm going to quickly browse over debugging to say we can't. So first a quick primer about confidential containers. And when I talk about containers, I always use this quote, because in computing circles, nobody knows who Gretchen Rubin is, but she knits it on the head. I love cunning containers as much as anyone, but I found that if I get rid of everything I don't need, I often don't need a container at all. Amen. So first, confidential containers is a rather active project. There will be a number of scan codes so you can scan the QR codes, and you can also scan them from the PDF on the website. So this is a project that was accepted as a CNCF Sandbox project this year. You see there is a large number of contributors, many repositories, and it touches a number of areas. So there are many repositories that cover really different projects. So confidential computing, what is it? Why should infrastructure see your data? The software that we developed now runs on hardware that most of us don't own. They run on a cloud provider, for instance. And the hardware resources are owned by the host. So the containers inside will carve out resources from the host, and that relies on a number of sandboxing mechanism, like shroats, et cetera. But these are designed to protect the host from the containers. Not the other way around. And so an admin on the host system can freely peek inside a container, for instance, read its memory, dump memory, et cetera. For that reason, it's difficult for multiple computing tenants to share hardware when they process sensitive data there. And so the problem is already solved for data on disk or network, where we encrypt everything. So the VM host cannot really tamper with it. But for memory, the memory is essentially unclear. So if you do a memory dump, you can see the passwords. You can get all the confidential data there. And so the idea of encrypted memory is that we can apply some kind of encryption on this. That makes it difficult to read. And I hope you can't decrypt what is in the green text here. There are other aspects to confidential computing, including the fact that you want integrity to ensure that the host cannot corrupt, or poison the CPU state, or the RAM content. It projects interrupts as well. And there are aspects outside of, I'd say, the kernel itself, like attestation, which is designed to prove where you're running and what you're running. So there is a vendor landscape that is rather complicated. AMD started with secure encrypted visualization, SCV, with variants called SCVES that adds encrypted states that encrypts the CPU register file, and S&P that has secure nested pages that's integrated production. Intel has something called Trusted Domain Extensions. I'm sure you're familiar with that. IBM S390 has secure execution, product execution facility. ARM has confidential computing architecture. All these technologies are based on virtualization, and that's why we need you guys. But they all work differently. So there will be zombies when we try to deal with that stuff. And scary. So a quick overview of Kata Containers, which is the product we started with. If you start from something like the Red Hat Marketplace, we are going to go through a program called Cubelet to send commands down to something called the Container Runtime Interface, which is implemented either by CIO or Container D. And then that's going to invoke a runtime, typically a run C or C run. They like to be creative with names. So that's going to start your container. This also relies on a container network interface and a container storage interface to be able, for instance, to deal with images and volumes. Now, when you run a VM inside to run your containers, you go through a different runtime called the ShinV2 at the moment for version 2 of Kata Containers. That itself launches a hypervisor, launches the VM, inside there is a Kata agent, and there you run your container. So the idea is to have the container's ecosystem but with virtual machine isolation. Now, that's not per se encrypted, so let's get there. So in order to implement that, we had to make changes in the Linux kernel, in the firmware and hardware, in order to implement confidential containers. I mean, there are changes in the runtime. There are changes in the hypervisor. And the images have to be encrypted. Now, what appears, oh, and there's an external component called the relying party that I'm going to get back to later. What you see on this diagram is three big colors. In red, you have the trusted platform that offers confidentiality guarantees that are enforced by hardware cryptography. You have the host that manages and offers resources that are used to run the container like CPU, memory, IO, et cetera, without being able to peek inside and know what's happening. And you have the tenant or trusted execution environment, often called T, TEE. And that's the confidential area that is not accessible to the host, even when it's running on that host. So the first phase, obviously, was to activate the hardware, the confidential computing technologies that were developed in the kernel and so on. So that's in the hardware, in the firmware, in the kernel, runtime, hypervisor, and changes in the Cata agent as well. Now, those are actually the most complicated for us, but this phase was mostly completed last year. Now, when you're there, you're not really offering any kind of serious confidentiality yet, because, for instance, your images are still done over on the host. So the next phase is to secure image pool. That is to make the image pooling from the guest instead of from the host. And then you store there, for instance, on encrypted trends and storage. Now, the problem is that this requires a change in the API architecture, because we want to delegate the pool image feature to the agent. And we did that implementation in container D. It was rejected as is, because they want to do a better one. So for now, that's working, but we need to rework that part. So the attestation is about measuring what we run using cryptography. There is pre-attestation, where you measure the payload before you let it start. That's original. You have post-attestation, where the code in the payload can confirm its identity using measurements that it made in order to get secrets from the relying party. And the purpose of all that is really to attest the workload itself, for instance, by getting secrets from the attestation service, or more precisely, the key broker, to make sure that the container can run. So how does this attestation work? It's relatively simple. You do a cryptographic measurement of the area of interest. And you respond to a cryptographic challenge by sending a proof of identity that you computed based on the keys that you received, the measurement you received. And if that works, then the attestation service is going to tell the key broker, it's OK to release the keys. And now your workload can run. And the security of the scheme, obviously, depends on the fact that you have to depend on the secrets. Because it's remote attestation, it means you can invalidate the workload if, for instance, it's compromised or it has a security weakness. And you can say, no, no, you can't run. So what is the problem? It looks like everything is under control, right? Well, let's start with attestation. The problem with standard, to quote Andrew Tannenbaum, the good thing about standard is that there are so many to choose from. So attestation is actually a very general thing. And so there is more than one way to do it. There is an ATF discussion about something called remote attestation procedures, or RATS, that is in itself a rather complex topic. But the schema that we have here is what we are trying to implement in confidential containers. But there are also other aspects that are closely related, at least in the mind of customers, like secure boots or trusted platform modules. And the question of how to implement trusted virtual TPM inside confidential computing remains a topic of hot research. Like, we are discussing manufacturing, transient virtual TPMs on the fly. Confidential containers have their own attestation mechanism. You see the diagram here. I'll refer you to the website if you want to have more details. The key point is that it needs to plug into all these technologies that I mentioned earlier. So one thing for you guys here is could we hide these platform differences from user space? After all, we can hide the differences between this spinning piece of rust here and this little bit of flash card over there. So it all appears the same to user space. Maybe you could make so that the secrets from AAB and Intel look the same. So attestation and key brokering. As I said, attestation is the only robust if you hand secrets that it needs. So the attestation service, or AS, which is called the Verifier in Rats, essentially checks your ID. And the key broker service hands over the secrets. So it has a database of valid inputs and what secrets you could do. So you need that secret to be actually necessary to run the workload. So it should not be some kind of gossip secrets. And the KBS protocol that was defined by the confidential container community. So you have the QR code here if you want. But it was actually first implemented by Sergio Lopez for DeepKaran. So kudos to him. And the problem is that the services will come from different players. Which means, for instance, in that case, that's Microsoft Azure. That means we need to define protocols, not just code. We need to set some kind of standards. And that's difficult to do when you don't have the first version working. Next problem is performance. So Edward Teller once said that a state of the art calculation requires 100 hours of CPU time on the state of the art computer independent of the decade. Confidential computing is going to guarantee that that remains true. So the cost of confidentiality is mostly that it gets in the way of deduplication. The downloaded images, for instance, are now encrypted. Which means that you can't really share them between parts of your machine. So that increases disk and networking cost. The disks also must be encrypted. And that means no deduplication of disk. And because memory is encrypted, that means no same page merging. So I'm waiting for the genius in this audience who will invent a patch for crypto-ksm. Please, please, please. The good news is that the runtime cost itself is manageable. So I did not get permission to share the numbers here. So I won't tell you what they are exactly, but that's another magnitude of the kind of overhead we have. The baseline is running the container on its own, on the host. And the square on the left, the square bar is a regular VM. And the rounded bar is not confidential containers. It's actually confidential workloads. Again, thank you to Sergio for these numbers. But you can see that, essentially, it seems reasonable. We have some outliers, and we need to investigate those, but it looks relatively good. And if I zoom on the middle area, you see that most of the stuff is within 5%. So regarding image download, it's the Lather Rinse Repeat School of Programming, as I said, as Will Durant once said, history is always repeating itself, but each time the price goes up. And that's the case here. So how can we manage to have cash plus Zdop plus crypto? That's really difficult. So there is something called NIDUS that helps with caching container images, and that's the numbers they got. So these numbers come from the NIDUS project. And you can see that the savings are really significant if you can do some kind of local caching. However, that local caching is based on file system analysis, and it's essentially a cache of files. The problem is that in our case, because it's encrypted, we want to have a cache of blocks. So we need to build that on GM Crypts, GM Verde, et cetera. And use, for instance, I got some help from Richard to get some stuff going on the host side to use NVIDIA Kit to fetch the blocks. So the idea there is that instead of having to decrypt the whole layer to be able to do your attestation or decryption, you can do that block by block. Maybe a bit weaker, but it's much faster. And we can do that because we need to change the way we build container images anyway. The reason we need to do that is shown on that slide, which is a busy slide that I imported from elsewhere. Because we need to split between the part that goes on a public infrastructure and a part that is encrypted that you want to keep really close to home. And so that split that happens where there is a little lock here, that split is new. For instance, in the case of confidential workloads, Sergio has implemented something called OCI2CW, I think. So it's a tool that does this split, and we need the same thing for any block-based identification. So the idea is again to use the host as a block cache for image downloaded by the guest. So at least guest that share an encryption key, we could cache the blocks on the host. What about access control? The problem here is that we need to rewrite most of the Kubernetes access control mechanisms from scratch. Well, I'm exaggerating a bit, but the reason is, as Henri-Louis Merkin said, for every problem there is a solution that is simple, clean, and wrong. And the authentication, sorry, the authentication in Kubernetes is the simple, clean, and wrong thing. So the problem is going from two to three in this diagram. Because you see you have three colors now and not two. And that means that you need to have credentials that go one way or the other depending on what command you're applying. And actually it wasn't that. If you look inside the APIs, you'd see that many of the APIs mix stuff that goes in the green or in the red. For example, if you want to read the logs, that used to be something that you would do from the host, but now you want to encrypt that. You want to read the logs only if you actually have the tenant credentials. So hence we need a new host tenant split API to decide which credentials we apply on the fly, and switch between a protocol that is encrypted on the right, or the usual protocol going to the host on the left when you're talking about host resources. What about debugability? So let me call that the FBI School of Debugability. And I'm going to quote Julian Assange for this. A nation can't solve where the price won't let it perceive. Well, if you debug and your logs look like this, you got a problem. So my suggestion is don't panic. Not in the sense of do not panic, but in the sense of don't ever call panic. Because if you call panic, we offer a cryptographic hardware enforced guarantee that you won't see the logs. So any good ideas on how to address this in a sensible manner is welcome. My conclusion is that confidential containers are a real opportunity to do new stuff and interesting stuff, et cetera. But it's a real challenge in terms of practically everything. The key takeaways I want you to get is the high cost of confidentiality and some boxing. The real is one of confidential containers is just around the corner. But practical portable uniform attestation is still a challenge. It doesn't work yet. Performance issues, as I say, is mostly a matter of vastly enhanced resource usage, which our hardware vendor is going to complain about that, frankly. Image download is a huge contributor to this, but this one is probably something we can fix a little in the case of shared encryption keys. Access control has to be seriously rethought and that's not going to happen in release one. So I'm not even sure it's going to happen in release two because it really goes all the way up the stack in Kubernetes and OpenShift. And debuggability or confidentiality, pick one. That's it, thank you. I think we have a little bit of time for questions. So please ask away. I spoke very fast so that you could ask questions. So respect my fault. So the question is whether I have seen the shared work use PSID work for VGPUs. The answer is I have not. I've heard about it. I've not looked at a code. GPUs at the moment are a really tricky topic which Dave Gilbert summarized quite well saying that when they say confidential computing in VGPU land, it means they don't trust the CPU. And so it means essentially it's confidential computing from the side. And we don't have really a good working model for how to exchange data with VGPUs in a way that would let us run compute intensive workloads, this kind of things. As far as I know. So I'll try to summarize your question is using the GPU MMU to do AO translations is something that can be supported in containers. The answer is yes. And I think that works in containers and that to some extent it's possible you really have to follow the instructions to the letter but instructions on the Kata Containers website on how to run for instance virtual GPUs and these kind of things in Kata containers. That extent work. It doesn't as far as I know it doesn't work for confidential case. And we could use regular bounce buffers or some stuff like that to do non encrypted IOs. If you don't care about the data that you send to your GPU, you can still use that. Yes. So the question is what are the security properties that a container storage should have? Practically none from our point of view it's a bag of blocks. So we just encrypt everything on it and the properties that we enforce is that we force the container description to use encrypted storage. But from the host point of view it's just blocks. It's just blocks it doesn't understand. Did I answer the question? So the question is, is it virtual devices that supply devices to the VM? So yes. So Kata containers by default uses something that is called VitaoFS. VitaoFS doesn't support encryption and would not really suit armies because it exposes too much metadata anyway. So you could do pattern recognition of the workload, these kind of things. And so for that reason we completely gave up on this on the VitaoFS side. And so we only use block devices and Vitao block. Yes, so I'm sorry, I see the sign I don't see whether it's five minutes? Three, okay. So, I'm sorry, ask again? So the host? Yes, so the comment is that the host knows which block devices belong to that particular guest even if it doesn't know. That is partly true. If you use something like DMCrypt plus DMVady you add some layers on top of that, of interaction inside the guest that make it harder to decipher exactly. There is not a one-on-one pattern between for instance file system accesses in the guest and the kind of block request you're going to see on the other side. That was the question? So the threat model is, it's a model called WIP which stands for Work in Progress. So it's actually a document that there is a PR that has a hundred comments on it or something like that. So it's not closed yet, as far as I remember. And some of the questions we have around these ideas of, do we consider a threat that you can detect a pattern or make a signature of the workload, these kind of things? One example, a very simple example is when you download images right now we download images using OCICrypt which lets you have a mix of encrypted and non-encrypted layers. And typically most people will not encrypt the shared layers, like they will say, I don't know my Linux image, the standard one, I don't encrypt that one. That means the pattern for this is very well known. So you know which release of Linux is that. If you don't even, okay, one minute, if you don't even, the pattern itself may be sufficient to find weaknesses, but you can also have someone who decides to use some kind of weak encryption along the way and expose data or to send data over HTTP or whatever. So all these kind of bad ideas threat models we don't know exactly how to catch. Last question, I think? Yes. So the question is that since we are bringing together a number of rather disparate advanced technologies, can we hide the complexity from the user? Can we bring a unified user interface? We want to. Right now we are about as close as we can, meaning that many of our components have several subcomponents. I showed you an example with the modules for the attestation server that are going to hide the differences between attestation from this or that provider or attestation secrets from this or that hardware, et cetera. So we are trying to hide that, but clearly for instance, in the case of plain, confidential computing and virtual machines, it's not completely there yet. It's not transparent. I think I'm about out of time. Thank you very much. Thank you.