 Hi everybody, my name is Alicja Frozzi and the other speakers are Christophe Dutinotian and Sergio Lopez Pasquale. We are all from Reddit and today we are going to talk about confidential computing and containers. So in the introduction we are going to have a brief overview about confidential computing. I'm going to focus on confidential workload with Lip Geron and Kubernetes. Sergio is going to explain us the support for Lip Geron and SCV and Christophe is going to focus on Cata container and Confidasha container. So first of all, what is confidential computing? So confidential computing is the protection of data in use by performing computation in a hardware-based and trusted execution environment. So nowadays we have already well established technique in order to protect data in rest and in transit. So what we are missing is the protection of the data and during the computation. So we cannot really guarantee without confidential computing that any critical data haven't been leaked by ML issued cloud provider or system administrator or somebody that simply has succeeded in acquiring road privileges on the system. So this is exactly the goal of confidential computing and in this talk we are going to focus on confidential computing technologies together and combine with containers. So it's important to understand the difference between confidential workload and confidential container. So confidential workload are a transformed, containerized workload in a special form that can be deployed with Lip Geron and confidential computing technologies. Confidasha containers are the deployment of a regular container with an OCI runtime, like for example Cata containers and confidential computing technologies. So confidential workload focus at container level and this implies a single container per encrypted virtual machine. Confidasha container on the other hand focus at pod level. So if you are already familiar with Kubernetes, you already know that the pod is a group of container. So Kubernetes sees and deployed pod as unit. This implies multiple containers per encrypted virtual machine. So confidential workload are a special form of containerized workload and also use a special form of container images. Confidasha container uses layer as a regular container and those layers are encrypted. So generally confidential workload is more restricted and limited than confidential container. It doesn't fit for all the use cases, but it also has a simple architecture and they try to reuse the existing Kubernetes infrastructure without modification. So Confidasha container probably fits in more use cases, is more generic, but it also comes with the cost of a more complex infrastructure. For example, part of the infrastructure is moved inside a trusted environment like the image of loading. So now I'm going to focus on confidential workload with Kubernetes and Nicarag. So Kubernetes is the orchestrator and is responsible to schedule the workload on a node. When you want to deploy a pod, you need to submit the description in a form of a YAML, like the example on the right. When the workload is scheduled, Kubernetes pass the information to the container engine, in our case, Cryo. So the container engine is responsible of pulling the container image on the node, preparing the bundle with the root file system of the container, creating the configuration with the container information and pass those information to the OCI runtime. So the OCI runtime is the actual launcher of the container. In our case, we are using SIRAN. SIRAN has already support for Libkeran in order to run more isolated container using KVM. In order to deploy a regular container with SIRAN, it uses another library called LibSIRAN. So in the example on the right, I have highlighted the important and interesting field involved. So annotation is a standard way in Kubernetes to pass additional information. So the first annotation contains the HTTP endpoint of the attestation server. The second annotation is already used by SIRAN and it allows us to select the right library and be able to deploy a containerized workload and confidential workload. So in the container section, you can see the image. This is a special form of image and we are going to see that in more detail. And we have also to specify command. However, Kubernetes and the container engine run in an un-rusted environment. So we don't want to provide them information about the process that we are going to run. That's the reason for the fake entry point. The last field is the node selector. So we have to inform Kubernetes that we want to schedule the workload on a node that is SAP capable. And that's the reason for the label. So the attestation is the moment where we verify if we are running a proper hardware. The attestation phase is started by Libkeran. He already knows where to find the attestation server because we have provided information through the annotation. So on the trusted side, we have an element that is called Encrypti workload coordinator that is responsible to do the attestation and validation. So Libkeran starts a session. And he sends us a measurement that has to be validated and proved by the Encrypti workload coordinator. If the attestation is successful, then the Encrypti workload coordinator send back a secret that is encrypted. So in the secret, we have the cannon command line. And there we can find the critical information like the lax passphrase, the process that we want to run in the trusted environment, and additional information like environmental variable and parameters. So once Libkeran has received the secret, it can inject it and boot the encrypted virtual machine. So in order to run a container, a confidential workload, we need a file system. And this is provided in a form of an image. So in our case, we are using an OCI image with a single layer. So in this layer, we have the file system of the container that is encrypted using lax and additionally, the fake entry point. So we can transform a regular container image into this special form. And this process has to be done on a trusted build server. So you can always build the regular container as you did before, maybe using your already existing pipelines, CICD tool. Maybe you want to use a validation tool like proving that your image doesn't contain vulnerability. So you can still do that once you feel ready. You just simply need to add an additional step and transform the regular container image into the encrypted image. So once you have performed this step, you can publish and push this encrypted image in the registry. So the registry is already an untrusted environment. When the workload is scheduled by Kubernetes on a node, Cryo is responsible to pulling this encrypted image from the registry on the node. It then extracts the encrypted IMG from the table. It put it in a known location. So in this way, LeapKiran is able to find it. And it already expects this special form. So it already knows how to pass it to the virtual machine. So together with the kernel command line that we got it from the secret and these locks, we can simply boot and deploy the confidential workload. So the decryption of the locks is done inside the trusted environment. So this is basically the architecture and the flow. You can deploy a confidential workload using LeapKiran and Kubernetes. So thank you very much. Thank you, Alicia. And hello, everyone. My name is Elphi Lopez. And here I'm going to talk a bit about the changes we need to apply on LeapKiran to enable it to run confidential workloads using a NDSCB. The first problem we face it is the fact that the VMA integrated in the original version of LeapKiran writes a number of data structures directly into the guest memory before starting the VM. This is fine if you are running regular VMs, but it becomes a problem with the CB enabled ones because those data in Intel data structures become part of the launch measurement. This implies that you have a remote attestation server that you are going to use to validate the contents of the VM and possibly send it some secret. This remote attestation server will need to somehow recreate those Intel data structures to be able to obtain the digits from them and do the attestation itself. This is something that is technically doable, but it's not really nice thing to do because it makes the remote attestation server very dependent on the behavior of the VM. So we instead opted for a more traditional approach and conservative approach by implementing a minimal firmware. So the CB enabled version of LeapKiran loads a minimal firmware alongside with the kernel image and a needed RFS system inside the guest memory before starting it. And those three components become part of the launch measurement. Since those components are achieved with LeapKiran, the remote attestation server just needs to have a copy of this library to obtain the digits from them and be able to do the attestation. After the VM has started, the minimal firmware will write the data structures that are required to run in the kernel and those data structures are no longer part of the launch measurement. Another change we need to implement was replacing BitRioFS with BitRioVlog. The regular version of LeapKiran uses BitRioFS, has this unique Ostrash device, which allows us to use any directory on the host as the root phase system for the guest. The ACP enabled version of LeapKiran uses BitRioVlog with some pre-encrypted image on the host as the storage backend. The reason why we did this change is because why the BitRioFS fits very nicely with the regular container isolation use case. Because basically it allows us to follow the same workflow we will have with a regular container, which is download some OCI image, then span it to some directory and use that directory as the root phase system for the guest. We felt that it was not the best solution for a confidential wireless use case. Because on one hand we needed to add encryption to the mix, and even if we were able to find an acceptable phase system level encryption mechanism, the implementation would probably leak too much information. In the sense that while the host won't be able to see the plain test of whatever we are reading or writing, it will be able to see when we are removing a file or copying a file or creating a file or changing the permissions and so on. On the other hand the implementation of BitRioFS itself is quite large and complex compared with BitRioVlog and requires a larger number of syscasts, which implies a more pessimistic second filter, which makes it a little bit worse on the security point of view. Another good reason for using BitRioVlog and switching to it for this ACV-enabled version is that it allows us to easily rely on lax. And that's great because lax too has the ability to combine both the inquiry with the inquiry activity, and the combination of both brings confidentiality and integrity protection to the table by using the auto-dedicated encryption with additional data mechanisms. Within mechanisms we are protected against all non-attacks instead of data replay, which will probably require a specialist hardware anyway. Another problem we had to face is that the regular region of Licker Run uses a small binary running inside the guest to set up the environment for the workload entry point. And this binary relies on the integrated BitRioFS server. As the ACV-enabled version of Licker Run doesn't have a BitRioFS server, we need an alternative, and the obvious one was to simply incorporate a simple in-it run phase system. This in-it run phase system includes a variant of the binary, an static build version of the script setup, and some support directories and device nodes. The TDX and ACV-SMP and TDX will probably also require and include some small attestation client, as the attestation no longer happens at a VMM level. And the job of this in-it run phase system from a high-level perspective is to open a lax device, potentially using an injected secret, and set up the environment for running the workload entry point. Lastly, I would like to share and comment with you the big picture of an application node running a confidential workload using Licker Run. On the bottom side, we have the untrusted components, and there we have the hash kernel, we have the QLED, we have Cryo, which don't know as the primitive image and starts the execution of the confidential workload by using CRUN plus Licker Run. And on the top side, we have the trusted components which are running inside the guest, which are the minimal firmware, the kernel image, and the in-it run phase system. All three of them are part of the launch measurement have we seen before and can be remotely attested. And we also have the lax base root file system, which includes the workload entry point and data, which is pre-encrypted and dedicated using AEAD. And there are a couple of things I would like to highlight about this big picture, and one is the fact that if we look at the untrusted part of it, we can see that it's very similar to what we've seen on an application node running a regular container, both in the number of components and the execution workflow. And in fact, the execution of our workflow is preserved until the very last stage where CRUN, after setting up the environment, uses Licker Run to start up the VM that contains the confidential workload. And everything else is pretty much preserved. On the other hand, we are also adding just a small number of components here, which are small in code size, are also tightly coupled and safe constrained. And the combination of both things within means that this is the less disrupted option for enabling an application node to run confidential workloads. And that's all I had to share. Thank you for listening and I'm leaving you with Christophe, who is going to talk to you about Cata's approach to confidential containers. Bye-bye. Hello, I'm Christophe de Dinsha, and I'm going to tell you about the transition from Cata containers to confidential containers. First, let me give you a very quick overview of Cata containers. As Alicia already pointed out, Cata containers, unlike confidential workloads, is designed to run existing containers, described the usual way, with the same YAML manifest file, the same container image format, the same existing storage volumes, networking, and so on. But we want to run them inside virtual machines with their own independent kernel and very little intensive user space, basically just Cata agent that starts the container. So we are talking about the ecosystem of containers with the additional sandboxing provided by virtualization. This is made possible because the Kubernetes architecture is very flexible, with a number of interfaces where you can add plugins. This includes the container runtime interface, the container networking interface, the container storage interface. So the Cata runtime is going to plug in inside the container runtime interface. Let me first start with a problem statement. Can we trust a host? Your containers run on a host that is typically managed by a third party, like a cloud provider. The existing sandboxing offered by either the operating system or the virtualization technology goes only one way. It's designed to protect the host from the containers, not the other way around. The resources that you use in your container, like CPU, memory, disk, networking, and so on, really belong to the host, which owns them and has free and unrestricted access to the data in your container. The containers are carved out of the host resources, and so that begs the question, as a container owner, what do I need to do if I start considering the host as potentially hostile? Now, why would I think that? Well, the host can read the data inside the container, so data exposure of the information is possible. That means multiple tenants may not want to share the same host because of the confidentiality risk that is presented by these undesired data exchanges. There can even be legal concerns that preclude the use of containers if you cannot guarantee confidentiality. You can guarantee that the data does not escape the container to the host and from there possibly elsewhere. Now we have a new emerging technology that helps us address these problems. It's called confidential computing. Confidential computing is not just about memory encryption. Memory encryption lets you ensure that any secret you have inside your container will be only seen as garbage by the host. But there are other features like integrally production that ensures the host cannot corrupt the guest or inject malicious data in it. And there is also attestation mechanisms that let the guest owners or tenants validate what runs inside their guests. There are many vendor-specific technologies. AMD, for instance, offers secure encrypted virtualization with two more recent variants, encrypted state that protects the CPU register file, among other things, and secure nested pages which offers additional integrally production for physical memory or intros. Intel offers trusted domain extensions. IBM mainframes have a secure execution. The power processor family has a protected execution facility and ARM recently announced confidential computing architecture. All these technologies are based on virtualization and each of them works in a slightly or actually markedly different way. So there will be a number of zombies to fight when we try to integrate that in something like data containers. First, let me explain the concept of separation of trust realms between the platform, tenant and host. The trusted platform, which is drawn in red on this diagram, offers confidentiality guarantees using hardware-level cryptographic enforcements. The host, which is shown in blue on the diagram, offers and manages the physical resources that are used to run the container that includes CPU, memory, disk networking, and so on. Finally, the Tenancy security realm includes a confidential area or enclave that is carved out of the host resources, but that the host cannot see nor access. This security realm for the tenant also includes things that may be outside of the host in some relying party that includes, for instance, a key broker to offer keys to the guests, attestation services, container image, download services, and so on. In order to enable confidential computing for CADAC containers, we need to modify a number of components which are highlighted in red on the diagram here. The first one, of course, is CADAC runtime, where we need to pass the right options to the virtual machine monitor like QMU, for example, in order to activate confidential computing. The virtual machine monitor itself needs to be aware and be able to enable encryption to set up an encrypted virtual machine, and so on. The kernel, both the host and guest kernel, need to be modified. The host kernel with low level hardware support, for example, changes in the page table management. And the guest kernel to be able to expose, for instance, some secrets that the confidential platform would give the guest, and the kernel will provide an interface to expose that user space. There are also new firmware services, for example, to control page validation, which is the transfer of pages from the host to the guest to make sure that once the guest owns them, the host no longer has access. And there will be hardware support like encryption in the memory controller of the CPU for the data that is sent to memory. So the CADAC side of this development is done for most platforms. However, we still have insufficient hardware to test with for many of these architectures. The next step is securing the download of images. We currently pull images from the host, and we need to be able to do that from inside the guest instead, in other words, from inside the tenant security room. So in order to do that, the Qubelet, which today delegates the pull image operation to the image service interface in the container runtime interface, that's where things happen today. That's where we download the image. And we need to forward that instead through the CAD machine to the CAD agent, so that the agent is able to do the download itself. Now that API situation is quite typical of the solid issues we run into for this project, where the existing APIs are either insufficient or targeted the wrong components for what we want to do. Also note that for the initial prototyping, the key has to be pulled out of some magic hat. We inject it directly in the image. That's not a scalable solution. In a cloud, you could not build an image with a new key every time you want to run a container. But for prototyping of this image download service, that's okay. We will also add the ability to store the images on some guest-owned encrypted volume. In other words, a volume where only the guest has the decryption keys and the host can't access the images. The attestation process is a bit different from what we saw for confidential workloads. The kubelet sends the creation of the container, the request-creator container that is forwarded to the Kerasim v2. And the Kerasim is going to pick up a good image. That part is somewhat similar. But there may be a new step called pre-attestation where we want to validate that boot image before we even start the virtual machine. We want to check that we are booting the right thing with the right thing inside the boot image. And then when we start the containers, that's typically done by the Kerasim sending an API to the Kerasim agent to say, now start the container. And we don't want to do that before we have attested that the container is allowed to run. And so this means the APIs now that go from the Kerasim to the Kerasim agent of an Etc vSock that these APIs will now have to be restricted and validated one by one. The Kerasim itself has to be extended with facilities to manage keys to deal with encryption of the container images. So we need new components in there as well as new processes in the image like Scopeo or ImuChi and an attestation agent. So what happens when you start your container now is that an attestation agent is going to do first a measurement of what is running inside the image. So now that process here is called remote attestation. From that measurement it can send a quote to an attestation service that will give a go no go and for instance authorize a key broker to deliver keys to the guest. These keys can then be used for instance to decrypt the container images that you got from your container image registry as well as if you have some kind of local disk storage that the keys can be used to decrypt the storage. So the container images are now in pod scope. They can't be shared across pods. Initially we're going probably to use a RAM disk for that but encrypted memory is a precious resource so as soon as possible we're going to try to make sure we can store that to disk so that we can free the memory for other users. That work is planned for the last quarter of 2021. The next step after that is to think about the configuration of the virtual machine and the reason is that hot plugging is currently used to add memory CPU or devices to the pod. The pod API that exists today in Kubernetes does not give us any information about for instance container sizes and so we need to dynamically add resources when we create the container. If you start a container with two gigs of memory and two CPUs we'll have to hot plug two gigs of memory and two CPUs to make room for that. So this adds a lot of complexity to the Cata runtime like support for hot plugging but it's also quite inefficient because it means we need to add much larger paste tables in the guest than we would otherwise need or that we need to wait for the hot plug to complete and it's a relatively slow process. But in the context of confidential containers the real issue is that we cannot guarantee integrity if you can change the configuration at runtime once you have measured and attested your virtual machine. So we cannot allow that. And that's why memory hot plugging are banning mechanisms conflict with encryption and validation. Typically today the expectation is that we would basically add and validate all the memory that we add to the guest at Linux boot time. It's really much easier to do it at that time and adding memory after that is not really implemented yet. We also have issues with any device that could do any kind of direct memory access like a VGPU for instance would not be part of the trusted platform. And so we cannot hot plug those after the fact because it would completely break the integrity of the confidential platform. So we are moving towards what we call immutable pods which are fully defined at the time you create the pod before you boot the virtual machine. And that requires a massive change to the existing Kubernetes API because the APIs that we have typically put things in the wrong place. An example is that we send logs to the host even if the data in the logs typically belongs to the tenant. So we need now to expose an interface so that only the tenant sees the logs and the host cannot. This will massively simplify and optimize the non-confidential case for Cata containers by removing things like our plugging. And so that's why it's likely that Cata v3 is going to be largely defined by the changes we need for confidential containers. This leads to a need for a shadow control plane. The tenants now need to have their own isolated administrative realm for things like getting the logs, the container metrics and so on that the host should not be able to see. The host itself is still needed though in order to manage physical resources like pod creation, destruction, raw disk, hardware metrics and so on. This means that you need now two sets of credentials, hosts for physical resource access and tenant for what's inside the virtual machine. And getting that will clearly take a couple of years because we need to dispatch the APIs. Ideally we can hide that in tools like Qubectl or OC for OpenShift that would dispatch using one set of credentials or depending on the command that you are using. But obviously that's a lot of work and so that's going to take a long time. So this is basically the vision we have for confidential containers with Cata containers. There are some things that we are doing today, there are some things that we are planning to do and we have basically a lot of work to do in the coming two years. I hope this interested you and if you're interested please come and join us. That's really an exciting project. Thanks.