 Hi everyone, this is Ziyue Yang. It is a great honor to present our topic in Open Source Summit Japan. Today, my colleague Haokun Shin and I will present our topic. Our topic is improving a boot-up performance of container with overlay images in TE environment. This page shows the agenda of this presentation. In the first part, we will introduce the background and the motivation. We know that containers are largely deployed by users in both public and private cloud. If the containers are running in the public cloud, then there are strong requirements on the data integrity and data security of users' workloads. There is an open-source project named as Confidential Containers, and the data is very active. It will mostly rely on hardware-trusted execution technologies to protect the containers. If the data is protected, customers still have strong requirements on the performance if their workloads are deployed in those containers. In our presentation, we would like to address performance issues when customers' workloads are deployed with confidential containers protection. In the second part, we will introduce the status of the current Confidential Containers project. In the third section, we will introduce our optimization work. It is related to how to accelerate the container boot-up, which includes accelerating the mod key, retrieving if there are multiple image layers of a container. We will also introduce how to use some hardware-accelerated technologies to accelerate decryption operations on the container images. In the last part, we will summarize what we present in this topic. The first part will be the background and motivation. This page shows the background and motivation. Customers who run containers on public clouds have strong security requirements on workloads in the following three situations. That is, data-arrest means that data is stored in the external storage space, and data is transferred in the network, and the last part is data-in-use. Currently, there is an open-source project named as Confidential Containers. It is designed to address such requirements in three dimensions. Confidential Containers will mainly rely on some hardware-trusted execution environment techniques to protect their containers. There are some well-known TE techniques such as Intel's SGX Secure Guard Extensions techniques, Intel's TDX Trusted Domain Extensions techniques, AMD's SEV Secure Encrypted Virtualization techniques, and also ANS CCA Confidential Compute Architecture. So with this project, customers' requirements on data security can be satisfied gradually. However, another issue is not well-addressed. That is the performance of the workloads, because even the data of customers' workloads are protected. Customers still want their performance of the workloads with little impact. They do not want that the workloads will run slowly compared with the normal solutions. Actually, besides of the hardware TE solutions, there are other solutions like homomorphic encryption techniques compared with those techniques. The hardware TE solution impacts the performance of workloads in containers less than other techniques. It is much better. However, customers still have strong desire for good performance of the workloads deployed with the trusted solution provided by confidential containers. So in this presentation, we propose some techniques to address the performance issues. It is mostly related with how to quickly start customers' containers. We provide the following two approaches. When the containers are deployed in competition containers, the first approach is that we would like to accelerate the retrieval of multiple keys for overlay images. During the user attestation with the key broker service or attestation service. Secondly, we would like to accelerate the decryption of image layers through hardware offloading techniques. For example, we can use Intel's quick-assisted technology. In this part, we will introduce the background of the confidential containers. Usually, we call it by Coco. In this page, we introduce the characteristics, course, and core projects of the Coco. The Coco has three key characteristics. The first one is the data confidentiality. Guaranteed unauthorized entities cannot view data. The second one is data integrity. Guaranteed unauthorized entities cannot aid removal or update. The third one is code integrity. Guaranteed unauthorized entities cannot aid removal or update code execution. The project has five goals. Firstly, it helps along cloud-native application owners to enforce application security requirements. Secondly, it wants the development of unmodified counters as transparent to the customers. Thirdly, it supports multi-TE and hardware platforms. Fourthly, it proposes a trust mode, which spares cloud service provider from guest applications. The last one is the list privilege principle for the Kubernetes cluster. Currently, Coco has two implemented projects, including Carter, Contents, and Enclave-CC. The Carter Contents is a VM-based content. The content process runs inside an independent VM. The VM will be protected by hardware such as TDX or SEV. The Enclave-CC is a process-based content. The memory of the content is encrypted by Intel SGX. In one word, Coco aims at providing a secreted environment for the content. The Coco wants to invent the confidential containers into current Kubernetes ecosystem. This page displays an overview of the key components. Firstly, the blue components include hardware and hypervisors that support the confidential container. They provide the fundamental capabilities to launch a security container. We call the confidential container by CC in the following parts. Secondly, the Coco relies on some service. They are colored by pink. The image build service helps to convert a plantax image into an encrypted image, or build an encrypted image from scratch. The image registry stones encrypted images. The key management service manages the encryption keys. A test station service validates the CC running in validTE. KeyBlockService is an intermediate service helping validate the CCs and transforming the decryption key. We will call it by KPC in the following parts. The first part of our work is helping the CCs to communicate with KPC to accelerate the decryption key retrieve. The second part of our work is accelerating image decryption inside the CCs with hardware. Our optimization is implemented in the agent of card contents or inclusive CC. The left three pictures show the modes of left hand. The left one, TDX or SEV creates a secured VM. The VM's memory and the file system is encrypted. The agent and the real application containers run inside the guest machine. In the middle one, an agent and application run on the host machine, but their memories are encrypted by SGX. And some lab OACs will help us to run the container with modified images such as framing or occult. In the red one, a normal VM is created firstly. The agent and the application run inside a guest machine and are protected by VSGX. The VSGX is the SGX device that is passed into the VM. At the last, the right laser shows the task of the agent need to do. The agent need to download, decrypt, decryption and store image. Our improvement is inside step 1 and step 3. In this part, we will briefly introduce some concepts in the encrypted images to help understand our optimization or image decryption. Firstly, one symmetry key encrypts one image layer and the symmetry key will be encrypted by another wrap key. The encryption information is drawn in the annotations entry in the manifest file as the top picture shows. The key information is colored by yellow. The public options contains layer encryption algorithm of the symmetry key and the digest of layer at the middle picture shows. The private options is shown in the bottom picture. The private dot attestation agent indicates the symmetry key encrypted by a service called attestation agent. The private options will be sent to the service attestation agent to retrieve the real symmetry key. The software of the attestation agent is got from a config file. In the private options, the real wrap key is represented by the key ID. The IV and wrap type indicates the encryption parameters of the wrap key. The wrapped date is an encrypted symmetry key. In one word, the agent need to send the private options to the service for every encrypted layer and retrieve the symmetry keys. Obviously, the network communication is too frequent. As a previous description, we have found the overhead on the encrypted images in CCs. The first is network delay on symmetry key retrieve. We know multiple layers send a multiple request and a wait for response and the network delay blocks image unpacking. The second one is extra decryptions on encrypted image layer. The correction options make a desperate delay on image unpacking and delegates extra workload to the service queue. Given this overhead optimization on image unpacking and boot deserves to be pre-oritized to improve confidentiality and performance. In this section, we will introduce our optimization work. Generally, we propose to accelerate the boot up of containers with multiple overlay image layers. Firstly, we would like to seek the potential optimization point in current confidential container solution. The first part will be the multiple keys retrieving for different overlay image layers in one round during attenuation process. The second part will be the parallel download of image layers and leveraging hardware-based accelerators such as Intel QoT as an example. This page shows the original diagram of container boot up in CC VZelo with Intel TDX Technic. And the original diagram is from page 18 in SAMU stack. From this diagram, we can see that there is a confidential computing process. Then users can leverage Kublate and invoke ContainerD and invoke CataCentVersion2 and talk with the Cata agent inside the confidential virtual machine. In this diagram, we can see that users can leverage Kublate and invoke ContainerD and invoke CataCentVersion2 and talk with Cata agent inside the confidential virtual machine. And we can see there is about 11 steps to boot up the containers. We can see that in the fourth and fifth steps, Cata agent were communicated with container image registry to download the images. Definitely, the image is already encrypted, so the Cata agent were communicated with key broker service and attenuation service to get the keys related with the layer. We can see there will be a step from 7 to 10. And we see that there are still some optimization spaces. For example, maybe we can parallel download the images from the container image registry. And also in the diagram, we just get the key for each layer in one round. And this is not the optimal solution. We would like to download the images together. Then the Cata agent will talk with key broker service and attenuation service to retrieve the key list. Then we think it can save some times if the container has multiple images, multiple image layers. And this diagram shows our optimization approach in the fourth step. And the fifth step, we would like to pull different layers of the image together. Then in the fifth step, the Cata agent will download all the layers of the image in parallel way. After that, Cata agents were still communicated with key broker service and attenuation service. But the Cata agents were sent a request to get the key list. It means all the keys related with the downloaded layers. And the key broker service should send all the key list in one round. And we think this optimization is feasible and will also be efficient. This page shows that in Confidential Container's project, it leverages image IS on container image operations. In the left diagram, it is original usage inside Cata agent. There will be the OCI Crypto module in image management. And it leverages scoping to download the container images from the image registry. And use the UMC tool to do the image-impact-related operations. And the right diagram, we can see that Confidential Container leverages image IS in image management module. And image IS is written by Rust language. So, in the next page, we will propose the solutions to leverage QAT to accelerate the image operations written by Rust language. This page shows our contributions to improve key retrieve procedure. The agent use create image IS to be the image management module. The left figure indicates that currently the agent will handle layers in parallel. And every layer will send a key retrieve request to the key provider. Multiple communications increase the network delay. The right figure shows our design. We will retrieve all keys in one shoot before the layer is processed. To implement the design, we need to add a bench to retrieve API between the provider and these clients. And make sure the symmetric key can be mapped to corresponding image layer. The page shows the other part of work. That is to balance decryption procedures between CPU and accelerators. The left diagram shows that currently all decryption procedures are carried by the creator or CI-Crypt IS. It dedicates all of them to CPU. It makes the CPU busy and may occupy time slice of other tasks. The right diagrams shows our design. We design a balancer. It will delegate the decryption procedures to CPU on accelerators according to the attributions of the tasks. The accelerators can share the pressure with the CPU. And furthermore, it can handle a part of the decryption tasks faster than the CPU do. Our design will decrease workload on the CPU and accelerators' decryption. Accelerators can help decryption. Most of them provide C or CPP labs. But NCLAB agent is written in raster. There is a language gap between scene and raster. This page shows our contributions on wrapping the hardware C labs for the raster applications. Currently, the rubber creator would provide three kinds of functionalities, encryption, decryption, compression, decryption, and digest. It is supported by QET. In the future, we plan to wrap more hardware to support more operations. The creator explores low-level APIs and the advanced users can use all the APIs provided by the C labs. And we also code some high-level abstract functions and expose their APIs. The high-level APIs handle all low-level details to use the hardware and only consume necessary inputs. For example, if we want to encrypt a field with AES algorithm, we only need to provide the key, IV, and field data. Different accelerators can be optimized by tuning on use case. Our RAP create tries to enable more features. For example, our RAP creator leverages a synchronous operation feature of the QET. The right diagram shows the synchronous operation workflow of the QET. In step one, the application calls the performance operation. The step two and step three, step four, delegates the operation to a QET device. Then in step four, five, the control is written to the application immediately. The application and QT will work in parallel. Once the operation completes, the QT will inform the application by call back functions. If we can take full advantages or features of this hardware, such as synchronous operation of QT, we can furthermore improve the agent performance. The page shows the workflow of decryption in image eyes if we combine it with the QT. The procedure has four kinds of thread. The image eyes mind thread is responsible for scheduling operations and other threads. The demo thread in its QET and the pulling thread pulling results and invoke call back functions. Every layer will have a layer X-ray thread. Interprepare the layer data and send it to the demo thread. Let us go through the procedure. Step one, image eyes launch the demo thread. Step two, demonstrate in its QET device. Step three, the demo thread launches the pulling thread. Step four, pulling thread and into a loop to pulling and invoke callbacks. Step four, the demonstrator waiting for layer data and submit it to the QT device. Step six, all keys is retrieved. Step seven, image eyes launches the thread for every layer. Step eight, every layer sends its data to the demonstrator. Step nine, layers read existing. Step ten, the image eyes wait for all layers read existing. Step eleven, demonstrator wait for all callbacks finished. Step twelve, the pulling thread finish is task and existing. Step thirteen, the image eyes wait for demand existing. Step fourteen, demo thread existing. At last, the image eyes continues to run. Now comes the conclusion section. Let me summarize what we talk about in this presentation. We know that it will be the trend to run workloads in container with trusted manner. Then the confidential containers project, which relies on hardware-based TE techniques, are initialized to address this issue. Besides security and trust, customers still have expectations for performance even with provided confidential container solution. In this presentation, we propose some performance optimization methods to accelerate container boot-up in hardware TE environment. We provide a following approach. For example, parallel image downloads and decryption with hardware-based accelerators. We also provide the multiple key retrieval for different layers in one round. Finally, we will continue our work and commit to open source for the confidential containers related to the community in the near future. Thanks for your attention. Due to COVID-19, we cannot attend the event face-to-face. If you have any questions, you can reach me and my colleague, Hawquan, through emails.