 This is Jian Feng from Ant Group. I will present my colleague, Qian Yu, on this work. From the title, you can see that the work is mostly focused on the divisor and the performance optimization. We will do some architectural refactor to make the performance better. The lalo here is a work as a verb to minimize and speed up divisor. A little bit about the Ant Group is the complete behind the anti-pay. Our object to optimize today is the container runtime. Mostly, it's low-level container runtime comparing with container D, that is high-level container runtime. For lower-layer container runtime, we split into several categories. The first one you may find in a way is the containers, NINUS containers, maybe the most popular one. There are some others writing in Rust or in C for different scenarios. This container basically leverages NINUS kernels, C-group, name space, such technology to do the isolation. In essence, they share the host kernel. They cannot provide very clean isolation there. Some people think virtual machine is a great way to do the isolation. There is a runtime named wrong way. Wrong way makes the virtual machine like container. After that, people are thinking that we need to make it a little more like the container. A lot of micro-virtual machine runtime coming out. Cata, Faircracker, Cloud Hypervisor, Musteror. There are other kinds of container runtime, which is named sandboxing. The typical one is Gevizer and another one is Labda, which is open-source by IBM. It's not very active now. There's another one called container, which is writing in Rust. Different container runtime provide different features. What is Gevizer? Gevizer actually consists of three things. First one, Cisco Interception. We need to intercept all the Cisco from the application processes into this Gevizer guest kernel. There are three platforms to do that in the upstream Gevizer, like P-Trace, KVM. This year, a new platform named SysTrap is upstreamed. After Interception, Gevizer will do some emulation of the Linux API to implement some of the Cisco for the applications. They are mostly written in Golang. Then, at the last part, Gevizer will limit the Cisco into the host kernel for the security, so that is Cisco restriction. In the philosophy of Gevizer, it will use the locking pop, like some popular paths in the host kernel can be caught from this path. A little bit about the history of the Gevizer. In 2018, Google open-source Gevizer was concentrated on the small footprint and the quick startup time. Then, Wang Xu shared some comments on the Gevizer. Actually, he did a tech share on the quantitative comparison with Kata and Gevizer. They have different advantages here. After that, some paper from Academia has shown that this paper basically says that the performance of Gevizer is really bad. The other paper here basically says that the security of Faircracker and Gevizer is not that satisfactory, because according to their experiment, this content runtime can invoke much more code execution in the host kernel. After that, the startup company open-sourced the Quark container, which is very similar to Gevizer, but they rewrite the guest kernel using the Rust. They try to substitute the go-down language. We treat that more like the virtual machine, because they have a clear barrier between the guest kernel and the hypervisor. They even have a mechanism named Q-core to accelerate this communication. We treat that a little bit more like virtual machine. At that year, some architecture refactor in Gevizer, like VFS2, the Nissan FS, is inside the Gevizer. In 2022, FACO added the threat detection in the Gevizer, so the security model is not only to protect the container escape, but to discover some abnormals inside the container. Then in the last year, Google Cloud Run actually introduced the second runtime besides the Gevizer. They originally used the Gevizer as the content runtime, but they adopted another micro virtual machine based on the content runtime. This actually generated a lot of concerns from the community. Is Gevizer still a workable way? That is the question maybe most people want to figure out. After that, actually this year, we see a lot of active development in the Gevizer. Some nice optimization there. The TCP IPness stack, the direct FS, the system trap, the platform, and the root FS overlay. So feedback from both industry and academia actually are very pessimistic about this kind of technology. Actually, we invest in this direction for more than five years, so today we are going to talk a little bit about this, why we still invest this. Both sandbox and the micro virtual machine provide stronger isolation comparing with the native container, but with security tags. They have different pros and cons. For sandbox, they actually have a native resource elastic because they actually are processed to the host kernel. Every free memory page are returned back to the host kernel immediately. That also applies to the CPU, so this is very elastic. Another good thing for Gevizer is it's easy to customize and the full stack optimization. Actually, Gevizer has about 160k 9-of-the-codes, including all of the themes. So we can do the optimization from the VDSO to guest kernel to container runtime, and we can do a lot of things. We treat that as a transformer actually. It can change to a unique kernel, and so we think the architecture is very flexible. But the disadvantage is very obvious. One is the performance, and the latter is the compatibility. I'll go by the compatibility first thing. Everything trying to emulate Linux is not Linux, so actually Gevizer emulates about 270-6-codes, limited parameters, and Linux has totally 350-6-codes there. So the compatibility issue cannot be totally solved by this direction. Another thing is like a micro virtual machine direction. Actually, first we need to state that there are a broad specter of implementations there from different guest kernel configurations, different devices, different hypervisors, so there are different configurations. So for micro virtual machine, it actually has a very good benefit that it is a relative compatibility, because in the guest kernel, it's still Linux. So another thing is the great ecosystem. From the guest kernel, Linux, and then the parallelization device ecosystem, and then the hypervisor, there's a lot of BMM, these projects are shared by different micro virtual machine projects. And the third thing is the expected overhead yearly. The expected overhead is like you can just change from the normal container to this micro virtual machine without a lot of expectation. But it also has a disadvantage that it's actually hard to do for most people, because there are too many options, too many configurations, too many... And all the optimization will lead to the parallelization device, so we need to do the guest and the host match to do that. So our work today is trying to get this goal. Can we provide a better performance than normal container? That is wrong thing. This is what we do in these years. So we are going to one by one. Firstly, Cisco interception is actually the cost. The structural cost of Cisco interception is really large. Even with the central virtualization, we push all the things into the guest kernel. And as you can see from the right side, we have a PID benchmark. For a native Cisco, you almost spend about 60 nanoseconds. And even with the KPTI, you only need about 200 nanoseconds. But for G-Visor, the Nexus 3 part, P-Trace is really bad. And the recently upstreamed CIS trap, they need about one microsecond, even in the good situation. And the CPU virtualization, we use the KVM to accelerate that. The KVM platform actually needs 800 nanoseconds, so that's too much. So to minimize the cost as much as possible, we do some change in the architecture. Firstly, we try to avoid the CR3 register rights, which cost about 200 nanoseconds. And secondly, we do the Cisco pre-routing to bypass the whole century. By the way, the century is the guest kernel of G-Visor. So what can we get here? First one, we can get about 600 nanoseconds. And if we do the pre-routing of the Cisco, to make the Cisco go into the Cisco emulation code directly, it can get about 100 nanoseconds. So the cost is slow. So a little bit more about the CR3 register rights. Why do we need that? Firstly, in the century, actually, in the second or third table, we can say that both century and application, they have a conflict address space. So both are laid in the lower half of the address space. So we need to use different page tables for them. And we try to merge that into one. So we split the lower half of the page table into two. So we put the century in the upper half and the application in the lower half. So 64-bit address space is very big, so we can do this work. Even we use the virtualization to accelerate the Cisco interception, but it can slow down the sandbox Cisco, which is sandbox Cisco. Sandbox Cisco is the Cisco need to the guest kernel course into the host kernel. So that is slowed down. So on the right side, there's a number there. For CPU virtualization like KVM platform, we need about six microseconds to do that. Even this path is not that frequent, but still too heavy for us. So we introduced a platform named Null Virtual Machine. It is based on the DOOM, and it's much lighter than the KVM. Then the Cisco path from the sandbox host can be reduced by... We don't need the red PL in the original path. So it can be accelerated at the last one, about 4,800 nanoseconds. 4,000 nanoseconds. OK, previous slides show how we reduce the cost of the Cisco interception. It's still the cost. And then we need to do the Cisco emulation. How can we do that in a much more efficient way? So here we need to implement those hot paths in very efficient ways. The hot paths including the schedule, the networking, the logging. That is a file I'll write. So basically we treat the sentry as a Cisco router. There are several paths in the sentry. The first one is the VDSO. The hot paths are not taking the sentry in the path. It's in the user space. And the second path is implemented in the sentry. And the third and the fourth paths are very heavy. The third path sentry needs the help of the host kernel to finish the work. And the fourth path actually involves an IPC to make it work. So to make it more efficient, firstly we want to bypass the sentry if possible by Cisco pre-routing as previously noted. And the second, we rewrite some components with say and rust. So Go is very efficient to coding, but it's not that efficient to execution. And third, we do the device path through to bypass the host kernel. So why sentry needs the help of the host kernel to do that? Because there's no driver in the guest kernel in the sentry. So it must leverage the host kernel driver to send a packet outside of the machine. So we do the device path through so that it does not necessary to involve the host kernel to finish your job. Then this path can help us to eliminate the third path. And last one, we want to eliminate the golfer. We will go into some detail here. So why do we need a golfer in the first place? Golfer is to proxy IOC scores for security, but IPC is poor in performance side. And the current multiprocess model is too heavy for function as a service. We will say that in later slides to say why we think multiprocess is bad. And third, so in containers, the root FS and band mounts are the most challenging obstacle to avoid the golfer. So sorry to add one thing. Golfer is to avoid the open syscall to directly call into the host kernel. Because divider thinks open is too dangerous. So how do we do that? We want to want sentry contain the file system IO for golferness. Firstly, we introduce the EROFS for root FS for the snap shorter. And second, we use the low file back to temp FS as the upper layer. So this work is also coded by the divider project first half year of this year. And we think that's not enough. Actually, we need some asynchronous IO to make the directory reclaim as less as possible. Third, the direct file system with predominate mount point file descriptors to totally remove the golfer process. So this is a new architecture with previous statement. We still treat the sentry as a syscall routine. Firstly, you can see the pink path. We have the syscall pre-routine to bypass the whole sentry and go directly into the implemented network and storage part. And this syscall can send package directly to the virtual function into the smart leak. So the whole path does not involve any sentry code or host kernel code. And the second one is the yellow one. The upper layer right will be implemented with temp FS. And in the host kernel, there is a raw file to back up that. So this is very similar to the raw file in virtual machine scenario. And the third, the lower layer of the file are user-space or EROFS and intercepted by the user-space snapshot. And there are some other optimization there, like how do we make the IO for logging better? Like we introduce something like buffer IO. And we also do some optimization in the go-round time, like the scheduling and memory management. So here is some number we picked from our evaluation. There are four kinds of workloads. As you can see, they are both network-intensive workloads, actually. So a little bit better this one. But indeed, we can provide better throughput than the normal container. So if you're familiar with some kernel bypass technology like DBK or RDMA, you can see there are like 10 times optimization, but this does not work like that way. It's only this kind of optimization here. Okay. My colleague, Tian Yu, will introduce what's the security trade-off of this architecture change and some scenario. Hello, everyone. My name is Tian Yu. Please allow me to introduce the trade-offs we have made in Gevizer and the adoption of the Gevizer in the production environment of Ant Group. First, let's take a look at the trade-offs we have made in Gevizer. In this part, we mainly focus on the security concern in the guest kernel sentry and the host kernel Linux. First of all, let's go through some basic ideas of how to ensure the security for a sandbox container. First, we should use the popular and stable code in the host Linux kernel rather than unpopular or newly merged code. More details, please refer to the login pop-up paper in link one. Second, we should maintain a second list for the container to restrict the container to only invoke the system cause that in the list. Third, we should minimize the amount of host Linux kernel code that could be reached by the sandbox during this cycle in order to make it less possible to trigger kernel bugs. Researchers have shown that both Firecracker and Gevizer will reach more host kernel code than native Linux container, although with different frequency. As for our production, we need to modify the Gevizer to balance security while optimizing performance. We have to drop some security features and introduce less secure features to achieve significant performance improvement. Specifically, we have removed the kernel page table isolation KPTI allowing the guest kernel sentry and the guest user application to share the page global directory PGD, thereby avoiding the overhead of writing to the register CR3 when switching between the guest user mode and guest kernel mode. At the same time, we introduced Seagull and Rust to accelerate the host path in the guest kernel sentry, such as transferring the network packet. We further reduced the second list to minimize the attack service from sentry to the host kernel. We used static analysis approach to identify all possible system costs that sentry may invoke and filter them down to a minimal set based on our own workload. Furthermore, we divide the sandbox lifecycle into two stages, initialization and runtime. We will cut down those system costs that only need in the initialization stage and remove it after initialization has finished. We believe that Gevizer still remains a good level of security after its changes, which relies on its multi-layer defense in depth model. Next, I will introduce two important use cases of Gevizer in our production. The first one is the container runtime for long-lived containers. We have implemented mandatory access control at MAC in sentry, making it possible to audit the behavior of tasks inside the container and control them through user-defined rules. We still rely on the defense in-depth model of Gevizer to keep it difficult for attacker to escape from the container sandbox and compromise the hostliness kernel. As it shone in picture, we introduced level minus 1s security for we are using device pass-through in our network stack. In addition to this vertical defense, we also implemented horizontal defense measure, network access control. We introduced this both in sentry and network stack, making it possible to allow or drop network packet by ACL rules. As a chance, we began to use Gevizer as the core apps container runtime since December 2018. It cost 52% CPU util more than RunC, which is the basic Linux container without any guest kernel. After half a year, in July 2019, the overhead of Gevizer has reduced to 10% compared to RunC. And after two years later, in November 2021, the CPU usage of Gevizer on core apps is less than RunC. So you can see the modified Gevizer has beaten RunC in both security and performance in our production environment. The second use case is the runtime for short-lived function instance, which is the basic unit in function as a service. In order to reduce the co-start time of a function instance, we modified the way of creating a container in Gevizer. We switched from the original way of creating container in Gevizer, which is the fork and EXEC. We changed it to the simple clone system call from the CED sandbox. The CED sandbox is a pre-created, it contains initialized guest kernel sentry, network stack, and user space language runtime. We add a clone interface into the runtime for sentry to call after the CED sandbox has finished its initialization. Inside this interface, the CED sandbox will stop the world, which is the semantic in the runtime, and reclaim all its thread. It finally becomes a single thread task and waiting to call clone system call. Each function instance creation request will cause this CED sandbox to invoke clone system call, and the new thread will recover to the multi-thread task and start the war to run the guest user applications. As Linus Copiang-Ride is adopted in such clone operation, the new instance will cause less memory as it shares most of its pages with the CED sandbox. Thanks to this method, the modified Gevizer outperforms existing industry-ready fast-containing runtime, such as upstream Gevizer, firecracker, and run-d, both in cold startup time and memory usage. We could create a ready-to-work function instance within 7 milliseconds and cause less than 1 megabytes. Now, please welcome my colleague Jianfeng to summarize this presentation. Thank you. So, some take away from this presentation. By architectural refactor, we actually think Gevizer can provide much better performance than Run-C, but for some scenario. And Gevizer can be used at scale for both long and short-lived workloads. And last, micro-watch machine and sandbox are complementary to provide stronger isolation. So, the mentioned work above, we are trying to upstream to the Gevizer project. So, thank you. We can take a question now. If you have. Yeah, please. Hello. Thank you for the presentation. Those 7 milliseconds that show up there refer to the cold start or is it warm-up? So, by the mean of cold start, it's actually like the, it's ready to work, right? So, we think the initialization are already finished after the start. It can get the request immediately. Okay. Thank you. Hi. Thanks for the presentation. So, already I'll see a compliant runtime, which I can try it out with continuity to launch the sandboxes or how does it work? Like, did you modify Run-SC or is it like a separate runtime now? Like, for plugging it with container D? Did you catch that? So, like, Gevizer uses Run-SC, right? Yeah. To launch the sandboxes. So, and you have to configure container D with that. So, is it already possible with NanoVisor to make it work? Or do you want to use it right now? Yeah. Yeah. We already used that for both scenarios. Yeah. We already used that. So, did you modify Run-SC or is it like a separate runtime? Oh. It's still Run-SC. Oh, okay. It's still Run-SC. Yeah. We don't want to split this project. It's actually still Gevizer. Yeah. So, we are trying to option all the work back to the Gevizer. Yeah. Okay. So, I think that's all. Thank you very much.