 Hello. Good afternoon, everyone. Welcome to my session about Cata Containers. And yeah, as the title say, I will do some comparison between the Cata Containers and Geoizer, the Google open source project. And there will be some benchmark data. And also, I will put the slides online. So after the session, you will see all the content. And yes, about Cata, I'm sure you run from the hybrid usage. And most of the tests has been done by my colleague, Fu Panli, and who is not here. But many data that's come from him. So I put his name here. And I'm from the community of Cata Containers. And this morning, I showed you at least all of our session from the upstream developers. And yes, this is a corrected version. And correct version. So after this session, the 420 and the 510, we have two forums about Cata Containers, both in level three. And the first one is talk about the proof of concept. And the other one is a philosophy problem about container security. And tomorrow morning, if you are interested, in the integration with Kubernetes, we have another forum about Kubernetes integration. And I and Eric and other upstream developers will be there to discuss this problem. And also, our project update is on the level three, M3 room. It's a pretty small room. So if you are interested in our project update, please visit it. And now, this is my topic. Yeah, actually, about the Cata Containers is about something, another indirect layer. So this topic, I have four parts of this topic about the general introduction about the secure containers technologies. And then I will compare the architectures. And then the main part is about the benchmarks. And then at the last, I will talk about something about the futures. First, about Cata Containers. Cata Containers itself is a container runtime, just like Run-C. And our project is initiated by us, by our hyper.sh. And the Intel. And hosted by the OpenStack Foundation. And now, I think we are the first pilot project of the foundation. And we launched the project in last December. And now, it's about one year. And we are at version 1.3. And we will have one for very soon. And there are still many people asking me, what is Cata Containers? Is it a VM, or it's a container? In short, it's a container runtime. It's OCI compatible. And just like Run-C, you can use it at any position that Run-C can run. You can put Cata runtime at that position, do the integration with Kubernetes, with Docker, with Zoom, and maybe your own private systems. But different from Run-C, we use the virtualization technology. So when the Cata Containers is launched from the outside, from the host, you can see a VM there. And from a different way, the device is also OCI compatible container runtime. And it's an open-sourced project by Google. And the Google guys said they developed this project for four or five years and open-sourced it in May. I think in Copenhagen, the KubeCon, KubeCon Europe. And yes, it's a user-space kernel. And it's writing in Go. And the employment subsite of the Linux system interface, the CIS calls, I think, totally the Linux have more than 300 CIS calls. And they implement less than 300. So it's developed in Google and open-sourced. And both projects aimed for something called secure containers. And they provide two layer of isolation. The first is the upper layer is a Linux API. So the standard containers that's aimed to run on any OCI compatible runtime should could run on either Kata or a device. But that's in theory not practical. And unlike RunC, we have another layer of abstraction. And so in the middle, there is a black box. For Kata, it's a VM and a kernel inside VM. And for device, that's the century. And now let's go in deep, the architecture comparison. If you want to run the Kata containers, they are pretty easy, but looks not so. Actually, you're just called the Kata runtime. And just like you're called RunC, if you have Docker, you can just direct it to the runtime equal to a Kata runtime that you can launch it. And it uses the existing VMM technology and the test cumul, something else. And inside the guest, there is a standard Linux kernel. And we do some configuration and some small patches, but in general, that's a standard Linux kernel. And on top of it, aside with the container apps, we have an agent. Also, we have shims. And the container application, container app inside the VM, the sandbox is for the, if you run Kubernetes, that sandbox is the pod. And in pod, there may exist several containers. And the agent manages all the container apps. And for divisor, looks a bit simple. And the divisor has a single binary. And they may run in different processes, but they have only one binary. I think that's a bit more friendly for the users. And they have a binary called RunIC. And if you run RunIC to launch a container, then they have another process called Sentry. That's the user space kernel and launched the container app. And it's do the stab work. And also, in the host kernel, we'll just catch the syscalls and then send to Sentry. And Sentry will process it and then go back to the container. Later, I will go in depth. And there is another process that's called Gopher is for the disk.io. They only support the file system.io. And they use the 9P protocol. And the philosophy behind the divisor is, as so-called, they think the container will call all of the Linux APIs, the syscalls. But the Sentry will implement the syscalls. And only a subset of the syscalls can pass to the host. That's because the developers of divisor believe that the hot path of the syscalls, those be called frequently, is proven to be safer than others, the code pass. And so they could use the namespace and use the sitcom to limit the syscalls to the host kernel. And on the other hand, they investigate on the syscalls and exploits. And most of the CVs come from the syscalls of the open. That's operate on the file systems. So they separate the file system access to another process. So by this way, they enhance the security. Yeah, that's how you call them and some detailed architecture. About the Kallagin Linux, yeah. Actually, it's a classic hardware-adjusted virtualization. We have a KVM in the host kernel. And normally, there is a cumul here. And inside the cumul process, there is a guest kernel in the guest mode, drink 0 and full. And it helps to manage the whole VM. And there is an agent inside it. And there are application processes. And this layer provides, the VMM provides some virtual hardware mode interface to the guest. So the guest kernel run on top of the hardware instruction sites. So there is a two-layer of isolation. The first layer is about the KVM. And the second layer is in the guest. That's the ring 0 and the ring 3. And for the, especially for the IO things, because we are the hardware-adjusted virtualization, we can use all the existing virtualization technologies, such as the pass-through of the hardware and use the vhost. And use the vhost user to use dpdk and to have the high-performance networking. And also, we can use the classic cumul device model. I think almost everyone is pretty familiar with the architecture. So let's go to some new things. For divisor, that's a bit different. The divisor's idea is to use the syscall interception that they catch the syscalls to pass by to the user space kernels to process it, and then go back to the application process. Only a small side of the syscalls will pass to the host kernel and will be filtered by the syscall model, namespace. And by this way, they have two layer of isolation. The first layer is about the sentry will only, it's a minimal gas kernel, and it's a brand-new implementation by go. So they believe it could be done safer than the big monolidest kernel. And also, the ring protection between the host kernel and the user space kernel there. And also, I just said, they put the fail system access to another process. And from the architecture, one thing you may think about is how the syscall interception go, because that's, you think they support two, they call it, that's two platform. So one is the p-trace, and the other is KVM. And for the p-trace model, that was used in UML for about, I think, 15 years ago. And the p-trace itself is a tracing technology that's, in other words, that's slow here. And yeah, something deeper here. There is another platform for the device, it's the KVM platform. For the KVM platform, they use the KVM to do the syscall interception. And in this mode, the sentry map itself to three different places. The first place is in the host part, and it do something about the interception part. And the second part is, and the third part is located in the guest, and in the different places. And among the different places, there are only a single program, just to put itself in three different places. So they can jump from here to there. So yeah, in other words, the KVM here is used for catch the syscalls. It's not for the isolation. The isolation part is still the same as the last page site. Yeah, that's about the architecture of the device. So the main focus on device is still about the interception performance. And here is the summary of the architecture comparison. We may have some inspection based on the architecture. Firstly, it's the isolation. Both of the technology, the Cata containers and the device are perhaps the two isolation layers. So you can think that's much better than the RunC, the RunC run the containers on the single host kernel. And if there is some bugs, they find the exploits, they break the kernel, and they make access to everything. But for the compatibility, in general, Cata runs a general Linux kernel, so the compatibility won't be a problem. But for the device, that's a new implementation. And the authors of the device, they may try to implement the syscalls as what they think that's correct, maybe that will break some compatibility. So we will prove that. And about the performance, you may think both should have very little overhead on CPU and part of memory. But on the other hand, because the simple architecture, the device may have smaller footprint, but on the syscall interception method, the device may have some pain on the syscall, heavy workloads. And yes, we have many optimization ways. And now here we come to the main parts of the talk, that's the benchmarks. We still have half of the time. So we do some functional tests, and then we do some standard benchmarks about the boot time, about the memory footprint, and CPU and memory, and our performance networking performance. Then we do some real life cases. And I leaked some result here. We did more than three real life cases, but some of them does not completely in the device. Maybe some syscall has not been implemented in the device, or maybe they implement in different semantics. I don't want to say the device implemented them in the wrong, maybe not a wrong implementation. They just implement different semantics. Maybe Linux implementation is wrong from the original semantic of the POSIX or something, or single unit standard. But every user space application, they assume they run on Linux kernel. So if it doesn't implement the bugs in Linux kernel, then they can't run on top of it. So I just say they are different. And we also do Jenkins test that tool and build about the Git colon field. So we only have three results of the real life cases. And about the setup, the test is pretty long. And we only put some, the bare metal result here. And it's run on top of the Sura on top of Packet.9. It's zero and I think E3 or E5 CPUs on Packet.9. And also for some disk IO tests, we have more power for servers that have additional disks. And the data is here. So most of the test containers have eight gigabytes memory if not have further nodes. And for the networking tests, we use the two servers to play different roles to avoid the client's affection to the servers. First, we do some Cisco coverage test. I mean, the deviser, the design of the deviser, they think only if you're limited, it's to some small set of Cisco's to the house that will be more secure. And because deviser is a fairly new open project and pretty, it's a code in a single repository. So from the code, they only call about 70 Cisco's to the house. And for Cata, we collect the status for the cumul process. For different cases, we used about 30 to 53 Cisco's. Maybe more, but normally, they will be similar. And for the memory footprint, here, we use the half gigabytes memory for the containers. And we have two cases. The first is to launch a single container. So for the Cata containers, the single container will make consume about more than 70 megabytes. And for the deviser, it's about, I think, about 25, I think. Yeah, here, we enable the template for the Cata. That's a method to save the memory and accelerate the boot speed. Also, you may enable the KSM here. And that could save memory as well, just that they do this in a different way. And if we use the template, that means we can share the memory for the read-only parts of the containers. So if we run about 20 containers, the average memory will be lower. It's about 50 megabytes. And for deviser, that's similar. It's about 20, 25, I think. And yeah, so the result is clear. The deviser has fewer memory footprint. But actually, the Cata can do it pretty good. And about the boot time, there is different configuration of the Cata containers. We have four different configurations. And we compare it with Deviser and RunC. And because the smaller footprint, the deviser can launch very, very fast. It's fewer than 500 milliseconds. And the RunC, including the image processing part, it can launch about a bit more than 500 milliseconds. And for Cata, if we enable the factory, it will be pretty fast in a similar speed. And if we use the new ongoing work on Shim V2, it will be about 700 milliseconds. And the vanilla, the most simple configuration, it will be more than one second. And there is some variance between the different tests. We do the test for about 10 or 20 test runs and get the statistic data here. So the result shows the proper configurations. The Cata containers could get a good boot performance. But the Deviser could go down even better. But pretty similar. About CPU and memory performance, about CPU. You know, right now, for the modern CPUs, that's a part of the virtualization. Most of the technology could guarantee the CPU in the same level. You can't find the, actually, you can't find the minor difference between the different technology. They can reach the same level of the CPU test. But for memory, this test is based on the CIS bench for memory bandwidth. And for the sequential write, there is some difference. And for the random write, they are more similar. And there are some small gaps in the memory parts. And here is our performance test. This is based on the system I showed before. And we tested this in different configurations. And the test uses a testing program, FIO, and the FIO counterrun on Deviser. So there's only Cata to do the sole test a bit lonely. And compared to the House FS, for the Buffard IO, forgot it. For the Direct IO, we have a better performance. That's confused me. And I think you have the questions as well why the Cata containers can reach even better performance than the House. Yeah, yes. Let me try to explain it. For the Buffard IO, when we use 9P, you know that the default configuration, the performance on 9P is really bad. I have to say that even it is the default configuration. For production usage of Cata, you should avoid 9P for the most cases. But here, if we pass through the disk to the guest, that will achieve better performance. And here, because the default kernel of Cata containers support multi-QIO, so with some CPU usage, we can achieve better performance than the House, which does not support this yet. That's the most of the contribution come from the multi-QIO parts. And that's not the, yeah. That's because we only got the system from the provider. And the only confused one is the big block random read test. I also can't explain it. But we do the test several times that show the results. We may do further investigation on this. And also, the IO performance, because we can't run the FIO on the device, so we run some DD test on top of the device. The blue one is the device. And the brown one is 9P of Cata. Actually, the device uses 9P as well. But in some case, it shows that the 9P of the device has much better performance. But for the smaller block, I think it introduced too many six-cals, so it still reduced the performance. And there is one thing we have to say for the 9P. Natively, it's a networking file system protocol. That means even if you use the direct IO, it can only do the direct IO on the guest side to avoid the page catch in the guest. But in the host side, in the server side, there is no guarantee about it can avoid the page catch. So we just put the result here. And they may make some sense. This is about the IO performance. And also, for the networking part, yes, actually, from the beginning, many people wondered how about the networking performance of Cata and the networking performance of the device. And actually, right now, the virtualization technology makes the VMs could achieve a similar throughput of the hosts, I think. Even with the VHOST, it could achieve a good throughput on the networking part. If we use the VHOST user and DPDK, we may get better networking performance. But for GVISOR, yes, we got a not-so-good network performance. And from the GVISOR team, they said the first thing they think about is the security and the performance is not what they are focusing on. So that may explain the results. But yesterday, I talked this with Samuel. And Samuel told me he did this test for several times as well. And he wanted to believe that's what we do some wrong tests, the wrong configurations. The network performance of GVISOR should not be so slow. But I think if nobody pointed out what's wrong, then that's the result. And here, similarly, we got some real life case for engines. Yes, in the table. And this column is run C. And this is Cata. And this is GVISOR. And here, we move the triple W route of the Cata test from the 9P to the 10-part fast to avoid the bad performance of the 9P. If we move it to 9P, then that's totally different story. But if you don't use the 9P, you can get pretty good performance of the Cata engines test. But for GVISOR, we don't find such a way to improve the data. But on the other hand, we can compare some detailed things for the orange lines is run C. And the dark blue line is Cata. You can find the run C is pretty stable on all the connections. But looks serious about how two connections of Cata have a bad result. So this is the later we were working on to give more equal rights to every connections to make it has a constant performance for all the connections. That's the things later. And for the GVISOR line, that's pretty smooth, but not so good. And another test is about the radius. Radius has its own benchmark. And both GVISOR and the Cata supports the test pretty well. So if you want to run GVISOR and your workload is ready, that's fine. And the test shows there is some gap between the run C and Cata, and also more gap between the Cata and the GVISOR. I think the result is come from the memory performance and partly from the IO performance, the networking performance. So that's the result for the radius, the memory and networking. And another case is about the TensorFlow. But this is the pure CPU test. There is no GPU inside. And the result shows most of the candidate of the tests pretty similar. So I think this is because most of the workload is running on the CPU itself. So the different candidates don't have many, many, many differences. And here is the configuration. And have a summary. Both Cata and the GVISOR have a smaller attack interface than run C and could be more secure. And the CPU performance is identical to the house and run C. But for the Cata candidates, some suggestion is if you could use the path through the device for IO, you can pass through it. And if you can use the factory or use the Shin B2, you can use it to accelerate the boot performance and reduce the memory footprint. And in most cases, they have good throughput performance. And if you have a proper configuration, and later we will have more suggestions for different users on how to configure it. And on the GVISOR side, it has faster boot performance and lower memory footprint. And there is still some improvement on the compatibility side for the general usage. And later, we may have some further more tests. The first is about some fine-tuned benchmarks on the networking side, new mass-specific tests, and net-seated virtualization tests. And different kinds of networking tests and more real life cases. That's the further working. And the same topic, we will have a session on the Kupacan Seattle. So we will update the topic and add more results here. And if there is anything wrong, we will correct them. And so about the two projects, all the time, looks time-stop. And yeah, Cata is on the way to be more efficient and more feasible, such as the GPU support and other accelerators. And as we are open-source project, any contributions are welcome. And just visit our site, and you can use select, your IRC, and the mailing list to access our project. And on our GitHub, you can file issues and PRs. You can also participate in our weekly meetings. And for GVisorSign, yeah, we also have another similar project called ISUDI. We have another link here. Yes, time is up. Thank you, everyone. Is there any questions? Oh, no? Thank you.