 Hello, everyone. Welcome again to OSS. It's Friday. It's lunchtime. Thank you for coming to our session. So my name is Yixiong. I'm with Futureware Technologies. My teammate, Macaulay, Eunice here as well, we're going to present and talk about Quark, a new high-performance and secure container. Here is the agenda that we have prepared for our talk. I will get started with introduction and background and introduce ourselves a little bit and give some background of why we're building these new Quark containers. And then Eunice will take over. He will talk about the Quark design, Quark architecture and he will explain the difference between the Quark container with other open-source containers. We have some performance test data to compare the Quark and with other open-source containers project. And Eunice will also show and analyze some of the performance data. There are some very interesting features to building into Quark containers, mainly for performance optimization. Again, Eunice will also talk about that, especially for RDMA communication. Hopefully at the end, we'll have time to answer questions you may have. So who are we? So again, my name is Eunice, we are part of CloudLab at Futureway. And Futureway is an R&D organization that focuses on open research, open innovation, open standards, of course the open-source project. Currently, I'm leading the CloudLab and Eunice is one of the chief architect in our lab. So CloudLab, by the name, we built and optimized Cloud-related open-source project. And here I show some of the examples of the open-source project we have worked on. Some of the project we're still working on. One goes through each of them in detail. But I do want to mention one of the projects called Centaurus. This is a new two years old project focused on Cloud Infrastructure project. And this is actually a Linux Foundation project we donated in 2020 to Linux Foundation. So it's a Linux Foundation Cloud Infrastructure project. The Centaurus project actually has four or five, in fact, the sub-projects, and they're focused on different domains. Arktos is for Cloud Compute and MISA is for the Cloud Network. We had two sessions yesterday and the day before yesterday, Wednesday, to talk about MISA. And then we have, of course, one project for AI, one project for Edge. So please go to the website, Centaurusinstruct.io if you want to know more information more about this project. Now back to today's talk. So what is Quark? So first, Quark is an OCI-compatible container runtime engine. Now OCI standard for Open Container Initiative is an open industrial standard for container runtime and container format. So it has two specifications. One specification for runtime specification, the other specification is for the format for image. So when I say Quark is compatible with OCI, it means that given a Docker image, we can run your applications with Docker image within the Quark containers. Quark also provides VM-level isolation and security, such as semi-like Cata. For those who have not heard about Cata or Cata containers or don't know about Cata containers, Cata is a secure container with lightweight virtual machine that feel like and perform like a container, but it's actually a lightweight virtual machine that provides high, strong isolation for the work node using the hardware virtualization technology like a hypervisor. So basically it is a virtual machine. Now Quark is quite different than Cata. You need to talk about more about the difference. We also have, again, test data to show the difference, the performance difference between the Quark and the Cata. Now Quark also supports RDMA communication for container network between containers. When I say transparent, I mean for user, they don't need the changing code. They need to use TCP-IP socket to communicate from one container to another container, but underneath it, when you're running on the Quark container environment, underneath it, we use an RDMA connection, which provides much better performance networking when we talk about then go through the TCP whole stack. Now Quark is also developed using Rust programming language, which is considered secure, multi-secure language than other programming language. Of course, Quark is open source, and please check out the github.com with the link provided in this slide. So you may ask why we are building another containers, why we have Cata container from Cata community, device from Google, Web assembly. In years to this keynote, Matt from a Web assembly company he showcased a Web assembly-based microservices that start up within one millisecond. He literally said one millisecond to start up a Web assembly application. So why we need a Quark? Now before I answer the question, let's look at some of the requirements we have and the use case we have for high performance and secure containers. First is the serverless requirement. We have a serverless container platform where we run customers' serverless functions, serverless applications that within a shared cluster due to the multi-tenancy requirement, we have the run customers' functions within a secure container so that they don't impact each other. Currently, we do use Cata as the secure container in our platform. However, like many Cata users, we encounter some of the performance issues. It's very heavy because it's a VM and it's taken a long time to start up. It doesn't meet our serverless platform requirement. So the team is looking for, we're looking for something to replace Cata. The second is no lightweight low overhead containers for edge applications. So on the edge server, the resources are very limited, especially in our case, memory resources is very limited. So the team, our edge platform, looking for a new containers that has not waited, replaced the containers that you're using today, which is Docker. High performance container networking, mainly driven for AI and other network-sensitive applications. For the AI machine learning training jobs and running on the platforms, the communication between the training tasks has a huge impact on the performance. And they are looking for something to high-performance container network. So we bundle orders and we decide, hey, we may need a new containers. But how about GVISOR from Google? Well, GVISOR is a potential solution for us, in fact. However, GVISOR design has its own limitation and a performance issue. And we need to talk more about this. And we have a test data to compare our quad containers with GVISORs with Cata, and so far as from preliminary results, and we have much better performance than GVISORs. And there's also a news that GVISOR is also going to be implemented in Rust language. But that's just the news. Now, how about WebAssembly? Well, when the clock was started almost two years ago, WebAssembly was considered as a container for browsers. It was only running browser. Today, I don't think they still... they still doesn't support the multi-threading, the network IO access. So we are not sure this is actually ready for business-critical enterprise application. But the community is working very hard, and they try to make it as a mainstream for cloud computing. For us, we don't think Quark and WebAssembly is mutually exclusive. Technically speaking, we can run WebAssembly code within the Quark containers. But that's a whole not a story, and of course it's out of scope for today's talk. And we are actually working with... talking to some of the WebAssembly companies to see whether there's a potential that can collaborate and integrate these two together so that Quark can support the WebAssembly backcode. So this is a basic, simple background and introduction. Now I'm turning it all over to Yunin. He's talking more about the Quark itself, the design, the architecture, the difference between the Quark containers and the g-visors and the cutters. There you go. Yeah, let me go to the Quarks architecture and design. Yeah. Quark's goal is to provide secure high performance and the Linux-compatible container runtime. For the compatible, we hope that Quark can run the Linux container just like the RunC, just like Docker, without changing. And actually, just like Linux kernel, it provides a Linux system car interface, and we hope it is a Linux-compatible system car interface. It also integrates with the CNCF tools, like Docker, Kubernetes. It supports OCI and CRI interface so that we can use Kubernetes and Docker and CRI to schedule it. Another thing, unlike RunC, we hope it is secure. It is secure like the current water machine level isolation. So it is based on, just like the Linux water machine, it is based on KVM water machine isolation. And it is very basic, the secure program language that's Rust. The next is performance. We hope we can provide high performance container runtime. Actually, because Quark needs to be part of secure, so secure is not free. So if we want to get high performance, we do some trade-off on that. So we compromise some containers, users' scope. For example, RunC sometimes is just like a common Linux process. It can run in embedded device, it can run the mobile phone. But Quark is optimized only for the cloud native application and running in the server environment. And next, I will give more detail on how we do the optimization for the cloud native application. Let's go to the architecture. Actually, Quark's architecture is like the Linux water machine. There are four layers. The bottom layer is the Linux host kernel. And the next is VMM, that's the water machine monitor. That guides the kernel, that guides the application. For the Linux machine, the VMM is the QML. And the guest kernel is the Linux kernel. And the Linux container application, which are the same Linux container application. The Linux water machine's virtualization happened at the device, at between QML and the Linux kernel. That QML provides device virtualization. For QML, it can support Linux kernel, support Windows kernel. And they just virtualize it just like a real machine so that the other OS kernel can run over that. And Quark is different than that. Quark, virtualization happened at the system core layer. That's between the Linux container application to the guest kernel. Quark can be considered as a virtualized user space OS kernel. It intercepts the system core. Just use the S86 system core instrument. And based on that, we do the virtualization. And Quark, unlike the Linux water machine, Quark's implementation includes Q-Kernel and Q-Weiser. Q-Weiser is just the VMM. And Q-Kernel is the guest kernel. But different than the Linux water machine, Q-Kernel knows that it's running over the Q-Weiser. And Q-Weiser knows it's running over the Linux kernel. So it's not designed to run on any real hardware. It's just running over the Linux host kernel. So that based on that, we can do some optimization on that. Later I can give more detail on that. Here. We can consider that Quark, just a user space, virtualized OS kernel. Just like common Linux kernel and other for Windows kernel, it has OS components, like memory management. It's just like other common kernel. It's based on the pager table, have pager fault, and it will copy on right, et cetera. It also has task manager. It has process manager, thread manager. For the Quark, it supports the Linux process and thread semantic. It also has fast system, has IO system, like a common OS kernel. For the performance part, because Quark and the Linux water machine all use the S86 virtual VM instrument. So for the CPU part, there's no much difference. For the memory part, they all run with the KVM shadow pager table. There's no much difference, too. The major performance optimization part is at IO side. For example, the IO include disk IO and the network IO. For example, the IO that's container application want to send some data to the remote. What he needs to do is that container application should send the data first to the guest kernel, and then guest kernel send it to the host application. Then host application send to the host kernel, then go to the hardware. The normal way to do this sending needs the context switch between the guest and the host. The common standard way to do that is to go through the hypercars. That's the standard terminology in the virtual machine side. Quark supports that. But hypercars have some performance issues. Hypercars need to do the guest and the host context switch. It includes the store mining register into the memory. The register includes the float register. It may be more than 1kb in the store in the memory. It also includes the page table switch. The cost is pretty high. Because Quark implemented this kind of system car virtualization, the guest kernel and the host VMM know they are working together, so we can do some optimization on that. For this kind of guest kernel to the host communication, we implement another mechanism that's Quark call, that Q call. The idea is that we use memory map to create a queue between the queue kernel and the queue wizard. When queue kernel wants to send some request to the queue wizard, he just sends it to the queue. In the host side, we have a dedicated thread to get the request and process that. Before the request is processed, the queue kernel's virtual machine can switch to another task, because we have our own task manager. When the queue wizard finishes this IO request, he will just schedule the task to the ready queue, and then queue kernel can schedule it later. Based on that, we don't need to do the contest switch, so we can get better performance. The next is your car. We can get better performance with your car. That's IO urine car. IO urine is the new IO infrastructure in the Linux that just appeared a few years. It is the fastest IO mechanism now in the Linux. It's faster than the EPUL, than the PPUL, than the select, this kind of async IO infrastructure. The basic idea is that the IO urine is created a shared memory queue between the application and the kernel. When the application wants to do some IO request, he just puts the request in the queue, and the kernel thread will take it and process it. Quark uses this thing, too. The basic idea is that Quark just maps the IO urine shared memory queue to the queue kernel, so that queue kernel can send the request to the Linux kernel directly, and then Linux kernel can process it so that we can bypass the queue wizard so we can have better performance. Furthermore, we have another IO car, that's the RDME car. RDME later I can get more detail. Based on RDME car, queue kernel can send the consenting data to the device, RDME leak directly, bypass the Linux kernel and get better performance for the model. This is IO optimization. Based on that, we can get better performance. This is some test result we tested before. Because Quark is designed to run a service application, one-dimension problem dimension is the memory overhead. We can see that Quark's memory overhead, this overhead is over the wrong C. Quark takes maybe twice as much memory over the wrong C to run a small container. And the device takes 28, that's almost twice of the Quark. And Cata, because this is a full Linux kernel, so it takes much more than the Quark. Next is the startup latency. For the service application, we hope we can start the container as quick as possible. And we can see that the wrong C's startup latency is about 600 milliseconds. And Quark is about 20 milliseconds or that. So almost the same. And the device is 700. And the Cata is full Linux kernel, so it's the most slow. We also do some throughput tests for the Quark. We use the Radius, etcd, and the NJX benchmark to do tests. For the Radius, we use the Radius benchmark. etcd, they have a benchmark named benchmark. For the Radius, the type result is compiled with wrong C. Wrong C has so many benchmark, for example, for the PIN, for the SAT guide. This 100% is... We do the comparison with the wrong C throughput. Wrong C is 100%. And if the Quark is slower than that, that's maybe 90%, etcd. This chart is about that. And we can see, for the Radius, Cata and Quark is almost like wrong C. A little over that and a little under that. But the device is slow. And for the etcd, similar thing, we can find that wrong C is much better. In some scenarios, it's better than the Quark, the device and the Cata. But the majority of the scenarios, the Quark is better than the device and the Cata. For the NJX one, that's the most interesting. Wrong C is much better than the Quark. Wrong C is 34K, RPS 34, but the Quark is 20. And the device is 10. The Quark is twice of the device, but the Cata is very slow. I don't know why, but it's super slow. So far, Quark has some optimization already done that. And it's already reached about 30K. That's just 10% less than the wrong C. But because we need some manual configuration, so we didn't put the result here. Here we can see that very likely that wrong C is Quark's performance rule. We cannot accept wrong C's performance. It is easy to explain. Because Quark is also just a Linux process running inside Linux container. So its performance is hard to accept Linux container. It's almost impossible. But because Quark is designed for the container application and running inside the server environment, running inside the data center, maybe we can get some opportunity for that. One opportunity is the TCP or RDMA. Let me give some initial introduction about RDMA. For the normal TCP, in the left side, that's the normal TCP. That's the data transfer between one machine to another machine. When application want to send data to another machine with TCP, it needs to send data to the socket and socket to the TCP IP proto stack. TCP first stack, send it to the NIC, and then send it over the line and then go to another wire and send it to another machine. It needs to go through NIC, proto stack, and socket, then go to the application. But RDMA is simple. RDMA can support to send a user space memory to a remote memory so that it can bypass the Linux kernel. And then we can get a lower latency and a higher throughput. But all this performance improvement is based on less CPU consumption. It is so good, but we didn't use that in the cloud native application. It's because for our cloud native application, when we use the network, we use TCP-based RPC, like TRPC, like HTTP. And the RDMA interface, API interface is different than TCP. So if they want to use RMA, they need to do the core change. It takes time to do that. First of all, for the cloud native application, when they run in the Kubernetes, there's more challenge. Because in the Kubernetes, we have container network visualization. One container has one virtual IP. And when the client needs to talk with his server, he needs a service discovery to find the server IP because this is a part of the startup at a wrong time. We don't know SAP. If we want to use the RDMA-supported, we need to solve both of these issues. First is the API compatibility, another is container network visualization. Quark tried to solve this issue with these two steps. First is TCP socket visualization, another is container network integration. Because Quark implemented this kind of system call virtualization, it can intercept the system call from the application. When the client application wants to send some data to remote, he needs to use a CIS call, like a CIS send. The CIS send will copy the user's buffer to the kernel buffer. If you go through TCP-IP protocol stack, this kernel buffer will send to your TCP-IP protocol stack and to the remote. But with Quark, because Quark intercepted the system call, it's Quark's role to explain how to implement this send. It can use TCP-IP stack, also can use RDMA. With RDMA, Quark just sends this kernel buffer to a remote kernel buffer. Just use RDMA, send it, so that it can bypass all the no matter-length host protocol stack, all the application protocol stack. They just bypass that. Next is the container network integration. Actually, this is our current implementation architecture. For each machine, we have one that's RDMA service. That's a shared memory connection connected to all the Quark container. And the Quark container sends the data to this Quark RDMA service. And the RDMA service sets up this RDMA connection to all the other nodes. And based on this our node, that's all the TCP connection is built over this RDMA connection. That's multiple TCP connection can build over one single, that's RDMA connection. And we can get better performance. And we also do the integration with the Kubernetes that's the API server. And then we can integrate with the Kubernetes container network. We have done some, we have done majority code for that. And it's working on the performance test and the function tuning for this solution. So far, the initial test that the TCP or RDMA performance is better than container network. The container network is the part mapping, just over the Linux part mapping over the Linux bridge. It's performance better than that. But not better performance is not, it's not meet our requirement. It's not exceed so much. So we are still doing problem tuning. And in some scenario, performance is also better than the host network. Our metal machine in the host network is something else better than that. But not all of them. But this one place is much better than both of them is that the TCP connection set up time. When we set up a TCP connection, connect to another, connect to a server, the RDMA connection is much faster than them. For the host network, maybe the TCP connection latency is 300 microsecond to maybe 500 microsecond. But the RDMA connection just takes 50 or 60 microsecond. We are continuing to work on that to improve performance. Next is optimization for the service mesh. Service mesh is popular now. And it just adds the flexibility in the container programming and the production maintenance. But this flexibility needs to pay the price. That's the performance. So far, the major service mesh implementation is based on set of car. That is the client, the TCP client, send the data to the remote server. The TCP connection is intercepted by the set of car in his machine. And then the connection will send it to the remote set of car. Remote set of car send the data to the remote server. The intercept is based on the IP table. The IP table, the data transfer between the client and the set of car is the inter-process communication. It introduces a match that's memory copy. The performance is low and latency is high. So it is decreased. So it introduces that's actually two TCP hops. So the performance is bad. But for the quark, it can implement a better implementation for this service mesh. Actually, because the quark can do the system car intercept, so the system intercept, that means they get the user's data before they use the segment to the IP packet. And it just gets the data from the application directly. And they can check the ITB header and can find the remote IP address and just send it. So they can implement the psychological inside the Q-cernel. And based on that, so there's no extra hop between the client and server. So they can get a better performance. That's higher throughput and lower latency. This is for service mesh. Next is for the two-stage container start process. As I mentioned that quark is a kind of user-based virtualized OS kernel. So we can do some optimization that's maybe harder for Linux kernel to implement. For example, for this two-stage container start. For the service application, we hope we run the multi-tenant application in a shared resource environment. That's different. And when the tenant's traffic come in, and we start the container for this tenant as soon as possible, as quickly as possible, we can execute that. And so we can get better performance. But there are two types of the container start. One is the cold start. That is, we just start the start container from zero. It takes a long time to start it. Another way to do that is to warm up, warm start. That means we start the container beforehand and just put the container there and waiting for the customer's traffic. It is fast, but the lower hand of that is that each of these are already sourced. Even there's no traffic. The CPU usage is zero, but because each of these are all the CPU and all the memory, so the customer need to pay for that. And can we have some, another way to do this container start between the cold start and the warm start? That's what we can do as a two-stage container start. Let's go to the container start itself. As a container start, I have two stages. First is the send-off setup. The send-off is the Kubernetes. That's the CRI send-off definition. It includes, first, schedule the container. That's to choose the node and notify the node start container. And the node will download the image and decompress that. And then they need to set up a network and register the load balancer, et cetera. And then the container needs to start. It needs to create a Linux namespace, create a C-group among the file system and load the ERF. After that, the send-off start is finished. After that, it goes to the application start. For the application start, the Linux kernel will just change the CPU's IP register to the application's entry point. And the Linux application will take over the CPU control and start to run that, include the memory allocation and run the application, and waiting for the customer traffic. For the stage one, actually, memory usage is low, because it's just that the send-off consumes the memory, and it's controllable. And the latency is high. It'll take maybe 100 milliseconds to maybe 10 seconds, because they need to download the image sometime. So because we can take that quark as the user space that's OS kernel, we can do some hack here. For example, we can start the whole sandbox and block before the entry point, before we enter the entry point. At that time, we have another state for this container. It's not a core start, not a warm start. It's just between them. And in this state, customer already finished, we already set up the sandbox, but we didn't enter the user's application. So user application cannot consume the CPU. User application didn't consume any memory so far. So we can use the memory usage as the sandbox. That's about 20 megabytes. So when the customer traffic is coming in, and we can just load the... We can enter the application's entry point and the start application. So we can have a lower latency than the core start. Actually, we can do more aggressive way to do that. Because quark can also handle the page table management, like the page fault. If there's a missing page, there's just a page fault and a locate memory for that. So they know how much user's application consumes the memory. And they can also block the user's application to run anytime. So we can also use the more aggressive way to run the user's application. For example, we can block the user application when they consume some memory, for example, 20 megabytes or 50 megabytes. When they consume so much, we stop it and block it there. So based on that, we can... We can continue to minimize his application start-up, start-up time. So the cost that we consume, the 20 megabytes or the 50 megabytes memory, but it takes less start-up time. So it's just giving more flexibility to the system, to the service schedule. This is some promise of optimization for the quark. This is all my talk. Any more questions? Yeah, thank you, Euni. I think hopefully we have time for open for questions. You may have. The quark is very detailed. Thank you for explaining. We have one question. I'm sorry, I have a hard time to listen to you. Can you repeat again? Let me repeat the question. The question is, is there any operating system that doesn't run the quark container? Is that right? Well, we try the Linux container OS, so we have not tried any other OS yet. So all things with building on Linux OS. We are not 100% compatible with the Linux kernel. So, you know, some application doesn't work with GWISER, so I'm wondering maybe some application just don't work with quark containers. Okay, that's a very good question. Let me repeat the question again. Were all applications run within quark? The answer just like GWISER? No. We implement maybe 70 to 80% of system call. That means that if your application using the system call does not, 80% or 70%, it will not run. But we believe 80% of application or cover 80% or maybe 90% of application for the system calls they have. That's a very good question. I think I say the stop sign. One more questions if you have? No? Okay. Thank you very much again for coming to the talk. Thank you.