 Hello everyone, good morning, good afternoon, and a good evening. Thanks for attending my topic today. Today, my topic is about Cutter Container Performance Evaluation and Optimization of ARM64. My name is Justin He. I mainly focus on the container virtualization technology of ARM64. Here is today's agenda. Firstly, I will give you a brief introduction about what is Cutter Container. Then I will give you the status update of ARM64. That is what we have done so far. Then I will go through the performance evaluation from several aspects. At last, I will introduce the two simple user stories in real cases. If you have used the dot container, maybe you have thought about the question, how to make the container more secure? You can drop some Linux capabilities such as you can mount the root FS with read only mode. You can use SE Linux and APP Armor to protect the container. Also, you can use the sitcom to allow or disallow some system calls but the more you add, the sick mid layer will be and you will get more performance over head. Actually, the Cutter Container is a combination or trade-off between the virtual machine and the container. It is compatible with OCI runtime spec. Therefore, it works seamlessly with the dock engine. Besides, it also supports the Kubernetes and CII through the CIO and the container D. In other words, you can choose to select between the round C and the Cutter Container. This is the Cutter Container's architecture design flow chart. The Cutter Agent is a process running in the guest as a supervisor for managing the containers and the process running those containers. The Cutter Process offers access to the VM Cutter Agent to both Cutter Stream and Cutter Runtime. Its main role is to root all the IO streams and the signals. It connects to the Cutter Agent on a UNIX domain socket, i.e. VSOC. Cutter Process uses earmarks to multiplex GIPC connections on its connection. The stream process runs in the host environment handling standard IO and the signals on behalf of Container's process, which is inside guest. But after the Cutter Container 2.0, some parts of this flow chart will be changed a little bit. So this is what the new items in Cutter 2.0 for example, CII, I-container D, Sheen V2 it reduces many parts of the architecture design flow chart. It uses Rust Agent to replace the Golang Agent. It uses the TTRPC as a tiny GIPC implantation to replace the original GIPC library. It uses VSOC to replace the virtual IO serial and by default the cloud hypervisor will replace the 2-mil hypervisor. It also supports the guest C-group V2 and what's the status of Cutter Container on ARM64. Totally speaking, it can be wrong smoothly on ARM64. You can install the Cutter Container on ARM64 by two ways. First way is to snap install and secondly, you can execute it from the source code. To run the Cutter Container, you can use the CTR command line. Here is the examples. You can use the CTR image pool to pool the container image and run the simple application on the CTR ROM. Also, you can run the Cutter Container on a Raspberry Pi 4 platform with minor changes. Here is the future comparison result between ARM64 and the X86. As for the hypervisor support X86 supports Q-Mune, Firecracker, Cloud Hypervisor and ACROM. But ACROM is not supported on ARM64. The Cutter Container will let the hypervisor to create a medium device and then mounted as the guest RouterFace. It can speed up the boot time and share the guest RouterFace. And the RouterFace feature is used in the container RouterFace which I will introduce in later slides. The characteristic mark for the ARM64 Q-Mune here is the upstream Q-Mune has supported this feature but we also need some time to introduce it on ARM64 for Cutter Container. Next, the VM template is a useful technology to speed up the boot time and this feature is supported on both and also the raster agent is supported on both. The memory hard plug, the upstream has supported on ARM64 but we will introduce it in the future. The VM VCPU hard plug, someone had posted a patch series to support it in both Q-Mune and kernel community but it hasn't got merged yet. For the next KVM feature ARM64 KVM the community has discussed one big series and after that after the community merged them we can introduce it on ARM64 for Cutter Container. The virtual database is a shared file system that virtual machine access a directory tree on the host. Unlike the existing approaches, it is designed to offer local file system semantics and performance and the low level of the virtual FS is a fuse implementation. The fuse protocol is not based on the network protocol. It means a faster performance and better projects semantics capabilities. There is a independent virtual FS demo process more secure and needn't maintain the device mapper and there is a DEX mode with this DEX mode enabled the host and the guest can share the memory and improve the performance and you can also bypass the guest page cache, avoid the unnecessary VM exit and the back end is in user space which is convenient for the user to further tune. But we once observed that the virtual FS will increase the system level memory footprint because it uses additional shared pages which allows the KSM to merge the pages. The internal tests showed that it will significantly improve the file system performance in Cata compared with the virtual IO 9p. For this chart we use the file read-write test and the data improves increase about 10 to 20 times in different cases. This is what we have done for the functional features development and firstly we enabled runtime and Rust agent on ARM64 and we maintained the CI test subsystem and we enabled Firecracker and even the cloud type weather on ARM64 especially cloud type weather is another independent repo we enabled the 12.5.1 from scratch on ARM64. We also finished the Kubernetes integration test with Cata container. There is a to-do list in the future for us to do such as the memory and the important next state of visualization. So to summarize the performance comparison between different architectures we choose some important aspects such as the full time, the binary code size, and the memory footprint. Here is the hardware or software setup info, the host, the guest, the QMU and the Cata version. This is the evaluation for full time. I once started run a simple container application and entered the and exit at once for 100 times and then calculate the average data for full time. The X axis is the full time difference to the starting point. The Y axis is the millisecond time unit. We observed that the most gap is between the VM started and the agent started, i.e. the kernel boot time and the QMU boot time. And the boot time might be a little different between different configuration. The total boot time of the guest kernel is not so long, but we still found something to optimize. We reduced the kernel boot time from maybe 117 milliseconds to 81 milliseconds. Another tunable is the CCD service, but given that most of the distra enable CCD, we didn't remove the CCD as Ipile did. From this chart we concluded that the boot time gap on ARM64 compared with X86 is not the guest kernel boot time. This is the tuning items and for example disable the PMU initialization if the user doesn't want to use it. And by default the SCSI scan mode is the synchronization mode we can set it to now. And also by default ARM64 will create about 32 VM, virtual IO, MMIO devices even the user doesn't use it. We can disable it by default in the kernel configuration. There is another way to speed up the boot time. It is more aggressive that is the VM template. The VM template is a new feature that enables the new VM creation using a cloning technique. When it is enabled the new virtual machine is created by cloning from a pre-created template and they will share the same initial RAMFS kernel and agent memory in read-only mode just like a process fork. It is expected that CUMUL doesn't write anything to the guest RAM until the virtual machine starts. But it does in ARM64 CUMUL. So there is very exception on ARM64 when we enable the VM template feature. Actually the room block DTP will be filled into RAM during the room reset. In the coming case the room filling seems to be not required since all the data have been stored in memory backend already. So we bypass that process and make the VM template enabled on ARM64. This is the binary size performance comparison. Cata will start a guest who is limited memory or CPU resource. Hence reducing the binary size is a tuning aspect. From this chart we reduce binary size for Cata binary about 20 to 30% We also tuned the VMM currently only the CUMUL by customizing the configuration by stripping the binary size. The binary size was reduced by about 20% with configuration tuning. We can cut off all the unnecessary device creation and by stripping we can reduce more 60% for the code size. This is the memory footprint for the performance comparison. You can see from this chart the possible reason why the virtual memory footprint of CUMUL on ARM64 is bigger than that on x86 is the firmware devices creation. There are two P-Fresh devices takes about 128 megabytes. It will be created unconditionally. The resident memory a physical memory of ARM64 the CUMUL footprint is as much as or even better than on x86. So this is the comparison result. This is the resident memory summary for the size comparison for the go agent and cut go runtime process. The size of two architecture is close or even the same. This is the network input. The BBR and Cubic are different congestion algorithm. These are two TCP these two different congestion algorithm for the TCP in kernel. This shows BBR as the default but from the test result shows that at least in the local in the local lab environment the BBR has lower performance than the Cubic. Here is the test result the algorithm we increased the throughput from almost nearly 11 gigabytes to 15 gigabytes. This is the performance to the items what we have done. Enable the VM template which I have FSDECTS and the persistent memory support and we changed the algorithm from BBR to Cubic. Finally I will introduce the two user cases Baidu is a dominant Chinese search engine operator. In its AI cloud Baidu AI cloud is a complex network with huge amounts of traffic and complicated deployment scenarios. The peak traffic is about 1 billion page views per day and 50,000 containers for a single tenant. Baidu chose to use CUT containers after doing extensive research on secure container technologies and determined that CUT containers is a highly secure and practical container technology. Besides as one of the important founders and the maintenance Alibaba uses CUT containers in its ECS bare mental instance plus Kubernetes as serverless infrastructure. So that's all for today's presentation Is there any questions?