 Hi, folks. Welcome to the last topic of the content account today. I'm Bo Chen from Alibaba Cloud, and I'm one of the maintainers of the Dadi open source project. Hello, everyone. I'm Haibin from Intel. I'm an on-app commit and a Dadi commit, and now I'm focused on Cloud Media to support Intel speakers. It is our pleasure to show you our written work in the open source summit. This talk is focused on how to accelerate remote container image combined with Intel technology. And Haibin and I will do the presentation in turn. So let's get started. The first thing I'd like to introduce is Dadi. Dadi means Earth in Chinese, and it has been widely deployed in Alibaba Group and Alibaba Cloud. The overlay BD of Dadi is a remote image format based on block layering. The concept of overlay BD is similar to overlay FS, but it managed to do things in a different way. In later slides, we'll explain why we choose block instead of fire system. In the year 2021, we published two papers in Eugenics Annual Technical Conference. The first one introduced the basic theory of Dadi, and the second one demonstrated a large-scale function compute scenario that enabled Dadi. Next, please. In terms of open source, we used to incubate Dadi projects by ourselves, but since last year, container D has adopted our two core projects, accelerated container image and overlay BD. The first one is basically a snapshot of plug-in with an image converter and build kit. The second one specifies a remote image format and includes implementation of storage backend. You're welcome to check out these two projects, especially if you're a cloud provider and willing to accelerate image distribution for your customers. Next, please. A wide block. I think the first and most significant reason is that block is simple, and simplicity almost always brings high reliability. Implementing a remote, bug-free and POSIX-compliant fire system is challenging, while the industry has already had decades of experience on block. The second reason is that block requires less kernel dependencies. Some fancy FS implementations require high-version kernel, and for overlay FS, on the old kernels, it sometimes encounters the cross-layer reference problem, which is to say modifying a large fire is slow, because the kernel needs to copy the entire fire to the upper layer. Another advantage is from the perspective of prefetch, as we know the prefetch mechanism is to warm local cache, including page cache and fire cache, to accelerate the container startup. Google has a star-gz format, and we know that its prefetch is based on fires, but for this reason, we know that its prefetch is based on fires, but for the block, because block IOs are much easier to capture and understand, and relatively easier to replay, and block has better granularity compared to the FS. Besides, block also has many interesting features. For example, native writable support. You don't need to resolve any disk space or host anymore for overlay FS upper layer, and multiple fire systems for means that you can choose your own FS type now. I think some of the programmers may have encountered a problem where your code is working on EX-4, but caught up on XFS or other fire systems. By specifying your desired FS type in your own image, you are now able to eliminate this situation. What's more, Windows containers might become feasible, considering there is still no good replacement of overlay FS for Microsoft. At last, block could be integrated into EBS, where we can sync the IO into the underlay world to get better performance. After all these talks, I just want to say I'm not here to judge and deny the importance of overlay FS. It's absolutely a perfect solution to host local image service, but when things come to the remote image, I have to say you are suggested to consider a better alternative such as overlay BD. Next please. As you can see here, the dark boxes are the four major components of DAVD. The first one, overlay BD is actually a log-structured algorithm for merging image layers into sequential blocks. And TCMU is responsible for exporting a virtual block device. The third part is caching library. It can cache remote image blocks in registry or P2P system, with small granularity. The last one is a high-performance streaming decompression module called Z-File, and we also need to calculate checksum in this module to verify the integrity of the file. Next please. Our cooperation with Intel is just on this Z-File. Why we want to accelerate Z-File? Because for a traditional tar image, which is basically OCI-V1 tar file, the snapshot is supposed to download all the blocks and verify and decompress them together, and the container runtime will do random access on those local files. However, for a remote image like overlay BD, the runtime is actually doing p-reads on a virtual block device. After the transformation of overlay BD, a bunch of HTTP render requests will be simultaneously sent to multiple blocks in registry. And finally, merge all reads into a sequential block. Of course, the last step, which is here, merging to the sequential block, it doesn't belong to Z-File, but belongs to overlay BD. In a word, it means that all the IOs are taken in flight, and it's not hard to imagine that DA-D is very sensitive to IO latency. So we definitely want a faster Z-File. On the other hand, CPU workloads have changed. During the container startup time of a remote image, the workloads now consist of both checksum and decompression and user applications. So CPU uploading becomes meaningful. Next, please. Let's take a look at how we use to implement Z-File. By design, Z-File has to be signable so that the decompression or CRC calculation can be done in any position of the file. We choose the chunk size to be equal to the sector size of the block. We use LZ4 instead of GZIP as the compression algorithm because it's relatively faster than GZIP. And the CRC function is got from GCC built-in. We did this by enabling SSE 4.2 instructions. Then we see the opportunities to work with Intel Fellows on the new incoming SPR platforms, where we've been told that we can use hardware to accelerate CRC and decompression. And thus the CPU workload will be reduced. So next, I have been to give us more details about how to accelerate Z-File by using Intel technology. Happy, please. Okay. Thanks, Timber. And now I will introduce about GSE, how to optimize the CRC. GSE, first of all, we need to know is Intel Data Streaming Accelerate, which is an accelerated built-in SPR that performs the common data moving operation. The GSE and the CPU are connected by accelerators and encapsulates inside the socket. You buy stuff like CPU and you're all by GSE. When we use the GSE, we can copy like a DMA to copy like a DMA from network to memory. But GSE have an additional function is we can do CRC calculation. We can do some checks for checks done. Is CRC checking the right or not? If not right, it can be dropped directly and reduce the memory occupation. If the CRC checks them is right, we are copying data from network card to memory. And when we transport data from network to memory, we have other functions like give, do or cast. Okay. GSE have four operations, move, fill and compare and flash. Any move operation have four types. The first is more memory move. Transfer data from source address to destination address. Source and destination region can be either in main memory or MML. The second is CRC generation. Generate CRC checks them only transfer data. The third is DIF, DIF means the data integrated field check. If insert, strip or update, we are transferred it. The fourth is DRCAST, which is a copy data simulator to the destination location. The fill is fill memory rich with a fixed pattern. When compared operation have four types in memory, compare with the fill memory rich with a fixed pattern, and have data decoder create an emerging function. The pattern zero detected in the special case of compare where instead of a second input bar, an eight-batch pattern is specified, that may be zero or anything. The last is cache, flash. Cache flash is very useful to cache, to even evict outlines in a given address range from all levels of the city of cache. Now let us to see the DSA Internet how it works inside the DSA. It consists of several work queues and engines, where the composition can be configured with a global configuration engine. Work queues are on-device storage to contain descriptors that have been submitted to the device. An engine is an operational unit. Any engine in a group may be used to process a described postage to any work queue in a group. Below the next slide have a example of a combination. Each descriptor, each part of a work queue is available to be dispatched by the group at a beaten to an available engine in a group. They are beaten for each group. Each group is dispatched to describe the group for the primary work queue in a group according to their priority. Now this slide we have given one example about how to accomplish the work queue on an engine. For example, this group 0 is comprised of work 0 and work 1 to engine 0 and engine 1. So engine 0 can process the work 0 and work 1. Engine 1 also can process the work 0 and work 1. Now let us see the software architecture. The software has two parts. One is config DSA. One is usually DSA. First, if we want to configure groups, work queue and engine, we need to use the access the configure application. And after config DSA, we need to, we can use the label access the configure or DMRO to submit tasks to DSA. But if we use the label access the configure, just one, just one task to DSA. If we use the DMRO, we can have two, have many, so the two that meet task to DSA. Now, after we use the DSA to accelerate the CSA calculation, we have do some performance tests. This is the performance data. First, as we know, the CRC calculations are 10% of inter-duty calculation. And we put in CRC computing performance from LISCQPS to LISCQPS about improved five times. And this is one SPL code. And this is one DSA. And after uploading CRC calculation, the CPU will not only have to calculate CRC. The problem to project status is we have completely enabled DSA by use the label access the configure library and to push it to community in GitHub. And also, we have enabled DSA in our BD for CRC calculation. The code has been merged in our BD, and can be our BD. We use the guide. We can to find it in label access configure and the DMRO and our BD data mean. Well, and future work in Intel, we are doing optimization for data compression. Performance using QAT, which also accelerates hardware for SPR means the server lepids, which is overloading decryption from CPU to QAT. Again, we improve the QAT part and they fail to, they fail programming. They fail to access the layer on the line, access the Java framework to access the QAT device. Okay. Next. Thanks, Haibing. And having just said, the depression acceleration is still not in Dadi and we're waiting for his excellent work. And in terms of Dadi's future work, Dadi first, the first thing is to try to be adopted by OCI as one of the official remote image formats. And the second thing is to try to be merged into kernel as a kernel module as a equivalent of overly FS. Okay. Thank you very much for participating in today's meeting. Any low-right kernel is my in-contact information under this simple email. We are coming to actively contact us. We have 10 minutes to answer questions. We are coming to actively ask questions. Thank you. Thank you very much to attend this meeting. Thank you.