 Hello everyone, thanks for listening to our topic. My name is Xianggao. Today, I and Xinying are going to talk about NIDA's image service on Incuno E-RFS. Sorry due to COVID-19, we are not able to give this talk as a conference in person. Now let's get started. This topic consists of the following parts. First, I'd like to recap the E-RFS file system, which has been landed in Lin's Kernel Upspring for about 4 years. Then, I'll show the Dragonfly NIDA's architecture. Next, we're going to dip into E-RFS over FS cache features. This has already been in Kernel since Lin's 5.19. After that, Xinying will talk more about the real practice of E-RFS over FS cache and play a quick demo. Finally, we'd like to spend a minute to show our future work. Now let's recap E-RFS file system. So E-RFS then for enhanced read-only file system. It was originally started in late 2017. After it had already been used on smartphones, the preliminary version was merged in Lin's 4.19 and formally landed in Lin's 5.4. It is currently contributed by community lovers Algawa Cloud, Pat Dance, Coupapp, Google, Huawei, Opel and more. It is PSY's log-based file system with tail-packing inline feature, which saves space and has better performance. It is targeted for various high-performance read-only solutions, such as system partitions and APX for Android smartphone and other embedded systems such as routers, IoT, etc. Also live CDs such as Arc-RSO, etc. And container image such as Neda's image service. This is the main part of this topic. It has profile as a 4 or LVMA since 5.16, transparent data compression. Also it has many useful features. Many useful features are actively under development. The following has two figures. On the left side it is Android smartphone use case. On the right hand side it is RAV36, UFS-comparable container image use case. So this is UFS ecosystem. As you can see, as a self-contained file system, UFS can sport block device, files or decks. You could use UFS image on various types of storage anywhere. Also it sports boating with U-boot and grab is still working in progress. Nowadays, many mix distributions for UFS such as Android, Arc Linux, Build Route, Debian, Federal, Gentle, Open Analysts, Open Sousa, Yocto and Ubuntu. In addition, UFS can also work on Mac OS with Mac Fuse. As another important use scenario, Neda's now supports UFS smoothly, connecting with other ecosystem such as ROMC, Community, Cryo, Dragonfly, Harbor, Cata Container, Portman and Sealer. Then let's take an overview of Dragonfly Neda's architecture. This slide is not a new stuff since it was introduced for many times. Typically, OCI Image Format is defined by OCI Image Format specification, which means a container image is set of layers just like an illustration on the right hand side. Each layer is stored in the file archival format, specifically TAR and GZ. Container root effects is generated by merging these layers, and it has layer debilitation only. Specifically, the core part of each OCI blob is TAR-GZ, which is file-based format easy to extract the pen, but it has no disk summaries. Also compared with real file system, it needs to traverse to get the directory tree. Also, the plain TAR and GZ stream is non-sickable. TAR shows that polling packages account for 76% of container start time, but only 6.4% of that data is wrapped. So image polling is the most consuming step doing container start up. There could be some in-hots work runs, but could start is still pretty limited by the image poll stage. This is quite hard to slain all container image, especially when used third-party base image or language run times, frameworks, etc. Also several open-source solutions are in public in different directions. Lady polling solutions can be roughly divided into two different paths, working with file-based format or block-based format. Nowadays, most solutions to file-based format. Since such file-based format itself is more flexible and it has more compatibility with original OCI image, in addition to that, due to such solutions are mostly self-contained, so the security concern is typically impacted by itself. Also block update can easily work in this way. More details are shown in the table below. In contract, there are some block-based format solutions treating container image like virtual machine images, but since it heavily relies on random local FS, the OCI image compatibility and attack surface of the overall solutions could be challenged. Also block update seems almost impossible for block-based format solutions. Apart from lazy polling, let's take a minute to discuss other drawbacks of the current OCI images. Naturally, due to the OCI image design, the OCI image layers will be stored and downloaded as a whole when metadata or data is updated. Also deleted files or duplicated data can still be downloaded. For example, as illustrated below, layer 1 and 3 has the same file A, but layer 2 only has file B. In such case, file A will be duplicated unnecessarily. OCI image runtime file data is untrusted. At least no self-contained approach to achieve that. Finally, due to layer deduplication, data deduplication issues are not idle. We've seen many issues with OCI image designs already. It needs to be optimized from the aspects of format construction distribution, operation and so on. First of all, we try to make the image layer only store the data part of the file. That is the blob layer in the figure. The blob layer stores each chunk of the file data. For example, a 10 megabytes file is divided into 10 chunks. The advantage of this is that the granularity of deduplication is refined and the deduplication can be done at the chunk level, or it can allow the container to pull only the required chunk data instead of the entire file. Then the metadata of all layers are stacked together and placed as a single layer, meta layer in the figure. The meta records the metadata of the file, such as filing, permission base, size, etc. This is also called both-strap. In addition, the most important thing is to record the index of the location of each chunk in the blob, and also includes the hash of each chunk so that we can do runtime data verification for each chunk of each file in the image. The designs ideas mentioned before are no firmness in the netus project, which is the sub-project of the drag-and-fly project incubated by CNCF. Netus implements image metadata and data separations, media pullings, and decompression on demand. The chunk-based data deduplication and verification flatten the metadata layer, and can directly present the entire file system view, which can reduce the overhead of overly-affirmed stacking. It is compatible with OCR artifacts, such as registry storage, container runtime, etc. At present, several companies have participated in the development and core construction of netus, such as Alibaba Cloud and Group and Badance, and have conducted large-scale production in these companies. Note that netus is an officially supported image-accelerating solution by Qatar Containers. Netus has good cooperation with open-source communities, such as drag-and-fly, Qatar Containers, Open Analysts, Harvard and Sealer communities, and netus also provides related solutions for Open Analysts, Harvard, and Sealer communities. The next part is one of our recent work, EOFS over FSCache. I'd like to discuss what sort of problems we'd like to resolve if the image-acceleration service implemented in user space. Nowadays, the image-acceleration service usually implements advanced features, such as deduplication, compression, easy pulling, etc. In order to support these advanced features, a customized image format is usually needed here. Thus, the image-acceleration service is usually implements in user space since no suitable internal FS is available. However, this design may suffer from performance, especially in high-density employee scenarios. For example, when processes access the image, they will switch to internal space through system calls, like read or write, and then switch to user space to parse the customized image format, or implements the advanced features mentioned above, fracking switching between kernel and user space becomes the performance bottleneck. As for netus, we try to make it an internal solution so that the performance overhead of switching between kernel and user space can be avoided. There are two technologies adopted to achieve this. Firstly, we implement an internal image format called roughly 6. It is based on the internal IRFS file system, with that we can parse the image format inside the kernel space. Secondly, we implement the internal LED polling technologies based on FS cache. This is merged in 5.19. In this case, processes will only switch to user space when cache meets. On cache heat, processes will not switch to user space anymore. As these two technologies mentioned above, netus is more of an internal solution, since there is no frequent switching between kernel and user space anymore, it behaves better in performance. Next, we will discuss these two technologies in details. Prior to the introduction of roughly 6 format, netus used to handle image format in user space, working with fields or whatever FS, however, as mentioned above, the user space solution will suffer great performance overhead since frequent switching between kernel and user space. To address this, we introduced roughly 6 image formats, a container image format implemented in kernel based on IRFS file system. IRFS has been in the Linux mainline since the Linux 4.19. It is a native with only file system suitable for various scenarios. It can save space effectively while keeping high performance. In the past, it was mainly used for smartphones. Over the past year, we made several improvements and enhancements to IRFS file system. Adapting it to the container image story scenarios and finally making it as a container image format implemented on the kernel site. In addition, roughly 6 also carries out a series of optimizations on the image format such as block alignment, more compact metadata, and etc. Another important feature for the image acceleration is leading polling. Prior to this, almost all leading polling solutions available were implemented in the user space. The user space solution involves frequent kernel and user space switching and the memory coughing between the kernel and user space, resulting in performance bottlenecks. This problem is especially prominent when all the container images have been downloaded locally, in which case the file access will still switch to user space. In order to avoid the unnecessarily overhead, we can decouple the leading polling into two options. One is cache management of image data, and two is fetching data through network on cache miss. If we implement cache management in kernel space, we can avoid kernel and user space switching when the image is locally ready. This is exactly the main benefits of FSCache-based lazy polling technology as we discussed today. As the name indicates, this technology is based on FSCache. FSCache-cache files is the relatively major file caching solution in Linux operating system. It is widely used in network file systems. Our attempt is to make it work with the lazy polling for local file systems such as E-RFS. In this case, when the container accesses the container image, the FSCache will check whether the requested data has been cached. On cache hit, the data will be read directly from the cache file. This is processed directly in kernel, and kernel will not switch to the user space. While on cache miss, the user space demand will be notified to process the request while the container process will slip on this. The legacy will fetch data from remote, write it to the cache file, and awake the original slip process. Once awakened, the process is able to read the data from the cache file. Apart from that, there are other advantages of FSCache-based lazy polling technology. The first optimization is called async prefetch. After the container is created, NEDAS-D can start to download images even when the cache miss is not triggered. NEDAS-D will download data and write it to the cache file. Then when the requested file range is within the prefetch range, the process will directly read from the cache file without switching to the user space. The second optimization is called network IO optimization. When the cache miss is triggered, NEDAS-D can download more data at one time than requested. For example, when 4 kilobytes IO is requested, NEDAS-D can actually download 1 megabyte of data at a time to reduce the impact of network transmission delay. Then, when the container accesses the remaining data within this 1 megabyte, it won't switch to user space anymore. The user space solution cannot work like this because the cache management is implemented in user space and thus processes still need to switch to the user space to check if the requested range has been downloaded or not. The next part will be handed over to Xinying. He will show some problems and solutions when 0FS or FSCache is landed. Thanks. Hello everyone. I'm Xinying. I work in the infrastructure department of BIDANCE. I'm very happy to introduce the work we have done during the practice of NEDAS-ERO-FS over FSCache solution. Today my topic is about how to enhance the reliability of NEDAS image service. At present, for most image-less pulling solution, the reliability problem is mainly due to the user daemon participating in the IO path which will introduce additional reliable dependencies. When the user daemon restarts or exits due to failure or upgrade, it may cause IO errors or IO hunts on the container side. First of all, let me introduce the IO path of 0FS or FSCache scheme which can be divided into two parts, on-demand data request and local data request. When the read request on the container side is forwarded to the FSCache framework for through ERO-FS, it will determine whether the required data is ready in the local cache. And if so, just read it directly in the kernel space as shown in the step 5 in the figure. If the required data is not in the local cache, the on-demand process will be retrieved and shown in step 1 to 4 in the figure. The kernel initiates an on-demand request to the user daemon and NEDAS puts the remote data and fills it with the local cache and notifies the kernel that the data is ready, and then kernel reads the cache data and returns it to the container side. For the all-over IO processing flow, when the user daemon restarts may be due to failure or upgrade, what we hope to achieve is for the on-demand request, do not pass IO errors to the container side. There can be a short IO wait on the container side and the IO will be restored immediately up with the user daemon restarts. For the local request, the container side is completely unaware of user daemon restarts. Even the user daemon can't exist after the image is completely downloaded and to achieve daemoness. In order to achieve this, we have made the following designs. We will still discuss according to two types of IO request. For the local request, the current IO pass has completely bypassed the user space, whoever due to the limitation of FS cache framework. After the user daemon east, the kernel will set the FS cache to the no-present state, and all data requests will be directly returned with idles. The current workaround is pretty simple, we just keep the dvfd in the user space. For example, pass dvfd to a supervisor through unix domain socket. So as to avoid FS cache from interning no-present state after the latest east. At the same time, we are also discussing with the community about the kernel solution. For the on-demand request, it is a bit more complicated because it involves a lot of resource and state maintenance and recovery. This part, we will talk about the recovery of fd resource. Firstly, let me introduce which fds will be involved when the user daemon is running. The following figures describe the user daemon in nationalization under the IOFS mount press. Firstly, the user daemon opens device file, dvcache files to obtain the dvfd as a communication channel with the kernel. And when the IOFS is mounted in the FS cache mode, the FS cache framework will create an open local cache fields and pass the fds of the cache fields as the anonymous fds to the user daemon through dvfd. Then the user daemon will use these anonymous fds for the local cache filling. So when the user daemon is running, a 1dvfd and a large number of anonymous fds are maintained. If all these fds are kept in the user space, such as sending them to the supervisor, it is very complicated and prone to error. So our solution is for dvfd, maintained in the supervisor in user space, when the user daemon is restored, pull back the dvfd from the supervisor and restore the communication channel with the kernel. For anonymous fds, a new feature is implemented in the kernel. For the local cache fields that have been closed, an on-demand request can retrigger the recreation and resending of an anonymous fd. In this way, we can ensure that the fd resource can be reliable restored after the user daemon restored. For the on-demand request, there is another state that needs to be restored, which is the in-flight request. An in-flight request is an on-demand request that the user daemon has received from the kernel but has not yet replied to the kernel. This kind of request may be lost when the on-demand request of the user daemon and an on-demand error will be pressed to the container side. For handling of the in-flight request during user daemon restart, our solution is firstly when the user daemon is used instead of passing the IO error to the container side that will cause the IO to wait and implement the restored command, which can restore the in-flight request to on-raid state. In this way, the waiting IOs can be reprised after the user daemon restart. Although the above are the main design points that we did, and the final effect is also what we expected, the container side does not perceive the restart of our user daemon and can achieve the same stable experiences as the OCI image. And we have already submitted our patches to the kernel community that sacs experts from Alibaba Cloud for their suggestions and support during the development. OK, then I will show an overall daemon for the latest EROFS over FS cache solution. First let's check out the next kernel version and load the cache fields modules, delete the local cache, clean up the environment, start a net-slam shorter, and check our test script. This script will run until the service is available, and then x. Then let's try to run a net-st container. It took 16 seconds in total, and we can see there is an EROFS mount point. Then we will clean up the environment again, and try to run an OCI container. We can see before starting an OCI container, all data needed to be downloaded. It took 27 seconds. The container test in the demo will download a large amount of data immediately after start up, which is often not friendly to the laser pulling scheme. But even so, the E2E time of the net-s container is much shorter than the OCI container. Next, I will introduce some future work plans. For data duplicate trunks, support page cache sharing in kernel. FS cache stimulus in kernel solution. Add standard EROFS compressed format for net-s image service. Connecting with FS stacks in the secure container scenario, support memory sharing, explore the combination of FS cache and over-FS as a unified cache system. Support laser pulling of standard OCI blobs, so net-s. Well, that's all of our talk. Welcome to the web set of net-s, and scan the code to join the discussion group. Thank you.