 Thank you for joining today's session. The subject of this session is drunk fly. Make image distribution efficiently and safely. I'm Yu Xin Liu from Alibaba Cloud, and I'm one of the commentators of drunk fly. Today, my partner Peng Tong will join me to share this session with you. First, let's take a look at the agenda of this session, which contains three parts. The first part is the introduction of the drunk fly. The second part introduces the architectures of the drunk fly so that you can have a quick understanding of the core technical ideas of drunk fly. The third part will be shared by my partner Peng Tong. It's about Nader's image service, which is already a sub-project of drunk fly. Let's get into the introduction part. Here, let me first try to answer what is drunk fly. Drunk fly is an open-source, intelligent, P2P-based image and file distribution system. Its goal is to tackle all distribution problems in cloud-native scenarios. The core objectives of drunk fly is to improve download efficiency at large scale and reduce the pressure of source server and save costly bandwidth. We can take a look at the comparison chart on the right. This chart shows the time consumption of the downloading files using native mode and drunk fly mode. The native mode means that all nodes download directly from the source. In the chart, the x-axis means the number of the concurrent downloads and the y-axis represents the download time. The green curve means download in native mode and the red curve means download in drunk fly mode. Through this chart, we can quickly join the following conclusion. With the increase of the concurrent downloads, the average download time remains fairly consistent when using drunk fly. At the same time, when the skill reaches a high level, the source server will even go down, which leads to the failure of the download. But using drunk fly will not have this problem. Let's quickly pass several important milestones of drunk fly. In 2015, due to the internal needs of Alibaba Group, drunk fly was born. In 2017, drunk fly has undergone large-scale internal verification and decided to open source in the community. In 2018, drunk fly supported the image distribution and donated to the CNCF Foundation as an image distribution tool. As in April this year, it's entered the CNCF incubating stage. In the future, drunk fly still has many directions to explore, and we welcome all of you to join this project. Currently, the drunk fly community has more than 5,200 staff, more than 70 contributors, and 8 maintenanceers from 4 different companies. There are more than 50 companies already using drunk fly, including Alibaba Cloud, Channel Mobile, Bilibili, and so on. You may be more interested in the core architecture of drunk fly than in the community. Next, let's start focusing on that. Firstly, let's analyze the core of drunk fly, P2P, and CDM mechanism. drunk fly has two components, DFGAT and Sumnaut. DFGAT is a client-side component that needs to be installed on each node, which takes a row of peer in a P2P network. Including download files from other peers and acting as a uploader to support other peers to download files from it. Supnode is a server-side component. It mainly provides these capabilities. First, it's a tracker and scheduler in the P2P network that chooses appropriate downloading netpads for each peer. Secondly, it's also a CDN server that catches downloaded data from the source to avoid downloading the same files from the source repeatedly. And secondly, and so on. In fact, Supnode is also a special peer for all the contents of our files in the P2P network. Another important thing is the granularity of drunk fly downloads its blocks instead of the unfair. Each fail will be divided into multi-blocks according to the sorting rules. Here, we can find that the blocks currently downloaded from each peer are different. Then, let's take a quick look at JungleFly's P2P and CDN mechanisms through a process of downloading files with JungleFly. First, which trigger download files with JungleFly at host. And DevGate will send a request to Supnode looking forward to obtain the scheduling result that is which peer has which blocks. After Supnode receives the download request, it will first check whether the fail is catched locally. If not, download it from the source and catch it locally. Then, the Supnode will return the scheduling result according to the recorded relationship between the peers and blocks to the scheduling algorithm. If all nodes out of the Supnode don't have any blocks of the fail, the Supnode will act as a peer node and provide other peers to download the fail blocks. The scheduling result will not contain other blocks. Otherwise, it will only contain a certain number of blocks. Default is 4. After DevGate gets the scheduling result, it will download the corresponding blocks from the peers according to the scheduling result. After downloading a block, DevGate will do this since the first one reported to the Supnode that it has downloaded a block, and then provides the ability to download this block for other peers. Then repeat steps 2 to fail until all nodes have downloaded the entire fail and content. I hope that through the above explanation, you have a good understanding of Dragonfly's core principles. So far, we have been talking about the topic of fail download. You may be more concerned about how to download an image through Dragonfly. Next, let's take a look at how Dragonfly participates in image distribution. Firstly, we need to know when we run the image pool command, what will happen. According to the OCI distribution specification, the process of putting an image around retrieving two components, Manifest and one or more layers. A manifest is a JSON document which defends an OCI image such as layers, sets and digest. Layer is the binary form of content that is stored in the registry addressable by a digest. Then the process of pulling an image is as follows. The first step is to retrieve the manifest by sending a GET request. And the client should verify the returned manifest signature before fetching layers and then pull layers which are not accessed. We see that the process of downloading images is through standard HTTP requests. This is similar to downloading fails. If the request to download an image can be downloaded through Dragonfly, then the larger scale download of images can use the capabilities of the Dragonfly. So, in order to meet the needs of image distribution, Dragonfly added a new component, DFDmen. The role of DFDmen is to act as another process, responsible for intercepting blobfail download requests and calling DFGet for P2P download. Let's take Docker as an example. The process of downloading an image through Dragonfly becomes run the Docker pull image command and configured to use Dragonfly DFDmen. Then download the required layer blobfail through the Dragonfly P2P mechanism. When all years needed for an image are downloaded, then we will get a complete OCI image. At present, Dragonfly has a good practice in combination with Docker and Content-D. Here is the key part of their configuration of the image registry process. In fact, as a process component, DFDmen can almost support all containers engines. As long as the download fail request is forward to DFDmen, the following download logic will be handed over to Dragonfly to complete. At this point, you have learned about the main architecture design and the process design of the Dragonfly. In addition, Dragonfly has the ability to provide more advanced functionality, such as network bandwidth limit, transmission encryption, and so on. If you are interested, welcome to join us. That's all I have to share. Let's welcome Pentao to share with us about Nader's image service. Thank you. Thank you, Wixin. Hello, everyone. I am Pentao, a software engineer from Art Group. I have been working on containers for a long time in recent years. Today, I am going to introduce the work that we have been doing in the past year, trying to improve Dragonfly with better ways of container image distribution. A new sub-project, Nader's, has been added to the Dragonfly project's family. I will dive into details on why we do it and how it looks like. Before we start, let's take a look at why we did it. In the container ecosystem, a new tool organization, Open Container Initiative, shorthand OCI, was created to standardize container image, runtime, and distribution. To do that, OCI has defined an image spec, a runtime spec, and a distribution spec. They work together pretty well and helped to increase the prosperity of the container ecosystem. Here, we will focus on the OCI image spec in the rest of the session. At the right side of this slide, we have a very simple illustration of the container image. It is consisted of several JSON files and Nader data files. This layer touches the files of container image data layers, containing all files and directions of a container image. The OCI image spec defines the format of each JSON file and how this touches the Nader files are to be combined to form a file system view. In recent years, as the container ecosystem evolves, OCI has recognized some of the drawbacks of the image spec. The most important one is that, before we start our container, we must beforehand e-contrast this image to the local file system. It adds a lot of delay to the container startup type. What's ironic is that, even if we only want to access a very small portion of the container image data, we still have to download the entire image. Another less obvious but still important problem is that the OCI image can only support deduplication and the layer to touch and the tile touches the layer. There are three consequences of the defect. The first one is that, if we change a file's metadata, it will cause the file's data to be copied to the upper layer as well. The second one is that, if a file is modified multiple times when building a container image, it will be saved at different image layers and we have to download it multiple times when starting the container, even though already the last modified file data is actually usable for containers. The third one is that, deleted files and deductions are still saved at lower layers and need to be downloaded when starting a new container, even though they won't be used by containers at all. The OCI community are pretty aware of their drawbacks and started to talk about OCI v2 image stack for some time. Here is a link to a shared document, a brainstorm that has happened in the OCI community. Besides discussing what's to be OCI v2 image stack, the community has also seen several projects, also seen the existing defects in different ways. For example, the Dragonfly project and Kraken project speed up image distribution via P2T network. SurveyMFS employs file-level local cache to avoid downloading the entire image. Stanker tries to solve the deduplication problem with storage layer's snapshot capability. CRFS hacks the target layout format and make it seekable to support undemand loading. FileGray and UMOCI break image layer in-built, saving each file or sub-file as an object in the image registry. All these are generous works and solve the problems that they deem most important. However, each of them suffer from different problems. For example, the P2P image distribution solution, while fast, still requires downloading an entire image before starting a container. The file-level local cache in SurveyMFS reduces the problem to holding require a file to be fully downloaded before processing, but it is still not a flawed undemand loading. Stanker relies on a storage snapshot capability that is not always available. CRFS requires a few smart points for every image layer and in performance overhead with each layer, which is slow if there are many layers in the image. FileGray and UMOCI both suffer from the problem of too many objects for each image, which is a great burden on the image registry. Observing this, we designed and implemented the Niders image service to improve the current state. With careful design, Niders image service has these key features. Container image data is downloaded on demand. Files are split into chunks and Niders supports chunk-level data deduplication. Image metadata and data are flattened so that low-intermediate layers are maintained. Niders also supports end-to-end data integrated check. And it is compatible with OCR artifacts spec and distribution spec, allowing it to be easily integrated with existing image distribution deployments. Let's take a look at how these are done in details. First, let's look at the architecture. At the high-level, Niders provides a user-space file system demo as shown in the middle of the picture. When starting a new container, Niders only need to download a very small container image metadata layer, which is usually very fast. The demo unfolds the metadata layer and exports file system views to container-spiled views over how it is portable. So Niders can support both traditional John C container and virtualized containers, like Cata containers. Because Niders image format is compatible with OCR artifacts spec and distribution spec, it can be stored in a Docker registry as well as an OSS in a cloud environment. Users can choose to deploy an OCR distribution network like Dragonfly between registry and the compute node to further accelerate image data distribution in large scale. Inside Niders schema, an optional local cache can save image data locally to boost future data access. At the core of Niders, it designs an image format to support all its key features. Each container image is divided into a metadata layer and a data layer. The metadata layer utilizes a mergo tree, also known as a hash tree data structure. Every file and directory is a hash node in the tree. With its hash digest calculated either by its data or by its descendants in the case of a directory. Then we can easily verify a file or a directory by recocculating its digest. Each file's data is cut into fixed sized chunks. Each chunk has its own hash digest calculated by its data saved in a shared data layer. Because the data layer is separated from metadata layer, Niders can represent a container image file system view just with the metadata layer. And the data layer can be shared among different metadata layers at the data chunk level, so that they are shared between different container images. With this chunk data sharing global chunk data, deduplication is possible as well. After learning about the Niders image format, let's look at how it can benefit us. The immediate benefit is that we can launch containers much faster. In a container startup time benchmark, we measure that the conventional container startup time can be proportional to the size of container images. For example, a busy box can be launched very quickly since its image is pretty small. However, for a large image like TensorFlow, it takes several minutes to start. And most of the time is spent pulling the container image. However, when running with Niders, we can see that the container startup time is pretty consistent, all below four seconds. At the bottom of the graph, we can see that Niders always finishes pulling the container image in less than one second, thanks to the image format design that separates image data and metadata. Another very useful benefit brought by Niders is end-to-end data integrity. With conventional container images, images are downloaded and verified at image level. Then it is uncompressed to a local file system. After that, we cannot know if it is changed intentionally or unintentionally, or if there is a big flip or a silent data corruption in a local disk. This bothers some non-running applications for a long time because disk cleaners and silent data encryption cannot be anticipated. However, Niders can be configured to always check image data checksup before returning it to container applications. Then it ensures that whatever you read by the container application is always verified and consistent. Even if there is a disk failure or silent data corruption, it can detect that and result to VGCs for the original copy of data. There are more benefits of running Niders image service than we do not cover here. We have open source suite as a sub-project of Dragonfly. We encourage the audience to check it out on our GitHub with policy. In future, we plan to introduce better integration with the container image ecosystem, such as integration with Buildkit etc. We want to support more data compression of the algorithm as well. Currently, only LZ4 is supported, and we are looking at flexible check sites and flexible duplication levels. Last but not least, we will propose to the OCI community to data. Let us serve as a reference implementation to the OCI V2 image stack. Okay, we have finished all the introduction. Thanks for watching. Here is our contact information. The Dragonfly website, the GitHub repository, the Twitter handle and the DingdingTalk group. Please feel free to contact us in any method you prefer. And we can answer questions in the Q&A session. Thank you.