 Hello everyone, the topic of this part is best practice for accelerated image distribution using Dragonfly. There are two speakers. I'm Wen Boqi, working as a maintainer for Dragonfly at Ant Group. I'm my partner, Yi Yang Huang from BatDance, who is also a maintainer for Dragonfly. First, let's introduce the Dragonfly project and some other research and development dates in the past year. Dragonfly is an open-source, P2P-based file system and image acceleration system. It's hosted by cloud-native computing foundation as an incubating lab project. It's designed to improve the physical and speed of large-scale file distribution. It's used on the fields of the application distribution, touch distribution, log distribution, and image distribution. At this stage, Dragonfly has a habit based on Dragonfly 1. On the basis of maintaining the capabilities of Dragonfly 1, Dragonfly has been upgraded in major features, such as system architecture design, product capabilities. Dragonfly has been selected and put into production used by many internet companies. Since its open-source in 2017 and InterSensef in October 2018, in 2020, Sensef TOC voted to accept Dragonfly as a Sensef incubating project. Dragonfly has developed the next version through Production Protects, which is observed the advantage of Dragonfly 1 and met a lot of optimizations for non-properms. Now that Dragonfly has been released more than 270 times, the project has active commits for a long time. We can refer to the first picture, which is the commit date for the past year. Dragonfly maintainers come from different companies, include Ant Group, Alibaba, Baddance, Baidu Group, GitLab. Listeners who are interested in the project can join in the community through the link below and discuss the future development of the project with us. Some listeners may not know much about the Dragonfly project. I will introduce the architectures of the Dragonfly project and the rules of each service. First, Dragonfly consists of four services, including manager, scheduler, set peer, and peer. The manager in the project can be used as a manager service. The scheduler in the project can be used as a scheduler service. The DFD men in the project can be used as a peer service and all set peer service. We introduce each service. First of all, the manager service, which is a management service, is used to manage the relationship between multi-clusters. A P2P cluster includes a scheduler cluster, a set peer cluster, and multi-peers. Multi-P2P cluster are managed by the manager. At the same time, the manager also provides a dynamic configure management, such as controlling the load of peer and set peer. The manager can also become a certificate issuing CA service, issuing leaf certificates to scheduler, set peer, and peer. It also provides a front-end console, which is easy for users to manage P2P clusters. It also provides user-related features, such as use mode, management, and RbAC. Also provides open API, such as pre-hate API, to be coded by other services. Of course, the main feature of the manager is to manage the capability of the cluster. It can remove inactive service drives from the cluster. Scheduler is a very important service in John Fly. Its main feature is to select the candidate download parents for download peer. When the peer download is failed, control the peer to back-to-sons download task. Of course, the scheduler is still very completed. It will build a deck of peers, and in the scheduling, it will go through filter and evaluator to select candidate parents. Set peer can be triggered by the scheduler to download back-to-sons and dive the resource into PCs, which can be used as the root node of the cluster resource. And it also has the feature of peer. Urally set peer in the cluster provides high performance machine and high quarterly network environment. Which downloading task for the first time, they can back-to-sons download the resource at the faster speed. Peer is the client in P2P network. It can be downloaded and uploaded. It will start a DevGate daemon to distribution. It gets the parents by exchanging information with the scheduler. The DevGate daemon and DevDaemon extensions DevGate, Dev Store, and DevCache command line choose a CSS architecture. For example, DevGate will download files with command line and DevGate will call the JPC API of daemon to download files. The above picture shows a download precise. First, peer will assign the most matching scheduler address from the manager. Then peer adjusts the task with scheduler, and peer will build stream with scheduler. If the task is downloaded for the first time, the scheduler will trigger the set peer to download back-to-sons. The scheduler returns the candidate parents to the peer. Then peer will download pieces from parents, and build the piece into a completed file. Users can deploy the set peer in a place where the network environment of the source is better, so as to ensure that first time back-to-sons is as fast as possible. Of course, we can see that John Fly can support different types of protocols for P2P transmission. Firstly, John Fly can support object storage, such as AWS Google Cloud Storage, Azure Blobs, Aluminum OSS. Object storage is also a common solution for storing machine learning modules, so John Fly can also provide service for inference modules, and accelerate the process of pulling modules. John Fly also provides acceleration for container images. John Fly supports many clients, including Docker, ContinuD, CIO, OR, AS. John Fly also supports the HTTP protocol, and the HDFS protocol. This supports for the HTTP protocols, expands John Fly's range of use. John Fly and Netus load for both lazy load container images, and download with P2P technology. We will provide users with a set of container image acceleration solutions based on Hubble and John Fly. First, we will build the image, then upload it to the Hubble, and then download the container image through container D. When downloading the layer of the image, it will be accelerated through the John Fly P2P network. In this way, users have an image acceleration solution based on P2P technology. Netus acceleration framework implements content addressable file system that can accelerate the container image startup by lazy loading. It has supported the creation of millions of accelerated image containers daily. Currently, its containers come from Ant Group, Alibaba Cloud, and BatDance. Netus has many features include an image can be fetched on demand in chunks for lazy load, lazy pulling to both container startup, deeply integrated with Netus kernel's ERFS and FS catch, enabling internal support for image acceleration. From build time to runtime, Netus covers container ecosystems, such as Docker, container D, podman, buildkit, node control. Faster build speed than OSI-V1 gzip or GSTD format, as well as the ability to directly accelerate OSI-V1 image. The ability to deduplicate data at the trunk level between local and server-side reduce storage waste of over 50% caused by incremental builds. We will provide users with a set of container image acceleration solutions based on hardware, John Fly, and Netus. This is a more perfect solution. First, we will build the image into a Netus image, then upload it to the hub and then download the image through Netus's snapshotter, which of course is a lazy load. When downloading the data of our file, it will be accelerated through the John Fly PDP network. In this way, users have an image acceleration solution that can be used on demand and in joint PDP acceleration. That's the performance of a single machine image download after the integration of Netus, mirror mode, and John Fly PDP. That's the running version commands using images in different languages. For example, the startup commander user to run a pattern image is a pattern version. The test will perform on the same machine due to the influence of the network environment of the machine itself. Actually, download time is not important, but the ratio of the increase in the download time in different scenarios is very important. Tess has different scenarios, including OSR V1, use the container D to pull the image directly, Netus codeboot, use the container D to pull the image through Netus snapshotter and doesn't hit any catch. Netus and John Fly codeboot use the container D to pull the image through John Fly and Netus snapshotter, transfer the traffic to John Fly PDP based on Netus mirror mode and no catch hits. Hit John Fly remote peer catch, use John Fly to pull the image through Netus snapshotter, transfer the traffic to John Fly PDP based on Netus mirror mode, and hit the remote peer catch. Hit John Fly local peer catch, use the container D to pull the image through Netus snapshotter, transfer the traffic to John Fly PDP based on Netus mirror mode, and hit the local peer catch. Hit Netus catch, use the container D to pull the image through Netus snapshotter, transfer the traffic to John Fly PDP based on Netus mirror mode, and hit the Netus local catch. Test results show Netus mirror mode and John Fly PDP integration. Use the Netus download image to compare the outside WeE mode, it can effectively reduce the image download time. The codeboot of Netus and Netus and John Fly are basically closed. All hits to John Fly catch are better than Netus only. The most important thing is that if very large Kubernetes clusters use Netus to pull images, the download of each image layer will be generated as many range requests as needed. The QPS of the source of the registry is too high, cause the QPS of the registry to be relatively high. John Fly can effectively reduce the number of requests and download traffic for back to source registry. In the best case, John Fly can make the same task back to source download one only once. John Fly also has very very good practice in machine learning. Inference energy include test floor serving, turtle server, and torch server. They will download modules from object storage. John Fly supports different types of object storage protocols. Allowing it to support the effervescence energy to download modules through P2P technology. This provides the object storage from reaching bandwidth limitations and download modules slowly. It can accelerate the download speed of modules and reduce the bandwidth on the object storage. Many companies are currently using John Fly to accelerate their download modules in the inference energy. Hello everyone, my name is Huang Yi Yang. I'm a software engineer from Bidance. I will introduce best practice using John Fly in Bidance Cloud Service Volcano Engine. Before start, let me have a short introduce about products of Volcano Engine. John Fly will be used for them. Container registry CR is a product of Bidance Cloud Service Volcano Engine, which provides secure and highly available container image hosting services to facilitate users to manage the entire lifecycle of container images. Volcano Engine Kubernetes Engine VKE provides high performance Kubernetes container cluster management services centered on containers through deep integration of next generation cloud-native technologies, helping users quickly build containerized applications. Volcano Engine Container Instance, VCE is a service and containerized computing service. Currently, VCE can seamlessly integrate the container service VKE to provide Kubernetes orchestration capabilities, with VCE users can focus on building the application itself without purchasing and managing infrastructure such as underlying cloud service and only pay for the resource consumed by the actual running of the container. In Volcano Engine, users use CR to store and manage their images. There are some limits in two scenarios. First, in a VKE Kubernetes cluster, the number of clients is increasing. The image used by application may reach GB level. Second, if an image is converted to Nordic format, the number of requests will increase by an order of magnitude. Therefore, the concurrency or pool is limited by the bandwidth and QPS in the end. We did some research about P2P projects in community. There are two production available P2P projects, General Fly and Kraken. Here is the comparison about them. They are both high available. Continuity support, HTTPS artifact registry support, and production available. But Dragonfly has some advantages. Its community is more active and has more users. It is used by Ungroup, Intel, Baidu, DD, Kwaishou, and many other companies. Kraken is used by Uber and NetEase. Most importantly, Dragonfly is Nordic-competible. The architecture complexity of Dragonfly is less than Kraken. So, Dragonfly is a better option and we choose to use it. How to deploy Dragonfly in Volcano Engine? We need to consider three points. How to deploy in VKE? How to serve for VCI? And how to serve for Nordic? In the Volcano Engine, VKE and VCI pool images through CR. The product feature of VKE is Kubernetes nodes deployed in ECS. So, it is very suitable to deploy DfDamon on each node, making full use of bandwidth of each node, and then making full use of P2P capabilities. The product feature of VCI is that there are some version nodes with sufficient resource at the bottom level, and the upper level services are carried by ports. So, it is impossible to deploy DfDamon on each node, like VKE. So, several DfDamons are deployed as caches in the formal deployment to take advantage of the caching capabilities. The VKE or VCI client may pool the Nordic converted image. In this scenario, DfDamon needs to be used as a cache. If using too many DfDamons as a cache, it will cause scheduling pressure for scheduler. Here is our architecture with Dragonfly. The resource in the Volcano Engine belongs to the main account. The P2P control components, scheduler and manager are isolated at the level of the main account. Each main account has a set of P2P control components. The control play implements the P2P manager controller, which all P2P control components are managed. The P2P control components is deployed on the VPC, on the control play on the CR side. They will be posed to the user cluster through LB. In the VKE cluster, DfDamon is deployed as Damonset. It is deployed on each node. In the VCI, DfDamon is deployed as deployment. Continuity on ECS access DfDamon through local JS. By deploying a controller component in the user cluster based on the private zone, a certain domain with class IDs generated in the user cluster. And the controller will select DfDamon port include port in VKE VCI according to certain rules. An A record will add to this domain. By visiting this domain, client can access DfDamon. Node is Damon on ECS access DfDamon through the certain domain. Image service clients and Node is Damon on VCI access DfDamon through the certain domain. Here is our benchmark data. The environment, CR bandwidth, 10G bits, the ECS 4CAG. With local SSD, the bandwidth is 6G bits. The test image is engines and TensorFlow. The version of Dragonfly is 2.8. The core tau DfDamon limit is 2C6G. We deploy scheduler and manager to replicas. The request is 1C2G. The limit is 4CAG. Here is engines port creation to container start benchmark data. And here is TensorFlow port creation to container start benchmark data. In a scenario of large-scale image pooling, using jengfly and jengfly with notice scenarios can save more than 90% of the container startup time. By using notice, the startup time is shorter because of the lazy load feature, which only needs to pool a small part of the metadata. Here is pooling engines image peak bandwidth comparison with using jengfly or not. And here is pooling TensorFlow image peak bandwidth comparison by using jengfly or not. Here is pooling engines image back source traffic comparison by using jengfly or not. Here is pooling TensorFlow image back source traffic comparison by using jengfly or not. In large-scale scenarios, the number of images pulled by jengfly back source is rare. In OCI scenario, all images pulled must back source. Therefore, the peak and back source traffic of using jengfly is much less than that of OCI scenarios. By using jengfly, the number of concurrency increase, the back source peak bandwidth and traffic will not increase significantly. Finally, thanks. I hope more developers can pay attention to jengfly and the nettles. Interested developers can scan the QR code to join our GitHub group or join our select channel or follow our Twitter. We will release some version information and feature upgrade information. Thank you.