 Hello everyone, my topic of this sharing is drone fly introduction updates and practice in AI model distribution. My name is Wen Boqi, I'm the maintainer of drone fly. I hope my introduction can let you know that the current state of the drone fly project. Drone fly provides fail distribution and image acceleration based on the P2P technology to be the best practice and standard solution in cloud native architectures. It is hosted by the cloud native computing foundation as an incubating lab project. It is designed to improve the speed of large-scale fail distribution. It is used in the field of application distribution, cache distribution, log distribution, image distribution, AI data site distribution, and AI model distribution. You can see that in the container registry of the Sensef landscape, there are two graduation level and incubation level projects. One is harbor as an advocate registry and the other is drone fly as an image acceleration and fail distribution solution. Project has more than 100 contributors and the maintainers comes from Ant Group, Alibaba Group, BatDance, Intel, Baidu Group, JiHu, and Dalian University of Technology. There are many companies now using drone fly to improve the speed of fail distribution and speed of image distribution, such as Alibaba Group, Ant Group, Intel, DD, BatDance, Qashu, Baidu Group, Xiaomi, Bilibili, and so on. The project commits in the past year have been active in contributions. Next, let me introduce Nethus, a sub-project of drone fly. It provides a content addressable fail system on the RAFS format. The most important feature is to let the container image are downloaded on demand in trunk and the trunk level deduplication cross layer or cross image to reduce storage, transport, and memory costs. It can reduce the end-to-end code start time of containers from menus to seconds. Maintainers comes from Ant Group, Alibaba Group, BatDance, and so on. Let's take a look at the important milestones of drone fly project. Drone fly has been selected and put into production used by many internet companies since it is open source in 2017. It joined the SenseApp as a sandbox project in 2018 and become an incubating project in 2020. In 2020, Nethus became a sub-project of drone fly and was widely used for image acceleration. In 2021, the drone fly version 2 was released. Drone fly is not only used in image acceleration but also has many use cases in fail distribution and AI model distribution. Now, drone fly focuses on three parts, image acceleration, fail distribution, and AI info. In the field of image acceleration, drone fly supports container clients such as container D, Docker, CIO, ORAS, and so on. It provides three solutions for image acceleration. The first solution is to use drone fly to distribute image based on P2P technology, which is suitable for large-scale cluster. The second solution is to use drone fly and Nethus to distribute accelerated images, which is suitable for large-scale cluster and faster container launching. The third solution is to use Nethus to distribute accelerated image, which is suitable for faster container launching. In the field of fail distribution, drone fly supports large-scale fail distribution and use P2P technology to emanate the impact of origin bandwidth limitations. Nethus supports fail distribution of protocols, including HTTP, DFS, and so on. It also supports different object storage protocols, including S3, OSS, OBS, and so on. drone fly added DFS to expand the fail distribution capability. It can depend on different types of object storage, provides stable object storage capabilities. We have been working in the field of image acceleration and fail distribution for almost six years, and it has become the standard best practice in this field. In the field of AI infra, drone fly supports distributing data during AI training and AI inference. In the AI inference, drone fly supports Triton server and Torch server to use the drone fly distribution model. In the AI training, Flow8 downloads this site through drone fly, one running based on Druze FS. It also supports downloading a model and this site from Hackingface Hub by SDK through drone fly HTTP proxy. In the future, we will pay more attention to AI infra because we believe that P2P technology is the best solution to accelerate AI this site distribution and AI model distribution. Next, I will introduce why use drone fly. For example, Kubernetes cluster with 1000 nodes, and each node needs to download a fail or image at the same time. For the storage, there are 1000 concurrent download requests. One downloading fails in large batch, batch the storage bandwidth can easily reach the limitation. This will cause slower startup of containers and slower fail downloads in the clusters. How to resolve the problem? There are three solutions to this problem. The first solution is to increase the bandwidth of the storage. But no matter how to increase the bandwidth of storage, as a backend storage, it must have limitations. So this is not the best solution. The second solution is to use P2P technology to use the del bandwidth of nodes to eliminate impact of storage bandwidth limitations. This is also the best solution in large scale, downloading with drone fly. The third solution is to reduce the download fail size, remove the duplication during building and on-demand load during download to reduce the size of the fail. Let us deduplicate fails and downloads on-demand. So drone fly includes the second solution and the third solution. In the next part, I will introduce the architecture and important features of drone fly. Drone fly includes manager, scheduler and peer. The manager service, which is a management service, it is used to manage the relationship between multiple clusters. A P2P cluster includes scheduler cluster, state peer cluster and multiple peers. At the same time, manager also provides a dynamic configuration management, such as controlling the load limit of peer and state peer. It also provides a front-end console. User can easily operate P2P cluster. scheduler is a very important service in drone fly. Its main feature is to select the candidate parents for download peer. When the peer download is failed, control the peer to back to origin download the task. Peer is the client in the P2P technology. It can be the downloader and uploader. It will start a defcat daemon to distribution. It gets the parents by exchanging information with scheduler. For example, defcat will download fails with the command line, and defcat will call the gRPC API or for daemon to download fails. Manager includes a front-end console. It includes cluster operations, P2P metrics, user management, job management, and so on. User can easily operate the P2P cluster. This part introduces the scheduler. It is an important module of the drone fly. Scheduling will include two steps. The first step is to filter available parents. Unveilable parents will be filtered based on the parent's load limit, download state, and download piece cost, and so on. The first step of filter is completed, and it gets all available parents. The second step is to evaluate based on different features, and sorted based on evaluation. Finally, the scheduler will return the best set of parents to the download period. Earlier version of the scheduler was based on a tree to represent the network download top, and there is no filter step. It will cause two problems. First problem, it can only represent downloading from a single parent, and cannot represent downloading from multiple parents concurrently. The second problem is that there is no filter step, which may cause a bad node to become the root node of the tree. It will cause all nodes to download fail slowly. Therefore, the scheduler has two steps of filter and evaluation, to schedule best parents, and change the tree to deck so that filters can be downloaded from multiple parents at the same time. The latest scheduler can increase P2P network transfer speed. The community is now working on scheduler solution based on machine learning. It is trained using the data of prop between peers and the data of download cross, and predict parents based on the GN and MLP for download peer. The model will include downloaded features and network features for prediction. Of course, it is very simple to download once in a P2P cluster, but it is difficult to ensure the stability of downloading in a large-scale P2P cluster. And exception, isolation is very important for service stability. Twinfly provides three levels of isolation, including server-level, peer-level, and download task level. Service level represents managers, schedulers, and set peers that will isolate exceptions in the cluster if they are unavailable. Peer level represents that if a peer network is exception, this peer will be isolated from the P2P cluster. Task level represents that if a download task is exception, this task will be isolated from the P2P cluster. Exception, isolation ensures the stability of each downloading. In the field of AI inference, three solutions are provided. The first solution is to directly use the drone fly to distribute AI model. Users can use the drone fly to inference framework directly without doing anything. The solution for the model file system is to build a model file system through NATUS using the duplicate feature and on-demand load feature, and each trunk will be distributed through drone fly P2P network. The model image solution is similar to the model file system solution, but the difference is used image format. Many companies use this solution is that the advocate is simple, just an image version. There is no need to depend on other services, reducing communication between different departments. Of course, Triton server TouchServe and Triton flow serving can use drone fly based on plugins to distribute AI model. This will allow users to simply use the drone fly distribution AI model. The community has also open source the plugins. For example, Triton server can use the drone fly and point plugin to redirect TouchServe download the model request to the drone fly P2P network. My sharing is over. You can scan the QR code to follow drone fly and NATUS. Thanks to everyone.