 Hello, everyone. My name is Qi Wenbo. You can call me Gaios. I'm the maintainer of the drone fly. Another speaker is Jiang Han. He's work for the Ant Group. And I hope my introduction can let you know about the current state of the drone fly. And I hope many developers can be interested in the drone fly. Okay, the first page. I will introduce what is drone fly. Drone fly provides image acceleration and file distribution on the P2P technology to be the best practice and the standard solution in cloud-native architectures. It is hosted by the SynCef as an incubating project and it is designed to improve the speed of the file distribution in the large-scale clusters. And it is used in the fields of the application distribution, log distribution, cache distribution, image distribution, AI data site distribution, and AI model distribution. Okay, let me see the contemporary gesture of the SynCef land skip. There are two graduation-level and incubation-level projects. One is the harbor as an advocate hub. The other is the drone fly, provides the image acceleration and the file distribution. Now we have more than 100 contributors and the maintainers come from the Ant Group, Alibaba Group, ByteDance, Intel, Baidu AI Group, and Jeep AI, and Dalian University of Technology. Next page, I will introduce the NADUS. NADUS is a sub-project of the drone fly and it provides a file system on the REFS format and it can remove the duplicate trunk in the building and download on demand by the trunk. So NADUS can reduce the end-to-end code launching of the container from the minutes to the seconds and the maintainers comes from Ant Group, Alibaba Group, and ByteDance. Okay, let me introduce the important milestones of the drone fly. The drone fly has been selected and put into the production by many internet companies since it is open source in 2017. And it joins the SynCef as a sandbox project in 2018 and became the incubating project in 2020. And in 2020, NADUS became the drone fly sub-project and widely used in the image acceleration. In the 2021 drone fly released version 2 and now drone fly is not only used in the image acceleration but also can use it in the file distribution and AI data distribution. Now drone fly maintainers believe that the project is ready for graduation. So we submitted the drone fly graduation prop source. Now drone fly focus on three parts. The first part is image acceleration. Drone fly supports the container clients such as Docker, Kennedy, and so on. And it provides three solutions in the image acceleration. The first solution is to use drone fly to distribute the container image which is suitable for the large scale cluster. The second solution is to use drone fly and NADUS to distribute accelerated image which is suitable for the large scale cluster and faster container launching. The third solution is to use NADUS to distribute the accelerated image which is suitable for the faster container launching. And in the fields of the file distribution, drone fly supports the protocols include HTTP and ADFS. It also supports the object storage protocols include S3, OSS, OBS, and so on. So we have been working for the image acceleration and file distribution for almost six years. I think this drone fly is the standard and the best practice in these fields. In the AI info, drone fly supports to distribute the model and this site through the drone fly based on the P2P technology. Now drone fly supports the Triton server and TorchSoul to distribute the model through the drone fly. And it also supports to distribute this site and the model from the hugging face by the passenger SDK through the drone fly. In the future, we will pay more attention to the AI info because we believe the P2P technology is the best solution to resolve the model distribution. Okay, next part I will introduce why we use the drone fly. For example, we have Kubernetes clusters and this cluster has 1,000 nodes and each node will download a large file or large image or large model at the same time. For the storage, there are 1,000 concurrent download requests so the storage can easily reach the bandwidth limitations and this will cause the slower file downloads and the slower container launching. So how to resolve this? We have three solutions. The first solution is to increase the bandwidth of the storage but no matter how to increase the bandwidth of the storage as I'm back on the storage, it must have limitations so this is not a better solution. The second solution is to use the P2P technology to use the idle bandwidth of the nodes to eliminate the impact of the storage bandwidth limitation. So drone fly uses the P2P technology to eliminate the bandwidth limitation. The second solution is to reduce the download files. We can remove the duplicate content in the building and download on demand in the running to reduce the file size. So NIDUS will remove the duplicate content by the trunk and download on demand by the trunk in the file system. So drone fly includes the second solution and the third solution. Okay, in this sharing I will focus on the AI inference. One way download a large model file in large scale clusters we must face the problem of the bandwidth limitation of the storage. So drone fly support, hugging face SDK, gate LFS protocols, threatened server, and the torch serve to distribute the model through the drone fly. So we can easily use drone fly in the AI inference. That's the best solution to use drone fly to forward the download model traffic to the P2P network. So we can use drone fly to distribute the model based on the P2P technology. So users can upload the model to the model registry and the interface serving will download the model from the model registry through the drone fly. So we can use the idle bandwidth of the node to emanate the bandwidth limitation. Okay, this performance pricing of the hugging face to use the drone fly distributed the model on this site. For detail, we can find the document in the drone fly website. The next part is performance pricing of the gate LFS protocols to download the AI model through the drone fly. Okay, this is the performance pricing of the torch serve to use drone fly to distribute the AI model by the torch service native plugin. So you can also find the document in the drone fly website. Okay, in the future, we will release the new version in the first half of this year. And we will focus on the AI inference. So there are two important features we will release in the new version. The first important feature is that drone fly will support to write the large file into the peer. For large file write to the peers, this peer need only report the metadata to the drone fly scheduler. So this file can be downloaded by the other peers without uploading the model to another storage just like S3. So this feature will support the AI model and AI this site to distribute read and write faster in the P2P network. Okay, the second feature is that drone fly will support the RDMA for the faster network transmission in the P2P network. Okay, so it also can load the model into the memory directly and the CPU offload and the zero copy. Okay, the following we are shared by the Jiang Han. Okay, thanks. Bonjour, everyone. Well, it looks like everyone is talking on AI. So let's take a look on something else. Well, I will start the background of the whole project we built. We built a system called HUSE, which means HyperUnitified Service Engine. And we built this in order to hand both service function and the spark drop from the big data application. And we also supported the task from the flink. And while we built this, the answer is easy that we want to standardize the allocation of resource. Well, you can see that we have lots of scheduler in different systems like we have scheduler in Kubernetes and they are used to schedule the port and we have scheduling the yarn that we use that to deploy the task from the spark. So we want to build a unified engine to schedule everything. Oh, that's the background. Okay, next I will give a brief statement of the system. You can see that the whole architecture is just like the lambda from AWS. And I will focus on the node level in order to reduce the redundancy. I had lots of details. Okay, let's take a look at that. If a user invokes the function service, we will download the files from different storage registry by Dragonfly. And we have a node worker on the node side. It's just like a Kulvet in Kubernetes. And you can see that we have downloaded different files from the different backend that we have downloaded the spark jar packages for the big data. And we will download the user code and the runtime images for the service function. And when we download them, we will start a container, something like a container, but it's not actually. Yeah, and that's the whole architecture. Next, I will explain why we choose Dragonfly as the downloading utility in the whole system. Well, you can see that the first reason and the most important, we need to download files from different backend. First, just like the GateLogic file storage and something from the Gate directory and also HDFS and OSSS. Well, I have to say that the Dragonfly inherently support all those backend. And so it saves us a lot of time when we develop the whole system. And I just showed the data from the benchmark website for the AWS Lambda. And you can see that the time cost of the code start is just under 150 milliseconds. That's very fast. And what was the time cost? They download the user code and the starter container in the node. And the time underscores the necessary for a swift download that you cannot spend too much time on downloading the thing you need when you want to build a serverless architecture. Yeah. And the last reason why I chose the Dragonfly is because the maintainer guy was just said by me. So if I got some problem in the production cluster, you know, I can find him. Okay. And next, I will show you guys some special scenario that we meet in the cluster. The first is there are significant difference in the file size in the system because we need to download the huge images for the service functions and for the Spark tasks. But sometimes the Spark tasks file is very small. It could be 50 MPs. And let me give an example that you have two types of tasks. The first one is for the, they need to download the file like 50 MPs. And the second type, they need to download the 10 GBs task. And in such scenario, that you need to schedule when you need to think about the file size when you schedule. And you cannot schedule all type ones task in the one node. And you need to consider about this. And the next question is that in some scenario, we need to download a large number of files just like Spark. Well, many tasks in Spark, they contain lots of files just like configuration and some initial files. And the file number can reach 1,000. And in such scenario, where I gave a benchmarker test that you can see that if the file numbers come to the 2,000 and the time cost becomes over two minutes. And we've dive into the reason why the file numbers can cost such a result. Because more file will cost more damage to the DragonFly's schedule. So the solution is that we compress all the files in the Spark job and then we will decompress them after we're downloading. It's pretty easy, right? Okay, next, I will give an instruction about how we deploy the whole system focusing on the node level. We bound the necessary components in an image just like the Walker demon and the DragonFly demon and some runtime binaries. We put all of this into an image. And then we deploy the image as a port in the native Kubernetes. And you can see that we allow this port can run in different quality of storage. As an example, we make the port running under the batch state group which is low levels in the online state group. And by this deployment method that we can improve the resource utilization in the single node. Next, the whole deployment method can bring us lots of benefits. The first one is the whole cluster is scalable. We have a schedule to detect the lack of resource and when it finds we need to scale out some new port it will call the API server to create a new port. And the secondary is the whole system supports multi-tenants that we support different configuration for different tenancy in a cluster. We will make different configuration by config map and we will mount the different config map into different ports. And the whole system is configurable. That's because if you modify the config map and you apply it, it will immediately get into the cluster and you don't need to restart the port. Okay, in order to give you a brief insight of our scale, I just feature a data panel from the hour when a foreigner and you can see that during last week we the maximum task create can reach 1.44 million and it's pretty big, right? Okay, and in order to show the performance of drawing a fly, I just feature the data for last 24 hours and you can see that there was a spike around 10 AM because this is an active time for the online business. And during this interval, the batch state group will be strained and the whole port performance will be strained. And apart from this interval, you can see the dragonflies downloading speed is very stable across the whole day, yes. And this is data in a single day, but I have to point out additionally that dragonfly has supported the stable download speed during the last three years and it's pretty good. Okay, that's all container I wanna show and if you're interested, you can scan the QR code to access the dragonflies official website and that's all. Well, if you have any questions, I'm happy to answer. Thank you. Okay, thank you. And to work, I'm Charles from Huawei and just one question for the peer-to-peer software image distributions and it is very important from my point of view to decide that the first node to pull out the image and it can distribute this image to the third one and to the second one and the second one and the third one could be another source of peer-to-peer. Yes? So how to choose the very first one? No, and do you have some schedulers and do you have some algorithms related to the schedulers and it is open-source it and could you give me our audience more information on that, thank you very much. Okay, okay, I got your question and we do have the scheduler in dragonfly system and we have all about the technology details in our website and you can find the documentary in the website, it's very detailed and I trust you can find the answer by yourself and we actually, I contributed later to the dragonfly so if you want more details, I can find the guide to answer the questions. Do you want that or you just want to find the answer by yourself? Is it okay to... Gals, okay, come on. Thank you very much. The peer-to-peer, okay, the first image from the origin, the storage. Dragonfly selected just like the BT tracker. If you know about the BT protocols, you'll know about in the first BT have our tracker to do the schedule, okay. Dragonfly have the schedule in the clusters and we can select, Dragonfly have the two types of peers. The first is the peer, the second is the seed peer. So in our clusters, if the file download in the peer-to-peer, first schedule will target the seed peer and download from the source origin and the peer will download from the seed peer. So we can deploy the seed peer in some, in some of the nodes, it have the better network. Okay, so we can pull the first quickly, okay. Well, thanks for the question. And additionally, I just mentioned in the slide that if you download too much files and it will cause pressure to the Dragonfly's schedule, so yeah, you need to take care of that. So anything else, guys? Okay. Okay, thank you guys. That's all. Enjoy.