 All right. Hello, everyone. Welcome to join our session. This session is about Kubernetes-native data orchestration framework, Flu. Flu is an open source project, which is started from September last year. The founders and the maintainers are from Alibaba Cloud and the university. A lecture. They have a lot of experiences in Kubernetes and the big data and the areas. First, let me introduce the presenters today. My name is Yang Che. I'm from Alibaba Cloud Kubernetes team, and I'm the co-founder of the Fluid community. Another speaker is Yuan Dongxie. He is the key contributor of Fluid project, and he is working at Tencent Cloud now. Both of us contributed to Fluid project from very beginning. In this session, first, I will talk about Fluid project itself, including the history of the project, why we want to start creating such a project and what problems we'd like to solve. And Yuan Dong will go through the overall architecture as well as some major design thinking behind the architecture. And he will also give a demo to show how Fluid speeds up the deep learning applications in the end. We also will discuss the roadmap of the Fluid project, which allows to invite all of you if you are interested in this project. Now let's start it. In this session in the recent years, we noticed that there is a big trend in big data and areas. According to Gatner's prediction, 70% of the AI applications will run on containers on the Kubernetes platform by 2023. In addition, the new stack technology reporting them, the Google will replace scheduling Spark application from YAR to Kubernetes. By reviewing the changes in data-driven application components, we found that the changes took place in architecture level from three dimensions. First, the location relationship between the computing resources and the storage service. Second, the diversity of the storage and workload types. Third, the deployment mode of computing is changing from fixed mode to dynamic mode. Let's take a look. In 2010, MapReduce tasks run on HDFS as mainstream and they are coupled tightly. In 2015, task types are gradually extending from Hadoop to Spark, Flink and TensorFlow. The storage types were also enriched to various types, such as HDFS and SAP. Computing and storage were separated. But at this time, the application is still on fixed mode deployment without any flexibility and containerization capabilities. Now, in 2020, we can see there are also a lot of changes. Task and storage types will become more diverse. Disaggregated computing and storage is the mainstream. And the environment will also become hybrid cloud and the Kubernetes. One of the most typical features is flexibility. Computing is no longer fixed on certain nodes. This architecture can bring many benefits, like on-demand creation and automatic elastic scaling the computing resources. As this data application deployed into Kubernetes cluster, it improves the developer's productivity and sales cost. But it also brought out many new technical challenges, such as it's not good at handling history genius data sources. And it brings data IO latency due to the disaggregated computing and storage architecture. Furthermore, it lacks the capability of being aware of data affinity while doing the workload scheduling. These challenges slow down the promotion of data driven applications running on the cloud native platform, like Kubernetes. OK, let's go to details. Let's take a look at the first challenge. Multiple data storage service in hybrid cloud data driven applications need to access the underlying history genius storage architecture. Different data service and the various low-level data access APIs like POSIX and HDFS. But with the development of the business, new AI models to build requires accessing data across different storage systems from various business departments together. For example, some data is from CIF, some data is from HDFS. How to use them together in the same application? The developer have to write different data access API in the same application. So we introduce this site for high-level abstraction of the data hiding the details of different data sources implementation. Second, slow data IO. This is because the cloud platform uses a disaggregated computing and storage architecture to introduce IO overhead, which will make the application very slow. And the computing resources like CPU, GPU, idle. In addition, running on elastic cloud instance, which is creating and destroying on demand also lead to the failure of the data locality. And this behavior result, the remote data access, even we already have cash in the last run. Moreover, the bandwidth of all night of the storage services, such as HDFS self, also affects the data access performance to solve these problems. We need to speed up the data access through elastic distributed caching. The last, inefficient workload schedule. It's because the current Kubernetes scheduler is not aware of the data location, so will not reuse the existing cache information in the cluster level and is also not able to prefetch the data intelligently for the workload before the run started. In order to solve this problem, we designed a well-scheduled strategy on Kubernetes to coordinate the applications with the cache return automatically. Now, let's see what's true indeed. First, we provide data set abstraction that allows the port to access multiple independent storage systems through the same POSIX and HDFS APIs interface. Rather than communicating to each individual storage system, the port can delegate this responsibility by connecting through float, which will handle this underlying storage systems. Second, under the hood of the data set abstraction, we are able to make the cache worker autoscale and portable. In the end, the information of the data set can also be used by Kubernetes scheduler to improve the data affinity scheduling. So float is an elastic data abstraction and an acceleration platform in a cloud-native environment. And the float is also a CSF sandbox project. And it's also an open-source community. There are many contributors from different companies. And they are working together to discuss the features and contribute code. In the next part, Yuan Dong from Tencent Cloud will introduce the float architecture and how to use it and the roadmap of the community. Thank you. Hi, everybody. I'm Yuan Dongxie from Tencent Cloud. Thanks for your answer for explaining what the float project can solve. Now, let me introduce the architecture of float. From top to bottom, first, the user need to define two custom resources. One is disk site, and another is runtime. The relationship between disk site and runtime is one on one. So far, we spoke to three kinds of runtime, a local runtime, Jindol runtime, and a good surface runtime. And inside the float operator, there are two major components. The one on the left is the float controller managing. It is responsible for scheduling disk site. In float, we convert the scheduling problem of disk site into the scheduling problem of data acceleration engine. It contains three components. The first is the disk site controller. It manages the lifecycle of the site, and including creation, bounding, and unbounding with runtime and deletion. The second one is runtime controller. It manages lifecycle of runtime, including creation, scaling, and the scale out, catch prefix, and clean. The local runtime is one of the runtime implement. The third one is volume controller. It is responsible for the creation and the deletion of data volumes related to the data site. On the right is the application scheduler. It is responsible for assign the pose to a property node with the information of the data catch. There are two components, catch co-locality plug-in and the prefix plug-in. Catch co-locality plug-in will schedule the job to the catch node intelligently. No need of the user to specify it. And the prefix plug-in will notify the runtime to prefix the data to a specified node, which the job will consume it. Next, let's quickly introduce the use of fluid. The co-component in fluid is disk site, which is essentially a CRD and provide abstraction of data. The operation of creating a disk site is very simple. Data source from IDC and the cloud can be accessed through a unified interfaces and through the CRD definition. At the same time, user can define the scheduling information of data. The following example comes from a real customer AI training case. Its training data is relatively large and the data presentation is completed. It is placed on a cloud object storage, like S3. But the validation data is sensitive and cannot be placed on the cloud. It needs to be placed on the IDC storage, like safe. On one hand, fluid CRD provides a unified view. And on the other hand, it provides the ability to accelerate the distributed catch. So node affinity, you can dynamically move data, save data to GPU instance on the cloud at the time of training and accelerate data access. When the training is not required, data can be migrated to low cost CPU nodes, avoid the use of GPUs and network dedicated LAN. The method of creating data site is very simple. The user only need to define the YAML field as shown in the top right of the finger and the right edge. Once created, you can see the total amount of data in the site and how much data can be catch in total and the catch ratio. So data information can be used to trigger elastic scaling later. And the data site can also send data source from OSS, COS, S3, HDFIs, safe. All can be accelerated through distributed catch and can also accelerate exciting PVC, means persistent volume claim. In this example, we use a look-show as runtime to accelerate the data site. If you need to deploy an application that uses the data site, you only need to specify the name of the data site as PVC. And the scheduler will slide a property node based on the data cache information. Select node with cache capabilities first. And then select the nodes with more cache data. And this is transparent to any user. At the same time, we also spelled automatic and elastic expansion of data site catch volume. In this example, we also use a look-show catch engine as the default runtime. By default, when the data cache ratio reaches a certain percentage, data eviction will be triggered, which will affect the support of data access. Therefore, we provide an automatic expansion mechanism. When the data cache reaches a certain percentage, we need to expand the distributed catch mode to provide great catching capability. In this demo, we will show how to use fluid accelerate to the machine learning training job. And we will compare the training job speed and the final accuracy result of the task. And we will also put this demo on the GitHub for your reference. First, let's take a look at the factor of directly reading training data through object storage for training. Here, you need to define the PV and the PVC of object storage in advance. And the SAS driver of PV needs to be designated as object storage. After creating PV and PVC, we can view the data site through the Kubeflow Arena command line. Next, we use Arena to submit deep learning tasks. Here, we specify the data parameter as the data site name. We will check the log of the training task. We can see the image loading speed during training is about 638 images per second. And the final training accuracy of top five is 92.26%. The entire task runs for a total of one hour under the full machine. And every machine has eight-way 100 GPU card. Let's take a look at the training factor after catching acceleration through fluid. Here, we take the Alexiocach engine as an example. First, we define the data site, specifying the source address of the training data storage in the cloud object storage. After creating the data site, you can also view it through Arena command line and create deep learning tasks in the same way. We can see that image loading speed during training is about 2,800 images per second. And the final training accuracy of top five is 92.5%. Comparing the two experiments, it is defined that without affecting the final accuracy of training and the use habits, your fluid can comprise the training time of the task from one hour to 24 minutes. And the training speed is increased by 4.5 times. Up to now, we have released version 0.6.0, which mainly satisfied the use cases of data infrastructure engine who can simply and effectively use catch to accelerate AI application. When version 1.0 is released, we hope that data infrastructure engine can achieve performance acceleration of big data and AI will close without having to care about data management. In order to achieve this goal, we need to implement the following features before version 1.0. The first is to further accelerate the big data application on communities through fluid and on the basis of already supporting AI applications. The second is to improve the locality of data through scheduling optimization, thereby improving the efficiency of data access. Meaning through the scheduler, comprehensive scheduling of data and workload. The third is to ensure the ability to manage the lifecycle of data sites in large scale applications narrowly through performance and stable ability optimization in the communities environment. After introducing the 1.0 release, the specific plan can be divided into the following parts. The first part, automatic operation, maintenance, and observability. In order to ensure the stable operation of the system through component self-recover, multi-master support, and the smallest upgrade to achieve runtime have availability. Through various declarative CRD, such as mid-data backup and the recover, data cache migration, data preview warm, and CADRA to achieve the purpose of automatic operation and maintenance. By talking with permissions, it automatically collect data sites indicators to ensure the certain military and learn them capabilities. The second part, accession through the data site scheduling strategy, such as affinity or tolerance, scheduler the cache to the other node. In addition, the cache location information is reported to guide the workload scheduling to nodes for the speed-up of the data access. Part three, multi-runtime support. Currently, the community has supported a Luxu Gendo IFS and the GuSIFS cache engine. Part four, fluid agent, which can obtain node cache information through agent pull or push model, can also implement cache clean and other operations through the agent at CADRA. Part five, elastic scaling. Through the definition of elastic scaling strategy, the ratio and the vertical expansion and the contraction of the cache instance to adopt to the scenarios of workload electrically. Part six, application access model. By supporting fields, means the field system in user space interfaces, and the HDFS, means handover compatible field system interfaces. Flows can play a role in AI and the big data scenarios. The community is currently developing steadily, and there are currently more than 18 contributors from different companies and organizations. If you are interested in participating and have questions, you can contact us through CNCSF stack and search fluid channel. Or you can contact us through GitHub. We will hold a community meeting every two weeks on Wednesday in baiting time. We will discuss the progress of some features and share proposals at CADRA. At the same time, you can visit our website for more information. Welcome to join us. At present, many companies have a variety of used fluid in production environments. For example, voice AI company Unisound, autonomous driver AI company Harmony AI, social network company Vivo, and the information service operator Channel Telecom are all using fluid to accelerate AI cloud. And security-related internet company 361 is using fluid to accelerate big data applications. At the same time, Alibaba Cloud and Tencent Cloud also integrate the fluid acceleration service on the container platform based on Kubernetes. I would like to thank you for your time and open up for questions and answers. If you have any questions here, thank you for joining.