 Hello everyone, I'm Li Yongchuan coming from the story team of Opal, and I'm the main trainer of KubaFast project. It's my great honor to introduce KubaFast today. KubaFast went open source in March 2019, released version 1.0 at April. At the end of 2019 we joined the CNCF sandbox. In 2020, we updated several times and supported the worst interfaces, including S3 and HDFIs. In 2021, release version 2.4, which enhanced the stability. At the beginning of this year, we released version 3, which is a significant release support usual code report to reduce the cost in the scenarios as district, and it already be online for one year in our previous story platform. We are honored that the KubaFast became an incubation project in June, and in August, we already released version 3.1, which supports hybrid cloud acceleration and QS flow control components. In terms of open source community growth, we compared the date before and after joining the sandbox. The increasing numbers indicate that KubaFast increased the in-practical applications. Many software engineers are willing to join and contribute to the project, and many end users are likely to make a test or use in production environment. In the field of full storage, some technical changes are commonly acknowledged to obstruct practical applications. For example, we had to deal with files from KB to TB. Most of all, it's hard to use redrawed patterns sequentially or randomly. Of course, there are many more difficulties than this. These are the pinpoints of managed management mentioned before. KubaFast provides an extensible managed systems. Bell and node storage is segmented by region, and performance is fully persistent in memory. Setup shots and route pipelines ensure strong consistency between managed copies. The figure shows the management organization of the shares in the managed. The B3 is used to manage entry and end nodes, and features such as copyright are implement, which is convenient for operations, such as modifications which make natural persistence. The entire index design is relatively sophisticated. Compared with the products of friends, this design can better solve the problem of scalability and the strong consistency. We will talk about supporting multiple version literature, and its many implement in these two traits. One is for entry and another is for end nodes. KubaFast is optimized for small files. Storing small files centrally in storage nodes to stream merging the small files to one big file, which can improve the store output and performance, especially for HDD disk. It utilize the punct hole to face over the nodes to synchronize the free disk space occupied by the delighted small files, which greatly simplifies the engineering work of dealing with the small file deletions. This page is the KubaFast performance comparison. From the comparison of the figure, we can see that sequential read and the rate of the large files is the same as self, and the random read and write performance is better. This is due to the following reasons. First, each MDI only catches a portion of the file made in this memory. In the case of random reads, the catamase rate can be increasingly dramatically as the number of the processing increases, causing frequent disk input output. In contrast, each made in node of KubaFast catches all the file made in memory to avoid the expensive disk I'll input output. Second, the overwrite in KubaFast is in place. This does not need to update the file made it. In contrast, the overwrite instead, you really need to walk through multiple keys, and only after the date and the made it has been processed and synchronized, the commit message can be returned to client. KubaFast stores all file made it in memory to avoid the expensive disk input output during file read. In the case of small file read, right, the KubaFast client does not need to ask the resource manager for new extents. Instead, it sends the request to the disk node directly, which further reduces the network overrides. The Bob talk is about the KubaFast feature before Opel join. So now let's talk about the user case in Opel and the improvement. Opel is a consumer electronics and mobile communication company, which have more than 20,000 engineers. They put KubaFast in three parts in practical AI platform, data lake and Spark remote shuffle service. In AI platform, as an underlying storage of AI platform, KubaFast serves a serious open problems such as massive small file storage, millions of directories and hostage access, emotional learning scenarios. It stores the various of region, various original corporate and model data during interface and training, supports advertising games, recommendations, CV image processing and other online business. In the data lake, there are many problems to be solved. The main problem is about HDFIs. The managed substantial name node has limited the scalability and the disk skill of the single cluster is limited. Second is the storage and computing are integrated and the operability is poor. Second is poor cluster stability. Poor is high storage cost and not friendly to small files. Opel use KubaFast to replace ADFIs. KubaFast is friendly to both large and small files and supports multi-tenant resolution and greatly improve stability. The data skill of a single cluster reaches reliable and the usual code engine can reduce the storage costs. In Spark remote shuffle service, the native shuffle is strongly bound to the local disk. KubaFast decouples the dependency of local storage resources and gives full play to the elastic advantages of distributed storage. This page is a case study of data lake solutions. First, we need to know what is data lake. A data lake is a centralized use poetry that allows you to store all your structures and unstructured data at any scale. You can store your data as is without having to first structure the data and run different types of analytics from dashboards and visualizations to big data processing. Real-time analytics and machine learning to get better decisions. Just about data lake, what's the requirements for KubaFast? First is we need to provide read ADFIs interface capabilities. Second is every code storage to reduce cost. Three is client local cache for performance. Fourth is domain failure to enhance vulnerabilities of massive disk scenarios. Fifth is QIS flow control. Now KubaFast has improved all the capabilities. Let's look into the solution for part of the requirements. As can be seen from the picture, KubaFast has three interfaces, both capabilities, ADFIs, process compatible, and S3. Overdeveloped the interface ADFIs and use it in our production environment. All the interfaces are built on top of the kernel SDK, which includes features such as cache capability and stream data processing for high performance. Let's move to individual coding subsystems. This is a significant subsystem of KubaFast. The left figure shows the components of the whole individual coding subsystem and the flow of data. The right figure shows the distribution of data in different AD and EC modes. To be specific, we can use read solomon codes 33 which contains three data fragments and three code fragments. All fragments are stored on different storage nodes at a single EC. We can also use RS104 which has higher duality and lower cost and can be used after configuration. We use RS610 and RS159 respectively under two AD and three AD. All fragments are evenly stored on different nodes in different easy. On this occasion, when all fragments of one easy is entirely lost, we can still recover the data from the fragment for another easy. Additionally, we implement local reconstruction code, LRC with reference to a paper visual coding in Windows ADR storage. LRC will ask to reconstruct offline data fragments with less visual coding fragments. The most important benefit of LRC is that it reduces band-wise and IOS input output incurred from data recovery so as to maintain a significantly lower storage or high. Every coding subsystem has five advantages, large cluster capability, high durability, lower TCO reducer density from 3 to 1.3 or less. For multi-AZ deployment, we have multiple specification configurable visual coding model. Of course, Cuba has directly used online EC. This avoids data migration and save storage space. Let's talk about its architecture. It includes four subsystems, client cluster management, data subsystem, and data and data. We made some improvements mainly in the subsystem when updating it from the second version to the third version. For example, we added a visual code. Generally speaking, they are the main trees of the substituting of architecture. Firstly, now Cuba Highs contains lower-cost visual code and then it is compatible with the former versions. And it's both smooth updates to version 3. Finally, volume includes duplication volume and visual code volume. They made many optimizations to visual coding subsystem under the condition of the visual code is accountable. Certain red filters are loud and the background of synchronous patching is performed for the field data blocks. Through this chrono mechanism, the problem of the tail support can be effectively resolved. For small files, using space codes to replace time codes, introducing redundant data blocks for EC, encoding can effectively reduce the number of input output and improves read performance. That's a tiny file storage of replica subsystem. The original coding subsystem uses the first semantics of file system and the reclaim space and avoids a large amount of data relocation introduced by a synchronous compact during garbage collection. Multi-copy implement is simple. Data durability is average. Results utilization is low and the storage codes are high. Usual codes are more complex to implement with higher data durability, higher results utilization at a lower storage cost. Comprehensively, invaluable data redundancy strategies for multiple dimensions such as business data skill, access model, durability, and cost requirements, taking OPPO mobile phone cloud app business as an example, it spoils hundreds of millions of users to access. And this volume is large and the cost of demand is strong. At present, a lower cost adro coding engine has been fully used in OPPO. Next, I want to introduce the hybrid cloud acceleration section. Hybrid cloud AI training scenarios that is most of the GPU computing comp power is in self-built cloud machine room and a small part of the computing power will use the public cloud GPU. When the self-built cloud GPU resources count meets the user's sudden computer power needs, the task can be dramatically scheduled to public cloud. The advantage of this scheme is that the number of the self-built cloud GPUs can be controlled, which can only meet the user's normal training needs, but also meets the user's sudden incremental needs. The self-built cloud only needs to maintain a normal computing power level, which can effectively reduce the overall TCO. However, this brings new challenges to underlying storage. In the hybrid cloud scenario we often face challenges of data transmission and storage between public cloud and private cloud. First is performance problem in storage during cross-regional. Due to the network bandwidth daily, nearly two microseconds, the excess storage performance of the public cloud GPU host drops dramatically compared with open-private cloud GPU host. Excessing storage samples processed by single GPU card per epoch fall to nearly two to four times different. I would bottleneck it to low GPU card utilization, which is a great waste of task training time and GPU resources. Second is high cost of data migration. As the initial state, the task port dispatches to public cloud GPU, which amounts the public cloud file storage, requiring users to migrate the KubeFast data to hopeful cloud to the public cloud storage, which is a high cost for users. Third is the data security issue of public cloud. A cache acceleration mechanism is designed on the client side to chill local cache pool, remote training data close to GPU computing nodes, reduce the KubeFast latency to microsecond level, and reduce the latency to 100 microsecond level. This page is the details of how the design and deployment cache. First picture below is deployment architecture. The data acceleration mode is deployed in the public cloud GPU host in the demo side, site mode. The cache disk is mounted to task port, and the block cache port is in the way of host pass. To reduce the request, the uniform storage provided cloud. There are two types of requests. First is edit, second is data. So it builds the mechanism to catch both of them. First is need to adapt the flexibility, flexibility to different types of training tasks. If the cache expiration time is that long, the memory count will be released in time. If it is set to short, the cache hit rate will be affected. Then second, need to consider the scope of caching. These on the specified directory build the cache of dentry and iNode so as to manage the cache scope and lifecycle modes flexibly. The red picture is the architecture of inner components of cache acceleration for match cache, cached in the memory of the KubeFast claim. And cache iNode and dentry manage it. But that for this cache has two key factors. First is build this cache service. Need to consider the resource limitation, the generality. Second, need to build index management and data management. Therefore, the data scale block cache is designed as a lightweight service module, which mainly provides three functions, cache read-write function, cache data index measurement, and cache disk space management. The picture is about the test comparison result. It can be seen from the data in the finger that after the public cloud GPU starts the KubeFast claim to accelerate the performance distribution of result 18 in dashboard or worker 116 increased by 360 and 114. Alex Knight increased by 130 and 80 respectively at work, 16 and 24. The effect is very obvious. Compared with operator with cloud GPU access, it also increased by 12 to 27 percent with significant benefits. The cache hit read has a very important impact on the overall performance. Also, the cache disk capacity of a single GPU host is limited. It is limited by the number of GPU cards and GPU resources of a single machine. And the number of the train tasks running at the same time is not large. The cache hit read is physically stable at a high level. This page is about the QIs. When designing QIs, two aspects are mainly considered. The first is the mentality of the system. And the second is to consider the characteristics of the flow control requirement of the file system itself. Therefore, in our implementation, there is no dependency on external module. And there are mainly two modifications. One is the flow control module of client and the other is global flow distribution management module of master. Through periodic client reporting and traffic requests, master supports allocations and do more prior allocations of buffer for clients. We try to reduce the interaction between the client and master as more as possible in case of a scenario that master needs about 10,000 clients. For customers who has not request for a long time, it expands the power of obtaining traffic from the master to reduce interactions. In the storage or post-air training platform, the characteristic of the QBFS meets the storage characteristics of data required for air training, especially post mass storage of small files. It shows the pain point of many storage. At the same time, in the multi-user platform of air training, so amounts of thousands of people may operate the same batch of data. For example, the data security and historical recovery have become a demand point. Which is also the data storage demand of the OPPO hardware too. Therefore, QBFS began to design features to support snapshot in OPPO. And this feature also became a classical competitive retail to a similar system. It can be seen from the finger that different colors represent different versions and different colors of clients represent the version to read. Only the newest version can read and write. At the same time, multiple versions of the mileage are stored in MIDA. The color block diagram also didn't know the reflex version's ownership. But there's no need to identify the version in the node, and the version is controlled by MIDA. It has several nice features. First is create snapshots in seconds. Second is no-like snapshot version reads. Third is no-write amplification. Fourth is metadata data without space redundancy. Fifth is strong consistency. This feature is the index design inside our match. As mentioned earlier, the index is stored by P3. The picture shows the measurement relationship of different versions. Information inside the file. They will build the index according to the time order of the versions, and reading any versions can get the complete storage shared information. Snapshot is a complex function. For example, QSYV, multiple components are involved to implement a snapshot operation. Almost all key internal routines need to be updated. QSYV is the snapshot deletion scenarios involved visibility range, judgment, and operation of nodes and entries. QSYV is the problem of excessive fragmentation of the data caused by the modification of data fragmentation, and the problem was optimization and the combination. The fourth is we need to guarantee that the consistency between the client catch and the mandate, and all the nodes. Okay, next page, we will talk about the remote shuffle solutions. The technical evolution in big data computing engine has always been inspirable from the optimization of shuffle. Whether it is optimization of execution playings about shuffle operations as much as possible. All the evolution of various shuffle mechanisms, it is also shortened the time consuming of shuffle as much as possible. There are two main factors for shuffle, two became the key factors for the efficiency and the stability of big data computing. One is fragment read and write of disk, spill rise to disk multiple times, and reduce only two parts of the partition date, which affects the efficiency. Second, reduce read local data on max side, which requires max remote networks which affect stability. With the evolution of shuffle technology, the main line is also advancing along the solutions of the above two problems. For example, remote shuffle service, the principle for remote shuffle service is that the map task pushes the same partition date to the remote shuffle service, and the remote shuffle supports the data of the same partition. Shuffle uses a distributed file system as a storage base. Today, when distributed storage technology is so developed, we don't need to spend too much effort optimizing storage. Professional things are hand-over to professional people. The main advantage occupied by this is, of course, reduce the complexity of shuffle system itself and improve its stability. Secondly, the distributed file system itself has advantages of good stability, scalability, load balancing. Third is adapt to a variety of the distributed file system. Truths diversification and make full use of the advantage of different systems. Fourth is make shuffle worker decouple local storage capabilities, separate storage and computing, and make it easier to deploy it on cloud. Let's introduce feature of flexible replication strategy used in shuffle. The above figure is the deployment of three couples. The client and the subsystem are deployed independently. The following finger shows the deployment of a single copy. The client can choose to mix deployment with data subsystem, which will choose local priority storage and will not choose other nodes if there's sufficient space. In replica options, we can choose single replica 2, 3 replica. At the lower of the number of replicas, the lower or all cost, let's focus on single replica. It reduced TCO and narrow traffic. Besides, it significantly reduced writing latency. The last page is about the roadmap of QBA files. There are several important components in the development. First is QubeKit, a structured storage for mobile applications or called various devices. Second is an HDFS protocol compatible. Direct to use HDFS seamless to access QBALYSE without any modification. Three, automaker name, recoverables or middle students, status remained caused by the routines be interrupted by something happened in the distributed environment. And there are two main release data. First is the snapshot mentioned before. Second is usual coding code subsystem reconstruction. Reduce the count of components upon all. And it will be easily be deployment. Thank you.