 Hello everyone, I'm Li Yongchang, and I'm from OPPO, and I'm also a maintainer of QBiFi's community. It's a great honor for me to once again participate in the QPCON North American Conference. Today, I will be talking about Russian learning best practice based on QBiFi's knowledge. First, let's look at the modern history of QBiFi's. On this page, we can see that QBiFi has gone through several keymail stones. In 2019, we joined the sandbox. In early 2022, we released version 3, which introduced both the usual coding subsystem. In the middle of 2022, we successfully became a sensef incubating project. In 2023, throughout this year, we have released several versions. This includes the refactoring of an usual coding subsystem, support for auditing, and improvements of stability. In our recent release, we added support for automatic process interface. Space quota management. The upcoming version will include features such as stamp shots, automated disk migration, life cycle management, and recycle bin. This is the ecosystem of QBiFi's. QBiFi's covers various domains, as shown in the diagram on the left, we support story services for databases, life cycle, elected search, and clean house, which are focused on computation. We also create support for traditional big-data technologies, such as HDFIs, HBis, and Spark. Additionally, we have associated components like premises and communist ecosystem. On the right side, we have the overview of community and its operations. Currently, our community contributors include OPPO, BigGo, Xiaomi, Jingdong, and many others. Interim of architecture QBiFi's also introduced at last year's conference. QBiFi's is a system with scalable protocol interoperability. As you can see from the picture, the three components, EDFIs, process, and the three are connected, then the three protocols can be accessed by each other. The data we're adding by S3 will be recast to the object node in the picture, but it can be read through the possible interface in the picture. We support the story engine in addition to multi-copy engine and the usual coding engine. Both systems are horizontally scalable, and our method node can also be horizontally scalable. Sharding is performed within the data method node. In the sharding dimension, we will form a replica group to maintain strong availability. At the same time, QBiFi's is a high-performance system. Our method system supports a full-memory method organization based on B3. As we will be mentioned later, we will create a multiple-level catch system, especially on the cloud side. We have made a lot of optimizations, including graphic direct storage, which will be discussed later. We are also making adjustments in terms of architecture, which we will be able to discuss in the hybrid cloud later. What you see in the diagram in our usual coding subsystem, the usual coding subsystem is an independent system. In the diagram, you can see that we have the access layer, method management system, automatic health check and repair system, and storage pool. In the original data section, we use route for the three replica consensus in terms of core zone deployment, which specifically supports deployments across one, two, three available zones. Different configurations of data blocks and parity blocks can be used, and we also support local parity blocks within each zone, enabling fast repairs. Additionally, we have implement special handling or small files to reduce system fan outs and improve performance by treating space for time. This topic is the best practice for AI, and the workflow of large-scale modules can be divided into three stages. First stages is data storage processing. Second is module development and training. The third is module acu-ing and online inference. Let's talk about the first stage. In this stage, the storage requirements are scalability and efficiency. Large-scale modules relay a massive amount of data that may be distributed across different dissenters, clouds, applications, or systems. Collecting this data is a prerequisite for training large modules to efficiently collect data. A high-capacity and scalable storage platform is needed. Additionally, the platform should support various methods for fast data collection and data import from different sources. Furthermore, the processing task such as cleaning, labeling, duplication, and augmentation should allow the sharding and interoperability of the same dataset across different data processing platforms. The second is module development and training. In this stage, the storage requirement is to fully utilize the graphic process unit without waiting. Large modules have high training costs and require significant computational power. Due to the enormous number of parameters in large modules, training often involves parallel computing on thousands of graphic process units of several months. So parallel IO capabilities and stability of the storage directly impact the cost of the training large modules. The third stage is module archiving and online inference. In this stage, the storage requirement is fast distribution in second. When the module passed the archiving validation, it needed, it needs to quickly deploy it to online inference service to generate business value. The entire update process requires fast and synchronized distribution, ensuring that all inference nodes and nodes complete module update at the same time. This of all is inconsistency among interface nodes that could affect the inference performance and the user experience. Archiving second-level synchronization requires storage with high concurrency and through output capabilities. The entire process involves repeatedly reading the same batch of this site and performing multiple iterations of computations. It has a falling IO characteristics, repeatedly reading the same batch of this site. Each epoch reads every depot from the same disk site and it is read only once. The data read by the same graphic process unit node in each epoch is random. Random shuffling is applied to this site is to prevent module overfitting. The range of data processed by a graphic process unit is in two epoches. Epoches is randomly selected. The writing of checkpoints during the training process is also important. Checkpoints are written periodically to allow the resuming training from the current report in case of interruptions. The checkpoint writing process blocks the training. Considering the IO characteristic of the training with large models, building a large-scale distribution data catch between the computing nodes and the disk center can significantly improve training of parallel IOs for output. It also helps to leverage IO pressure on remote storage and enhance the OO system stability. Now let's look at the challenges of large-scale module workflows. The first challenge is high-capacity and efficient data interoperability. The workflow online modules rely on the rapid collection massive amount of data, module channels governance, the ability to share and exchange data at different stages, and the requirements of the store for storage to store large-scale data while unifying storage for data with different life cycles and hones. The second challenge is training acceleration capacity. Matching powerful graphic process unit computing power to eliminate graphic process unit idle times to boost parallel IO capability of storage to the test. Sufficiently stable storage, fast checkpoint saving and quick recovery after the interruptions are crucial to avoid blocking training while interruptions occur. Robust storage performance and stability are key to reducing the training cost of large modules. The third challenge is accelerating checkpoint writing capability. Efficient checkpoint writing is important for seeing modules process dual training. The ability to quickly write in checkpoints while training ongoing is crucial for minimizing any disruptions or delays. The fourth challenge is rapid online deployment capability. The efficiency of module updates directly impact business benefits, the fast, frequent, and consistent distribution of gigabytes like module online environment requires high demands on storage concurrent IO throughput and streamlined support sites. The ability to distribute the modules quickly and consistently is essential for successful online deployment. So let's look at the data interoperability. Throughout the entire workflow, large-scale modules still flow through different stages from the proper size into module packaging and distribution. How can data efficiently flow through between different platforms? The key line in storage that supports multiple product access, allowing discharging among different access methods while strongly only a single copy of data. And the second challenge is the acceleration. Now we build on a three-layer, three-level catch. The first-level catch is the computing side catch uses local idle memory and disk resources of the computing mode and deployment acceleration service, and catches the recently accessed data of the business. The first-level catch includes two parts, memory-based metadata and disk memory-based data catch. The metadata catch is also included anode catch and the danger catch, which can significantly reduce the metadata query data when filled with files in lookup and open. The data catch mainly catches file, data blocks, the first-level catch as group performance, and is limited by local node storage capacity. It can realize the local catch acceleration tens of millions from metadata and terabytes of data. In terms of catch eviction strategies, the traditional LRU algorithm is not suitable for large-scale modules, unlike a random hotspot or cassettes on private business scenarios. Data reading in large-scale modules training flows a regular pattern. The enderity side is read once before accessing to next round when the catch space is insubstantial. LRU product has evicted earlier access time, accessed date. Due to the massive size of the data in large modules, if the catch space is smaller, then the size of training data size catch breakdown can occur. This means that in the later half of the epoch, the data catch in the first half may be evicted while this data is precisely what the subsequent iterations will need. The cycle needs to no catch hits in the later stages, certainly impacting the effectiveness of the catch acceleration of large modules and even causing active effects. To address this reading characteristic of large modules, QBi5 provides two catch strategies. The other strategy is time-to-life catching. TTL is based on exploration time for data eviction, maximizing the catch of the data in the acceleration rate. There, the data remains in the catch until it reaches the TTL, ensuring catch hits across multiple iterations. These two strategies can be configured appropriately based on the actual scenarios and basic needs. So now we have known that the first level has its shortage. So we developed Level 2 catch. The Level 2 catch is an independent distributed catch system designed to be deployed on the business side. Particularly in public cloud environments, from the diagram, we can see it supports hybrid deployment with multiple environments. Slots are divided into one metabats trunks and hushed into a slot array based on consistent hushing. The slots are zined to flash groups which can have one or multiple replicas in enabling deployment across multiple or available zones. Additionally, the clean side contrasts its own slot routing based on latency or achieving proximity-based exercise. The Level 2 catch can be deployed into its disk-based or memory-based. It can also be distributed on the application side or deployed intimately, leveraging the client-catch capabilities and addressing the limitations and scalability issues of the Level 1 catch. Compared to the Level 1 catch accessing the Level 2 catch requires network communication. However, it offers a large capacity, higher throughput, and can dynamically scale in and out. The Level 2 catch is particularly suitable for accelerating data in large-scale module training scenarios ranging from hundreds of terabits to tidbits. The Level 3 catch, the Level 2 catch effectively utilizes clean side resources, but these resources may be limited in the background system that uses error coding close to low, but performance may be bottlenecked. In such cases, there is a need for the acceleration system with a certain storage capacity that is also suitable for future architecture evolution. Therefore, the modified replica system is equipped with catching functionality, which will be further upgraded in the hybrid cloud scenario discussed later. Additionally, the Level 3 catch supports a large-capacity preheating feature, allowing it to be loaded in advance to meet the acceleration requirements of the business. The OO, the OO, all the out-of-the-3-level catch is shown in the diagram. Level 1 and Level 2 are suitable for catching support for public cloud, while Level 3 is ideal as an acceleration catch for the extra cold storage system. Each level also supports regional affinity for catching. By leveraging the 3-level in combination, the requirements for large-scale module training in scenarios such as from public cloud to pre-cloud can be addressed effectively. The combination of the 2-level catch acceleration and data perfection solution can significantly improve data-processed performance, reading through output bandwidths or handlers or gigabytes to terabytes per second for training framework. This means the IO bandwidth requirements form parallel training of large modules with thousands of graphic processing units. Moreover, the design of Level 1 catch and the Level 2 catch allows for modularity, enabling business to deploy them as needed to meet the requirements of business scenarios. For language modules with data sizes in range of several terabytes requiring small batch level with restrict latency requirements, Level 1 catch is suitable. On the other hand, for image, image, modules that typically have larger data volumes and limited local storage space on computational nodes, catching data in the Level 2 catch based on high-performance SSD would be more appropriate. To address the aforementioned catch acceleration, we have also started investing in hardware-level performance exploration. QBF has utilized graphic processing unit direct storage technology, which enables the direct association of GPUs memory with QBF's remote storage using direct memory access. This allowed data transfer between the GPU and QBFs bypassed the central processing unit, accelerating the data transfer process. This direct transfer mentioned reduced the wall time to process the IO from the client side to the storage side. By leveraging the graphic direct storage, the GPU can direct the read and write data from the storage device generated by motion learning. This saves time and bandwidth while providing higher data throughput. The bow is some of the work we have done in the field of AI. In addition, there are ongoing architecture adjustments for QBFs. Is hybrid cloud storage being a key focus for our future work? Let me provide a brief introduction on the aspect. Hybrid cloud storage refers to combination and integration of unparameterized storage infrastructure with cloud-based storage services. It aims to average the benefits of both environments, allowing organizations to store and manage data across multiple locations, including their own data centers and public cloud platforms. The hybrid cloud business represents a leap for QBFs in terms of data storage. Initially, we considered hybrid cloud storage to address the cost of issues of QBFs, aiming to achieve cost control through data theory. However, we discovered that leveraging the lifecycle concept based on S3 storage enables better mobility that can occur within a pre-cloud or cloud within availability zones or cloud zones, and even between public and private cloud. For the more detailed mobility, it can span different storage systems, encouraging various backend systems within a unified new space, mediate management, acceleration system, and flexible support for different storage subsystems. We can effectively meet the diverse data storage needs of different business scenarios. The unified new space enables directives to correspond to different storage systems, providing greater convenience in meeting various business requirements. As seen in the diagram, the clean side mounting point corresponds to object storage, ADFI storage, and regular POSIX interface based on internal storage. To achieve this capability, the clean side needs to interface of these different backgrounds, backend capabilities. The lifecycle is crucial for the overall data flow, for volume-level operations. It is important to associate volume data in lifecycle with lifecycle policies and resource strategies. The foundation of lifecycle is resource management by the master node and the scaling of migration tasks. Regarding the manual, which handles mediate, it needs to support mediate changes during the data migration without affecting regular data reading. On the other hand, workers can focus on their own migration tasks, making the process simpler for them. The unified cache acceleration ensures timely acceleration of data from different background storage systems. Aligning with the concept of replica cache mentioned earlier, this approach enables one type of data to be stored in multiple storage. However, it is important to consider the impact of migration on the cache as well as the impact on data consistency when users make modifications on the data. The formation, the concept and the content, including average cloud storage and some cache acceleration features, is still being on development and also needs to be enhanced. Some of these features still will be released in the future and some is already on the open source branches. Thank you all. Nice to be here today. Welcome everyone to visit our website John's Community. Thank you.