 Hello, everyone. My name is Lianpeng from Wear My AI Labs. Today, my colleague, Fan Xiuwan, and I have introduced the project FATE LAM, which I will be working on the Linux Foundation AI and Data Community. Wait. I'm going to empower the large model with better learning. This is the right-to-use framework in financial and telecommunication company. So first of all, I wonder how many people here have heard of better learning before? Better learning is generally considered raised by Google from his broad better learning collaborative, machine learning, without centralized data. He proposed a method to train a model in users' cell phones, but doesn't need to collect all data from cell phones in the center place. This method is not complex. In the Wikipedia, it is applied to four steps. First, from the central server, choose a model. Then send the model to clients as the initial model. Let each client train this model with local data on the client side. Lastly, after several efforts, each client get a trained model, and then send it back to the central servers for averaging. Then we repeat the second to the fourth steps until we get a verified global model. This is a very basic flow. Of course, for privacy-reserving cases or learning ID cases, there are many different algorithms designed during the flow. The original case is a device case. But obviously, it tends to different organization in a global enterprise. And then you can envision a bigger scenario. How about different companies collaborate? For example, an internet company and a bank, an insurance company, and a telecommunication carrier, etc. That data can complement for better modeling without data centralization. And also able to prevent privacy leakage with some different privacy or MVC algorithm at doing the fair learning. Based on this, we ban initial and industrial better learning framework fake. For empowering the bank models with internet company data comparatively, this level is developed and driven in committee and then it is contributed to the Linglis Foundation AI and data. Now fake is biggest industrial-grade open source for a learning system in Linglis Foundation AI and data. There are over 4,000 engineers participate in the development and it is used in many banks and financial companies in production today. Well-well is also one of the key contributors to this framework. We are the technical feeling committee member of FAKE. We are the maintainer of FAKE and FAKE LLM project and the initial initiators of QFAKE which make FAKE easier to manage and provision in Kubernetes. And FAKE LCM which is based on QFAKE and provides better learning life cycle management. And the last keyword is LLM as our title. I believe most people here must heard of LLM before because this is the hottest topic this year. This image is from the presentation, the AI development. It shows the strategy is the part of platform or software that achieved 1,000 million users in history, much faster than Instagram, Facebook, Twitter and other famous names. LLM is amazing, not like other machine learning algorithm which are designed for one purpose. LLM which model is some frightening. The same foundation model can adapt to a wide range of downstream tasks. And if you push harder, the increase of the training scale even shows more capabilities. We call this emergent that made more room for imagination. However, training for a large model is not easy. It requires a large data set to be trained and of course, it needs more resource. This table shows the way of data scale and the results to train the mainstream LLM model including public and closed source ones. Some of them use solo TB tokens and thousands of GPU or TPU that crazy. In another hand, along with more data set need to train a large model, the high quality public data is running out. According to the paper, we will run out data analysis of the limits of scale, there's less in machine learning. If that's the high quality training data we will run out next year. Moreover, when the requirement is the large model, more speed to local application it require more private data as the training data set. So how to protect privacy when we use private data to train a large model? For this demand that the federated LLM is raised, this is why we developed the project for LLM. But it is not easy. We face several obvious challenges. The first challenge is as I described before, family is collaboration, calling clients or even it has a through web. And as I said before, training a large model is required huge data for training. So how to exchange such large scale data through web during training and made a process in an acceptable time? The second challenge is as the table I saw several slides before, training a large LLM model requires a huge computing resource. However, a lot of family use cases log in the data center, but we need to train a local model in the Edge device or even IoT device. How to train or apply to a large model in his junior participants with limited resource? It is not easy, but we all come these challenges and build faith LLM. I will let my colleague, Banshu Wang, introduce how we do it in the project. All right. Hi, everyone. So Lane just introduced to us the background of federated learning and large language models and why we are attempting to integrate federated learning into this training of these models. So the challenges he mentioned are indeed the issues the industry is currently trying to address. So in the FIT project, we have already implemented and are in the process of implementing some of the solutions to these challenges. So all these implementations, we put them into one of FIT's sub projects called FIT LLM. So first, let's take a look at some of the proposed solutions and approaches in the industry to address these challenges. So let's start with the first challenge, which is that the model parameter that being transmitted during the federated learning is too large. So the most basic way to apply federated learning in large language models is for all participating parties to jointly train or fan tune a large language model with the same architecture. So for example, here we have three parties and together they can fan tune a LLMA model with their own local data. And to address the high communication cost problem, the industry has tried various methods, such as the approach described in the paper on the right, which involves freezing certain layers of the model and only train and updating other layers. However, experimental results have shown that although the communication cost may be lower, it can significantly impact the final model's performance. Therefore, we need to explore other more effective ways. And on the other hand, we all know that for now, for fan tuning large language models, the commonly used method here is what we call parameter efficient fan tuning, or PAPT. So the core idea of PAPT is that during the training process, we do not adjust all of the model's parameters, but instead we introduce a small number of new parameters. And only update these parameters during training. So here are some typical approaches, such as the adapter mechanism. And below that, there is LORA, which is a popular path method that uses low rank matrices to represent the updates to certain weight matrices in the self-attention module. And there are other methods like fan tuning and the prefix tuning, which involve inserting prefixes at the model's input layer or each transformer layer. And these prefixes are trainable. And during fan tuning, only this part will be trained and updated. So naturally, in federated learning, we can attempt to apply these PAPT approaches. This means each party will apply PAPT for the local training. And after that, we only need to transmit and aggregate the updated part. So this proportion of the transmitted parameters is generally like 1% or 1,000, or even less, of the original model's parameter numbers. So the industry has also proposed solutions like FedAdapter, Fed Prompt, which combine PAPT with federated learning. And also there is this paper, which compares various PAPT methods in the federated learning settings. Basically, when these methods are applied in federated learning, the data being transmitted or the transmission cost can be significantly reduced, while the model's performance do not decline much. So essentially, we can achieve a balance between cost and acceptable performance. And furthermore, this article also inspired us to implement FedLLM in certain ways. What we want is that we want to provide a solution or a framework that allow users to easily choose fine-tuning approaches based on their requirements and to configure and validate different methods very easily. So this is the first how to put it, the paradigm like we implemented in our current FedLLM. Basically, it is a homogeneous federated learning LLM framework based on PAPT. So the general idea is what we saw in those papers earlier. And in our code implementations, we have a module called PELLM, which stands for parameter efficient LLM. And within this module, we can work with a range of pre-trained models from the hugging face ecosystem. And this module also works with hugging faces PAPT library to use different path methods such as LORA. And after that, we can leverage Fed's existing like called horizontal federated neural network trainer to perform the federated training. So here is a simple example of the overall process. There are three parties here, and each of them will conduct several rounds of local training using LORA. And afterwards, the parameters of the LORA part will be aggregated and updated. And then, this updated, the global parameters will be used for next round of local training. So this process will iterate until certain termination conditions are met. And the aggregation process here can use secure aggregation methods based on different technologies like secret sharing, like differential privacy, which are commonly used in federated learning to protect privacy. Basically, by using these methods, we can ensure that no party can access the parameters of the others, but they can obtain aggregated results. And also, the community has some practical validations. Based on this PELM module, it can support like federated fine-tuning of homogeneous LLM with over 30 parties. And the final model's performance is better than any individual party's local model fine-tuned with their so-called own data. OK, so this is how to efficiently and privately do federated learning model aggregation for large model with the same architecture. Apart from that, Lane also mentioned that there is a second challenge, which is about the large computational resources required for training large language models. You know, in federated learning, the local resources available to each participant are very different. In many scenarios, these participants or clients can include edge devices, IoT devices, and the branches of various sizes. So unlike data centers or clouds, they do not have the large-scale computing clusters, and they also can have very different data volumes and distributions. So this scenario is what we call collaborative federated learning between cloud, edge, and the end. So in such scenario, participants may want to find new models with different scales. And how can we apply federated learning to this so-called heterogeneous settings? So for this, the industry also has some solutions and ideas. There's a paper from MIT that mentions an off-site tuning approach. Basically, with this approach, we can establish a federation between large models and the data. What does it mean? You know, for a model training scenario, there are data holders and there are model holders. We all understand that the data holder does not want to send their data out for fine tuning. So often, they need to bring the model locally for training. However, in some scenarios, this can be challenging too. For reasons like limitations in local hardware or because the model holder itself does not want to send the model to anyone, as they want to protect informations like model structure and weights. So with these requirements, the off-site tuning approach proposes that the model holder can provide some lightweight adapters and a lossy, compressed emulator to the data holder. And on the data holder side, they can fine tune the adapter with the assistance of the emulator. And in the end, the train adapter will be combined back to the original large model. And this way, the fine tuning process is relatively resource efficient, as it is essentially training a small model. And at the same time, the original model remains protected. So this is one approach, and there are other ways. When we think about collaboration between large and small models, we often think of knowledge distillation. There are various types of distillations like response-based, feature-based, and relation-based. So in this context, there are also explorations to apply knowledge distillation in federated learning for heterogeneous large language models. So for instance, there is distillation initiated from the participant or client side. Basically, each parties will do knowledge distillation from their different model to get student model with the same architecture. And the training of fine tuning process can then be performed by them to initially train their respective heterogeneous model and the student model at the same time, using both local private data and knowledge distillation. So the knowledge distillation is depicted in this figure besides the KD mark. It's a mutual. It's a bidirectional process. So then during the model aggregation phase in federated learning, we only aggregate and update the student model. So by iterating this process, we can allow the participants to continue benefiting from the training of others through federated learning, even though they have heterogeneous local models. And another research approach is kind of like server-side distillation, where the knowledge distillation is initially performed on the aggregation side, which means that during federated learning, all these participants still train and define tune as the modeler model. However, this model's initial state is distilled from a pre-trained, stronger, large model. So as mentioned earlier, especially in the so-called cross-device federated learning scenarios, participants may not be able to train large-scale models due to reasons like resource limitations. They can only train and use relatively small models. But if this small model can receive assistance from knowledge distillation from a large model, it will help the performance of the final model. So Google has this paper that talks about this kind of large model guiding small model approach. So there are some of the solutions of heterogeneous federated learning and the collaborations between large and small models. And of course, there are relatively cutting-edge research areas, and there are related topics like privacy protection, efficiency, and more that are still continually being investigated, being evaluated by the industry. And for the FIT LLM project, it has also introduced another paradigm called federated transfer learning for large language models, or FTL LLM, to address this kind of heterogeneity and collaboration. The transfer here refers to the transfer of knowledge. So through a fit of FTL LLM, we aim to provide a way to improve both large and small models simultaneously in federated learning. So in terms of implementation, for example, in the case of offsite tuning, the FIT LLM extends original paper to accommodate federated learning scenarios with small participants. So as shown here, the FIT aggregator or server first obtains the model from the hugging face interface. Then by following the offsite tuning approach, it can extract the simulator or emulator and the adapter. These elements are then distributed to these multiple participants or clients. And then each participant will fine tune the adapter in a federated learning fashion, including local training and the global aggregation. And the aggregation process still relies on FIT's secure aggregation protocols to further protect the privacy of all the participants. So this entire implementation is encapsulated within a specific implementation class called offsite tuning trainer. And also there is also practical validation in the community. Basically, through this approach, the model's performance consists of significant improvements while the resources requirements for participants are lower compared to training the original model. So this FTL LLM paradigm and the implementation was released back in the latest FTL LLM 1.3 version, which was released in the earlier September this year. So for now, we've seen how we can address the challenges in large language models for federated learning and the specific implementations within the FTLLM project. And from an engineering perspective, the FTLLM project plans to divide these implementations into different components, including a communication efficient hub, a model hub, and additionally, the implementation of other parameters, including mechanisms for privacy protection, is hosted in the privacy hub. And as mentioned, FTLLM itself is a sub-project of the broader FIT ecosystem. So what does it look like to run FTLLM within the overall FIT architecture so we can take a deeper look on the overall system design here. So FIT itself operates as a system with an API and a scheduling service called FITflow at the top. And the FITLLM model, as we are seeing here, is the components below and are orchestrated by FITflow. And further down the stack, there are optional optimization and acceleration libraries such as DeepSpeed. And moving on, we can introduce COOPFIT. The COOPFIT project organizes and deploys the FIT systems in a containerized cloud native manner. It provides functionalities, including the management of the underlying distributed computing engine and the hardware. And naturally, lots of these functionalities are implemented on top of Kubernetes on cloud native features. So now we have a system for running FITLLM. However, what we see here is just one participant, right? We need at least a second one, which is essentially another FITLLM system. And in the middle here is the OSX model who will coordinate the tasks between the two parties. And below there is a FedLCM service used for deployments of the federated learning systems in multiple parties within a so-called federation. So this includes deployments, operation, interconnection, management of tasks, data models, and more. So this is the typical deployment setup for FITLLM. Of course, our FITLLM can also run directly on FITs core or virtual machines. But if you want to manage and deploy them in a cloud native fashion and enjoy all the benefits that comes along, we introduce COOPFIT and FedLCM. Basically regarding COOPFIT and the so-called cloud native federated learning, we've covered this in previous COOPCON events and other events. So we won't take a deeper dive here today. Just know that by using COOPFIT and FedLCM, you can quickly deploy and manage production-ready FIT systems to start your federated learning jobs. And this is the FITLLM project roadmap. It is an active project that is continually involving. There are a couple of releases since the initial release in the first half of this year. And as mentioned earlier, we recently released version 1.3 in early September, and in which we introduce FTLLLM paradigm and the off-site tuning framework. And meanwhile, the FIT project has another development branch known as version 2.0. This dedicated to another crucial topic in federated learning, which is the interoperability among federated learning systems. This involves a lot of factory work, so there is a separate 2.0 branch with the community. And the plan is to merge FITLLLM, which is currently based on version 1.x into the 2.0 release by the end of this year. Okay, let's wrap up and provide a summary of our discussion today and what's next. So we've explored federated learning and its value in the context of large language models applications. And we talked about the challenges we will face when applying federated learning. And within the FIT open source project, the community has collaborated and is continuing collaborating with the industry to introduce suitable paradigms for training large language models to address these challenges. And still, it is worth noting again that the research and application of this so-called federated large models are still in the relatively early stage. So here we've highlighted some topics related to this and the FITLLM project. One important aspect is the integration with deep speed. You know, FITLLM already supports multi-node, multi-GPU training using deep speed for large models like chat GLM and Lama. And we can currently investigate how to leverage native scheduling mechanism in environments like Kubernetes to better integrate with deep speed and other libraries. And another topic was discussing and I think most people are interested in, it's about privacy protection. So we know that in federated learning, the training data never leave the local environment, right? But the final train model might still carry information from the original data. So especially for this generative models, could there be a way for the model to output this original data in some form? So currently there are no definitive answers. And as far as we know, different paradigms have different privacy implications and the community is actually exploring this area. So again, the FITLM and the industry will continue to explore federated large models. This includes expanding support for even larger models, exploring better mechanism and the more effective paradigms for privacy protection and to achieve a balance between privacy security and the efficiency. So as an open source project and the LFAI and Data Foundation, we welcome everyone to follow the developments in this area and to participate in this project to contribute to the community. And I think that's basically all for our sharing. Thank you, everyone.