 Hello, everyone. My name is Myung Jin. Thanks for coming. I'm really excited to be here to give a talk on the work that I have been doing for the last, you know, two years. Today, you know, I'm going to talk about the topic related to how to adopt the emerging day federal learning technologies. So the, you know, we are actually living in a very exciting time because of, you know, the generative AI models, such as, you know, GPT or Germany recently. But today, I will actually slightly touch upon a slightly different topic, which is called federal learning. For those who may not familiar with the federal learning, I will first briefly introduce the federal learning, and then shift gears towards the challenges that we face to adopt the, you know, new federal practices and approaches. Then I will present some of the approaches that we took and then finish my talk. Okay. So what is the federal learning? Federal learning is the machine learning technique that builds a model collaboratively with distributed data while keeping data privacy. So you may be very, very familiar with the, you know, centralized learning. All data is shipped to the cloud with, you know, some private data center. Then they use a lot of resources for the training. That's the centralized learning. As opposed to the demodel, the federal learning actually involves the devices which are distributed across, you know, many different places. And then without the sharing the data or the shipping it to the central location, each participant will do the training with the local data. Okay. This is kind of a very different model. So you may wonder where, what is the use case of the federal learning? The canonical use case that you might be familiar with is keyword recommendation. So when you actually use your mobile device with the keyboard, smart, the virtual keyboard, and when you type some keywords, the keyboard actually makes a recommendation on the next keyword. So that's actually driven by the machine learning model, which is trained through the federal learning practice. So I will actually work through the, very briefly about how the federal learning takes place. So let's say we have a bunch of participants who are the devices. And each device where the participants maintains some private or local data. Then by using this local data, the participants will do the training and build a local model. Then these local models will be shared with the central node called aggregation node or the parameter server. Then the parameter server will aggregate the individual model, the local models, and the build a global model. Then the global model will be shared again with the participants. On top of the global model, each participant will do the local training and this processor repeats until the certain criteria are met such as certain accuracy or certain number of rounds is over. So this is a slightly deeper look at with one of the famous, one of the most popular algorithms called the federated averaging. So on the client side or participant side, the local training takes place by applying the stochastic, the techniques such as stochastic gradient descent. Then the the once training is offered, the weights will be sent to the server. Then server will, you know, initially will take, select a certain number of clients at random. Then for each of the selected clients, it'll actually receive the weights from the participants. Once that is done, then in addition to that, it'll also use the number of samples or data points used for doing the weighted averaging. So by here this NK is the number of samples used by the particular participants K and then the MT is a total number of data samples used by the old participants. So with that we can actually conduct the weighted averaging approach. Then the WT plus one is the global model that can be used for the next round. So you may wonder why do we need the federated learning, although I already mentioned that the federated learning is actually used for the keyword recommendation. So one the challenges would be, you know, need for the federated learning is the privacy and related to the privacy and the data sovereignty issue. So you don't, you know, in many cases, in some cases, for example dealing with the medical record, we don't want to share the such the data with the, you know, other hospitals or the users, right? On the other hand, it may be very expensive to send the data to the central location. And thirdly, if none of these are issues, still the problem that faces by the organizations is that data silo problem. Even within the single organization, multiple teams may not be able to easily share the data because of the internal policies and so on. So even if data actually exists in the cloud. So in order to solve this, mitigated these challenges, the federated learning may be a good fit. So there are many different types of federated learning since it's exception, right? We have cross device federated learning, cross silo FL, vertical FL, synchronous FL, asynchronous FL. There are very large number of the approaches that exist today. And then the, you know, these different approaches, the more and more approaches are proposed. So one question we want to ask is, you know, as a ML practitioner, especially the federated learning practitioner, how do you actually use or choose one of these and then experiment these things quickly? So in order to kind of the, the facilitate it is experimentation or the use of the different federated learning technologies. We actually come up with the, we started a project called FLAME, which is an open source federated learning project that is led by, you know, Cisco Research, but it is a truly an open source project. So if you actually visit our repository through the disk QR code or URL, you can actually find out the more details. So the, I already alluded, you know, what would be the challenge, but today's conventional FL has this problem called the one size fits all. And the, let's say, let's say we have, we want to actually train in a model with a simple two tier topology. In a small scale, it may be a good enough, but as the number of participants increases, it may actually take longer to for the model to converge, train and converge. And also it may be hard to personalize a model. For example, if I want to maintain a model for the participants in Europe versus participants in the US, it is, you know, with this simple architecture, it's not the problem straightforward. So you may want to actually use the, you know, hierarchical topology where you have an intermediate aggregator, which can actually maintain the model for the particular group of users. And while the global aggregator can actually build a global model. And we also did some experiments with the, you know, various conditions like what if the, there are stragglers or the clients may actually fail or the aggregator or lead, lead or not fail across different schemes. I will not go over details about the individual schemes of, you know, we tested a classical FL, hierarchical FL and hybrid FL, but the, the one key takeaway from this slide is that depending on the conditions, experiments condition, the each approach actually, they exhibit a different performance metrics, like, you know, time to accuracy may be different, word, accurate self will be a different depending on the different approaches. So the, what are the challenges? We have many potential topologies, but which involve the complex and various training approaches. But then the framework are not the, the flexible or the, the extensible easily to support these various topologies. And then another problem is the, unlike the central running, the federal learning involves the, you know, how to deploy, the problem of how to deploy and where to deploy in what components. So we want to actually solve these, these challenges or the mitigated these challenges by introducing the flame. Flames are ultimate goal is to empower ML engineers by letting them focus on what they do best, which is the motor development, while the flame at the, the handles the monitoring functionality, multi-cloud support for torrents and so forth. So we already actually support various algorithms and topologies. Left hand side shows the, you know, a bunch of algorithms that we actually implemented in our framework. And the right hand side is the different topologies that the user can actually play and use. These are all the, you know, available, but the most important thing is that the flame actually exposes the APIs and, you know, the programming of paradigms so that the user can actually extend the flame easily. So, restfully, we actually published our work in a conference, in a conference called the ACM Symposium on Cloud Computing, which was published this year. So you can actually the look up the internet, search the internet and the find, the find out our work more detail. So please, you know, visit the, the, I mean, the read our paper for more details. So the, now let me shift the gears towards the, you know, more, more details. The flame, the, we want to decouple the ML code from the topology or the topology description in order to simplify the workload specification. So user need to, first of all, to think about what topology they want. And then they, they can be actually expressed as the topology section graph, which we call tag. Then our control plane will take the tag and then they will actually expand the tag into the real topology and workers based on the specification. What if, you know, user wants to change the topology, then user need to update the tag. Of course, you know, if we actually introduce a new components, then the new components will be associated with some code. But that this basically enables the online update and the fast turnaround because the user doesn't need to actually change, it doesn't need to actually develop the, the, the custom, you know, code at all. So the tag is represented with the two building blocks. One is the role and channel. Role defines the ML training behavior. Therefore, it is associated code. And then the channel, you can think of it as the abstraction for the communication. And the channel actually has the specific attribute called the group by, which allows the clustering of workers to support, you know, different topologies. So I will not go over details, but I will actually show how these features are used with this, you know, small toy example. Let's say I want to build a high-optical topology which has two groups. And this can be actually represented with this tag representation. The, there are a bunch of things, but what you want to look at is this, the dataset group. In this example, we have the west and the east group. And the, the, each group has the two datasets, A, B and the C, D. And then the, these, the groups, group, there are, we specify these group, grouping by using the group association and the group by attributes. So since we have, you know, four datasets, we need to actually, the, need four workers, four trainers. And then the each trainer basically inheriting a group information. And we have two groups. So we need to actually create two, the intermediate aggregators. Then at the top, we only have global aggregator. And, and, I mean, we only have one group, which is a default. And therefore, we only need to create one, the global aggregator. So this is the, you know, the very, very simple example, but we can actually, sorry, the, before I actually move to that, what if if you want to the, the, add more groups, right? So all we need to do is basically the introduce new group and then associate, you know, the new group with their data, new data sets. Then the, the, our system will automatically reprimand the, the new, new, new topology. So in addition to the, the, the example I showed, we can actually easily express the different topologies, like, you know, the distributed topology will be expressed with the simple south loop edge or, you know, hybrid approach on the right-hand side, on the right-most side, we can actually use the, the two tier model with the south loop so that we can actually create the hybrid approach, hybrid topology as well. Just to quickly summarize the, the flames, the abstraction provides the, provides the easy topology update by allowing the small code changes. And because of the abstraction, communication abstraction, the, we can actually use the different communication mechanisms, the, without disrupting the, the, the abstraction. I, I, I hope that these, the building blocks will help the, the transparent ML operations for federal learning. So now, let me shift the gears towards the, the, the how user can actually use our system. So let's say user want to build this simple two tier topology. It is basically expressed as, you know, a dumbbell topology. Then our library SDK actually support the base classes like aggregator and the trainer class, which basically implements the behaviors required for the aggregator or the trainer. So for aggregator, it needs to be able to, it should be able to aggregate the local models or the distributed global model. And then for trainer, it should be able to upload local model and fetch global model, right? All of these are actually supported out of box. And we also implemented, you know, other approaches like a hybrid approach or the hierarchical approach and so on. Then the, what the user need to do is, user actually inherit the, this base class and then implement their own, the training logic. For example, how to load data and how to train a model, how to evaluate. These are the, the, the, the abstract method that requires the users to implement. Then this will be packaged as the job that we, you know, through the CLI tool or the, you know, we recently released the user, the web-based GUI application in our repository. So then once the, the, the, the, the, all these, the specifications are specified as a job then it can be submitted to the control plan. Then the, the control plan will take care of the rest. So let's say the, the, the user submitted a job then the controller will save some state in database. Then it also interacted with this cluster manager such as a Kubernetes. Then the cluster manager will instantiate the, you know, workers. The, by using the, our container image and the image that contains the, an agent called the Flamelink, which connects with the control plane and fetches the, some details about the, the job itself and then launching the, you know, training app in the, in the Kubernetes cluster. So then each training app has its own specification. It could be, you know, in this particular example, the one training app is the, I mean, to train, to training apps or train has a role of trainer and the, the, the other one has the, the role of aggregator. Now, you know, with this, with the specification in each worker, it'll actually form the two tier topology and then the training, the federated learning prep, the, the procedure will be executed. Then the model will be generated. We can actually save the model into the model registry such as the ML flow. So the, now the, I will actually do a quick demo. It's, you know, UI based demo. As I already mentioned, the, we actually deployed the system components, I'm not the deployed. We released the system components, the library components and UI components in the monorebo. Therefore, users can actually use the, you know, any of these things, the, these components separately. If you want to quickly attach the, a flame on your local machine, you can use the library only. What if you want to see how the system works, then you can actually use the, the, the, the system components by using the mini-cube. We actually have the, the single machine setting called the Fiat, the flame in a single box. You can actually use that too. And also we have, you know, UI. So once they are the, the, the, if you have a, the Kubernetes cluster, you can actually use the, the UI as well. So let me quickly play the, the demo. Hello and welcome to the Flame dashboard application demo. The first section that I want to present is the design section. This can create a new design by clicking on the click new button. They have to input a unique ID, a name and an optional description. After the design is saved, the user can start to build the tag. There are two ways to build the tag. Building it from scratch, user can add roles. They can connect these roles with some channels and they can see the details of each entity. The user can set a role as the data consumer. Furthermore, adding more groups to a channel will lead to multiple groups for each corresponding role. Another way to create a tag is by using a predefined tag template. After the user uploads a valid tag schema, they can add the code files for each role. So I'll add a code file for the trainer, one for the middle aggregator and one for the top aggregator. After the user uploads the code files, then they can see how the tag can be expanded with some simulated workers. After all validation are complete, the user can save the changes by clicking on the save button. Once the design or tag is saved and valid, the user can create a job. Another important thing for a job are data sets. Flame manages data sets without leaking privacy. Users can register metadata for their data sets and make sure that users work or only access the data set during the training. Now let's move on and create a job. The user has to set a name to select the design. I'll go ahead with the one that I've just created. For the back and out two options, P2P and MQTT and the time out in seconds. On the next step, the user has to select data sets for each role that has data consumer. Moving forward on the last step, the user has, if the user has a pre-trained model, they can input the name and the version number. On the optimizers and selector sections, the user can select from the list of these, each coming with different set of args. Lastly, the hyperparameter section are some default parameters and the user can add some custom ones. Once the user filled everything, the save button can be clicked. After the job is created, we can start, start it by clicking on the start job option. After the job is completed, the user can check the results of the job by clicking on the name. So I'll go ahead to a completed job that was created with the same tag that I've used for the previous one. And here's a real topology expanded from hierarchical FL tag. In order to see more details about the job, the user can click on the graph icon. The first section represents the timeline of each worker metrics. For a better overview on how the job was executed. By clicking on each worker, the user can see more details about different metrics with their values. And by clicking on top aggregator, besides the metrics and hyperparameters, the user can check and download the model that resulted from the execution of this job. That would be all. Thank you. So the, this is the, you know, the short demo. We are still actively developing the UI. We are going to actually add the more features that are developed for the CLI. So, you know, stay tuned. We'll actually make, you know, release this pretty soon. It is currently our, you know, the overnight source code. So here's a list of, you know, awesome people who actually contributed to this project. You know, there are some backend developers and couple of interns over the last two years contributed to this project and shaped, you know, the current, the architecture of the project and then, you know, they build a lot of the other features as well. So just to finalize my talk, the Flaming is actually a community driven open source project for federated learning. It is, you know, flexible and configurable, you know, most importantly extensible easily. The, as opposed to the other frameworks that has, you know, low level APIs, we actually accept those and then provide a well defined structure for the software engineers to develop the different topologies or different approaches. And so we think that Flam can facilitate easy adoption of fast evolving the state-of-the-art FL technologies. So that is all. I'm happy to take any questions you may have. Yes? So, do you know that? Probably. So what are, oh, are you aware that's a real knife use cases who are using the learning? What are those use cases and why the distributed learning is is more suitable than the centralized one? So the one of the, these, the canonical use cases apart from the mobile, the keyword recommendation is the healthcare use case, right? The medical record are not easily shareable and therefore the individual hospitals will do the, you know, training with their local data set. But there's a problem with the such model, such approach is that the, you can actually train a model with your local data set, but that's not easily generalizable. Generizable. Therefore, you take a model and then try to deploy it to some other, for some of the hospitals. It's very likely that the model will not work, right? So that's a, that's a privately related one. And then in addition to that, let's say you actually shipped all the data, but then the, even if the hospitals are the, you know, put the data into their cloud, still they are reluctant to share. So even in the, although data is actually in the cloud, but it's not shareable. In such a case, federated learning would be a good solution. But this does not address the problem that the data itself could be leaked into the parameters in the model, right? Yeah. Even you do not leave the data, I mean you leave the words there are, but. Yeah. Absolutely. So that's a, you know, different threat of research work that the research community has had to put a lot of effort on. So that's the, you know, let's say you have only one, one user data, then the, you know, if you release the, you know, model, then the model actually contains all things about the, you know, one particular user, right, or user data. In such a case, you know, the model may actually leak the, you know, private information. So for that, it's not an extreme case, but for that, the research community tried to apply the differential of the, you know, privacy algorithm and so on. So do you see any use cases in IoT space, which this could be a good place, in my opinion? So I may not say that IoT use case, but if you think about the, you know, factory floor, a lot of sensors are sensors are deployed to kind of monitor the, the, monitor the process of the production, right? And then the, this predictive maintenance is an interesting area. And then the, you know, how to use the sensory data across different, you know, factory floors or across different factories. That might be a good use case for the federal learning too. Do you see any industry is using that? Last question. I am not certain. Yeah. But we, we are interested in this predictive maintenance use case. Okay. So if there is no further question, thanks for attending. I hope you enjoy the rest of the event. Thank you.