 Welcome to this talk about federated learning all in one box. This presentation was developed by my colleague, Huamin Chen and myself, Ricardo Noriega. We both work at the Red Hat's Office of the CTO in the Edge Computing team as part of the emerging technologies group. In this presentation, we are going to showcase a machine learning architecture called federated learning, and it will be running in a cloud native. First of all, what is federated learning? What does it mean? Federated learning is a distributed machine learning process in which each participant node or party retains data locally and interacts with the other participants via a learning protocol. The main drivers behind federated learning are data privacy, regulatory requirements, as well as avoiding the risk and cost of moving data to one cell trial location. For this project, we have used the IBM Federated Learning Library that provides a basic fabric for federated learning use cases. It is not dependent on any specific machine learning framework and supports different learning topologies and protocols. For example, it supports deep neural networks, as well as classic machine learning techniques, like linear regressions, k-means, etc. The figure in, so in order to understand the benefits of a federated learning architecture, we need to compare it to a more classical centralized learning architecture. In a centralized learning scenario, each entity or party is sending data to where the machine learning model is running, so it can process this data locally. Federated learning is an approach to conduct machine learning without centralizing training data in a single place for reasons that I mentioned before, of privacy, confidentiality, or data volume. However, solving federated machine learning problems also raises some issues. This can include setting up communication infrastructure between parties, coordinating the learning process, integrating party results, understanding the characteristics of the training data sets of different parties, etc. So as you can see in the slide, while the centralized learning architecture shows how the different parties send data to the cloud or where the machine learning model is running, the federated learning example shows how this data is processed in each party and only the already trained machine learning model is sent out to the central location, where usually the aggregator is running, and then it will perform fusion algorithms to have. So I mentioned before that for this project, we have used a federated learning library developed by IBM. In this slide, I'm gonna show you how its architecture looks like. In the center of the slide, we have the aggregator. The aggregator is in charge of running the fusion algorithm. A fusion algorithm queries the registered parties to carry out the federated learning training process. The queries send very according to the model or algorithm type. In return, parties send their replies as a model update object, and these model updates are then aggregated according to the specified fusion algorithm. Each party holds its own dataset that is kept to itself and used to answer queries received from the aggregator, because each party may have stored data in different formats. The IBM Federated Learning Library offers an abstraction model called the data handler. This module allows for custom implementations to retrieve the data from each of the participating parties. A local training handler sits in each party to control the local training happening at the party side. IBM Federated Learning has a library. If we want to run the IBM Federated Learning Library in a cloud native fashion, we need a platform that supports our edge computing use cases. We are presenting here a new Kubernetes distribution optimized for edge devices called MicroShift. MicroShift is a small form factor Kubernetes distribution that provides a minimal open shift experience. For those who don't know, OpenShift is the Red Hat's Kubernetes distribution. To give you a little bit of context, the edge devices deployed out in the field pose very different operational, environmental, and business challenges from those of cloud computing. These motivate different engineering trade-offs for Kubernetes at the far edge than for cloud or near edge scenarios. MicroShift was designed to make frugal use of system resources like CPU, memory, network storage, tolerate severe network constraints, update securely, safely, and seamlessly to build on an integrate cleanly on an edge optimized operating system like Fedora IoT and real for edge while providing a consistent development and management experience with standard OpenShift. So we designed MicroShift to keep the mantra of develop once and deploy anywhere. So any application that you have developed for to make use of OpenShift APIs, you will be able also to deploy it on MicroShift. We believe these properties should also make MicroShift great tool for other use cases such as Kubernetes application development on resource constraint systems like maybe your laptop or small devices, scale testing and provisioning of lightweight Kubernetes control planes. It runs as a single binary that can be deployed as an RPM package or as a container in many different environments such as Linux, Windows, Mac OS, and on this slide explains the end-to-end deployment workflow of a device. In a production environment, this all starts with our hardware vendor imaging an edge optimized operating system like real for edge that we provide. Usually the SSDs are image at scale prior to connect them to the motherboard of the devices. The devices then are boxed and located in storage warehouse. It's important to highlight that the images are not associated with any given deployment site or location. They are just generic. Then, each device or set of devices will be shipped to their own location. Once there, a technician will screw the devices to the wall, plug the power, plug the network cable. It will probably be power over internet and the work is done. The operating system on the device will start and usually it would have baked a device management service endpoint in it. There are two ways to authenticate our devices into our own management system. The traditional approach where once the the second way to authenticate our devices into our own management system is making use of security device on board project. The Fido Alliance is working on the SDO specification and Red Hat is participating on that project too. SDO supported devices will have a keeper per device and an ownership voucher given by the manufacturer. SDO will automatically authenticate the device based on those vouchers and keys. The goal is that the edge optimized operating system such as rail for edge will use SDO as the foundation for this registration process. Once the device has been authenticated it will receive configuration stored in a Git repo. If we take a look at the example, the device will be configured based on this repo with config files, binaries, SSH keys, system disservices, et cetera. MicroShift will be one of the elements to be deployed and configured based on this repository. Finally, once the device is configured and MicroShift is up and running the cluster will appear in what we call Advanced Cluster Manager. It is a fleet manager for Kubernetes clusters and it will appear as a new imported cluster and ready to be used for deploying applications on top. Now that we have all the pieces to run a cloud native federated learning workload, let's see how this demo looks like. In a production environment all these pieces will be distributed in different sites but as promised we are going to run a federated learning model all in one box like a laptop or a server. In the base layer we have of course an operating system like Relfarage but it could be any other. The next layer above represents the container runtime. We have developed MicroShift and this demo to work with our reference runtime that is Cryo. Then in order to provide cloud native capabilities we deploy MicroShift as our Kubernetes distribution. MicroShift not only runs the basic control plane and Kubernetes control plane and not components but also provides OpenShift APIs. MicroShift will be able to schedule and orchestrate our application components as pods. The IBM Federated Learning Project comes as a Python library. We have packaged this library in container images and created Kubernetes manifest to run different entry points for the aggregator and the different parties. In this demo we will run two parties. Data will be ingested through MQTT to the parties. We use an MQTT publisher to simulate sensors to inject data to the parties. Party pods connect to the MQTT broker AKA Mosquito and subscribe to a topic there. As a curiosity we have used the MQTT broker as well as a control plane for MQTT subscribers to wake up the parties. So when the subscriber accumulate enough data it publishes a message to a control topic. The parties subscribed to the topic wakes up and registers with aggregator and starts training. Once the training is performed by both parties the train model is sent back to the aggregator to execute its fusion algorithm and start the cycle once again. Now it's time to move to our terminal and see this live demo. I hope you enjoy it. So during this demo we are going to show you how to deploy MicroShift which is a very easy process and how to run on top our federated learning library components in a cloud native fashion. We are starting here from a fresh environment. It is a Fedora 34 environment and what we are going to execute is the install script. The install script allows you to deploy MicroShift and prepare the environment in a very easy manner. It detects what Linux distribution you're running on. It detects what architecture because MicroShift supports not only x86 but also ARM and other architectors. In case you're running Reset Enterprise Linux it will register your subscriptions, apply a Linux policies, dependencies, it will install dependencies, download and install our container runtime, configure it, et cetera. Right now it is getting the latest packages from the Fedora 34 repository. It is installing contract, firewall D, et cetera to make it more secure. Right now it is establishing some firewall rules and installing Cryo. Cryo, as mentioned during the slides, it's our container runtime. MicroShift will interact with Cryo to deploy our applications, our infrastructure services and so on. For your information in order to run MicroShift you will need at least two CPU cores, two gigabytes of RAM and 2.6 gigabytes of free storage space for the MicroShift binary. And it seems that the installation has finished. The installation script performs a very handy task which is to create the kube-config file into the default location so the kube-ctl binary will grab that configuration by default but in case you have already an existing configuration it will merge both files and you will have access to all your contexts, your clusters, your users without erasing any existing pre-configuration. Now we check the status of MicroShift which is running as a system disservice and we will read the journal to see some of the logs. MicroShift has an internal service manager framework that controls the boot up of the different Kubernetes and OpenShift components. So it has some dependency graphs that will make, for example, HCD to start first then kube-api server and then the rest of the components like the controller manager, the scheduler, finally the kube-let as the node component and kube-proxy. And once the binary services are up and running then it will start applying some manifest to the kube-api server that will deploy our infrastructure services. These infrastructure services will give you some capabilities like DNS, service certificate rotation. We use the hostpath provisioner in order to be able to mount volumes into the pods and deployments. And we can see that now all the services are up and running and MicroShift is ready to deploy our federated learning library. As we mentioned during the slides, the different parties will grab data from an S3 bucket. This S3 bucket is stored in AWS. And as you can see here, we are applying a Kubernetes secret with our AWS credentials stored in it. This way, this secret will be mounted into the aggregator and the parties pods or deployments and they will be able to access that S3 bucket. Now what we are going to look at is the aggregator Kubernetes manifest. The aggregator manifest is basically a Kubernetes deployment. It has one container that runs the aggregator entry point, the Python script with some environment variables to be able to configure it and it will mount a couple of volumes. One volume is the AWS credentials and the other one is basically our working directory to store some data. So what we are going to do now is to apply this manifest and wait for the deployment to be created. We have modified the aggregator and the party from the library in order to automate the initial boot up because otherwise you would have to type and execute some commands in order to start the aggregator, the training process and we have automated all of this. So if we take a look at the logs of the aggregator, we see that the aggregator has started successfully. Now we will check the party manifest, the same way that we have done with the aggregator and it's basically the same. Another deployment with a container running the party entry point, the Python script with some environment variables and the same two volumes with the AWS credentials and a working directory. We have applied the party manifest and we are going to check the logs of this party. Once the party starts, it will detect that the aggregator is running and it will try to register against the aggregator. So they have communication and once the party has grabbed enough data it will start training. We can see here that the training has started. This will take a little while but we can check also what kind of information the aggregator has received. We check the logs of the aggregator again and we can see that one party has been registered and that the training has started. If we keep watching the party we will see that at some point it will complete the training. As you can see now the local training performed by the party has finished, it's already done and now the party will generate a model update. This model update will be sent to the aggregator. The aggregator will run its fusion algorithm and send back the result to the party so it can train again and the model will be improved and improved at every cycle. If you have enjoyed the demo please scan the QR code and you will find the GitHub repo and all the assets to run the demo by yourself. Thanks for watching.