 Hi again. The next speaker is Densuri. I think she's calling from India and she has a very interesting talk about federated machine learning I think. Hi Densuri. Okay, now I'm finally unmuted. Hi Kiu. Good to speak with you again. Hi Kiu. How are you? I'm good. How are you? Fine, fine. And you're calling from India but where exactly? I am calling from the NCR region right now, Delhi. Okay. And it's been raining a lot today so my house was flooded this morning. Okay, let's... Whenever you're ready, I think you've got it. Yes, so hello everyone. Thank you for joining me although I can't see you but I'm hoping that our people in matrix right now and we have the next 45 minutes which are going to be extremely interesting so let's dive right in. So let me tell you a little bit about myself. I'm someone who works with data. My background is in data science. I started as a data scientist but now I'm working as an MLOps engineer. And as someone who works with data, there are a few things that often give me nightmares. As a consumer of intelligent data-enabled solutions, like how safe is my own data at the hands of the developers of those solutions? The second thing that gives me such a bad nightmare is as a developer of data powered solutions, am I fully compliant with data protection regulations? And well, how much of a privacy risk am I really subjecting my users to? So these questions, dilemmas, nightmares if you will, made me join Edelabs which is a startup based in Bangalore, India, where we're working on making AI safe. Now it is at this place that I was exposed to the challenging field of privacy and machine learning. And I started working on machine learning for two reasons. The primary reason being that it has gained a lot of popularity since 2017 and we'll look at it why. And the second reason is that it is the easiest to get started with if you want to develop an intuition about private AI. So let's dive right into it. Okay, so in this session, we're first going to address the privacy cost in AI and then we'll look at a high level introduction to building privacy and machine learning applications. We will of course focus on federated learning where and how are we using it right now and the problems it solves. We will then build our very own minimal federated learning system. And as with any new field, there are several opportunities that come with it to address theoretical and engineering challenges. And we will review some of these opportunities towards the end. I'd like to make a small note here. Since this is a talk and not a workshop, I'll try not to make it sound like a monologue. And I'd like to make it engaging since machine learning is large part brainstorming and creative problem formulation. But I wouldn't be able to see your answers yet. I still highly encourage you to post them in the chats and I'd love to look at them later. Okay, so now let's work on really driving home the idea that traditional machine learning comes at the cost of privacy. We generate a large amount of data with every waking second, whether it's through monetary transactions, ordering food online, through smart devices in our homes, or the wearables we love. I'm wearing a Fitbit right now. And of course, we love these applications. But often, we don't fully know what kind of data are being collected. When we do know what data are being collected, most often, we don't fully understand the full implications of sharing this data. For example, if you install a Pomodoro timer app that also, say, analyzes which times of the day you're the most productive, same application could require permissions for your phone book or your messaging history, which, of course, does not make sense. Right, so now I'm going to show you something that is even more mind boggling. This is data breaches through ages or the history of the last two decades in data breaches. As you can see, there is a massive increase in data breaches ever since the advent of our favorite apps and AI solutions. Well, of course, Q Facebook, Q Google+, and now here we are. These data breaches range from design tools like Canva to financial data exchange platforms like Plaid. And there is also the data concern around privacy partnership among institutions. For example, take the partnership between 23andMe and GlaxoSmithKline, which involves the data of millions of unwitting customers. Well, now you would argue that such data are often shared after de-identification such as anonymization or hashing of PII and that it is safe enough. However, events such as the Netflix price stand witness to the fact that de-identification with existing techniques is just not enough. It is an interesting story. In 2010, Netflix released one such de-identified dataset inviting the tech community to build the best recommender system for its movie suggestion. However, through a statistical analysis of data obtained from IMDB, scientists were able to link the two data and uniquely identify individuals from this data. Well, this type of an analysis is also what's known as a linkage attack. Okay, so now I hope I've piqued your interest in just how important privacy is in machine learning. So let's actually look at how it's done. So PBML, as it is often economized and how I will use it moving forward, is privacy preserving machine learning. And it is a set of techniques in machine learning workflows that range from data never leaving its source of origin for training, or being de-identified enough that it resembles a different population while still preserving the distribution of the population. So if that has piqued your interest, we will actually take a look in the next couple of slides as how these things are carried out. But before that, let's circle back to what machine learning looks like so far. So you have a great problem to solve and of course it relies on data. Generally, you would collect this data from different sources, put it in a centralized server and then our analyst in blue would try to formulate a problem on it, would try to make a model out of it and we'll call it a day, right? However, whenever data moves from its source of origin or it is aggregated at a place or a server that you cannot trust, there is a privacy issue. So now let's actually look at what techniques aim to solve it. So let's start with the key theme of this talk, Federated Learning. The key idea with Federated Learning is that the data never leaves its source of origin. The quote, essentially the model, comes to the data and this is where the actual training is performed. The machines, also known as nodes, that participate in Federated Learning can range from our phones, computers, to IoT devices, or even data servers across institutions. In most Federated Learning scenarios, and I'll explain why I say most just a bit later, there is a central trusted server that begins this training and it sends out a global model to the participating nodes. Now this is a site and after the nodes are done training, they send their local updates back to the server, which the server then performs an aggregation on these updates and it updates the global model. And then you just rinse and repeat. So we just saw what we can change with our machine training process in order to ensure privacy. Now let's look at what we can do with data itself. This next approach is differential privacy and it's very interesting. It's actually more of a concept than an actual algorithm itself. Essentially differential privacy is defined as a promise to an owner of the data, which could be an application on your phone or again data silo. And this promise provides a guarantee to the owner that the data is safe no matter how their data gets used. Now let's see where this will be useful. I'll actually code an example, which is a slight tweak of an example from the original paper on differential privacy. So I come from India, as Theo had pointed out, where in most urban cities, the air quality is poor. Suppose the government wanted to conduct a study on the effect of air quality from inside people's homes on cardiac diseases. Now my air purifier would work as a great source of this data. But suppose now let's add another interesting element to the story. I live near a factory that constantly produces emissions. This would drastically alter the data from my air purifier. Now enters a malicious actor since of course with all privacy scenarios, there is always a malicious actor, which as you guessed is the insurance agency. The insurance agency if somehow they get hold of this data by way of figuring out my age, sex or geolocation and other similar attributes and linking it with this data could potentially defuse my insurance application. So differential privacy works by introducing noise into the system. Now think about it, where do you think it would happen? But I guess if you see the image, it should be a bit clear by now. So a differentially private algorithm does not actually add noise to the raw data itself, but to the queries or the results of that data. There are two approaches to this, local privacy and global privacy. Within local privacy, the data generators trust, do not trust the aggregator and add noise to the data at source. However, contrary to this, within global privacy, the data generators do trust the aggregator and therefore they share the raw data and then it is the job of the aggregator to add noise whenever our malicious character Bob tries to query this data. Quite an interesting approach, right? Okay, so moving forward, now let's look at another approach. What if you could encrypt the data and then have any perform operations performed on it, specifically mathematical operations? Homomorphic encryption is a technique that works on exactly this principle. Current state-of-the-art techniques allow mathematical computations like addition, subtraction, addition and multiplication to be performed on encrypted data. In practice, it could look like a user server setup where the user sends encrypted data to an untrusted server. The server performs some operations on it and sends back the result, which of course is also encrypted. This ensures that the server is unable to see the data, yet it can still operate on it. In this example, we see that Alice sends the number 6 and 7 to the server that performs addition on these numbers. And without homomorphic encryption, our malicious actor Bob, here he is again, is able to see these numbers and worse yet also able to see the results if he has access to the server. However, with homomorphic encryption, all that Bob sees is garbage. So where is the practical application of homomorphic encryption, you ask? It is great for environments where machine learning is offered as a service. For example, machine learning on the cloud where the server is untrusted and you'd not want to share your private sensitive data with them. So we've finally made it this far. And now let's look at some of the ways federated learning is implemented and some existing use cases it addresses both across edge devices and enterprises. So federated learning use cases in edge devices. My first example here is recommendation systems. A good use case of this is how apps like Spotify or Netflix could understand your media preferences by locally training on the data generated on your devices. Another example is routine device management. For example, Google Photos asks you to declutter your photos by removing screenshots occasionally and it tries to train on that data by training locally on your phone. Another example is health monitoring. For example, fitness tracker apps can monitor your resting heart rate or the quality of your sleep simply by virtue of training on the data that's generated as source. And the next example is predictive typing. Of course, this example is the most widely known because this comes from a study that practically started federated learning, so to speak. And I'm talking about Gboard from Google. It was a paper released in 2017 where they talked about employing it and now here we are. So facial unlocking. Apple is actually known to employ federated learning in its face ID, thereby serving as another interesting example of edge device use cases. So if you guys can think of more edge device use cases, please add them to the chat. I'd love to know what you come up with. Okay, so I promised you enterprise use cases and now here we are. So the first two use cases, credit card fraud detection and credit lending are quite similar in nature. Banks have data on your transaction history and purchase patterns, but often banks do not have is in-house machine learning teams. So they end up employing external consultancies and users data is again moved from source. Federated learning, as we have seen so far, would help in this situation. Disease prediction is another such case where historical medical data about a person is generally present with hospitals or pathology laboratories. And oftentimes using these different sources of data together yields more accurate results. So think about it. In this way, federated learning is more in line or to phrase better, it's more of an enabler of data centric machine learning. Similarly, the next few use cases that is sentiment analysis, which is not just restricted to analyzing people's Twitter feed, but can also be employed in companies that use it as a human resource tool to understand the general sentiment of their employees towards the company practices or its administration. Autonomous vehicles, again, is a use case where data is generated on your car and federated learning can be employed to learn from this data. However, these previous use cases have yet to see some sort of production level collaborations in this direction. So definitely this is an open area. If you want to work on it, check out the recent publications, dive right into it. Very promising field. Now, last use case is precision medicine. And there is in fact a company called Okin, which is actively employing federated learning in their product called Okin Connect for utilizing data from sources such as medical records, genomic mapping to develop precision medicine. Actually, personal medicine would also not be a wrong term here. So now we've looked at extent, we have looked at all the use cases and what privacy preserving machine learning is in extensive detail. Let's actually cover some ways in which federated learning is implemented. As we go through these next slides, you will realize that some of these concepts are actually overlapping and it essentially boils down to your problem formulation. For example, let's look at these tables. This is a fictitious data set from Hospital Awesome, Hospital Bravo, and Hospital Curious. I personally find ABC quite boring after the point. So what we notice here is that these data sets are essentially same in feature space, that is the vertical columns in these tables, but they're different in sample space, that is the rows in these tables. This horizontal partition along rows is a great way to reason about horizontal federated learning. Can anyone guess how a federated learning model would be deployed in this case, or if it's similar to a scenario you're already familiar with? To post that down in the comments, I'd love to know what you think. So coming back to our example, suppose the local authorities in a city want to analyze what the COVID outcome, of course we're living in a pandemic, this does not need a lot of background. So suppose this local authority wants to analyze what the COVID outcome for an admitted patient would be given the symptoms that they're admitted with or if they have any comorbidities and what their oxygen requirement was. And of course this data set is similar across all these hospitals. And federated learning can effectively be employed here to carry out this analysis without actually moving the data out of hospital, awesome, bravo and curious. In contrast to this, there is vertical federated learning. So as you might have already guessed, we were talking about a partition across sample space, but the feature space was constant. Now we have the exact opposite. We have the feature space that is different, but the sample space is constant. So imagine taking a table, a data set and essentially dividing it in two vertically. This does not mean essentially like in a practical scenario, this doesn't essentially mean that you have all the people that are present in hospital A also present in the fitness tracking app data. But let's stick with this example for now. Let's go back to the authority that wanted to analyze the COVID outcome in patients. Now suppose some creative data scientist comes in there and they want to use features from a fitness tracking app for these for these patients. And so essentially this is also something that federated learning covers. However, one caveat here would be an actual private intersection of these data sets. That is to say, these two different institutions would not want individual identities in individual entities to be identified. However, they would still want to be identified in intersection of these two data sets. Okay, that's a lot about vertical learning. Now let's actually look at another case, which is cross device federated learning. I'm sure this image must be familiar to a lot of you already. This is from the Google AI blog, like I said, which is the source where it all started. Cross device federated learning is essentially when machine learning, a federated machine learning workflow, to be exact, is carried out across devices, plain and simple. You could also think of it as horizontal federated learning since the user or the analyst involved in such a federated learning scenario would already have control over the entire tech stack to collect this data. So it would be similar in feature space, but the devices would essentially be the difference in sample space. Okay, so now let's look at cross silo federated learning. So exactly as opposite to cross device federated learning, this type of federated learning is carried out across data silos. This is a picture from the NVIDIA blog, where they very nicely explained what is federated learning. And it is from an initiative that NVIDIA and King's College London are carrying out. There is a famous challenge called BRATS, which is international multimodal brain tumor segmentation. And they're essentially developing a federated learning model to learn from the data present across these silos. Like for example, a community hospital, a research medical center, or even a cancer treatment center. Now, an interesting thing to note here is that cross silo federated learning could both be horizontal, disjointly, or vertical. Again, let's reason about it. Or how about I actually leave it to you guys to tell me how cross silo federated learning could both be horizontal or vertical. Okay, so let's look at another implementation, which mostly concerns the topology of how federated learning is implemented. So you remember when we talked about federated learning wherein central server sends models to nodes where the data resides. That the presence of a central server which coordinates this learning procedure and performs aggregation on local updates from these nodes is what gives this implementation its name. That is centralized federated learning. Contrary to this, we have decentralized federated learning. Now, as it is clear from the topology present here, the nodes actually do not send updates to a central aggregator. In fact, there is no central aggregator. The nodes share these updates with the nodes joining them and perform aggregation on these updates from other nodes as well as the participating as well as an individual node itself locally. And then this aggregated model is again shared back to the joining nodes of any particular node. This is worth, so now also let's look at federated aggregation. It is an area of active research and I mean really active because you can see papers published anonymous papers in fact published in ICLR as back as March 2021. So, but coming back to the original point, how would you actually aggregate weights? The simplest naive approach would be to aggregate weights or gradients from local models. But there are often cases where this approach wouldn't hold true and then there are other security challenges that federated aggregation faces which we will discuss in the opportunities and challenges section. So now the exciting part of this talk, let's build a minimal effort system. Okay, so what are the ingredients that you would need for a minimal effort system? And by minimal, I mean a toy example, where and we don't consider any of the restrictions, any of the technical challenges. We essentially are only focused on taking a data set, dividing it and carrying out training across different nodes and then aggregating it away from these nodes. So the data set that we're going to use is from the tutorial mentioned in the slide. It is from the Cleveland Clinic Foundation for heart disease. And in this we will model how the 13 features in this data can predict the likelihood of someone developing cardiac disease. The data set is very small. It is essentially 303 samples and 13 features. And it comes as a tutorial bundle with chaos titled structure data classification from scratch. So now let's look at the rest of the ingredients. There is a centrally coordinating server. We will also have a modeling and data processing utility. Because the server and the clients or the participating nodes would need to communicate. So we would need to have a communication channel. In this case, we will use WebSockets. We would also need to have a medium to transfer these local updates. And in our toy example, we're going to use Kafka. Okay, so a small note on socket ire. We're going to use the Python socket ire implementation. I have linked the code towards the end of the slides and you can check it out. It's on GitHub. You can run the demo on your own. It comes bundled with all the requirements and any Docker compost files that you would need. So socket ire enables real-time bi-directional event-based communication. Of course, we do need a real-time bi-directional communication because the nodes and the server are constantly going to sync or align on sending of the global model, receiving of the local updates, and so on as we will later see. We will also use Kafka, which is a distributed event or message streaming platform. Why are we using a distributed platform? Because honestly, in my experience, it's just quite easy to work with. And it allows you to work with a producer-consumer pattern. Yes, you could implement it in the basic threading mechanisms that Python offers, but using something robust and reliable out of the box, especially for production scenarios, is definitely the way to go. So now let's actually look at the recipe. So of course, we will have a server class. We and the server class would need to define a connect method wherein the participating nodes can send a request to this, can emit to this event and connect with the central server. We would also need a method to start a training round and send the global model. Then, of course, we would also need an event at which the server can listen for having received the acknowledgement for updates from the nodes. The nodes would essentially send their updates to Kafka, so the acknowledgement isn't combined with the updates. This is actually a little bit in contrast to some other existing implementations out there which use GRPC, but for now we have this. So we would also need a function to consume these updates when they're all received. And of course, the central part of having a central aggregator is aggregating all these weights that we receive from these local updates. Here we'll actually only aggregate the trainable weights because you have dropout layers, you have regularization layers, you have input layers which don't really get aggregated. Then we would also have an evaluate method, which would evaluate on the holdout set that we're going to keep. And we would also want to store the history of federated losses across these towns and the overall accuracy of the global model as we continue training. Now let's look at the recipe for a client. This is actually quite simple. It is the server in the central federated learning case, which has to do a major amount of the work. This client would need to have a method to connect to the server and it would also need to have a method on which or an event on which it can listen for acknowledgments from the server. Of course, this node is nothing without actually having a method that trains the model and then sends these updates back to the server. In our case, we are going to take individual layers of the model and encode them as B64 strings and then from the client we will send these updates over to the server via a Kafka topic and from the server we'll actually send the global model via an event. When our training is done, we would want to end the session. We would want to get the latest model weights, update the model that is there at the look at the participating node and perform any cleanup as necessary. And then finally, we would want to disconnect from the server. Okay, so Murphy's Law, whatever can go wrong will go wrong and my house was flooded this morning so I did not want to rely on any power outages or not being able to carry out a live coding here. So I have recorded a demo for you guys. It is a brief demo where I will very briefly show this in action. So let me just play it. Okay, so here we have our Kafka service setup via Docker compose. Now we're going to take a server which I have configured with 10 rounds and it connects to the Kafka topic by way of a consumer and now we're going to start two different clients since our data set is quite small, essentially it has 300. Essentially after taking out a holdout set, we have two 42 examples. So we're going to look at two clients here with 121 examples each and each training for five epochs. So we're able to see that the server is able to correctly send the model across to these clients and the training is indeed being carried out here. The clients were able to decode the model weights correctly from the B64 representation. And one thing I forgot to mention previously was that if you run this model in a centralized setting, the baseline accuracy is 80, 82%, which we will also see that running it here with kind of receiving the same accuracy. Now this is a toy example and which is why we are getting same accuracies across in centralized setup and a federated learning setup. However, this is not the case in real scenarios and I'll talk about why that happens. Okay, the video started again. Let's just go to the next slide, online conferences. Okay, so we're finally at the section where we discuss opportunities and challenges in federated learning. The biggest problem with federated learning is its non-IID data conundrum. Now it is easy to confuse federated learning with distributed learning. However, there are several assumptions that distributed learning makes that fail within a federated learning scenario. These assumptions are that the data is equally distributed across nodes, the classes within these data are balanced and no one node can introduce over sampling of any particular class. However, if you look at real scenarios where federated learning can be employed, you would see pretty soon that these assumptions break. The training data, especially within the case of edge devices, the training data on a particular client could be very biased towards a particular class. It could also be unsampled. Like the training data on a particular node could have 10 million or like a million entries whereas other participating nodes could only have a half a million or even less than that. There is also the problem of communication between these devices. Like within a distributed machine learning system, you have control over the stack. You have control over the technology. Whereas within a federated learning system, the key idea again here is that you're deploying it in a location that does not belong to you. So there are other failures that can also come in. First of all, how do you actually perform an EDA on this data? Like Tesla wanted to employ federated learning for its life fleet, but how would it actually get to process this data that sits in these cars? Clean it up, pre-process it. Then come the technical failures. There's the issue of network latency. Out of all the connected nodes, there could be laggers that could affect the overall training quality or the training time. There could also be connection dropouts and irregular reconnects can happen. Like for example, and this is a use case we cover in the federated learning implementation that we have developed at EDA. We deal with a lot of these reconnects, dropouts, irregular reconnects. Like for example, node could be sending an update from a previous session or a previous round, which is either corrupted or stale. And you wouldn't want these scenarios because this would drastically affect the outcomes of your learning problem. Another important thing to note is that federated learning is not a standalone solution to privacy. I love this meme from XKCD. Predictive models here in, for example, can or like most models can actually overfit on certain type of data, certain type of rare data. And you wouldn't want that because there are scenarios like a model inversion attack where someone could just like vary the data enough number of times to figure out these rare data insights within the model and the particular label. And also, when you're deploying models to such number of uncontrolled devices, the attack surface increases. Like I remember there was also a recent study wherein someone was able to insert malicious code into their neural network updates. And these are some things any practitioner of federated learning would definitely have to keep in account. There are, of course, a lot of consortia and a lot of organizations that are coming together to solve these problems, build collaborative solutions. For example, I'll go back to the Okan example. They already have a federated learning consortia within Europe. And it is worthwhile collaborating on these problems to really bring privacy about in machine learning. So this is where I will end my talk. I realize there is only a single minute left for questions, but I'll be more than happy to take them in the breakout rooms. Thank you for attending. Okay, this was a very interesting talk. Congratulations, Lassie. We have one question and I think that maybe we can... Yeah, go on. ...take the time to answer. Where does the holdouts from the central server come from? It would need to contain private data, right? Yes, this is actually a very interesting question and we are also trying to work on it. So generally what this would look like is obtaining a synthetic data. So there is another approach, private synthetic data generation that I didn't cover. But what you would want to do is by way of a technique like differential privacy, get a data set that's sort of resembling the original data set that you're trying to model on, but it isn't exactly the original data set. Okay, I think that a discussion starts in the conference room. So you have to continue this on the breakout. Awesome. Thank you. Thank you very much. It was a very interesting talk.