 Hi everyone! So this is the latest cutting edge in terms of how machine learning happens being privacy centric. That's what federated learning does. So we'll see how what it is and how it can be used or how we used for anomaly detection. So one use case of it. So this is the current standard paradigm of how machine learning models are currently built, right? So you have data that's getting generated from multiple sources. It goes in as events to your centralized cloud system probably and then the models are trained and then eventually the inference happens either on the cloud or on the device. An example of this could be that for example you want to do fraud detection on a POS machine, point-of-sale machine. So you want to do credit card fraud detection. So comes somebody swipes the data, all these data from multiple POS machines around the world go in into a centralized system as discrete events, models are built and the inferences happen there to tell if that particular transaction is fraudulent or not. Now this paradigm is undergoing a bit of a change because now we are getting more and more into smart systems. Anything that we see around these days are becoming smart and smart. So we have self-driving cars, we have smart buildings, smart cities, manufacturing plants and their control systems currently are driven by IoT devices. This more than 10 billion IoT devices currently and it's one of the fastest growing segment right now. So we have more and more systems coming in. So what does this have to do with how we do current machine learning? It has a few major challenges. One of it is that these devices are not in one location, they keep moving around. So if you are having a self-driving car, it keeps going around. So which means that its ability to connect to internet is not guaranteed. So it could be connected to internet. So if you've gone to a manufacturing plant, they would just not put an internet device over there. So things change when you have a setup where either because of regulations they don't want internet or because it's just not feasible. So for example, if you're drilling 50 feet or 100 feet underground internet connectivity over there is very hard. It's not like communication cannot happen. There are other protocols which you can use. So you can use Bluetooth, small range communications to happen but you don't really have internet to send data to get influence on your models. These also have all these IoT devices currently operate on something like batteries. So it's like what you have in your phone. If you keep it on to internet, the battery drains within a week, within a day and if you put it on an airplane mode it lasts much longer. So you might want to optimize on your battery life also. So sometimes you just want to put it. It works for say a week or a month before you can replace it. Maybe it's just not feasible to do it every time. These things also generate a lot of data and extreme high velocity. Self-driving car is a great example. It generates almost a petabyte of data every hour and it captures high resolution pictures, images, things that needs to happen and decide right then rather than sending something to the cloud and making influence. So there's really no time to wait for an action to send the high resolution picture to the cloud and get back. It has very limited processing for all these IoT devices. Some of these things are microcontrollers, really, really small. They're supposed to, they are meant to do one simple task, maybe get reading and then do some actions. Not really powerful like even our mobile phones. The central to all of these things also is data privacy. So a lot of applications where user or the company wants to maintain privacy of its customers and doesn't want to give the data out. So these are the various challenges that makes the current kind of how we do machine learning very difficult to carry on. Now when you have all these 10 billion plus devices and rapidly growing, moving around all the time, the threat, the attack surface area is very elastic. It keeps expanding and cybersecurity becomes extremely critical. So one of the core tasks that we do in cybersecurity is anomaly detection. Anomaly detection is just in some simple terms, a bunch of time series. We have various events that happen across multiple use cases and the goal is to find anomalies in them. So it could be anomalies within a time series or anomalies across various time series. So that's the goal of how most, the most common security application is. How do we figure out a way to make this work while also respecting privacy and also ensuring that it's fast? And this is where federated learning helps us. Federated learning is just a fancy word for what these three tasks does. It's basically decentralized training. So you don't train in one centralized cloud system. You decentralize your training process. And you can tell, okay, distributed systems already do. I can put a cluster and I can do that. So this, we'll see how this happens. Decentralized learning. Once you have decentralized learning, you need to find a way to aggregate them to come up with better models. We need to have some way of ensembling the various models that are created. The third thing is how do we maintain privacy when such a system happens? So these are the three components that make federated learning. Decentralized learning happens something like this. So when you start, say, running on a particular, say, on a plant or a self-driving car, you want to get started. The first thing that you do is you instantiate a model from the cloud or a central server on to the edge nodes. So these are really the edge nodes. It could be your phone, Raspberry Pi, your IoT devices, but whatever it is, it's something that where you send the pre-trained model on to the device. And now, based on the data that's there in the device, on the devices, the data that gets generated on the devices, the model gets trained. You can go and have multiple of these edge nodes and all of them train based on the data that's there in the particular device. So it takes a pre-trained model, whatever data is there, it trains. It trains on the device. It doesn't train on the cloud server. It trains on the device. Now it has a few challenges. One of the things that if you do, in terms of machine learning, you want to have ID data. It's identical and independent data. And if it's just produced from one data source and you are just updating on it, it has its challenges, right? The data frequency, one sensor might generate 10 data points every second. The other one might generate one data point every five minutes. The other one generates one data point every day. So which means that the data that comes in would be entirely different. And data frequency is different and there's no guarantee that it's an ID data. So which means that the algorithm can have challenges in how you... So all that we have right here is that the model comes. It keeps training on individual devices. Now how do we get to the next version of the model? That's when what happens is each of the device now sends the model back to the server. And at that time, we do something very, very naive and simple. We just do averaging in the process called federator averaging. All you take is... You have the same use case running on multiple devices. You take the model weights and you average it. And you get the next version of the model. And that version of the model is now sent back to all the nodes. So one note here is that only the weight changes are sent, not the entire model is sent, only the changes in the weight are sent. Data is not moved at all. So data what gets generated stays on the devices. So you don't need to move the data. So communication cost goes down. The network load goes down. But also it makes it more privacy focused because the data is just staying on the edge devices. All that we do is that now it goes the new version of the model which has been averaged from all the devices goes to the server. The server sends back the data. New set of data gets is generated. The model gets updated. It again sends back to the server. And the process keeps repeating. So this is how it happens. Now this, you might tell that, okay, data, how is the data labeled? Okay. So I will answer this after a few slides. Okay, the question is, I mean, we have trading data, but then somebody needs to label the data. It comes, that problem comes if you want to do supervised learning. So we'll talk different classes of models in a couple of slides later. Okay, so now what happens is, yeah, yeah, it's basically the gradients that go correct. Difference between a parameter, parameter server. Okay, so the difference comes when the next slide. So all you do is, as of now, you're just, okay, so there's nothing groundbreaking in the way this process happens. This is how any cluster computing distributed computing concept works exactly like this. You have data, it gets distributed to multiple nodes, training happens, and you aggregate it. So even if you do neural network, there's also, you can also do, you can do distributed data or you can distribute a model also. There are two ways in which distributed learning can happen in neural networks. So there's nothing just new in this aspect. The new thing comes when you want to introduce privacy component also to this. So the privacy component comes in how the data goes back in. So for example, here, if I'm just going to send my updates from here, when a new node comes in, I know what the updaters and then a new node goes out and then I get a new model, I can exactly tell what data was generated in this particular node. So you want to be in a place where the data that's generated from a particular device is also cannot be reverse engineered. So that's, that's what differential privacy tells. And the way it works, the way it works is that the way gradients are created, clip the noise, provide differential privacy. How do you do that? So when your training happens, you average it across different batches. So you don't use the gradient only at that particular batch. You take a few of the previous batches and you average that to get one. So that's one of the stuff. The second thing is the maximum gradient update that can ever happen is clip to a maximum. So in one batch, it cannot really go up. So this, there's a maximum in which it can happen. And the third thing, which really makes the difference is that you add some random noise. So even if somebody wants to reverse engineer, they will not be able to get the exact data because on the gradients, you add random noise. So what happens is that the impact of one particular data point on your model update really, really goes down. So that's like capped by each of the norm, maximum norm that's there. This makes it reverse engineering really hard. So two thing that happens, you do training in a decentralized way and you add differential privacy to that. There's a third component to it, which we haven't tried. And it's, it's, frankly, it's really at a very research stage. It's called secure multiparty computation where even the data that gets sent, even as you send the gradients, it's encrypted and so, so we are not talking about encryption yet, but even the gradients that are sent can be encrypted. And that's like, like really the first version, good successful implementation, the code was really, really last week. So that's really something that people are working on. But these are the two common things that people use. If you want to do so, the differences, these are meant to be done on the edge nodes. This is not, so the way you do it is to have something which can really work on large volume of data on edge devices on things that has less than one GB of RAM, 512 MB of RAM. So, so that makes exhaustive parameter search, things like that extremely hard. So you want to have algorithms that can work on the edge nodes. Two major tool choices, if you want to do this TensorFlow federated, which is based on TensorFlow, and then there is PyTorch made, it's called PySoft, and that's done by this organization called OpenMind. Both are really, really, very good in terms of how it's implemented. We use TensorFlow federated, a journey on how the model started. So when we started doing anomaly detection, it was basically we started with just using Z scores, anything about say, plus or minus 3 sigma, we categorized as anomalies. Simple stuff that people start with. And as we get more and more data, two common ways in which you do anomaly detection isolation for us is currently, of all the techniques that works really well. And k means the cluster and find what's, what's anomalous from the cluster. And then eventually, once we have a lot more of these data points, we started have having labeled data. So, so the way it works as analysts will come and tell, okay, this is not you mark this as as anomaly, this is not anomaly, you didn't mark this as anomaly, this is anomaly. So they keep interacting with the system. So there's some way someone who gives data over a period of time doesn't happen on day one. So on day one, it's still unsupervised learning for a few weeks, months, and then when you have enough of these data points, you slowly transition into a mix of supervised and unsupervised models. And that's when it makes sense to do deep learning, we started with deep auto encoders. And now a significant proportion of what we do is a mix of deep auto encoders on the cloud and federated learning on the edge. Okay, so in architecture, it has a couple of things. You can do federated learning through Keras. If you want to build your own custom models, that's done through a low level interface called federated core. You can build your own custom algorithms on on federated core. If you want to do federated, any Keras model can be used in federated learning. So that's one of the biggest advantages of moving to it. So if you've been building deep learning models Keras, moving to federated learning is quite straightforward. I'll just say that how do you compress the data? How do you quantize can really bring down your weights and your update can become much faster. That's something we're experimenting. I'm not going to claim here that we have massive success on it. But that's something that we're spending a lot of time on it. Just summarizing where we are. This is the last slide. Okay, so we have better model accuracy because we can take a lot more data, not all the data can be moved to the cloud if you're using federated devices for various challenges. It provides really lower latency in terms of prediction. It consumes less power because you're only transferring the model weights and not the entire data. So makes it privacy focus. And two things which is that you can because you're encrypting you can move the movie to cross organizations without worrying about GDPR and other regulations. And because you're using free train model, you can do it immediately. So again, summarizing what do we do with decentralized learning for data averaging and differential privacy can help you get to build truly privacy centric models. This is me. This is my contact, BRGABA, Twitter, LinkedIn, and also that's my email ID. Thanks. Yeah. We'll take the questions offline. Okay.