 Alright, uh, good afternoon. Uh, welcome to the session. Uh, my name is Mokchada. Today, I will talk, be talking about FEDLIS, uh, Secure and Scalable Serverless Federated Learning. Uh, a bit about me, I'm finally a PhD candidate at the Chair of Computer Architecture and Pile Systems, uh, at the Technical University of Munich. I'm working mostly on solving various challenges in the domain of serverless computing and I've experienced in the domains of cloud computing, high performance computing, pile computing and systems for ML. So, uh, this presentation, uh, is structured as follows. Uh, first, I will give a brief introduction about serverless computing and federated learning followed by why this makes sense. Uh, then introduce FEDLIS and how FEDLIS addresses the, uh, straggler problem in federated learning. Some, uh, experimental results which we did with the system and finally, uh, what we are currently working on. Alright, so if you look at a brief history of the cloud, uh, the x-axis denotes, uh, focus on application or business logic and the y-axis denotes the virtualization stack abstraction. So, uh, in the beginning, uh, there were traditional IT where the unit of scale was physical servers with the deployment model being bare metal servers. Uh, these usually, uh, lasted for years and could be deployed in hours or days. Uh, with the introduction of virtualization, uh, VMs became the unit of scale and the deployment model as, uh, infrastructure as a service. Uh, VMs get, get deployed in minutes and live for weeks. Uh, further, uh, with, uh, introduction of OS level virtualization, containers became the unit of scale and deployment model as, uh, platform as a service. Uh, containers can be deployed in seconds and live for hours or minutes. Uh, so all these, uh, different deployment models are essentially server-based, uh, but what I mean by that is that the user or the application developer has to configure certain back-end server configuration parameters. So, uh, with the introduction of AWS Lambda in 2014, uh, uh, this new concept of functional service was, uh, introduced. So here, uh, the infrastructure management is completely handled, uh, by the cloud service provider and the application is essentially decomposed into fine-grained functions. And these functions became the unit of scale and they can be deployed in milliseconds and essentially, uh, live for seconds. So, uh, some prominent examples of service computing platforms today, uh, are, uh, commercial offerings such as AWS Lambda, Google Cloud Functions, Azure Functions, uh, IBM Cloud Functions, uh, open source alternatives such as, uh, OpenWisk, uh, OpenFast, and, uh, Knative. Right? So how did these platforms actually work? So, uh, Functional Service is an event-driven paradigm, uh, where functions are invoked, uh, on events such as, uh, HTTP requests or GRPC requests. Right? So, uh, on the occurrence of these events, uh, the Functional Service platform, uh, is responsible for providing resources to these functions and its isolation in ephemeral, uh, stateless containers called, uh, function instances. So on the cre- on- on function invocation, uh, if the service computing platform first checks, uh, if, uh, function ex- function instance exists or not. So if it does not exist, uh, the function, uh, the service computing platform creates a new, uh, function instance. Uh, this process is essentially what is called a call start. And, uh, this, uh, created, uh, function instance processes the event and then, uh, returns the response, uh, to the user. So the Functional Service platform or the Service Computing platform can create concurrent function instances, uh, to handle multiple events or requests on demand. Uh, when the number of requests decreases, uh, the fast platform can automatically scale- scale down the number of active function instances to zero. So, uh, to more- better understand what the call start process is. So you- essentially what happens when you have an event, you download the code, you start a new instance, and then bootstrap the function runtime, uh, and then execute the function handler, right? So this whole process, until bootstrapping the runtime, is what is called a call start. And if there's an already an active function instance, uh, which can handle the event, then, uh, only the handler method is executed, that is what is called a want start, right? So, uh, this part is essentially what is, uh, responsibility of the platform to optimize. All right. So, uh, moving on to, uh, federated learning, uh, so what it is, uh, it is essentially a distributed, uh, learning paradigm that enables the collaborative learning of MN models among a group of participants or devices or clients without sharing the data. So, uh, in contrast to traditional cloud-centric, uh, approach for deep learning, uh, which requires, uh, training data to be collected and processed at a central server, uh, in federated learning, the data never leaves the devices. Uh, instead the participants only exchange, uh, updated model parameters. So it tries to solve the fundamental privacy problems in, uh, distributed learning. So how does it work? So an FL system, uh, has two main components, uh, central server, uh, and a, and participating clients. The central server, uh, is responsible for coordinating the training process and contains the latest global model. Uh, the traditional FL training is done in synchronous rounds, and in each round a subset of clients participate in the training. So at the start of a round, uh, the central server sends the latest global model to a subset of devices, uh, to train and wait for the responses from those clients or for a pre-configured, uh, timeout. So when these, uh, devices receive the global model, they start training locally. After they finish, uh, they push the updates back to the central server. Uh, upon, uh, receiving the updates or when a timeout occurs, the server constructs, uh, a new global model by aggregating the updates, uh, from the clients. So this process is repeated until a desired accuracy is reached. So clients in federated learning can be mobile devices, edge devices, institutions operating their own data centers, or virtual machines, uh, managed by infrastructure service providers. So based on the scale of federation, uh, FL can be divided into two categories. So one is cross-device, uh, where the participating clients are mostly mobile and edge devices with limited compute capabilities. Another one is cross silo, uh, where organized, where clients are essentially organizations with, uh, sufficient, sufficient compute capabilities. So, uh, why, uh, bring serverless computing, uh, to federated learning? Doesn't make sense. So if you look at different challenges in FL, uh, right, uh, so since FL is traditionally a synchronous process, it turns out that, uh, clients that finish with their training, uh, often spend a lot of time waiting for others to finish until a new round begins. So this leads to unnecessary cost or wastage of resources. Uh, most clients have heterogeneous hardware. Uh, this often leads to, uh, resource over provisioning and cumbersome infrastructure management for the data holders. Uh, scalability, so at the end of each round, lots of clients report back their results at the same time suddenly. So FAS on the other hand, uh, fundamentally aims to solve many of these problems. So it provides rapid scalability, uh, during request births, automatic scaling to zero when resources are unused, and attractive pricing and development model, and finally, uh, ease of use. Right? So introducing FLLIS, uh, so it is a research project, uh, boosted up by me and one of my colleagues at TUM, uh, at the end of, uh, 2021. So what is it? Uh, it's basically a system and framework for federated learning on a heterogeneous fabric of functional service platform. What it means is that it can simultaneously work with clients in the cloud, edge devices, or on, on premise data centers. Uh, it provides essential features for security. Uh, it supports, uh, uh, training using differential privacy, uh, has a modular design and can be easily extended. Uh, it currently supports, uh, seven functional service platforms out of the box. Uh, supports multiple FL training strategies such as FedAbra, FedProc, Staffold, uh, supports, uh, training of arbitrary DNN models using TensorFlow and, uh, offers various performance optimizations, uh, to fix the shortcoming of serverless functions. Right? So let's look at the system architecture of FedLIS. So this, it can be grouped into several components. Uh, clients are essentially serverless functions, uh, which get their data from an S3 bucket or a mountain network drive. Uh, the controller is a stateful process, uh, responsible for, uh, managing and monitoring the entire training lifecycle. Uh, it includes the strategy manager, uh, and, uh, mocking system, uh, to simulate the platform as a whole on a single machine. So for debugging, clients and aggregator processes, uh, can run on a single machine without the need to deploy, uh, the clients on actual fast platforms. So the developer only needs to specify the mock flag while running the controller, and the rest is handled internal, internally by the mock invoker. Right? So the parameter server, uh, is MongoDB, which we chose because of its reliability, replication support, and creation of RBAC rules. And the aggregator functions are responsible for, uh, updating the model weights and computing, uh, the accuracy of the train models. So the security features of FedLIS can be divided into four aspects, uh, function ownership, uh, authentication authorization, uh, parameter server access, and, uh, general security features. So, uh, data holders are responsible, uh, for their own functions. Uh, so this enables, uh, complete trust through, uh, full control and flexibility. Uh, all participants and requests are authenticated. So only the trusted server, uh, can invoke, uh, client functions, and only identified and improved functions can join the heterogeneous fabric. And we support custom identity providers such as SAML 2.0 and OR 2.0. Uh, so the mechanism, uh, for this should be, uh, cloud agnostic, uh, agnostic to any fast platform and should require, uh, no manual intervention. So for this, uh, we make use of, uh, AWS Cognito, uh, which essentially uses, uh, JSON, uh, web tokens. So basically all requests to clients contain tokens signed by Cognito, and then clients can easily check if the token was actually signed by the Cognito user pool and sent by the FL server. We follow best practices for security and counter most of the vulnerabilities listed, uh, by the open web application, uh, system project with our system design, for example, isolating authentic, and authentication authorization by using, uh, separate functions. So, uh, a major problem in FL is privacy. Uh, so the, the main idea is that the model parameters can leak information about the training data. So there are essentially two types of attacks, which are called as membership inference attacks and more model inversion attacks. So there are different ways to prevent this. Uh, for example, encrypting the model parameters, uh, using techniques such as, uh, secure multiparty computing or homomorphic encryption, uh, using differential privacy, uh, but in most cases, uh, hybrid approach is often used. So encryption, uh, is problematic for fast because of huge computational overhead and complex, uh, inter-function communication. Uh, DP, uh, on the other hand, is well suited, uh, where the clients can just add noise to the parameters before uploading them onto the parameter server. So FLS by default implements local differential privacy, uh, which is a form of record level privacy in which the client functions add Gaussian noise to the parameters before uploading them to the parameter server. Right. Uh, so the core base, uh, of, uh, system is entirely written Python 3 with support for TensorFlow 2 plus and Keras. Uh, so it also incorporates in-built performance optimizations such as, uh, streaming aggregation, so which eventually performs, uh, a running average aggregation of the updated client model parameters, uh, so that the aggregator does not have to load all the, uh, model updates in memory at a single point. Uh, we also have this LRU cache, uh, for the normal global namespace due to the ephemeral, uh, stateless nature of the fast functions and we also implemented federated evaluation which enables, uh, like client side evaluation of global models so that you get better, fair accuracy results. So, uh, addressing one of the bigger problems in, uh, federated learning which is stragglers which are essentially, uh, slow clients. So, uh, stragglers can occur due to, uh, various reasons in federated learning. So, for example, one of the reasons can be system heterogeneity and some clients can have low computer sources. Uh, and since federated in FL, uh, most clients have non-independent and, uh, non-identical data distributions. Uh, some clients can have more data samples as compared to the other, uh, right? So, this can lead to statistical heterogeneity. Uh, they can be networking issues or complete failures. Uh, they can also be, uh, cold starts. So, if you look at this effect, uh, in practice, uh, we see that this is the traditional, uh, FL training strategy called Fed Average. So, when we increase the number of stragglers in the system, uh, uh, we observed that, uh, it takes more time, uh, to train the model as well as the accuracy of the final global model actually, uh, is reduced significantly less as compared to a scenario when there are no stragglers, uh, in the system. So, how do we account for that? So, uh, we developed and integrated, uh, a strategy called as FEDLISCAN into FEDLISCAN. Uh, so, FEDLISCAN is a semi-asynchronous clustering-based strategy, uh, specifically tailored for serverless ferret learning and can dynamically adapt, uh, to the client's behavior at very little to no communication cost. So, it contains two parts. Uh, first is an intelligent client selection algorithm based on DB scan clustering. And the second is a staleness of where aggregation to asynchronously aggregate delayed client updates. So, uh, our intelligent client selection scheme contains a lot of features that distinguish it from other approaches. First, it is based on DB scan which like, unlike other clustering, approaches does not need, uh, the number of clusters to be determined before training. And if you look at the complexity, it's N log N. So, N is the number of clients, uh, so it's suitable for large-scale, uh, ferreted learning systems. And, uh, we also consider failures and slow updates such that we group clients, uh, with similar behavior together to minimize the effect on stragglers, uh, on the rest of the clients. So, how does it work? So, our strategy partitions clients into three tiers, uh, rookies, uh, participants and stragglers. So, rookies are, uh, clients which have not yet participated in the FL training process and for which no behavioral data exists. Uh, participants are a group of clients that can participate in the clustering process. Uh, stragglers are clients that have missed one or more consecutive training rounds. Uh, these clients have the lowest priority in client selection process. So, to better explain, uh, selection strategy, assume that all clients are in a jar. Uh, this jar can be further divided into three jars based on the tiers we just mentioned. Uh, the rookies jar contains the clients which have never been called before. Uh, essentially all clients are rookies in the beginning. The second jar contains, uh, clients which participate in the clustering process and the third jar contains the unreliable clients. So, for client selections, jar, jars are ordered by a priority. The strategy for picking the clients from the three jars differs. We use random selection for rookies and stragglers, uh, while the second tier clients are selected based on clustering. So, the priority aggregation function does the following. It takes clients from the first jar, then moves on to the next, uh, if there are not sufficient number of clients available. Moreover, as the training progresses, uh, the clients can switch between the different tiers. So, after a few training rounds, the rookies jar will be empty when all clients have been called at least once. Uh, we now have few of them in the second and the third tier. During the training, we move clients between these two tiers, uh, based on a cooldown variable. Uh, this is set to demote clients which fail, uh, for a certain number of rounds to the third tier, and once it expires, they can move back, uh, to the second tier. So, how does clustering actually take place? So, uh, for clustering, uh, from the second tier of clients, we collect, uh, training times and mistrown ratios. Uh, following this, we come- compute a penalty factor called, uh, uh, mistrown ratios. Basically, this factor gives a higher penalty, uh, to, uh, clients which makes mist recent rounds. So, next, we do, uh, exponential moving average on both lists and get a single value, uh, for training EMA and mistrown EMA. So, we automatically- we provide a way to automatically tune the parameters of DB scan to give the best clustering. And after we obtain the clusters, uh, we compute, uh, total EMA, uh, for each client by adding the training time and the penalty time. Uh, we then start the clusters and pick the client closer to participate in the training process. And to accommodate for slow updates from some clients, we, uh, provide a stainless aware aggregation scheme that includes past updates, uh, but with a dampening effect. Uh, the aggregation is triggered at the end of each round. Uh, so instead of finding only the most recent round results, uh, in the parameter server, we fetch all updates newer, uh, then a certain tunable parameter, uh, called tau. Uh, following this, we aggregate the updates, according to the shown equation. So how does, uh, the training workflow in FEDLESS look like, right? So, uh, so the FL admin, uh, configures, uh, the training process, uh, the controller then requests, uh, a new invocation token from the authentication server using the credential configured by the FL admin. Uh, then, uh, it fetches, uh, basically, uh, the client behavioral results from the FEDLESS database. So this depends on the training strategy. For example, we've had average, uh, which does random selection of clients. This step will not be formed. Uh, then it invokes the, uh, intelligently chosen, uh, FL client during the round using the controller. Uh, in the next step, uh, the clients, uh, contact, uh, the authentication server to check, uh, uh, basically to validate the token. And on successful validation, uh, the client fetched the latest global model, uh, from, uh, the parameter server and then compute, uh, local model updates. So when the training is complete, the clients upload, uh, the new local model to the parameter server, uh, and they also upload some, uh, behavioral data which we need for this intelligent client selection process. So, uh, at the end of the training round, the controller adjusts the participant's behavioral attributes, uh, based on the strategy and then pushes them back to the fellow's database. Uh, afterwards, we invoke, uh, the aggregation function which combines the client's results into, uh, new global model. And at the end of the round, uh, the controller invokes a subset of clients again, uh, to essentially calculate the updated model accuracy. So this process essentially repeats, uh, until, uh, configured accuracies reach, uh, on the global model, uh, configured by the FL admin. So I would like to present some, uh, experimental results, uh, which we have performed, uh, uh, using our system. So we have experimented with a wide variety of datasets, uh, right, uh, which are also available, uh, in our repo and can be easily configured with our system. Uh, some of them are mentioned here, but we also support, uh, many others, uh, and other datasets can also be added and experimented with, right? So, but first, let's look at if this whole idea actually, uh, makes sense, right? So, uh, let's look at a fast-based system and, uh, traditional infrastructure-to-service-based system, right? So, so what we did was essentially we deployed 100 clients on the same hardware with same CPU capabilities, uh, one time using a traditional FL framework, uh, called Flower, uh, using Docker and C Groups and, uh, another time, uh, using open fast functions, uh, right? Uh, so what we found out was that, uh, our fast-based approach is actually a bit slower as compared to traditional, uh, IES-based approach. So, uh, this can, it's, this is essentially because of, uh, overhead due to the function of the service platform itself and lots due to the pure nature of functions, right? Cold starts, no persistent storage, and also due to the parameter server communication. But if you look at cost, right? So, we calculate cost, uh, using the Google Cloud Functions, uh, costing model, and we don't calculate point estimates, but we calculate, uh, broader bounds. So, for this in our calculations, we include how cost would change if the Federalist client functions took, uh, two times or three times as long as the Flower clients, uh, and if the Flower clients only took like 0.5 times as long. So, for all these experiments, we observed that the clients, uh, fast was actually cheaper. However, when we increase the total number of active clients on a total number of clients in the part of Federation, uh, fast becomes, uh, less, uh, cheap. So, this is intuitive because, uh, if you have better resource utilization in VMs, uh, the most cost efficient they are as compared to serverless. However, uh, these results are really promising, uh, since in ferrated learning, there can be, uh, millions of devices in a Federation and with only a few hundred which are active, uh, in each round. Right. So, some results with our, uh, straggler mitigation strategy, so these experiments will all perform on, uh, Google Cloud Functions, uh, for up to 200 concurrent clients, uh, in each round. So, we observed that our strategy achieves faster conversions as compared to, uh, different strategies that we, in this case, do, fed prox and fed average, uh, and on the right, we show the results for the Google speech dataset. Uh, so for the standard and the scenario with 70% stragglers, our strategy achieves faster conversions and across all datasets, which I mentioned in the slide before, we actually achieve, uh, 2% higher, uh, end accuracy. Right. So, if you look at training times, right? So, our strategy reduces over training time by 8% on average and these values in the plot represent, uh, average across all different straggler scenarios, uh, for a particular dataset. So, we observed that our strategy is definitely, uh, faster. And if you look at training costs, uh, our strategy actually leads to 20% reduction, uh, across all datasets and, uh, scenarios. So, uh, just to give a brief idea of what we are currently doing, uh, in this project. So, uh, one of the most implicit assumptions, uh, in current effort systems is that all FNL clients must have, uh, uniform ML model architecture, uh, to train a global consensus model. Right. But however, uh, this assumptions, uh, fails to address fundamental client level challenges in practical, uh, FL systems. So, uh, for instance, uh, clients in FL cannot significantly skewed, uh, non-ID data distributions, right? So, as a result in, uh, extreme non-ID scenarios, uh, the uniform global model, uh, may lead to poor generalization performance following the model aggregation process due to higher variance, uh, among the trained, uh, client models. So, in this case, uh, across the three model architectures, uh, we observed, uh, 27.1 percent decrease, uh, average decrease, uh, in the accuracy. Right. For the case with uniform data distribution and non-ID. So, uh, towards this, uh, we are looking into, uh, KD techniques, uh, to train heterogeneous personalized, uh, client models. Uh, so knowledge distillation is a popular technique, uh, used in ML, uh, that faci- facilitates the transfer of knowledge from a large, uh, and complex model known as the, uh, teacher model, uh, to a smaller and more efficient model refers to as the student model. So, uh, in this case, uh, you don't even need to exchange the model parameters or the weights, uh, but only the class logits, right, are exchanged between the clients and the central server. So, uh, we actually have, uh, integrated, uh, popular federated, uh, uh, knowledge distillation technique called FEDDF. So, it's a server-side knowledge distillation technique, uh, more details about that, uh, can be found, uh, in, you scan by scanning the QR code and converted it into a completely, uh, serverless architecture and integrated that into FEDLIS. And we have observed some promising results, uh, for example, uh, so here, alpha is essentially, uh, like a Dirichlet distribution for controlling the skew of the class labels, uh, in each client's data distribution, right? So, alpha, lower alpha value says that, uh, there is a lot of, uh, skew. So, some clients only have few class labels and also have a lot of unequal number of samples. So, when going from 100 to 0.1, we observe that, uh, the different model architectures, uh, which can be configured for each client, uh, there's not, not too much significant, uh, drop in accuracy, right? Uh, so this, we are also investigating further, uh, for other datasets and other different model architectures. So, in this case, the model architectures were simply, uh, uh, multiple different layers of CNN architecture. So, two-layer CNN, three-layer CNN, uh, stuff like this, right? Uh, great. So, more details about our work, what we're doing can also be fine found in these two papers, uh, right. And, great. So, thank you, uh, uh, for attending this session today. Uh, if you have any questions, I would be happy to answer. Thank you. Uh, that was a very fascinating talk, uh, and I had some questions, uh, as I was watching it, but you covered a lot of it, uh, throughout. Uh, one question that I still do have is, how do you, uh, account for concept drift or distribution drift, uh, across time? So, like how the average gradients can, uh, diverge? Yeah. Or like, uh, just the data distribution can diverge, uh, across the clients participating in, uh, yeah. So, uh, uh, uh, so it depends, depending upon the non-ID-ness of the distribution across clients. So, what we have observed, uh, Feliscan converges quite well, uh, for, uh, even higher non-ID data, non-ID scenarios. Uh, but, uh, in some cases where there's too much skew, uh, it doesn't converge that well. So, in that case, we're looking at KD techniques, which they are more robust. But that's a really active research problem in FL. So, uh, no one has a complete answer to that. It depends on the strategy, the aggregation scheme and a lot of other factors and what client architectures, model architectures you use. So, uh, simple averaging techniques definitely diverge, right? So, that's why I think, uh, KD and personalized thing, client models, depending upon the data distribution of a particular client is the way to go. So, personalized client models is the way to do it. Like, training a global consensus model will not always work with extreme non-ID scenarios. Yeah, so that's why we are looking at client model personalization using, uh, KD. Okay. That makes sense. Thank you. Sorry for the delay, first of all. Anyway, I, I saw about the, the results regarding the, the quality of the model itself. But what about the, the overhead in the training? I mean, when you need to share the weights all the time, I suppose that it impacts directly in the, in the training time. What about that? So, that already is accounted for that. Yeah, I'm talking about the, the training phase, and not just the results because in the, in the last slide you, you show the, the results in the, in the, in the quality. Yeah. So, uh, there was a plot with training time, which already included this. Ah, sorry. Sorry. I miss it. Yeah. Sorry. Thank you. Great then. Thank you.