 Okay, so I'm currently a postdoc researcher at Ghent University. I'm José Santos. I just recently graduated last month and this talk summarizes the latest work during my dissertation which is trying to bring network-aware information to the scheduling process in Kubernetes, trying to find better decisions focused on this information to allocate workloads in Kubernetes. Unfortunately, my collaborator, my co-speaker, Shen, was not able to be here today, so I'll be doing the presentation. This work is a joint collaboration between Ghent University and IBM. Today I'll talk about more about the motivation. Why do we need network-aware scheduling in current Kubernetes clusters? I'll try to show you the main building blocks that we currently have in our network-aware framework. I'll try to explain you some of the implementation details and some of the considerations that we have in our framework and I'll try to show you a complete example on how you can use our framework to deploy your applications in the Kubernetes cluster, trying to find better decisions in terms of network-aware placement. Then if the demo gods and the Wi-Fi help me, I'll try to show you a live demo by deploying the online boutique application, which is a typical application used in microservice research, so it's a web-based application with several workloads and by deploying this application with the default scheduler and with our approach, our network-aware framework, you will see the differences that you get by deploying this application in the Kubernetes cluster. Then I'll talk about the challenges and the lessons learned and finally acknowledgments. So typically when we started working on network-aware scheduling, we noticed that the Kubernetes scheduler typically mainly focuses on the resource efficiency. So typically you as a developer, you try to set up CPU and memory requests and limitations so that the scheduler can find better decisions for your pods. And that's particularly important when you are focusing on energy efficiency, but for certain applications like latency-sensitive IOT applications and video streaming applications, low latency plays a major role because you want to reduce the application's response time trying to give pleasant experiences to your end users. So currently there are very few scheduling plugins in the Kubernetes architecture that you can use with contextual awareness. So for instance when you have several workloads in your application, typically the scheduler allocates these workloads independent of each other, so they have not a knowledge of which workloads I will contact to or I will communicate to. And also there is no knowledge on the infrastructure topology, meaning what is available bandwidth, let's say in your cluster between cluster nodes, what is the communication latency or delay that you have on your network links, so there is not this type of information available currently. So we thought about how can we do better? Well, the easy answer is to say to consider some sort of latency or network bandwidth metric in the scheduling process, but how can we do this? There is a lot of research on these which typically involve a theoretical modeling where you can handle all the constraints, all the complexity concerning virginity of resources, network latency among cluster nodes, also bandwidth, but typically to find the optimal location scheme it takes several hours to find it, which is not scalable in production environments. So how can we address network-aware scheduling in, let's say, operational environment-like Kubernetes? So we have proposed a framework where we are essentially considering two main aspects, application characteristics, because we believe that microservice dependencies play a major role, so for instance you have here two example applications, the online boutique application, the Redis cluster one, and you can clearly see that you have several workloads here and they communicate to each other. So when you are deploying one of these workloads in the scheduler, you should be aware of that. You should be aware of the dependencies that each microservice has based on a certain application. The second main aspect is in terms of infrastructure topology, so we want to establish network weights among cluster nodes based on different metrics. So you as a developer, based on our solution, based on the knowledge that you have on your infrastructure, you can manually define network weights among the cluster nodes. And this is based on the current topology labels supported in communities like zones and regions. We believe that all kinds of topologies will benefit from this framework, but of course high latency is a major concern in multi-region clusters. But even in a small-scale cluster, if your nodes have different network connections, they can benefit from latency-aware decisions. And as an example, for instance, in a data center, you have certain applications like financial applications that if you deploy dependent workloads far away from each other in different zones of the data center, you may have high latency in the communication between these two workloads. So it's quite important to consider this information in the scheduling process. So before I jump to my main overview of our framework, I would like to ask a question. Many of you have already heard about the Kubernetes 6 scheduling community, and that's used in their scheduling plugins repository. Please raise your hand. So I see just a very few hands. So essentially we have designed our framework based on their repository, based on their current framework. And essentially, I'll try to explain to you the main building blocks that we currently have. So essentially, we have two custom resources, which will also have a controller. So the first one, we call it as an application group, where essentially you can say which workloads belong to a certain application and also establish dependencies among these workloads. And that will go more on this later on. The second custom resource is a network topology, where you can establish network weights among different zones and regions in the cluster so that you can use these costs later on on our scheduling plugins. We have three scheduling plugins. Essentially in the Kubernetes 6 scheduling community, they have opened up the KubeScheduler component and you have several extension points where you can develop your own algorithms so that you can use your own algorithms to schedule pods, let's say, in the cluster. We have three functions, a Q sorting function, a filtering function and a scoring one, which I'll go on as well later on today. And essentially with these three functions, we are trying to approximate the optimal solution that a theoretical model would give us but in a much more scalable way. As an additional component, we also have a net perf component that can be used in typically in small scale or medium size clusters where you run net perf measurements across your cluster nodes and then we save the measurements in a config map. And then what we will do is that we have a controller that will access this config map and it will update the costs on the CR based on the net perf measurements. It's an additional component that we are also proposing. And then another important part is bandwidth. In terms of bandwidth, we believe that bandwidth should be considered as well as a resource as typically CPU and memory is considered in Kubernetes. So we are using or we are advertising bandwidth resources as extended resources in Kubernetes so that you can also as a developer make resource requests and limitations based on these extended resource concepts. We also use the bandwidth CNI plug-in available in Calico to limit the bandwidth in pod deployments. It currently supports ingress and egress traffic shaping. So with this, we try to find better nodes in the cluster by as well considering bandwidth as a resource. So going through a complete example, trying to show you how our framework works. So imagine that you have an application group A1 composed of three pods. We have essentially a YAML based file where we say which workloads belong to this application group and we establish dependencies among the workloads. In terms of dependencies, currently we support the minimum bandwidth requirement and the maximum network costs. So essentially we tell the cluster that between two dependent workloads there is a minimum bandwidth that should exist between these two so that they can communicate properly and also a maximum network costs which is the maximum threshold, let's say, or that in terms of communication latency that should exist between two cluster nodes when they are allocating these two workloads. Then we have the network topology CR that I mentioned before where you can manually input costs with based on the knowledge that you have on the infrastructure. Currently we are focusing on latency. So we want to find placement schemes focusing on low latency so we establish costs between regions and zones in the cluster. As you can see, the YAML based file will correspond to the figure in the slide. So now going through the scheduling plugins that we are also proposing. First, one of the things that we thought about is that when you have several workloads in your application with how do we select the first pod to be scheduled? It's because the order from which you will allocate all these pods in your application will have an impact on the final end-to-end response time, let's say, of your application. So how do we solve this? Which pod should we consider first? So we have a potential solution for this, which is an additional QSort plugin that sorts pods based on topological sorting. Essentially, in topological sorting it looks for the dependencies amongst different workloads and tries to find the preferred order based on topology information. And currently we support six algorithms and I will show you that significant differences can be obtained based on the topology algorithm that you select in the application group. So for now we are supporting six algorithms and you can see here that depending on which one you select different orders are obtained. So currently for the online boutique application we achieved lower latency with three versions, with three algorithms, but you can clearly see that depending on the order it matters. How much can we optimize the latency in the Kubernetes cluster? So essentially what we do is that we attribute an index to each workload in our application group. And based on that index we will sort the pods in the waiting queue to be scheduled. Then we have a filtering and the scoring function. The filtering essentially tries to filter out nodes that will produce low scores, trying to filter out these nodes based on the requirements that I've already mentioned in the pods application group. And also then the scoring function is the main function of our framework, let's say, where we try to find nodes that will ensure that the network latency is minimized for dependent workloads based on the network costs that are available in the network topology CR. But how does this work? I'll try to show you an example. So let's assume that we have an application group, A1 with three pods, three deployments, and we have a dependency between P1 and P2 and we have another dependency between P2 and P3. We have the network topology CR that I've shown previously. We have two regions, four zones and eight nodes in our cluster. And then for the filtering function, let's imagine that we want to schedule the workload P1 and P2 is a dependency. So and we know that there is already an instance of P2 deployed on node N1. So how does the filtering function works? So essentially it tries to exclude nodes that will unmet a higher number of dependencies and tries to reduce the number of nodes being scored. So in this example, essentially it will check the maximum network cost requirements between P1 and P2 in the application group CR and the nodes that will produce a higher cost that will surpass this threshold will be automatically filtered out. So essentially in this case in our topology, node N5 to N8 will be filtered out and will not be considered for scoring. Going through the scoring function, essentially we are trying to calculate accumulated shortest path cost for all the candidate nodes based on the network weights that we have in the network topology CR and all the workloads that are already scheduled in the cluster. We normalize the accumulated costs for all the nodes because we want to favor nodes with lower costs. These nodes will be scored higher. So as you can see in the example, the nodes that will be scored higher are N1 and N2 because these two nodes based on our network topology will produce lower latency in the end. So now I'll try to show you a live demo on how you could deploy all these components and deploy an application based on our solution. So I have Kubernetes cluster with 16 nodes. All the nodes belong to the same region, Belgium, but each node is on a different zone. And I have emulated different network connections with varying delays with traffic control so you can clearly see the differences in terms of latency. This morning I have run the additional component that I mentioned, the NetPerf component. So I have all the measurements saved in a config map and I will try to deploy our network topology controller that will access these measurements and will update the network topology CR accordingly. So let's go to the demo. So first I will deploy the controller image. Essentially you have two images in the scheduling plug-ins repository. You have the controller and the scheduler. You can use the already available additional plug-ins that you have there, but you can develop as well in their framework your own algorithms and that's essentially what we did. So I will deploy the controller. Let's hope the Wi-Fi it's okay. So I will just check the logs to see if everything worked fine. So we should see two controllers, two additional controllers which is the application group and the network topology one. Sorry, let me check better. Yeah, so now I will show you the application group. So we have an application group for the online boutique application where we have all the dependencies, all the workloads that belong to the online boutique and I will deploy this CR to the cluster. I will show you the application group. So here you have all the workloads that belong to the application group and also our controller has given an index to each workload saying which is the preferred order for which these workloads should be deployed in the system. So now I will show you the other CR that we have, the network topology one and in this case I will not add any weights manually because I have run this morning the NetPerf component so I will just use those costs. So I'll just deploy it. So now if we check the logs you will see that I have the costs added from the controller into the CR. So you can see here that I have bandwidth capacities. I don't have yet bandwidth allocated because I don't have pods yet scheduled and I have a cost list for each origin in the cluster. So now I just need to deploy the scheduler image which essentially I will deploy a different scheduler into my Kubernetes cluster which will use our plugins. So I need a different configuration for the scheduler which will use our topological sort and our network overhead plugin with the filtering and the scoring function enabled. So I have the scheduler. I will also enlarge the font on this one so that I will keep the logs here. Sorry. And the idea now is that I will deploy the online boutique application with the normal key S and I will use the Locustload tool available in online boutique to see the application's response time and then I will deploy the online boutique based on our framework with this different version of the scheduler. So I have here the scheduler. I'll just put the logs. Now I'll keep them running. So everything seems to be working. So now I will deploy the online boutique with the normal scheduler. So I'm deploying all the... So in fact, I'm deploying the one with our network-aware framework first. So I'll show you first our network-aware performance, let's say. So let me see if the pods are running so that I can run the Locustload tool. So some of them are still being created. So I'll run the Locustload generator. So now let me get the logs of the load generator. Oh, it's not running yet. So here now you see the performance that we get with our network-aware framework. So we are getting average latency of 400 milliseconds in some of the get and post requests. I'll keep it running until 100 requests and then I will deploy it with the KS so that you can see the differences in terms of the performance that we get. So we have minimum response time of 200, maximum of 600, for instance, for the single get request. And then I will deploy it with the KS so that you can see the differences in terms of performance. So I'll stop it here. And now I will just delete I will just delete my deployment with our network-aware framework so that the cluster was on the same status when I deploy it with our framework. So all the pods, all the deployments are being deleted and I will delete as well the load generator. Let me just check if all the pods are already deleted. So I still have the load generator but it's terminating. So now I will deploy it with the normal, with the default schedule. Let me see if all the pods are running before I run the load generator. So I'll apply it. So now let's see the logs. So it's running. So here you can see that typically the KS when deploys the online boutique application typically we have higher values in terms of latency. So here you can see that with our network-aware framework we can at least, or for most requests, we can at least 30 to 40 percent reduce the latency expected in the cluster. I'll just keep it running. So even in the minimum and in the average response time that we get in the online boutique application with our framework we can typically reduce it based on specifying pod dependencies, specifying workload dependencies in the application group and as well having the network topology in the network topology CR we can typically reduce the latency in the Kubernetes cluster. I will just to finish my presentation. So what were the challenges and the lessons learned that we learned? Well, our plugins can significantly reduce the expected latency in Kubernetes clusters, especially if you properly define the application group based on the workloads of your application and establish the dependencies that you will have in your application. Another aspect is that currently there are a lot of workload or pod grouping definitions. There exists also the pod group concept where you can deploy all the pods altogether like in gang schedules. We are proposing the application group and I believe that in the community we need to get to a generic uniform way of having a YAML-based file where we could have all of these definitions and then we as developers can select the one that we prefer to use. Another aspect that I did not mention yet is that our plugins do not have significant overhead to the scheduler process. Currently, if we want to schedule a pod based on the Go benchmark tool, we are well below one second for a cluster with 10,000 nodes. So we are not adding overhead because we are accessing information via custom resources. And my main message today is that we are currently looking for contributors and engagement from the community that are interested in network aware scheduling because currently we are focusing on reducing the latency by specifying workload dependencies, but you may have different ideas, different concepts that you would like as well to add to our current implementation and we are open to it. Currently we are actively involved in the Kubernetes scheduling community and we have a Kubernetes management proposal already accepted and we have the initial PR already submitted and we are waiting for revision. To finish my presentation, I just would like to thank all the people that were involved in this work, mainly from IBM Research and Ghent University. And I would like to thank as well to the Kubernetes SIG scheduling community because since the beginning in our early CAP proposal, CAP draft, they gave us a lot of valuable feedback which allowed us to improve our current implementation. So I would like to thank them as well. So now I'm ready to answer all the questions that you may have, otherwise you can find me in one of the breaks. I'm always happy to share insights with you. Thank you. Anybody got any question? We have one online then I can go there. Can we? I will right there. We have one virtual. So the first virtual is how would this network or a scalar react to network disruption? Would the application services lose data state while the workload is being rescheduled? Can you repeat the question? Last part please. Would the application services lose data state while the workload is being rescheduled? If there is some data disruption, some network data disruption? So currently we have the additional component, the net perf, where you can run several times a day, let's say, the components so that you have updated costs in a config map and our controller will update the costs in the CR so that we can find better decisions in terms of scheduling and also when pods are evicted or rescheduled, all the workloads that are already deployed in the system, our framework will make sure that we'll find the best nodes based on the current status of the infrastructure based on the network costs that are available. Okay. He got his answer. Anybody else? Just raise our hand. Thanks. This is awesome. Thank you. I'm curious how close you think this is to being production ready? So currently we are waiting on the revision on the PR that we submitted. So what I've shown you, the main components that I've shown you in my overview, they are running so we have this version completely running but we are waiting so that it can be included in the C scheduling community but we expect to be there in the next few months. Otherwise you can access our current implementation via the fork that I have from their repo so it's available in one of my branches so you can testing out for yourself as well. Let's say. Great thanks. Thank you. One more? So I was wondering do you have any data or statistics on how this might affect resiliency? So when you reduce latency you tend to group things together, right? So another experiment that we did was with the Redis cluster application which typically you have several master instances and slaves so they are constantly exchanging let's say data and we have run, let's say, the Redis benchmark tool based on the deployment that we had in this same cluster with 16 nodes and we were able to improve the throughput by 20% on average with our network aware framework for that application. Hi, so is there a possibility to extend this so that there's like dynamic network aware scheduling so like if the nodes there's a lot of congestion and the like right now the costs are manually fed can we do that on a dynamic basis like on a real-time basis? One of the things that we have planned is to add an additional plugin which will focus on monitoring let's say the bandwidth that the incoming traffic on a given node and based on that we will develop a plugin that will try to schedule the workloads based on the bandwidth resources requests that you set up in the deployment file and based on the current usage of the node there is a load watcher component that currently does that for CPU and memory and we are planning to extend it to bandwidth so that we can consider the incoming traffic of the node to try to avoid to deploy workloads on nodes that are congested as you mentioned. So just for regarding to partition tolerance applications that we have for instance like just read this all just as your example regarding to boutique if you have for instance 10 ports there are the 10 type of applications that are living together so if one of the port evicted or getting crash loop and just say scaled somewhere else end of the some places so what does it do can you put some light on i mean is it going to regarding to network benchmark is it going to kill every port and started to schedule for network efficiency or is it going to live as it is like to say. I would say that as a cluster administrator you need to make decisions or decide which you prefer to optimize in your cluster so currently you have several plugins focused on the resource efficiency and we are proposing one which is more focused on network aware information which based on workload dependencies and network topology tries to find placement schemes where it tries to reduce the latency we currently are focused on that and even if pods are evicted as you said we believe that our framework can find the best nodes focused on low latency based on the current status of the of the cluster okay great session thank you jose i think we are done of time but yeah we have lunch break now so if you want to gather here that's fine thank you folks thank you thank you