 Welcome. Welcome to this session. We are very happy to have you here. We are not sure why you have picked this out of the many good presentations that are going on, but anyways, we are happy to see you. So we have titled this presentation. Is there room for improving Kubernetes SPA? And you may notice there is a question mark at the end, and the reason for that question mark is because we don't know. So basically, we are not experts on HPA. We are people, the three of us, we have an academic background, so basically we have been thinking about HPA, thinking with it, doing some experiments, and the goal here is hopefully to spark some discussion and have some conversation with you after the session. So a bit more about the three of us. To the end of the stage, you have Bertha, she's a PhD student at Barcelona Tech. You have Gabor there, he's the CTO for L7MP.io, and he's also a professor at BMU University. And I'm Alberto, I'm a Tech Leader at Cisco, and I work at the Enterprise CTO team there. So before we go into the core of the presentation, we thought it could be interesting to describe where we are coming from, which work we have done in the past, what has motivated us to do this presentation so that we are all in the same page of why we are here together today. So let's start slow, auto-scaling. Auto-scaling, the idea is simple enough. You have some load that goes up, some replicas that goes up as well. And the three of us, together with some people from our institutions, we were thinking on auto-scaling, and we have been thinking about the scaling for some time. And one of the things that we were thinking about is, okay, what really means that you auto-scale things? So one implication of that is that you are consuming some resources on your system. And typically people tend to think of the resources in terms of CPU and memory. But one thing that we realized some time ago is that another resource that is typically consumed and sometimes goes unnoticed is the network. So we asked ourselves the question, is there any relationship between auto-scaling and the network? Of course there is some relationship, right? So the idea again is still simple. You have more replicas, you have more traffic. You have more traffic, you have more replicas. So as the request went into your system, you have to scale up the microservice. Scaling the microservice means that the microservice may generate more traffic right now. So we did some work following that trend of thought. And we eventually end up with some designs that were able to take into account the number of replicas to then decide how you want to manage the network infrastructure. Basically with that we were able to reduce the amount of guesswork that we had to do on the network side and be more efficient in the use of resources. If you are interested in that work, Bertha here wrote a very nice paper that we published at one sitcom workshop. You have the reference there if you want to read the details and you can come ask us. So for some time we were very happy with this and we were happy with the resource we achieved. And then after a while we keep thinking on auto-scaling and we realize that we have not really scratched the surface because typically you don't have a single service. You have something like this. This is the canonical example booking for application from Istio that you have seen in half of the presentations around Kufka. So basically this is here to show that typically you have a microservice graph. So one service calling another service and then comes another service and this can go on and on. So what are the implications of this from an auto-scaling perspective? The idea is the same. So you have more traffic that goes in, you scale up, that generates more traffic that goes to another service and then needs to scale up and so on. But the dynamics are a bit more complicated because you have now multiple pieces that cannot interact. So at that point we ask ourselves, we like to ask ourselves many questions as you can see. We ask ourselves another question that is what really drives auto-scaling and the dynamics of this microservice graph auto-scaling? And as you can imagine because it is in the title of the presentation, the answer is of course HPI. So what this presentation is going to be about is really about microservice graphs, HPA and how the two relate with one another. Before we go deep into that, just in case that someone is here from the previous presentation or they just were looking for an empty spot, just one slide on HPA. So what is HPA? Basically three things that compose basically the way HPA works. So one thing is that you need to declare your workload with certain resource requests. For instance in this example we are showing an example with a request of 200mc CPU. The second thing you need is that you need to declare an HPA with certain resource thresholds. And we are using an example here that is you want to have the number of replicas between 1 and 10 and you are targeting having an average CPU utilization of 80% more or less. And the third thing that makes the magic happen is having this HPA controller that is monitoring the target metric and then depending on how it sees things it brings the number of replicas up or down to try to stay within the thresholds defined. And you will see through the presentation that we use typically CPU as the metric that we are targeting. But really there are other metrics you can target as well but we use a CPU for simplicity and we stick with that through the presentation. So that's the intro. We get that out of the way and we can focus now on the cool stuff. So as I said this presentation is going to be about microservice graphs. We are going to take a look at what happens when they outscale and why they behave the way they behave when they outscale. And then finally we are going to do the question can we do better there? And we hope that we can get some feedback from you people on if that is an option or not. So with that I'm going to pass it to Bertha that she's going to talk about what the heck is going on with microservice graphs. Thanks Alberto. So after we've got the background on HPA out of our way as he said we are now going to see what happens with a whole application with different microservices connected. We're still going to keep on using the booking for application as an example. So everybody knows what we're talking about. And a microservice application in the end it's just a connection of multiple services back to back. So an important thing to notice is that the output of one service is just the input of the next one. And that's the case for all the service. So if we take a look at the product page the output of that service is the input to the review service. And the output of the review service is in turn the input for the ratings microservice. And although these are connected each of the services has its own HPA and it scales independently. So all these resources are managed by their own HPA. And we're going to see what happens when there's a sudden increase of load at the beginning of these microservice applications. So the first service suddenly gets an increase of requests. And we'll do so with a thought experiment that we'll just go together step by step to see what happens. So imagine that we have these three service chain where each microservice has four stars. There's just one pod or replica, which is the gray square circle. Sorry. And each of the microservice is able to process one green square of requests. So if it just gets one green square of requests, everything's fine. The traffic flows and there's no losses, whatever. And we're going to check what happens when an increase of requests that we were not expecting, which is the yellow square, comes to the beginning of the input of the first service. And we're going to check three different metrics for each of the different microservices. So the first one that we're going to get, it's kind of obvious, is the request rate and to see how it goes up. The second one is the number of replicas that each microservice has deployed, which is also the number of pods, which is driven by the HPA. And finally we are going to check the number of failed requests that each microservice has in each point in time. So if we see what happens when the yellow square, which is the increase of requests, which is the input of the first microservice in the chain, you can see on the first plot on the first row that we have this increasing request rate, which in turn also increases the CPU utilization because the microservice is struggling to deal with this added load. And after some time when the CPU has reached high level enough, it triggers the HPA threshold, so it goes over and then HPA notices that it needs more resources. So it sends the event auto scaling signal to start deploying a new replica. And what is important to see here is this yellow line, which highlights the time between the request rate starts going up and the actual new replica is ready and able to serve these requests. So between those times, if we have a container with limited resources like very tight limits and requests, we can see some failed requests going up since we don't have enough CPU power to just deal with all the load that we're getting. So if we take a look after a while, this new replica will be deployed. So now you see on the first service now it has two replicas. So it is able to process the two squares of load, so now they are both green. And at the same time, the number of failures and the failure request goes down because everything goes back to normal and it stabilizes. But what happens is now that amount of load is passed and cascades to the second microservice, so it faces a very similar situation as the first microservice. So it gets this increase of load, its CPU goes up, and then it triggers the HPA event auto scaling again. And well, after some time, the replica gets deployed and you guessed it, there's a little bit of time in between where the failed requests go up and when it stabilizes, it goes down again. And well, nothing very different for the third one. So it gets an increase of request, the CPU goes up, the HPA triggers an event for auto scaling and some requests are being failed while we are waiting for the new replica to be ready. And well, after a while, all the three microservices have their own replicas ready to go and sort of this request, so the system goes back to its initial state of being able to process everything that it's getting. But we have had these three critical time spots that are highlighted in yellow that show that we have some small problems in the middle. And well, this as a thought experiment was really nice, but we thought, come on, we really need to check on a deployment if that's really the case of what's happening or not. So we went and implemented the same three microservice chain to see if what we had the intuition that was going to happen really did happen. So we did the same one service after the other, so we have three connected services and we checked for the same three metrics, which is the load or the CPU average utilization. Then we will see that we have the number of replicas for each microservice and finally the number of failed requests. So if we check at this first plot, you see that the green line, which corresponds to the first microservice in the chain, is the first one that gets this increase of requests that... Oh, I forgot to mention. For this test, we have a ramping upload for the first minute and the load is just stable until the end of the 15-minute test. So what you see here on the green line is this increase of load during the first minute and then the stabilization. So what you see is the difference between the three lines. This spacing is the time that the information or the traffic takes it to go from the first service so that we scale up the first service to the second service and the time that this takes to scale up and then to the third one. And this is also what we see on the number of replicas by microservice. This shifting time from the first one to the second one to the third one. So if we take a look at the green line again for the first microservice, you see that there's a point when there are two replicas deployed for that but for the second and the third one, the yellow and the blue line, there's still only one replica. So the first one is able to process the double amount of traffic that the second one and the third one are able to process and that's what gets us to the third plot. So in the third plot, we see the HTTP error rate for the second microservice and the third microservice. So for example, the green line here shows the request rate that the review service has reported by its parent microservice, that is the one that comes before in the chain because since they are errors, they cannot report themselves that they have failed. Similarly with the yellow line, it shows the similar thing, the error that the rating service has had because it has not been able to scale as fast as it would require. So with that, we can just take a step back and look at the big picture again and it's just like every time that you have an autoscaling event, there's this delay between the HP trigger happens until the new replica is actually ready to serve this request. When we have a critical time where there's potential request failures and if we look at the whole application graph and you just add up all these traffic losses, you end up with this green line here which is the total traffic request rate failure for the application and then we ask ourselves how can we just improve this line so we don't have traffic failure and it's a tough balance between actually having traffic loss because you've defined your resources with very few margins or actually over provisioning and saying, okay, you can just set very low thresholds but then you're wasting resources and in turn money. So now Gabor will go on on why is this happening actually. Thank you Bertha. Stop for a moment and try to understand what's happening here in this situation. Think about the boot info application. This application actually consists of three microservices, product page, the review, and the ratings microservices. The product page has enough resources to generate a landing page for a user request. Then it actually makes a call to the review service so that the review service supplies the user reviews per boot for the product page to be included in the user response. And the review service in turn makes a call out to the rating service to generate the ratings per boot or that we can also include in the response. Each of the three microservices in this situation must have enough resources provision for us to be able to return a meaningful and useful response to the user. Any of the resources fail. There is a failed user request or user request will be delayed beyond time out and this will lead to a bad user experience. So ideally, ideally for such microservice applications, we want each of the services to start scaling at exactly the same time so that every instance of time each of these three microservices have enough resources to be able to provide or to be able to serve the elevated user request level. But this is not what we are seeing here. Instead, we are seeing something different. Each of the microservices in this little service chain starts the scaling sequentially one after the other. So instead of scaling at the same time, we see this delayed response and delayed signal propagation along this service chain. The deeper the service, the more services to be chained one after the other in this situation, the higher the traffic loss and the higher or the worse the over or the under provisioning, this transient where some of the microservices in the chain do not run with enough resources to serve the elevated request level. If we actually look at the system and look at the situation, we see that there is a deep architectural reason behind this phenomenon. The situation is that the whole microservice graph is not driven by a single HPA control loop. Here we have a single independent HPA control loop provisioned for each of the microservices separately, and these three HPA control loops work in complete isolation one from the other. So each of these three microservice or these HPA control loops work on purely local information and we need time to get the news from the product page to the review page that the scaling is in progress and again we need more time to get the news to the rating service. So what would a normal and sane person doing this situation? They would just go out and implement something that would just remedy the situation for us. We are not normal people. We are academics. These are our souls. We are scientists. Scientists work in strange and unexpected ways for the first thing that they do in such a situation is to create an analytical model. This is where we feel at home. So wherever you have formulas, wherever there is mathematics involved, we feel familiar. This is because then we have a working model, we believe that we can convince ourselves that we have a deep understanding of what's going on in a situation. There is a beautiful piece of engineering in science control theory which is dedicated exactly to understand such situations, such control rules. We are interested in the dynamics of this system, so how the system behaves in time and for this control theory is the right theory that can be applied to actually answer such questions. To apply and to cast this system in the framework of control theory, we need two analytical models, two dynamic models. One for the application that describes the application's response in time to elevated request levels and one other for the controller. The application, it's actually the thingy that we are actually controlling in a control theoretical setting is called the plant in this fancy control theoretical terminology. We're actually going to use this terminology here. So let's look at the plants. The plant has two inputs. One is the instantaneous HTTP request level, QK, which is imposed on the system at the time K. And imagine that there is this hypothetical function, A, that will convert this QK signal into an actual CPU consumption at times that K. This amount divided by the amount of CPU provisioned by Kubernetes and provisioned by HPA to this microservice, the ratio of the two will generate the average CPU consumption of microservice at times that K plus 1. We take this output and we plug this into the controller. So the other component here is the control load. It's basically just a reformulation of the control load that's in the program that deep into the HPA controller. These two systems when plugged in with a feedback control loop will provide us with the required dynamics. On the surface, this is a perfectly well-behaved dynamic and discrete-time control system, so that's good. The bad news is that this is highly nonlinear. So if you look at the plant and you look at the control load, both are nonlinear functions of their inputs. This is bad. If you're familiar with control series, as long as the system is linear, you are on the same ground, but as soon as it gets nonlinear, we are basically out there on our own. Still, we've been able to show that the system is stable. All control series worry about stability. Self-stability or in control series in terms of global asymptotic stability in this sense mean that this control loop alone will actually reach the required reference CPUs level at some finite number of timestamps, irrespectively of the level of user input and irrespectively of the initial state where the system starts from. So that's the good news, and this is thanks to Berger. The bad news, however, is that when we plug these control loops into a microservice system, so the first system generates the output, for the second, things can really go wrong. At the moment, we don't have such stability proof for this system, and we know about very, very pathological examples where things can go really wrong. So one pathological example is where the microservice graph has positive feedback. A direct cycle of microservices is calling one after the other in a direct cycle in such situations the system can oscillate or it can even produce this rounder-based carrier. So what we mean by that is that the microservice A making the call to microservice B and then B calling C at a certain point in time that's still within the context of servicing the same user request, microservice C making the call back is A. At the moment, we post away that such a thing can only happen by mistake. So if you know in practice, in your own practice, and you've ever seen such a situation, A calling B, B calling C, C calling back A, in such a situation, if you've ever experienced in any of your production workloads, then please come and talk to us because we are really curious. The other good thing about analytical models is that you can see the simulation. So this is what we did. We can actually get insight into the running system without actually having to deploy another in production. So what we did here is to plug the system into MOTLUB and see whether or not the diagrams and the flaws that we get are similar to the ones we've seen before and now we actually can get the be confirmed that the system behaves as expected on the left side you see this delayed CPU response from the different microservices and on the right you can see that the HTTP failure rate is very similar to the one we've seen earlier. So what next? First, let's try to understand where we are at this point. The first thing to know, rest assure that there is a very, very good chance that if you switch HPA on or code it will behave as expected. We see that we have to do very nasty things for HPA to go wrong. Perhaps your microservice graph is not that deep. Perhaps not all your microservices are constrained on the same resource or constrained at all. For instance, perhaps you have databases are deep down in the service graph that need no scaling at all. In such situations, maybe you do what you want it to do because control load that's built into HPA is extremely robust. Its simplicity is both its elegance and its biggest throwback because sometimes you see weird artifacts when these independent control loops act together in weird and unexpected ways and for this to be able to avoid such artifacts you will need to have a very good understanding of it. Yes, of course. We can do something about this. We can implement something. Deploy something that will make the scaling process faster. One idea is to drive the HPA control loop which is at the moment completely local and doesn't have a microservice graph level inside from maybe a more clever or more compound signal. So we can actually for instance plug CPU consumption at the earlier microservices the parent and the grandparent microservices forward into our control loop and drive HPA from a compound aggregated resource usage metric this thing might have a chance to speed up the process and the good news is that when we actually plug this idea into our analytical system we see significant speed up so what we see here is that the workload is the CPU consumption and the lost microservice the ratings microservice without our modifications and the purple line is with our modifications you can see significant speed up placed in this very example in the analytical model it could completely remove the scaling delay and make auto scaling instantaneous across the entire service graph. Of course we also implement something which has only an understanding of the whole microservice graph and produces some optical driver for the HPA control loop at the moment we don't know how to do that this is also a distinct possibility. So what's next for us we will of course play with these ideas come up with different models compound scaling signals play with them and try to implement at least one or two ideas it seems that KEDA would be a perfect example to do that but we've also experienced with the SIG Auto Scaling Balancer resource it seems like a perfect tool to actually just try these ideas out Balancer is also something which uses input from multiple different services different deployments to drive the HPA console loop R in this case is very similar to Balancer it's just the way we actually produce the compound signal and of course we want to go back to the original network based scaling idea and of course we want to have more theoretical tools but before doing anything on this front we need your input we need your comments because we need to know the types of situations that we see in production whether or if so if you have an example of HPA behave as you expected or actually misbehaved please come talk to us because we need to and of course don't forget that we are high for PhD positions. Thank you and we are open.