 Hello. I'm Tony Allen. I'm going to talk about load shedding in Envoy. I'm a software engineer at a MechaCorp and you can find me at those social media platforms. So this talk is going to kind of take the form of two halves. The first half is not going to have all that much to do with Envoy specifically, but I am going to talk about some concepts, which I think are important, and it kind of sets the stage to talk about all the different load shedding mechanisms in Envoy. There's going to be a bunch of graphs and I'm hoping it changes the way people think about load shedding. So let's start with what is load. When I'm referring to load in this talk, I'm talking about layer seven requests that are proxied through Envoy and that are using resources on some application that the Envoy would be proxying for. So the resource on the application could be time on the CPU cores, memory, the often overlooked network bandwidth and disk bandwidth, and as the load increases on a server, these things are contended and there's not enough of it to go around for all of the active requests on the server. So I'm going to talk a little bit about Little's Law here, which I'm sure folks have seen this before, but it was kind of brought forth in 1960 in some operations research journal. And it's the expected number of units in a system is equal to the expected time spent by a unit in that system times, well, the rate of units entering that system. We can interpret this as average concurrency equals average rate times average latency. So it seems obvious to a lot of people, but there's some nuance here. It's independent of the arrival distribution of the requests. So it can take as long as the average is whatever you're plugging into the formula. We can have giant tails or it can all be that exact same value. It doesn't really matter. And the concurrency in a system, and by system timing and application here, is going to mean or it can take on any form, right? It can be a queue or we can just have a bunch of floating requests that we're interleaving between and juggling until we tip over. And another interesting thing to keep in mind here is that the time spent queued in the system is what's contributing to like observed latency, right? By whoever sent that unit into the system. So this kind of explains why our latencies go up as concurrent work increases because we're interleaving between it or we're waiting in line until there's some kind of service. So if we apply this to a server, right, request comes in, response comes out, we can explain it. Some servers have an executor that's pulling from some kind of work queue, right? So you have some worker threads pull from a queue and go like a channel or something, right? And there's a fixed service time. So I process a request, it takes me some unit of time and it always does that, right? The latency that's observed from a client is going to be the amount of time it would spend in the queue. So it's the queue delay plus the actual service time, which doesn't change depending on the size of that queue. Okay. In other cases, we just have a server that just like takes a request, spins off a go routine while it does its thing and then just keeps doing this, right? And we end up in this scenario where we're interleaving between all of the requests. So the service time for a request is some function of the number of things that we're juggling, the number of active requests we have in the system, right? And in this scenario, the service time is going to increase as you get more requests into the system, right? And since the latency is the queue delay plus the service time, right? We're increasing directly the amount of time it takes to work on that request as we put more requests in. And then the increased latency from that means more concurrent requests because we're slowing down. And then the service time increases and you kind of get the idea. We end up in this feedback loop, right? So there was a recent, like a 2011 paper where John Little revisits Little's law in a computational context. He's like, let's graph some things that are active requests and latency, right? So on the left side, we have requests per second on the x-axis and the requests in process, right? So in envoy terms, this would be our active, upstream active request or something. On the right side, we have requests per second on the x-axis and latency on the y. And you'll notice that they both kind of have this linear relationship, right? Or this linear curve in the beginning and then things kind of go real bad at around the same time. So this is that feedback loop I was talking about earlier, right? It's also referred to as a congestive collapse sometimes. So to avoid this kind of scenario, given that Little's law relationship, there are a few things we can do here, right? If what we care about is like fixing the latency, right? If we want to limit concurrency, we think of it as bounding queues, right? Putting an upper bound on how many we can have there. In envoy, we sort of do this with circuit breakers or adaptive concurrency, which we'll talk about later. And in an application, right, you can do this with worker threads in a work queue where you put a bound on that or queue timeouts in some cases. With a rate, we can adjust that by just sending less things, right? Or rate limiting. So we do that in envoy or the application. And then what some people don't think about is latency. We do have control over what the request latencies are in the service times. We do this with timeouts. So we place an upper bound on the amount of time that we spend servicing a request, right? And assuming we have some kind of cancellation mechanism in the application, we can also make use of deadline propagation. So I'm going to do some simulations because I think it's nice to see some of the stuff in action and to understand how it works. I'm using this buffer bloater project on my GitHub, which has 19 stars. It's a very popular open source project with a thriving ecosystem and community. So feel free to send a pull request. The way it works is you spin up this go program and it kicks off a bunch of client routines and it's a configurable number and it kicks off a bunch of server go routines. And the client's proxy requests through some external envoy process and they're sent to one of the server go routines. Okay. So the actual server go routine, what it does is it, you know, can juggle any number of requests and what it does the interleaving business, right? So it'll randomly pick some active request and each of these requests will have some service time associated with them, right? And we'll just decrement 100 microseconds of remaining work, sleep for 100 microseconds. And if we've completed that request, right, we've decremented it to zero, then hey, send the reply. We're done. Otherwise, just, you know, keep just keep doing this and put it back in the pool. So the output of that program looks like this, right? So at the top, we have an offered load, which is an RPS, the clients are sending to the servers. We have an observed latency around there in the middle, right? You see this banding because you can configure that generator to have a distribution of various latency. So each one of those points is an observed latency and good put at the bottom. So good put is the amount of useful work that a system is doing or system is working with there. So it's a portmanteau of good and throughput. So 510 requests and I reply 200 to all of them. Then I have a good put of 10 for whatever unit of time that was. If I time out half of them, then I have a good put of five requests per unit of time. And I should mention, I made this program before Nighthawk existed, and then I figured it's just easy to reuse. So don't actually use this for anything substantial. So the active requests on the server are also plotted, and you can see this. So where there's a load spike and that offered load graph at the top, you'll see the latencies go up. The good put slightly increases too. And then the number of active requests kind of spikes up. So you're seeing that the concurrency as that increases is affecting the latency. But we're getting more productive throughput through it. So this is like your classic latency versus throughput trade off. In case you didn't see it, there they are. So I'm gonna kind of go through all the different load shedding mechanisms. And first, we'll talk about no load shedding and the scenarios I'll bring up. We'll talk about rate limiting circuit breaking and then the fancy mechanisms towards the end there. Are we allowed to just take questions in the middle before we go on? I want to make sure we're all on the same page as far as queuing goes. Okay, cool. We're all experts now. So here's a scenario. I have some reasonable amount of load, and then there is a spike in traffic, and it increases it by, I don't know, 50, 60% for a short period of time. And then things go back to normal. So we can see this at the top with the offered load. You'll see how it affects latency. Okay, latency is whatever the service was doing. And then it climbs up until it hits whatever three is the the actual units don't really matter. We're just looking for the relationship. So things will start timing out, right? And you can take a look at the good put there at the bottom. So those vertical bars are the area in which the offered load was higher than what the normal steady state is, right? You'll see the good puts like kind of fine. It's, you know, hobbling along until eventually crashes to zero. We're timing out all of our requests because the server spending all of its time juggling active requests, the majority of which are timed out. So it's not doing any kind of useful work, right? But we're just we keep sending it more traffic anyway. So this is your classic overload scenario. And we know that this thing works because you know, this is we see this in real life all the time. So let's use the local rate limit filter, right? So in Envoy, we have the local rate limiting and the global rate limiting global rate limiting. I'm not going to talk about because that's like you need an external service and there's this interaction that's kind of complex, right? So we'll just look at the local rate limit filter local decisions with a token bucket. Okay, and the way the token buckets work is there's some refresh time we refresh a bucket with some number of tokens. And then whenever something wants to make it through the filter, right, it grabs a token. So 510 tokens in the bucket, only 10 requests are going to get through until I refresh again. So I see a lot of people in places I've worked that want to protect a microservice or something by performing rate limiting. And I mean, it tends to work out, right? So you see this like thrashing this up and down motion with the latency in the middle there. That's just because we keep refreshing with tokens. So we just have these bursts of requests and then we stop sending requests and we burst the request and then we stop sending the request. So a queue forms, queue burns down a queue forms a queue burns them. So we observe that but it didn't tip over, right? You'll notice that the good put is good. It's about 2500, which is what I happen to configure the rate limit at. There it is in case you missed it. But what this doesn't really protect us against is a server degradation, right? So in that previous scenario, right, we have a server that is, you know, some number of milliseconds servicing a request, right? And there's not really distribution is just, you know, a synthetic three milliseconds per request. And we offered more load, which is we increase the RPS term in that little's law relationship. But if a server degrades, what that would look like is the service time for request increases, right under ideal conditions. So if we bump it from like three milliseconds to 10 milliseconds, right? What ends up happening? Oh, no, this is trouble. We we're not doing anything useful. We don't we don't shed. We just drop good put down to like half. And then we go to zero, a queue formed somewhere, our latency increased, right? Our good put is lower. That makes sense because under ideal conditions, we're taking more time per request, right? But then it we had our congestive collapse anyway, right? So it's because the rate limits config is derived from some kind of latency snapshot, right? We normally just look at some service, we go, All right, yeah, things look good. And it's at, you know, what 10,000 RPS, let's set a limit of like 10,500, right? But then, you know, somebody, some intern pushes code, performance degrades on our system, or like just years and years of baggage, and we never tweak our configs, or we move to a new instance type, right? We end up in trouble here. And we don't want to keep tweaking our rate limit configs. So what we could do instead is put a bound on the concurrency, right? So we could say, All right, well, I mean, if I'm getting these high latencies, because some queue is forming somewhere, let's just go ahead and put a limit on how big that queue can get, right? And obviously, this will only work if we have control over that, like, if I have a fleet of envoys, and there's one server, and all requests are going through this whole fleet of envoys to that server, we have zero control, it's not a closed system, right? We can't apply any of this hand wavy queuing theory stuff, right? And circuit breakers aren't going to be a much help. But that's not a scenario we're talking about here. So let's see how it performs. Traffic spike occurs, our latencies look like it looks much better, right? And then the good put increases to about 2500. What's interesting here is that it increases 2500 just like that rate limit did, right? Because of what I did. So we have eight threads on the fake server, okay? Each request takes about three milliseconds. So I think if you just divide 8 by 0.003, you'll end up with about 2500, right? It's pretty cool. Like, all I did was I just bounded the queue, and then I got the behavior I wanted, right? If I take a, oh, okay, there's the math, boom. So we divide 8 by 0.03, we get 2,666 RPS, that's, you know, the error bars on that are good enough for me, right? Especially since all we're doing is saying, hey, the queue can't get any bigger than this, don't admit anything through until it finishes this plate, right? So we nailed it. Circuit breakers don't use rate limiting unless, well, we'll get to that. Let's look at the server degradation scenario. The request latency does go up a little bit, right? Which makes sense because the server degraded, like under ideal conditions, the latency would have increased anyway, right? But our good put did not drop to zero this time. It dropped down to some other value, right? We put a bound on the queue. It doesn't matter how long it was, like what the capabilities of that server are, right? So this is pretty sweet. I get excited easily. So let's compare the two. Local rate limit filter versus circuit breakers at the bottom. We're looking at the good put graphs. We see that the local rate limit filter had its congestive collapse, you know, somewhere in the middle of that degradation region, right? And the circuit break didn't. It's just, you know, it's another day in the life for it. So yeah, you'll notice that the good put dropped to zero at the end of that simulation. I think that was just how long it took. And I didn't, I forgot to plot, like, there wasn't anything fancy going. There's only one event in these simulations. So, okay, so rate and bandwidth limits versus concurrency control. If you're trying to be resilient, right, you're like, I want to protect this microservice from tipping over under load or something, right? Just use circuit breaking limit concurrency somehow, okay? Whether that's an application or via Envoy, right? But I mean, you need rate limiting if you're like charging someone for an API, right? Like someone pays you for some plan, then, you know, this is what you get and why would I give you more, right? Even if I have the capability to service those extra requests, right? So that all depends. But you, you would still want to use circuit breaking under the circumstances, right, to actually protect the back end server. Yeah, in the rate limiting cases, the precious resource is number of requests, not server latency. Circuit breakers have their own shortcomings, right? So I have to know, like, I knew how many worker threads that fake server had, and it was an ideal condition, right? And a lot of times you can kind of have more requests than are necessary. You can have a juggle, you know, as many requests as it needs to, and it's not how many worker threads it has, and you get good performance. And it circuit breakers are very unforgiving at birth. It's just this is the limit where I don't care if we go up temporarily, right and burn that queue down. Yeah, so they are notoriously difficult to configure properly. I don't have a citation for that. You know, there's just some anecdote, right? Just in my experience, when people have to configure Envoy, and they they just kind of guess at a circuit breaker value, and then you know, you don't check to see if the seat belts work until you get into crash, and then you find out, whoops, didn't configure that well at all. So what if I told you that was a better way? What if I told you the adaptive concurrency filter can, well, it's not perfect, but what it's going to do is it'll adjust the concurrency limit real time based on latency measurements, right? So I didn't tell you why this is called for evil wizards in the adaptive concurrency talk last year. I didn't go too in depth about how this thing worked. I just showed you how it worked. And I hand waved away all the complicated, you know, hellscape of YAML to actually get it to work. We're going to go in on that. So put on your robe and wizard hats. It's going to take a measurement of ideal latencies as a baseline, right? So imagine this, I'm an envoy, I'm fronting some incidents, okay? Requests are coming in, and I'm just kind of letting them through, right? But I picked some like lower bound for the concurrency. I'm like this thing can handle like five active requests. And I only let five active requests through and I'm just seeing how it does, right? I'm 503 anything in excess of that. And I'm just getting some kind of baseline latency. And then, once I'm, you know, satisfied with that, right? I measure like whatever the P95s end up being. I open the floodgates, I'm just going to let everything through, right? And I'm just going to sample all the latencies that are coming in. So we're going to divide it up into periods, right? And then we're going to say, I have now an RTT ideal, which is just, you know, shorthand for round trip time, that's center requests, I got to reply back, right? So an ideal time for that. And then I have sampled round trip times. And I'm just going to look at the difference between the two, right? I'm going to have this thing called a gradient, which is the ideal round trip time divided by the sampled, which has the convenient property of if the sampled latencies start to climb higher than what we would actually want them to be, this gradient gets smaller, right? But then if it is pretty much what the gradient is, or sorry, if it takes on the ideal value, that gradient is about one, right? So if you multiply it by things, it doesn't change because you're doing something right. So this is based on latency based TCP congestion control algorithms. So normally, you know, like in school, you're like, how does TCP work? I dropped a packet and I have now cut my, you know, concurrency window in half, right? And then I additively increase it again. Okay. In a latency based case, I've measured what my handshake, like how long that took. And I'm like, yeah, this links, that's how fast it should be. And then I start tweaking things based on, you know, whether it's latency there, if I'm sitting in a nick buffer somewhere waiting to be routed. So I told you about how great the gradient is, if the ideal is less than the sampled, a queue is formed, right? Because our latencies are climbing for some reason. If the ideal is about what the sampled is, we did it right. What I failed to mention, and I go into great detail and another talk is that we periodically re measure this thing, right, the ideal, because, you know, we migrate VMs or we, who knows, AWS has a bad day. I don't know, right? So let's look at this gradient, like, how are we calculating this concurrency limit? Well, we are adjusting the concurrency limit by just multiplying it by gradient, right? This works out. And then, you know, what about headroom? I want to burst, you know, like, yeah, no problem, just like add on some headroom value, make it the square root of the new limit. This has, this is nice, because now I just keep probing up against like, I'm just going to keep letting more stuff in until you know, something goes wrong, and I'll clamp down aggressively on it. So this is a simulation of that, right? This isn't real. But this is the behavior we would have this orange ideal concurrency, look at this, it degraded and then it went back to normal. Well, this concurrency limit in blue, just follows the ideal, it bobs up and down, right? We're probing, we get latency, we cut it down, we probe again, we get latency, we cut it down. So this is an adaptive concurrency configuration. Okay, this is, this is awful. This is in one of the downsides of using this thing, but it's really nice once you get it dialed in. The sample aggregate percentile, that's like, when I'm summarizing samples like the RTT, I'm looking at this percentile, like I'm not taking an average, right? So if you care about tail latencies, you would set this to like a P95 or something, right? So that's what you're tracking and keeping under control. But you know, if you don't really care, like set it to immediate, whatever, you know, the concurrency update interval, how often are we updating this concurrency limit, right? I'm taking samples. And then in this case, after point two, five seconds, I do something with those samples and update concurrency limit. And then I clear all my other samples and start again. There's a buffer value, right? So it's like, you know, if I exceed my ideal round trip time by like point one percent or something, right? Who cares? Right? So you just kind of give it some wiggle room, right? Before you clamp down aggressively on it with this buffer. And the request count is just how many requests am I going to take in before I go, okay, this is the ideal round trip time now, right? So you have like a super high RPS service, like who cares what this is? Because you'll just, you'll make your measurement very quickly. The actual math and stuff behind how this works is well documented on the Envoy docs. Let's, oh, sorry. And then there's a min concurrency. This is basically like, what am I going to clamp to when I'm measuring my ideal latency? Okay, so in this case, I'm like 10, you can handle 10, right? Of course you can. If you can't, like who cares, the adaptive concurrency works, we have other problems, right? So that's, that's usually how I go about figuring this out. So let's see how it does with the traffic spike. Okay, scenario. You'll notice there's this huge spike, but that's, it's not really that bad. So the, the latency, the ideal latency is three milliseconds, 0.003, right? And this goes to 0.125 shortly. There are some requests that did this, and then drops down to some value I don't even know, right? But you'll notice the good put, like it's didn't even sweat. Nothing didn't notice anything, right? So what this did was it was like, Oh God, like I let too many things in and then it clamps down, let's know more requests through and then it does the bobbing motion, right? So there it is in case you missed it. Server degradation scenario, right? Same deal, same behavior with the latency pretty much, right? Good put drops down to about where it was when we had the perfectly configured circuit breaker. So all right, it works. Now, when does this not well, here's some drawbacks. It's real, it's spectacularly complicated. Like you're dialing this in like it's a PID controller, right? So if you understand your environment pretty well, you can just set the, you can set these defaults, right? And then just let it go. It's nice because you're not like looking at a service and making it bespoke for that service, you're looking at your environment and then setting some values and that kind of just blanket applies everywhere, right? Maybe you'll tweak things here and there if you're a pro. And requests need to be a proxy for resource utilization, right? Like if I have a database query, like who knows what that thing is going to do to my server? I don't know what's in it and I'm not looking. So I can't, you know, circuit breakers kind of break down here too, right? Database people are on there and I'm sorry, I have no answer for you. Let's talk about admission control, okay? What does the admission control filter do? It probabilistically rejects requests based on a sliding window error rate, okay? So basically I'm like, how many of my requests were successful in the last, you know, period of time? Okay, let's derive a success rate out of that and then based on that success rate I'll reject requests, okay? We all should know what a sliding window is, right? We look at the success over the total. That's our success rate and we use it. If you want to know where this came from, this Google site reliability engineering book is just a goldmine of wisdom and this was in there. I just saw it and was like, let's do it. But first, let's change some things. That site reliability engineering book only has this linear relationship, right? Which it may be fine for some people but it wasn't for the use case we needed. So I threw in this threshold value and this aggression term which did awesome stuff. Check this out. X-axis is success rate and then rejection probability is on the Y-axis. So this kind of makes sense. You're like, okay, well as my request success rate drops to zero, right? I am more and more likely to reject a request. This is kind of nice. Somebody sent a pull request that also added a upper bound to the rejection probability. So, you know, it's defaulted to 80 now. This is a little out of date but I don't remember how I generated this graph. Okay. Don't use this thing in this configuration. All of the other load shedding mechanisms I talked about, I'm assuming like I'm an envoy, I'm fronting this one instance, come at me requests, right? In this case, we do not want to do this because here's why. If I increase offered load and then I, sorry, hang on. How am I doing on time these two minutes? Okay. We're almost done. You'll see this kind of behavior. We don't want this because our success rate drops to zero. We're like, oh no, I'm shedding load. I'm letting things through. Things are recovering. Okay, awesome. Success rate climbs and then I let too many things through and then I'm waiting for more errors. So we don't want that. What we want to do is client side throttling with this admission control filter. Okay. Why is that useful? What if I brown out, like that server is shedding load and that's what's tipping it over. There's so many requests or like the network is saturated or you have so many clients that one request from them is just, you know, well, a raindrop in the flood. So that's it. Sorry. All right. Can we do questions? Want to do questions if people have them? Do I, yeah, sure. I had a question about the, you mentioned that the, it wouldn't be protecting databases but I'm wondering from the client perspective wouldn't any like database latencies or issues just be reflected as like server side, server degradation anyway? Right. Yeah. Yeah. But so if you think of it, like if I'm limiting concurrency to like 10 requests, right, I'm thinking of this in terms of like, okay, well I'm using 10 units of resources on this server, right, which makes sense for a lot of services but a database it doesn't because I can have some like just bananas, insane query, right, and then another one could just be a no op, right. So if I have like 10 no ops, I'm good. If I had 10 of these crazy queries then like I'm asleep at the wheel, right, and that's what I meant by that. Got it. Yeah. I saw another hand, I saw another hand somewhere. Okay. Go on once. Oh, okay. Oh man, at least I'm just making them work. So I guess the same question kind of applies to a heterogeneous service, right? Like if you have posts that take way the fuck longer or whatever it is, like one thing has an upper bound of like 10 minutes because that's how that API works. You'd have the same thing as with databases, right? Do you mean, are you asking about like, like if you have a bimodal distribution or something? Yeah, if you have however many APIs guarded by this envoy config, I guess how finally can you slice this, I guess, the question? Right, that's a real good question. This comes up a few times and there's new information since I last addressed one is, okay, if I have a bimodal distribution, right? Like I have, you know, a bunch of requests take one millisecond, that's one that take 150 and that's totally normal, right? And there's, they have their own little distributions. If you're tracking the like p99 latency for this whole batch, right? Then you're really tracking this, this, the mode that's in the tail and you're kind of crossing your fingers hoping that the distribution of those requests doesn't really change, right? And that, you know, it seemed good enough for at the time. Now some issues were opened regarding this and now we're thinking about, by we, I mean, like who knows who's going to do this, right? But making it per route, right? So you just have like a type whatever the filter config is where you can have it per route and then you basically have a concurrency controller for each route because they're like super cheap, you know, to make and track. Admission control will really benefit from this too. A real rent, that's it. Okay, I'm gone. Goodbye, everyone.