 So welcome to my talk on protecting Envoy with the Overload Manager. I'm Kevin Beichu, a software engineer at Google, and an Envoy maintainer. So many users use Envoy at the edge. And in edge deployments, an attacker can either disrupt service to your service by either taking your service out itself or by taking the pipe out that leads to your service. In this case, the Envoy proxy. In multi-tenant deployments, we need to consider the fact that the attacker might control an upstream as well. And in that case, the upstream isn't trusted. And in fact, the attacker might use the upstream in order to take down the multi-tenant Envoy and take down stop access to another tenant. So the reason we ran into issues with those explosions is we weren't protecting some resources. And this is exactly what the Envoy Overload Manager tries to do. It tries to do this by one measuring a particular resource and taking action as needed. Let's explore what the Envoy Overload Manager can do. First, timeouts are very essential for distributed systems. It's the way we ensure that resources aren't indefinitely tied up. For example, if a client is sending a request for foo, we'd want to bound how long that request will take. We don't want them to wait around hanging. And we don't want resources tied up throughout the system while we serve the response. Many systems have static timeouts. So if the length of the spring is the duration of the timeout, the timeout will be the same regardless of the context. Envoy has these scale timeouts. What that means is there are effectively timeouts that can be compressed as there is an increase in resource pressure. So if there is an increase in resource pressure, the length of the timeout can decrease. Envoy has the ability to reset expensive HTTP2 streams. So this is Atlas. He's a Greek myth. He holds up the world for your traffic. This is Envoy. And Envoy has the ability to know for HTTP2 traffic how many bytes it has buffered for a particular request response. And it can use that information to drop the more expensive streams. And as there's an increase in resource pressure, we can continue to more aggressively drop streams to keep the proxy alive. Envoy has the ability to stop accepting connections. So downstream connections are really where often the workload of Envoy is generated from. So for an overloaded Envoy, we can, by disabling the listeners, we hopefully prevent additional work from being added to the Envoy and preventing it from crashing. Of course, this can harm both malicious and well-behaving clients. Envoy has the ability to stop accepting requests. It's a way to fail fast and send a 503 response and avoid tying up even more resources for an overburdened Envoy. As there's an increase in resource pressure, we can increase the probability of rejecting a given request. We also have this capability for the rejecting incoming connections. Envoy has the ability to tell clients to disconnect. This is particularly important in fleet-wide uses. So for example, in this given case, there's one Envoy that has many clients and that Envoy is overloaded. And as such, we might spin up some new instances. Well, these instances aren't doing anything helpful right now because the clients are still on the overloaded Envoy and they're having a lousy experience because they're on an overloaded Envoy. Well, if the Envoy can tell the clients to disconnect, that gives us another opportunity to better redistribute the clients and utilize our fleet. Envoy is C++-based and uses TC-Malloc as its allocator. TC-Malloc really enjoys a fast allocation path. And in order to do so, it maintains all of these free lists of allocable objects. Envoy can tell TC-Malloc to return some of the memory that it has back to the OS if memory limits are near. So now let's shift into an experiment. We're going to try out static timeouts versus slow loris. So what is slow loris? It's effectively a client being maliciously slow. And effectively, they're trying to tie up resources. For example, by sending a request and not reading the response. So in the following experiment, we have a client using HTTP-1 to connect to the Envoy. It sends 60 kb worth of headers. And afterwards, it maintains the connection or the stream by sending one byte every 15 seconds in order to maintain the stream as active. This attack in these scenarios could possibly reach about 25k for this given experiment. So this is a graph of the memory usage of the task. And you see all of those sharp spikes. Well, those are all spikes when the Envoy has crashed. And the reason we continue getting more data afterwards is due to automatic restarks. We can similarly see this with the active client connections. So this is a graph of the active client connections. And you see that we can start crashing around 18k client connections under this traffic with these given configurations. Now let's conduct the same experiment this time with scaled timeouts. The range of the timeout can scale between 60 seconds and five seconds. And the scaling starts at 60% memory utilization and saturates at 90%. What that means is at 90% resource memory usage, we'd effectively turn the 60-second timeout to five seconds. So here's the corresponding graph with the memory usage with scaled timeouts. We see, again, there's a sharp rise in memory usage, but it levels off. When we've passed the 60% threshold there, that's when we start scaling the timeouts. And 90%, that's when we would have reached saturation. We see that we're scaling the 60-second timeout under the 15 seconds that the attack traffic is using to maintain the connection. So the timeout is somewhat under 15 seconds. And as such, we're able to maintain the proxy up and maintain around 16k client connections. So this is the graph of client connections. You can see there are occasional spikes and drops due to this churn going on. So there are some caveats, of course. It's very important when you're using the overload manager to configure it for your given workload and your requirements. Otherwise, it might not help you. It could actually actively hurt you. Small deployments can run into trouble with TC malloc fragmentation overhead. And traffic diversity matters. The overload manager might not be able to help, depending on the traffic and its configuration. So here are some pointers to get started with using the overload manager. And thanks to all the folks who've contributed to this component. And here are some other great talks from past Envoy Con years on protecting back ends. Thank you.