 Hi, my name is Axel, and mine is Mikkel, and we are here to tell you how to successfully fail at life. The baseless claim that a company celebrates failure is becoming as common for trope in modern tech as have you tried turning it off and on again? It turns out that repeated failure is not enough to ensure future success, and in practice celebrating failure often amounts to not firing people who are bad at their job. As such, thinking about how to fail forward, that is, fail in a way that moves you meaningfully towards your goal is relevant. This talk is about how we failed five consecutive deploys at Spotify, and why we're reasonably proud about it. But first, a little background. At Spotify, we have a Google load balancer in front of our microservices. And for a long time, we've been wanting to run Envoy as an HTTP proxy at the perimeter of our backends. In addition to having a unified perimeter, there is a long, long list of futures we want to get out of this setup. Common metrics, authentication, rate limiting, client IP, lookups, access logs, and so on. And as you might know, Envoy doesn't actually do all of those things. Our desired Envoy setup contains a docker sidecar that runs a second service implementing authentication, GUIP lookups, and some other things. Envoy will call this service for most incoming requests using the OTSI extension. We started this journey over a year ago by adding our first services behind Envoy. We have then gradually added more and more traffic, and now it's time for the biggest deployment, the HTTP traffic from the actual Spotify clients. But before such an important move, we wanted to do a fair bit of testing to build confidence that this would actually work. We created a test setup that resembled the production setup as much as possible. Instead of a real client, we used a WRK2 tool which does open loop testing, that is, it allows us to set the desired request rate instead of just trying to fully saturate the system. This is almost always the correct way to load test a system. Our test setup uses the same load balancer as production, an identically configured cluster, but with only one host, and various different core counts on that host. Finally, our test used a single upstream service named NOOP. NOOP is a service whose reply time, status code, and payload size can all be configured on each incoming request. So, what did we find? First of all, regardless of number of cores on the host, Metric's propagation always uses one full core. Secondly, a few configuration bottlenecks were found. The biggest one was the HTTP thread pool in our off-sea sidecar. We tried 8 1632 and 64 core hosts, and 32 cores offered the best throughput per core. We saw some failure rate elevation on slow requests, but we didn't investigate this further. And finally, we could see that TLS used a bit more CPU than we expected in the flame graphs. Good enough, let's go! And we did! And thanks to that, we have this amazing news article. And needless to say, we had to roll back fairly quickly. What was going on? Well, it turns out that Envoy's circuit breaker, which is really just an outstanding request limit, had triggered. It has a default limit of 1000 requests, which we had not changed. Handling 30,000 RPS per host means that average latency on your requests must be lower than 30 milliseconds. We had in fact checked that the median latency was much lower than that. But as usual, the long tail ruins everything. So we had failed to check the right metric for this situation, which is the average. Because Envoy doesn't report the average. But the good news is, we could successfully reproduce the problem in our test environment. And we could validate that it went away when we adjusted the limit. So we adjusted the production circuit breaker settings, and we tried again. And this time it worked. For a few minutes. It started out fine. But as time went on, we got more and more errors. We noted this amazing graph, showing the number of requests for each host. Basically, the load balancer seems to throw all it can at a random host until it gets overloaded. Then it threw the load away at the other hosts instead, spiraling into more and more 500 diars from overloaded hosts. We adjusted our environment to be more like the production, by increasing its size from 1 to 10. We could then see the same problem in our test environment. And after some testing, we figured that if we tell the load balancer to target 15,000 requests per second for each host, everything looks fine. We had assumed that a single node test cluster would be fine. Looking back, it feels pretty naive to use a single node. But it's always easier when you have all the answers. We didn't know that the load balancer considered a single node cluster special. So we did fail to make our test setup similar enough to production, to save some money. And throttling to 15,000 requests stopped the flapping. But we have poor usage usage and an elevated failure rate. Clearly, something still wasn't right. But we didn't know where. So we tried to isolate the different parts of the system to locate the bottleneck. We started with the previously mentioned AuthZ sidecar. We disabled it, and RPS went from 15,000 to 20,000. This is expected, since the number of messages that are processed by the host goes down significantly. And also, CPU usage during these load tests still stayed well below 50%. So that was not the limiting factor. Next, we turned our eyes to the NOOP service. This fake service runs in Kubernetes. We did some quick profiling and found that each replica can handle 23,000 RPS. It is auto-scaled with a maximum of 100 replicas. That means it can handle roughly 2.3 million RPS. Once again, not the limiting factor. Most Envoy users use the HTTP2 stack, but Envoy and our upstream uses HTTP1.1. Perhaps the HTTP1.1 stack is somehow less scalable. We ran a test where Envoy directly responds to all requests, thereby bypassing any HTTP1.1 stack. And we found that we could handle only 30,000 RPS with a 10% CPU usage. Why? This is the low point. It is the part of the hero's journey known as the abyss. It is where we considered giving up on software development and finding a brand new career, one that makes sense, like carpentry. But instead, we started looking at the number of connections between our load balancer and Envoy hosts, and found that we have about 13,000 connections to East host. That's a pretty high number. And someone pointed out that the buffer size is 1 megabyte. And with some math, you get a total buffer size of 13 gigabytes. That's quite a lot of buffering for Envoy to do. So we try to decrease it to 32 kilobytes for each connection. And our request per second increased from 30,000 to 60,000 on direct responses. We did try to tweak similar settings like the number of concurrent streams and window sizes. But we didn't find anything we thought were worth changing. As soon as we hit 15,000 requests per second, latency started to increase. This did not happen if we removed the OTC decorator from the request path. To check if it was the service that was slow, we replaced it with a service that immediately returns 200 OK. Performance was still bad, and we only got 15,000 requests per second. Clearly, we have isolated an issue in the communication between Envoy and the OTC sidecar. This was narrow enough for a teammate to realize that they previously touched the network configuration on this cluster, and sure enough, we were using Docker network bridge instead of the much faster loopback device. Throughput increased to 30,000 requests per second. But why didn't we see this earlier? It does only increase latency, so badly that the load balance started considering the host dead. It didn't actually limit the throughput. Finally, we have reached the end of our journey. Everything worked. We hit production, and everything is fine. For a few minutes. Then, once again, the error rates started to creep up, and RPS went down to the same old 15,000 RPS. We decided at this point to drill down to the various thread pools on the system to see if any of them were overloaded. What we found instead was that the main Envoy worker pool was extremely unevenly loaded. Some threads were pegged at 100%, others were doing nothing. We assumed that this was a locking problem, and we started to work on profiling Envoy. That is, until someone noticed that the number of open connections to each worker thread was actually similarly lopsided. So why were some worker threads receiving all of the traffic and others none? We could not reproduce this problem in our test environment, which meant that we were flying blind. We decided to reach out to the Envoy community, as well as our cloud provider, Google. We got a suggestion from both in the form of Harvey Touche. SO reuse port. This configuration option in Envoy is described as such. This makes inbound connections distribute among worker threads roughly evenly, in cases where there are a high number of connections. Which begs the question, when would you not want connections evenly distributed among workers? Anyway, it worked. But why couldn't we reproduce this problem in testing? It turns out that load started out pretty evenly distributed, and then slowly diverges. Our test cluster was either reconfigured often enough, or saw long enough breaks with no traffic that things reset themselves, whereas our production cluster was always loaded. So this is the end of our journey. We have now had four months without any major problems, and to get more certainty, we did a successful regional failover test where we killed one region, and let all that traffic go to our other regions. And it just worked. So we have started doing fun things, like upgrading to version 3 of the XDS API, adding rate limiting, and looking at course configuration for our clients. And we did take it slow by rolling out gradual over an entire year. And we did spend a full week of performance testing before our last and final deployment. And still, we failed to identify five major scalability bottlenecks. Maybe, spending an hour looking at all the available metrics while load testing our setup might have actually identified a few of these problems. But probably not all of them. Looking back, this journey was a lot of fun, even though it didn't always feel like that while it was ongoing. And we for sure did learn a lot. So some suggestions we thought we would share. They would most likely have helped us, so maybe they can help someone else. Make the default queue size per core, so you don't have to remember to change it when you change your machine type to have a different number of cores. Make SREU's port default. We know this has some performance costs to low traffic servers, but we figure efficiency is more important on high traffic servers. Another alternative would be to highlight it in the best practices guide for Envoy as an edge proxy. And last, add average latency to the histograms. We know averages can be overused and misguiding, but when doing math on connection settings, it can be very helpful. Okay, so how do you fail at life? By planning for it. Assume that you will fail, because you will. Try to think ahead to when you will fail, and try to think of what you need to do next. And make sure that you have the tools at your disposal to do just that. This often means having the right metrics. Next, do your best to reproduce all problems outside of the production environment. Not only does doing so give you much more opportunity to see what happens in various related failure scenarios. The act of crafting a test environment often shows you blind spots you didn't know you had. And finally, communicate. Ask for help. Broadcast your shortcomings to anyone who can be made to listen. Like you. Even if your mistakes are embarrassingly done. Like ours. Keep talking. Maybe some of those mistakes can be prevented through code changes. And if not, at least more people will know about the common pitfalls. Hello everyone. Are there any questions in here? Thank you for all of the feedback and the thumbs up and whatnot. Let's see. Have you guys looked at enabling exact balancer on the listener? Mikke, I'm going to let you handle that one, because I don't know. I don't actually know what that is. That was what I was too ashamed to admit. I'm not ashamed of things like that. I have no idea. I will look it up. Thanks for the tip. A question about if HTTP 1.1 issue was identified. So there was no HTTP 1.1 issue. That was a suspicion that we had that maybe the HTTP 1.1 stack was slower or less battle tested or less scalable or something like that. And that turned out to be wrong. We are still using HTTP 2 from the load balancer to Envoy, obviously, and then from Envoy to our microservices, we're talking HTTP 1.1, and they both seem to perform just fine. Running into very similar problems at Twitter. I think overall, I would expect people that have very large request volumes to have similar issues, and I think there is a very good start of how to make Envoy an HTTP proxy for a large organization in the docs for Envoy, but I think there are opportunities to improve the configuration as well as improve that documentation to make life even easier for large installations. It's another way of forcing connection balancing. Then we should look into and see if it works better or worse. Thanks for the tip. Yes, Maxim. In our load testing, we've got about 1000 requests per second per core. I think in production, we get a little bit less. And Matt Klein asks, why are you using HTTP 1.1 to the back ends versus 2? And the answer to that is mostly legacy. So Spotify has very old network stack. It's about a decade old. We implemented our own transport layer instead of HTTP because we had a lot of scalability problems with HTTP. This transport layer called Hermes is basically very similar in most ways to HTTP 2. It solves the same problems in mostly the same way. And it tries to be very HTTP like in its API, but it is older than HTTP 2. We started work slightly before Google started talking about speedy publicly. And we are still transitioning away from this internal Hermes protocol. And what we have today for our Hermes-based services is a library that you can use to accept HTTP traffic as if it was Hermes traffic. And we are instead moving to internally use HTTP 2 and GRPC and then in the future, hopefully HTTP 3 and so on, modernizing our stack, but we're not there yet. And Christopher wears six people, I believe. Yeah. Something like that? Yes. And the other people are more competent than me and Axe. That's why they kicked me out. Yes. And Louise, I'm not sure how many requests per connection we had. If you ask me on Slack, I can check it. So with regards to much in the way of filters, we are using a few filters to filter out users who are not allowed on some resources and so on. But the big thing that reduces our efficiency, I would say, is that we are running both the both Envoy itself and this decorator sidecar, which is implemented as an EXT-AuthC filter on the same 32-core machine. So the three resource hogs on the machine is Envoy itself, which uses like half the CPU, and then the sidecar, which uses slightly less, but still a significant amount. And lastly, also the metrics propagation, which uses about one out of 32 cores. So all three of those are running on every single Envoy host. And also that means that you get a message in to Envoy, and then it's passed out from Envoy to the other service, and then back and then to the next, and then you get the reply in. So there are six message passing steps or something like that, not just the four that you would expect. My math is probably wrong, but something along those lines. And also our decorator is in Java, so we have some garbage collection fun things, and replacing it with filters. I don't think we have talked about that. I'm not sure why we decided to go with the sidecar. That was before I joined the team actually. That decision is over a year old. I was very interested to hear the talk, one of the starting talks about using WebAssembly to make your own custom filters in Envoy. That could definitely be useful for us. We did not want to write our own C++ filters because we, as a company, have too few developers who are super comfortable with C++, and then it becomes a who owns it problem, whereas we have lots of Java devs. But WebAssembly might help out with that. We don't know. We'll see. Yes. But overall, I agree that the sidecar solution feels like probably not what we want to do long term. Yeah. And Maxine, we're running on managing infrastructure. I think they answered all of the questions. If someone has a question that they posted that we didn't answer, it's not because we hate you. It's because we missed it. So please feel free to repost it in that case. Or ask on the Envoy Slack. At least I'm there. I'm not sure if you are, Axon. I actually am. Fantastic. I know. That seems to be it. Sure does. Thanks a lot, everyone, for listening. This was great. I will now disconnect and go say hi to the third talker from this conference, Titus. So, bye. Bye, everyone.