 Hello, everyone. Hello. Nice to see you all. Welcome to our talk. This is reducing cross zone egress at Spotify with custom gRPC load balancing. Really long title. My team is here. They're trying to get me to say it in French. I'm not even going to attempt it and make a fool out of myself here. My name is Alex. This is my colleague, Yannick. We work at Spotify, specifically the computer networking departments. Our team concerns itself with service discovery. We manage the protocols in our back-to-back communication and everything that comes with that. Specifically, one big concern is load balancing, everything around it, and that's why we're here today. So in our talk, we'll start quite high level and talk about load balancing in general, why we do it, what we care about, and then we'll dive deeper into exactly how we do load balancing as Spotify, what tradeoffs we take, and just how it works in general. So without further ado, the problem. In a microservice architecture, the choice of load balancing greatly impacts the performance of the product. What does this mean? So if we look at an illustration here, a common load balancing decision might look like this. There is a client. It needs to speak to some API to get a response and build the package it wants to. That API is served by several servers. So in this case, we have three instances. And we have to pick one. That's the choice of load balancing. As humans, we look at this and we see that the first one is fast. The second one is slow. The third one's on fire. It doesn't even work. So naturally, we pick the first one. So implicitly, we're optimizing for latency, error rates, utilization, various others, attributes that affect the product. So in the case of Spotify, high latency means slow playback, error rates means you can't see your playlist and et cetera. Now, on a slightly higher level, as a business, we're always optimizing for cost. We need to spend less money than we make in order to be profitable. And how we do load balancing affects this cost in many ways. And we're here specifically to talk about the cost of crossing networking boundaries. So again, what does that mean? And I want to apologize to all data center experts in advance. This is going to be very hand wavy, how a cloud works. But I think it will get the point across. Your favorite cloud provider probably has a room that looks like this, a data center with a bunch of servers hooked on to each other. To serve the needs of the global apps we have and everything, you need more than one of these rooms. So your cloud provider will have another room. The point we want to get across here is that this is a room that's physically close but not in the same room. So this might be a kilometer away or it might be another floor in the same building. Now, at some point, workloads are going to communicate across these rooms. So your cloud provider needs to connect them to each other. And that looks like this. So this is an accurate illustration of our clouds. This cable is quite expensive. So this cable is all the hardware you need, right? Fiber optic cables, separate power supply, and et cetera, et cetera. This costs money for the cloud provider. This is specialized hardware. So they're going to charge us for that. If you look at the list prices for all clouds, you'll see things like network egress and then you pay per byte and et cetera. So if we look at our load balancing example again, slightly more realistic, we have the same client. There's two servers. But the servers are separated across two zones. And zones here are availability zones, right? Separate rooms in our cloud. If we discard the things we mentioned in the beginning, latency and error rates, of course we're optimizing for that. But here we can consider other things as well. We can do this, keep all the traffic in the same zone. And that's going to be cheap. We're not going to pay for the crossing of the network boundary. We can do this. We pay for all traffic we send. Probably we want something like this. We want to send the majority of the traffic in the same region, but we want to have some spillover to, sorry, in the same zone, to another zone in case of problems if something happens, right? And this is expensive, but it's fair. We can pay this price because this can happen, right? Your cloud provider can have an issue. You can cause issues yourself. You can bring down parts of your infrastructure in the zone, let's say. So at the scale of Spotify, the cost of crossing networking boundaries is significant. Thousands of microservices, many, many pods talking to each other. This starts adding up. At some point, we realize that it's not only significant, it's a majority for traffic. So two thirds of our back-to-back in traffic were crossing expensive networking boundaries. And now you might think, oh, Spotify, why are you doing this? Why are you sending a majority of your traffic across networking boundaries? And there's a logical explanation to this. It's not intentional, but it's because how things work in our setup. So this is a very common load balancing decision in the Spotify backend. Given how we run our clusters and our setup, we have three zones, and the client again here has three servers to pick from. Those servers are separated across these three zones. And we were talking about latencies and error rates, but it turns out that in the majority of the cases, things are working well. High error rates is the uncommon scenario. So if we squint and we zoom out, things work mostly fine. Also, that shady-looking cable that connected the data centers turns out it's quite performant. So there's not a major latency hit when sending traffic across availability zones. And with these two things in mind, well, what happens is this. There's a uniform distribution, even when we're optimizing for all these things. So two-thirds of our traffic is crossing an expensive networking boundary. And the question we asked ourselves is, how do we manage it? How do we manage this expensive cost? We're going to go into a deeper dive how our custom load balancing algorithm works, and for that I'll hand it over to Janik. Yes, thank you, Alex. So I'm going to talk about how Spotify does client-side load balancing and how we actually solve the problem of avoiding cross zone egress in our client-side load balancing algorithm. So if you look at our traffic architecture kind of looks like, we mostly use GRPC to communicate between backends. And GRPC already ships with multiple choices of load balancing algorithm. So for example, that's around Robin, which is pretty well known. There's also these requests which you might want to use. But at Spotify for historic reasons, we have built our own load balancing algorithm. And it has served us well for many years. So in the following, I will explain like the high-level view and ideas and design decisions that went into that load balancing algorithm. Because that might be quite interesting. I will also then share how we integrated this new decision that we want to avoid cross zone egress. And I'm especially going to focus on the fact that you can easily build your own load balancing algorithms if you use GRPC and you can make like your customers, which is probably your developers, use these algorithms and optimize them for your specific business needs. So hopefully, the following slides might inspire you to explore this option within GRPC itself. So first, let's talk about design decisions that go into a load balancing algorithm. How would you approach building one? Well, at Spotify, we might, or the first design decision we might want to consider is how should the load balancing algorithm behave if everything works great, right? So if so, equally utilized, if there's not a big error rate anywhere, but everything works nice. In that case, we might want to always pick the fastest server, right? So we want to optimize for very low latency because especially in long core chains, latency will add up. So the client experience greatly depends on choosing low latency. So for the health estate, it's quite easy. Just use the lowest latency endpoint, depending on what you measured in the past, for example. So what happens if you introduce errors? And I want to stress this point because maybe that's the one that mostly distinguish the Spotify load balancing algorithm from the other choices that you have within GRPC itself. And that is you want to, as you encounter errors in your backend fleet, we want to propagate those errors very quickly. So we want to send them very quickly to the Spotify client so it can retry them, actually. We want to do that because if we propagate the error very fast, so we fail fast, we can basically don't expose the error to the user at all. Like you add a song into your playlist that might fail, but if we retry quickly enough, you won't notice, right? So considering the design decision, we would want to have a load balancer that in case of a higher error rate in your backend fleet would actually fail fast instead of failing slow. So it should account for that. And the last decision I want to talk about is the one that Alex already alluded to, and that is we want to avoid sending traffic across zones, right? If we have the choice between two servers, one is in the same zone and one is in a different zone, we probably want to choose the one in the same zone. However, there might be cases where we would prefer to send the traffic across zones. And those cases are where we see a benefit of paying that fee because we would, for example, serve a successful response instead of a failing response, or the latency would be much quicker. So our load balancer should account for that. Now, these high-level design decisions, how do we put them into a working algorithm? And the Spotify solution to that was we introduce or we use a concept that we call expected latency. And what expected latency basically just refers to is that for every server that you can choose as the client, you want to calculate what you expect the latency of a successful response to be. So assuming you can calculate that on past measurements, you can make a very accurate and good load balancing decision with high probability because you can pick the server that has the lowest expected latency. So this sounds quite tricky. How do we actually calculate the expected latency? And how do we measure all these things that we need for it? So let's build up step by step, right? So let's zoom back a bit and consider again this very simple scenario of not having any failures in our backend. Then if you look at the expected latency intuitively, you can basically observe that it is probably dominated by the fact that you need to send a request and you need to wait for the response. And there's also probably some load on the server already. So the server needs to take some work to do before it can get to a request or finish sending it out. So if we put that into a formula that models this observation, we arrive at something like the expected latency under a scenario where you don't encounter any failures is the network latency sort of time between sending the request and receiving the successful response times the size of the queue of the server. And that is because in our model, the server would need to process a whole queue, send out a response for each, so account for the latency of each, and then eventually the plus one is our request that we expect. So assuming the client measures the queue size it has for the specific server, so the amount of outstanding requests, and it measures based on past experience how quickly the server successfully responded, it can calculate that expected latency quite easily. Now let's introduce failures and see how we can integrate that into this formula and notion of expected latency. Consider that we send a request to some server and that server might have an observed success ratio of 50%. So based on past experience we notice that the server fails half of the time. Now the first request will then also fail with a probability of 50%. But we've talked about that we would directly retry the request, right? So then when do we get actually a successful response under the model that the server fails with a 50% probability? Well, the probability of the retrying request also failing and none of them returning a successful response is one over four. So if you continue that we can actually calculate the expected number of retries that we need to do and we arrive that the expected number of retries will be one over the success ratio. In this case it's two. So what this adds to our expected latency notion is basically we will also additionally need to account for the fact that we will have to do some retries and it takes some time until we notice that there was actually a failure. So if we add this to our formula we kind of arrive at this, right? So we have the success latency then we also have the number of retries that we need to account for and then we have integrated that with the model work that's currently on the server. Now that's already a very sophisticated measurement that you can use in your load balancing decision because if you calculate it for every server you can basically you basically assign a weight to each server and then you just pick the server with the highest weight. Now the last part that we wanted to talk about and we wanted to integrate into our load balancing decision and therefore formula is the notion of avoiding cross zone egress which is also the topic of our talk. So how are we going to do that, right? So let's again consider the design decision that we need to take and that is in cases where like the server in the same zone and the or the servers in the other zone that's not the same zone as a client would behave very similar. So they have similar success latency, similar utilization. We want to pick the one in the same zone. However, for example, of that latency of the same zone server is 10 times higher than one in the cross zone. Well, that's a good reason to actually pay some money and send to traffic across zones, right? Because we will improve user experience. So how to integrate this? Well, the idea is quite simple and that's also like the cool thing or the result of this talk. It's very simple to kind of make this decision and horristically integrate them into your load balancing decision. So the idea would be let's just make the expected latency of cross zone handpoints a bit higher than those of same zone endpoints. And now basically by doing this in a hard constant way, we define a threshold when we want to tip over, right? And what we run with currently is we say that the expected latency of a cross zone endpile should be considered 10 times higher than those of the same zone endpoint. And what this means is as the image just showed, if the cross zone endpoint has a 10 times lower latency than the same zone endpoint, we probably want to prefer the cross zone endpoint. And by also having all these cool things in the formula like failure rate and the queue size also utilization, we basically allow the failover for all of these parameters, right? So yeah, we're kind of just doing this. So this is a very maybe boring realization, but it's also a very simple approach that has a lot of advantages because you can very quickly iterate on it, right? You can fine tune this threshold to your needs. So if you want to avoid a lot of cross zone traffic and you're fine with then accepting high latency in certain cases, you can choose a much higher penalty. So it's kind of nice that this solution is very flexible. So while it might seem simple, it's actually, it's kind of well-thought out. So I want to give a shout out to the maintainers of GIPC because they enable us to do these things. So as I said, they enable you to build your own load balancing algorithm. And I encourage you to take a look at the open source implementation and kind of get inspired by them. And they also offer some very cool features like Orca. You might have heard about it, but Orca essentially is just some GIPC streaming connection that sits on top the already established HTTP2 stream between the client and all the available servers. And it allows you to very easily exchange load balancing algorithm over that stream. So what we use it for is that the client learns lazily about the zones of each server and then can adapt the expected latency of each server. But you can use it for other things as well, right? You can get a better insight of what is actually the utilization of that server and then incorporate that into your load balancing decision. So it's a very flexible framework and it's very cool. So you should look into it if you are interested in that topic. Now, for the results, I'd like to share, like Alex already explained that before making this change, we were kind of sitting at two-third of our traffic crossing network boundaries, which was not great. And just by making this very simple change, like penalizing cross zone traffic in our formula, we went down to 30% cross zone traffic. So while I must admit, this is probably not a perfect solution and you can aim for much lower cross zone traffic with the right tools. It is very simple. So it's easy to understand. Well, if you aim for very low cross zone traffic, you will need to account for a number of edge cases, for example, very heavy zone skew, by which I mean like the number of clients and number of servers might be very imbalanced between zones. So if you prefer the same zone traffic too much, you might risk the reliability of your backend. And at the scale of Spotify, this is a very difficult trade-off to make. So we're very happy with this kind of simple solution as it already gave us great results. And with that, I hand it back to Alex to conclude the talk. Okay. So before we end, we just want to give you some of the learnings we have doing all of this. As Janneke alluded to, we want to shout out GRPC and other CNCF technologies. It allows us all to iterate faster. The community is solving many of the same problems. And by doing this, we help each other. So a big shout out there. We also want to add that as platform or infrastructure teams, we sometimes struggle to see how we affect the end product. And this is a problem where we directly touch on top-line business metrics by looking at some quite nerdy things, but it does work. And finally, the message we want to end with is aim for good enough. So the short summary of this talk is we multiply the weights by 10 in our algorithm. Then it works. We spent a lot of time over complicating this and trying to reach that lead line in Janneke's graph. But as the saying goes, perfection is the enemy of progress. So aim for good enough and then iterate. Thank you all for listening. If you have any questions, we're here. We are open for questions. And we have a bunch of spotifiers here as well. So if you have questions afterwards, feel free to approach them as well. I think there's a microphone over there in the middle of the hallway. Thanks for the talk. You just scratched about Orca. And so I was wondering with language you use for the implementation and why you used Orca for? Yes, so we use mainly Java. Or back in this big majority Java. And this implementation refers to that as well. Why we used Orca? We looked at different mechanisms on how we could transport this information of zones. There was some in-house implementations of the same type of thinking by creating channels between all servers and then exchanging the information. The reason we decided to use Orca was, well, it's already there. But we can also control the window of how often this information is propagated. So in this case, we only need the zonal information once as soon as the connection is established. And with Orca we can configure the stream so that we get the message once and then there's no more information exchanged. Hi. Thanks for the great talk. I was wondering on what level do you determine the expected latency? Is it on a per gRPC endpoint level or is it on a per service level, per server? Where do you determine that? Do you want to do this? Yeah. So it's on an endpoint level. So it's basically fully client-side load balancing. Each endpoint determines and measures its own observed latency, failure latency, failure rate, success rate between all its connected servers. And with a high net, enough request rate, you get very accurate results over time. Hey. Thank you for the talk. I have one question. How do you make sure that every team uses this load balancing algorithm? Yeah. That's a great question. So I don't know if you've seen the Spotify talk that was earlier today, but essentially at Spotify we have something we call fleet management. And that is a way to roll out changes over all repositories or backends at Spotify. So we kind of just rolled out this change and forced users to use it by making a simple commit in their repository and then it was active. Yeah. Can I have one more additional question? Have you thought about maybe like force redirecting, trapping something similar to like service measures do? Yeah. That's a great question. So we definitely trying to adopt a more service mesh approach because you will probably get like, you will get closer to this perfect score of cross-on traffic, but admittedly it's very hard to adopt at scale if you didn't have the infrastructure before. So that's what we are currently still working on. So that's why we kind of phrase this as we aim for good enough and then reiterate to make or take the more complex solution that will solve it kind of perfectly. Okay. Thank you very much. Thank you for the talk. I just want to ask like cause in every feature deployment like the latency and the success ratio can it can differ a bit? So is it like a static value or is it like a dynamic value? Yeah. So those are measured over time. So they're basically just metrics that each client observes. So for example, it might send a request and it gets a successful response. And then it will directly track this in its internal metric measurement and use that for the next old policy decision. I can just quickly add something to that. We do have a tiny safety mechanism on the penalty. So if we notice that the error rate is too high when we send traffic to the same zone, we reset the penalty and then slowly move it back to the level where we started from. So it's to have some safety in that mechanism to your question. Did you modify the outer scalers in any way to scale up a certain zone if it gets more traffic? Great question. We started with not doing that. So we want to keep it at the load balancing level for now. At some point, if we continue iterating on this approach, we'll start hitting other problems which we can't fix with better load balancing. So zone load skew and how things are deployed. So at that point, we might look at the wider infrastructure and see if we can align it better for this load balancing. Hi. How exactly did you test this beforehand? I mean, the ratio or the multiplier for the expectant latency. Did you roll out some low value and gradually increase it or how did you test that? That's a great question. Janik, you did a lot of testing. Do you want to take that? Yeah. So yeah, great question. So we kind of started by rolling it out without activating the penalty. And then we had very advanced load testing set up where we kind of tried to fine tune that. So we were testing different zone skews that you can influence and have continuous loads between deployments and then kind of looked at what is a sweet spot. And then we had a very conservative start going with 10. And that worked very well in production. We didn't see any issues over a long time. And I can also add we rolled this out on a tier-based way, which means lower tier services that are not considered as important or business critical would start with this before higher business critical services would start using this parameter. Thank you. Hey. So from my understanding, you had to write custom balancers to do that, right? Sorry, I didn't. You had to write a custom GRPC balancer. Yeah, exactly. Yes. So my question is like, it sounds like there's probably a lot of reusable parts in there. And maybe you have to like copy paste some of the upstream code. Right. Okay. So have you engaged with upstream to try to like make the whole balancer like internals more open or maybe even like upstream your own formula? Yeah, for that specific one, we didn't consider to make it open source. I mean, it's very specific to Spotify. But I don't want to exclude that possibility. I don't know. But thank you. I can add to the question of having to copy paste code. Yes. We pretty much started off with the round robin implementation and reuse much of that and other things on top. So that was an issue. And we'd love if those APIs become more friendly for this. So great question. Hi. How long did this take to implement? Like from the moment you realize that you are sending two thirds of your traffic that's costing you extra money until this is rolled out in production for everything? If you ask me two months, if you ask our manager, maybe six months, but it's in the months. So three months, four months. If I'm remembering correctly. Yeah, I can slightly add to that. The complexity is maybe not building it, but making sure that it's safe and rolling it out over the fleet. There are just so many issues that you face at scale also by people doing special setups. So that's more the challenge than actually building the algorithm, I'd say.