 Hey, my name is Rob Scott, and I get to work with a great team at Google on Kubernetes Networking. Today, I want to talk with you about improving network efficiency with topology aware routing. But before we get into that, I want to provide just a brief outline of how I want to structure today's talk. First, I'm going to go through a little bit of background as far as how we got to where we are and then cover our first attempt at topology aware routing and some of the limitations associated with that. Then I want to explain our second attempt and what we're trying to do differently this time, as well as how that actually performs in real life with the simulator simulating millions of different inputs and scenarios. And finally, I want to provide a long-term vision for where we see this going in the months and years to come. But before I go any further, I have to add some disclaims. Topology aware routing is hard to get right. This talk attempts to show our thought process here, and I'll be the first to say this approach is not perfect. We still have lots of areas for improvement and things still could change. But I wanted to bring attention to the direction we're going and the improvements we're hoping to make here. So let's get into the background a little bit. Now, if you've used Kubernetes recently, you've probably created a service or a pod or both. And if it's a new cluster or a relatively new Kubernetes cluster, it probably has the endpoint slice controller running and that watches services and pods and creates endpoint slices based on those. And those endpoint slices are just lists of pod IPs for each service. And Kube Proxy watches those endpoint slices and configures IP tables on each node. So what does IP tables look like here? IP tables configuration that we're writing is very much based on probabilities. So I copied this little snippet from IP tables config on a node in one of my Kubernetes clusters that had a two endpoint service. And that just means a service with two pods behind it. And in this case, we have two key lines here. First, if you're targeting this service, there should be a 50% probability that you'll choose this specific service endpoint. So that 0.5 probability. And second, if you don't choose that service endpoint, continue down the list. And since you haven't chosen that endpoint, you should choose this second service endpoint. So each endpoint has a 50% chance of being chosen. And you can see that same thing scaled out to thousands or even 100,000 endpoints across different Kubernetes clusters. We rely heavily on IP tables probabilities. Now, this works, but it does come along with some issues. Traffic is distributed randomly across all endpoints, regardless of where it actually originates from. So in a three zone cluster, traffic is more likely to go to another zone than to stay in the zone it actually originated from. And every instance of Couproxy needs to keep track of every single endpoint in your cluster, which means the larger our cluster gets, the more endpoints every instance of Couproxy has to manage, which results in slower updates and even sometimes more latency. We also have to work within some constraints here. Couproxy doesn't handle requests directly. It just programs IP tables or IPvS. And that's great in many cases. It means if there's some kind of problem with Couproxy, your networking doesn't break immediately. Instead, it just relies on whatever was already in IP tables config or IPvS config, which is good because Couproxy is not perfect. But unfortunately, it means we don't have visibility into request errors or timeouts, and it's really difficult to understand when an endpoint is overloaded. And also Couproxy is deployed on each node in your cluster. So any significant change could be expensive. So think about this. Because it's running on every node in your cluster, any change to an endpoint anywhere in your cluster has to be sent out to every instance of Couproxy on every node. So if we're updating endpoints more frequently, that's expensive. Similarly, more advanced logic increases CPU utilization again on each node, which is significant overhead to add on every single node in your cluster. So we have to be careful with any approach we take here. Okay, so with that background, let's talk about our first attempt. We added a new alpha API field to services called topology keys. This allowed for endless flexibility. You could specify arbitrary topology keys in any order. And Couproxy would then filter endpoints based on those topology keys, based on those labels. So any endpoints with matching values for those labels would be filtered in and everything else would be excluded from what Couproxy considered on a given node. A wild card character could then be used to indicate that if matching labels couldn't be found, traffic should be routed elsewhere. So let's take some concrete examples here of this configuration. First, if you wanted to require traffic stay in the same zone or region, your topology keys might look something like this. And if instead of requiring, you are fine with just preferring the traffic stay in the same zone or region, you would add that final wild card at the end of the field. So this made sense. It was a very powerful API, but it did come with some real limitations. Now, first off, this was a relatively complex API. And what most users wanted was very similar. Traffic should stay as close to where it originated from as possible. And unfortunately, that really common use case was relatively difficult to communicate with this API. And it required configuration on every single service and relatively significant and complex configuration. Similarly, all the logic behind this functionality lived in Couproxy. And that meant that there was extra processing on each node. And it also meant that all endpoints still needed to be delivered to every node in the cluster, even if that node was then going to filter those endpoints out. It was also a relatively difficult API to implement. So ideally, topology keys would be given more weight if they appeared first in the list. But this would be quite difficult to achieve safely. There's a very real potential that we could overload endpoints. So let's take an example where we have a lot of traffic originating from a zone, but only one endpoint to serve that traffic. It's easy to see if you had configured a service with prefer or even require zone. That would be very difficult to avoid overload. Now, you know, at first we just did some basic filtering that matched any labels based on topology keys. And that kind of worked, but we still had the same risk of overload. And of course, if that wildcard character was passed in, then we couldn't really do any filtering because while you could route traffic anywhere. So our initial implementation was very basic, and it was hard to get it quite right, especially because you could have so many potential combinations of configuration. So what we could do is we could prioritize endpoints matching earlier labels in the list. And we'd avoid overloading those endpoints. We'd also avoid sending traffic nowhere. So as an example, we would avoid sending traffic to, you know, no endpoints. If there's no endpoints in the zone or region, we want to avoid that situation where traffic has nowhere to go. And similarly, we'd make that star character behave more like a wildcard more like failover configuration instead of just removing filtering altogether. And although these all seem like great goals, they became quite difficult to achieve, especially with such a flexible API. So with that background, we said, well, maybe we should try again. And in our second attempt, we wanted to identify three goals. First, we wanted to build consensus around a small set of topology labels that would be clearly defined and officially supported. We wanted to develop a simple approach that covered the most common use cases as automatically as possible. And finally, we wanted to only deliver the endpoints that each node would actually care about instead of continuing to deliver all endpoints to all endpoints. All endpoints to all instances of kube proxy. We thought this would make a very significant difference for both scalability and performance. So because this is Kubernetes, each one of these goals had an associated cap. So if you're interested in all the details associated with each one of these proposals, I highly recommend you get on GitHub and go to the Kubernetes enhancements repo and look up each one of these caps. I'm going to provide a decent overview of each of them in this talk, but I simply don't have enough time to provide the same level of detail that each one of these caps provides on their own. For example, cap 2004, the topology-aware routing cap, hat describes a lot of the different algorithms we considered and provides diagrams to show exactly how they might work. I don't have enough time to cover every different algorithm we've considered, but I hope I still will be able to give a decent overview in this talk. So with that said, let's talk about the first cap, building consensus around a small set of topology labels. Now, in this cap, we wanted to standardize on the following labels, one for region and one for zone. And we wanted to define region and zone as being hierarchical. This meant that zones could not be spread across multiple regions, but a region could contain multiple zones. We also wanted to state that these labels should be considered immutable. This is a very big deal and something that should dramatically simplify any implementations working with these labels. And finally, we've left open the door for a third or maybe even fourth key to be introduced in the future. But for right now, we want to keep it very simple and straightforward. Our second goal was likely the most significant one. It was to develop a simple approach that covers the most common use cases as automatically as possible. And an automated approach meant we needed a good algorithm. But not just a good algorithm, a good way of evaluating lots of potential algorithms so we could actually understand what a good algorithm was. And so Rick Chen, an awesome Google intern, developed an open source tool called Kubernetes Topology Simulator. And this allowed us to plug in any number of algorithm and run against millions of inputs to see just how well or poorly it performed. And we'll cover a little bit more about this topology simulator later on. But let's talk about the algorithm that we ended up deciding. And of course, this algorithm can still change and can still be improved. I've tried to simplify this algorithm into just a couple sentences. So once a service has enough end points, subset those end point slices by zone. If a zone doesn't have enough end points, contribute some from a zone that does. So what does it mean for a service to have enough end points? Below a certain threshold, this approach results in a lot of churn and potential for overloaded end points. Our testing showed that approximately three times the number of zones was a reasonable starting point here. So if you had three zones, approximately nine end points was a good starting point. Anything below that risk significant significant overload potential or significant churn. So as an example, if you have three end points and three zones, you could theoretically distribute them across those zones evenly. But what if one zone is slightly larger than the others? Or what if all the traffic is coming from one zone instead of all the zones? You have significant risks of overload. And also if there's any changes along the way, you have significant risk of churn where everything has to change across all of these zones, across all of these slices, and that gets expensive. So at least initially, we're proposing starting once we have a little bit of room to work with. And not just precisely nine in the three zone cluster case, we need to add a little bit of padding both above and below to ensure that we're not constantly flapping back and forth between approaches. So as an example, the upper limit might be closer to 12, and the lower limit might be closer to six, when we're switching between these approaches. Now once the service has enough end points, we want to subset the end point slices by zone. So we're introducing a new label that can be set on end point slices. And kube proxy will be updated to read from this label and only care about end point slices for the specific zone. And of course for backwards compatibility, kube proxy will continue to watch end point slices that don't have this label at all. But we imagine going forward, the vast majority of end point slices will get a label like this that indicates which zone they're intended for. So if a zone doesn't have enough end points, what do we mean by a zone not having enough end points? Now in this case, we're talking about the number of expected end points, which at least initially will be based on the proportion of CPU cores in a zone. So let's take this example. We've got a cluster that has 12 total end points. And we've got six total CPU cores. Three of those or half of those happen to live in zoning. So we expect half of the end points to also be delivered to zone A. And similarly, one third of the cores live in zone B, so we expect one third of the expected end points to also be in zone B, all the way down the list. So that's relatively straightforward, but what happens if this isn't the case, if there's a zone that doesn't have enough end points? Well, to minimize churn, we're only going to redistribute those end points once a threshold has been passed. We define an acceptable overload threshold as maybe 25%. And what does overload mean? At least in this context, we're defining it as the amount of extra traffic some end points might receive. So if we expected 10 end points in a zone, and we only had eight, that would mean each of those eight end points were 25% overloaded. They could get 25% more traffic than might be expected if everything was evenly distributed. Now, seven end points, on the other hand, would be below that threshold because now each of those end points would be getting 43% more traffic than expected, which is something we're trying to avoid. So once we reach that level of difference, it's time to start redistributing end points. Before then, we rely on kind of lazy rebalancing. So when a new endpoint comes in, it's automatically delivered wherever it makes the most sense. Now, here's a lot more to this algorithm than I can explain in this talk. So if you're interested, we've included a lot more diagrams and examples in the cap itself. But finally, you may be trying to picture exactly how this might work, so I thought an example might be helpful here. So in this case, we've got nine total pods, four in zone A, and two in zone C. With our original approach, all of those pods would be bunched into a single endpoint slice. But with this automatic topology-aware routing approach, instead what we're submitting is that three end points go into a slice for zone A, same with zone B. And for zone C, we only have two end points from zone C that we can contribute, but we have an extra one we can take from zone A. So we do that. So that's a very simplified way of how we imagine this kind of subsetting would work. And of course, as I mentioned, there's a lot more examples in the cap covering a large number of potential edge cases we could run into here. And finally, we want to only deliver end points that are closest to each instance of Coup-Proxy instead of delivering all end points everywhere. So we've introduced two new labels, one for zone and one for region to indicate where an endpoint slice should be delivered. And Coup-Proxy will be updated to watch end point slices with those labels. And so it will specifically not pay any attention to end point slices labeled for another zone or for another region. So in summary, there's three main things happening here. One, two official topology labels, zone and region, are now well-defined. Two, end point slices can be delivered to zones or regions. And three, users can opt into automatic topology-aware routing on every service. This will likely start as an annotation, but could be extended in the future. And maybe, if we're lucky, we'll get this done well enough that we could even default this to just being enabled and we wouldn't need any configuration at all. So with all that background, now that you understand how this approach works, it would be helpful to understand how our simulation actually scored this approach. So to understand our score, to understand our results, it's helpful to understand what we were evaluating algorithms on. So we gave a high weight to the percent of traffic that stayed in zone, and that weight was 45% of the total score. Next, we cared about overload. We cared about overload, the proportion of extra traffic that any single endpoint might expect to receive in a simulation. So across our 39 million inputs that we tested every algorithm against, our max overload got a 20% contribution to the total score. So our worst case scenario of those 39 million inputs, the single worst endpoint of all those inputs that got the most overloaded, we gave that a weight of 20%. And then we also wanted to make sure that our average case was also quite low. So we also gave weight of 20% to our average overload. And finally, we also cared about the proportion of new endpoint slices that would be required with this approach. So we gave a weight of 15% to that proportion. So our automatic approach actually scored reasonably well here. It managed to keep 84% of traffic in zone, which is significantly better than the original approach, which you would expect because the original approach wasn't even trying to keep traffic in zone. But 84% we thought was quite good because all these inputs included a lot of edge cases. The kind of edge cases where you'd have hundreds of end points in one zone and only one in another zone. And so you end up with some really impressive results given those highly lopsided inputs. What we observed when results were more balanced or more normal was significantly better than even this 84%. And then next, our overload. Our overload was quite good. On average, it was around 1.7%, which we thought was a reasonable level of overload. And although it's not quite as good as the original, it still felt like a reasonable compromise to get to that very high level of end zone traffic. And again, overload performed significantly better when you aren't dealing with all these wild edge cases that we had in our inputs. And finally, for extra slices. The new automatic approach required approximately 37% more end point slices than the original approach. But even still, that did not seem like a significant issue and definitely seemed worth all the advantages that we got for keeping traffic in zone. So overall, our simulator gave this automatic approach on 87% score compared to a 73% score for the original approach. So we recognize there's still room for improvement here, but we think what we have already is a significant step in the right direction. And with that, I want to talk just a little bit about our long-term vision here because really we're still in a pre-alpha stage. We're still very early in this process and we still could use feedback and we still could use a lot of work. Our first priority is to get this into alpha and get some feedback. We're hoping we'll hit that target in Kubernetes 1.21. We also have a number of open questions. You know, obviously how can we improve this approach? If you have ideas, if you have feedback, don't hesitate, get in touch, comment on the cap, come to SIG Network, whatever it happens to be, we'd love to hear what you're thinking. And we're also interested in if we can use any kind of similar patterns for DNS. But is there any additional configuration we need to provide for, say, power users or advanced use cases? And on the other side, can we eventually make this good enough that we can just make this the default and we won't need any configuration at all? Now some longer-term bigger picture questions are, how can we implement topology-aware routing with real-time feedback? This is really the goal. If we had real-time feedback, we could detect overloaded endpoints and route traffic elsewhere automatically. This is really something that would be great to include inside Kubernetes, but it requires a lot of changes all the way up and down the stack and is a bigger project that will not be done probably in the next couple of release cycles, at least. But we're still thinking about it and we're trying to picture ways that we could actually achieve this. And finally, can we do any of this redistribution of traffic without updating endpoint slices on each change? Because right now we're highly reliant on updating endpoint slices to describe what kubproxy should do, where kubproxy should route traffic, which is fine, but it seems like there's probably a bit more room to add local optimizations maybe in kubproxy that could require less updates to endpoint slices. So this is just a sampling of the things we're thinking about. There's plenty more. Many of them are addressed in the caps. So if you're interested, I definitely encourage you to get involved, ask questions, provide feedback. We'd love to hear from you. So thanks so much for coming. My name is Rob Scott. You can find me on Slack, GitHub, Twitter, etc. If you have any thoughts about how we can make this even better, I'd love to hear from you. Thanks so much.