 Hey everybody, I'm Aditya Peruppa. I'm a software engineer at Aviatrix and an undergrad at UIUC. I'm Kamal Lee. I'm a PhD student at UIUC. And we're presenting our system that optimizes latency and cost in service meshes called Slate. So, a bit of an introduction to the problem. When you have multiple clusters that are spread across multiple regions, the load between them can fluctuate pretty drastically. So, some clusters might experience high load sometimes of the day, sometimes relatively really low load. And so the autoscaler takes care of these load imbalances, but usually within the span of minutes. And so we can kind of see that when cluster east might be super overloaded and cluster west is not, and other times when the opposite is true. And so actually understanding and optimizing latency in these scenarios is hard. The other problem that we're attempting to solve is egress cost optimization. So in topologies, in some topologies, calls to a remote cluster or database are inevitable, where the call has to be made over to another region or another cluster. When you're making these calls, you usually want to make them with less data, because the data is traveling over a long distance, and cloud providers usually charge for data out, so it's called egress fees. And so we can maybe see in this scenario where we have two clusters. Gateway, front end, and MP, which is a processor service, and MP calls DB. So the call trees are very simple, but for some reason the database in US west is unavailable. This might happen to the GDPR, data regulations, or the database is just simply down, things like that. And so in this case, the call path conventionally with locality load balancing would be the red path, where it goes to front end, metric processor, and then the database. However, the very, and here the size of the error represents how much data is being sent with that call. The issue is that we're taking a very large amount of data over clusters, which usually results in a lot of egress fees and a lot of latency. So actually an optimal path for this call tree would be front end to MP in the US east, and then to database. So you make the large call locally, preserving, so you don't spend that much on egress and all that. So our two baselines here are multi-cluster load balancing, where you just distribute requests between clusters evenly around Robin. And then we have locality aware, which we try to keep a local, which is the red path. And then we have our system, and we improve the bandwidth expenditure a lot, and also the average latency for this scenario. And so our system is a system that basically adapts to the current load conditions and optimally partitions traffic between clusters in a way that results in the lowest predicted latency. So to go over the system architecture a bit, here's our system architecture. We have, this is a multi-cluster scenario running. You have ingress gateways, east-west gateways, and essentially we have WebAssembly plugins that are installed on every sidecar that stream their current load in the form of a request per second, and latency where they trace certain requests that are all agreed upon by a certain hash of the trace ID and this configurable, and the load and latency is streamed to the cluster controller. The cluster controller aggregates all these service level metrics, and then streams all these metrics to the global controller, where the global controller has basically a God's view of every cluster, the load of every service in every cluster, and the latency it's incurring right now. Additionally, the cluster controllers routinely perform pings between themselves, so it goes to the east-west gateway, and this essentially allows us to gauge how much time, what the network conditions are between clusters, because you could be traversing the public internet. So with all these calculations, our global controller will try to optimize latency and by pushing out rules for what the optimal partitioning is. So to delve into our sidecar, into our WebAssembly plugin, our WebAssembly plugin is basically an HTTP filter that uses the proxy WASM-ABI, and it essentially reports every second, or a few seconds, the current load and the latency of every trace request, and we use Yeager headers to figure out what we're tracing and what the parents of the current trace requests are, and all of this allows you to run this extension without having to recompile Envoy, which is really nice. We also have a cluster controller which aggregates all these metrics and serves as a relay to the global controller. The cluster controller also enforces the policy given by the global controller and by editing virtual services to enforce the policy. It also does inter-cluster things. And in the global controller, to calculate the optimal latency routing rule, we use linear programming technique, which is mathematically to optimize the latency to minimize it. And in this example, we're going to use Bookinfo application. This is an intermediate representation in Slate Optimizer. In this setup, we have two cluster, which is yellow and red, and the edges represent the flow of a request, and numbers on the edges represent the number of requests that flow from source to destination. Here you can see there are product page service, there are detail service, there are review service and rating services. And if we have almost equal load in each cluster, there's no reason to reroute the request from local cluster to remote clusters. And this is an ideal situation, and most of the production multi-cluster, I guess they assume that this is the most common case. So no cross-cluster routing. But what if there's like severe load imbalance in different clusters? Here we see 100 number requests per second to yellow cluster and 1,000 number requests to red clusters. If the red cluster doesn't have enough resource to process this 1,000 number request, then what we want to do is pay for the network latency and send some of the requests to the on-the-loaded cluster, which is yellow here, to minimize the latency of each request. However, if you just stick to the local routing policy, you will not do it. In our system, Slate, the linear programming will digest the load-to-latency relationship reported by Sarkard through the cluster controller and run this polynomial function to predict the latency based on the given load. And we're going to model the load-to-latency model for each every service individually so that we can predict with high accuracy. Now, by running this linear programming model, we see here from part of page in yellow red cluster, we want to reroute roughly 25% of the request to the remote clusters, the review service in the remote clusters, because this service start overloaded and you think that this model will calculate that this will be the routing policy to minimize latency. And also, we do the similar thing from review service in red cluster to rating service in yellow clusters. This is the example, the output of our prototype system. Here, we send loads to the west cluster and east cluster separately. After profiling phase, once we think that we have enough traces to train this polynomial function, we start to run this linear programming model and come up with this optimal routing rule. Here in cluster 1, we want to reroute 10 number requests from cluster 1 to cluster 0 from Ingress Gateway to the private page. And we translate this number into percentage and then we will install this routing rule to every single services in all clusters. As for where this product might go in some interesting future directions in this area, we think that the current reporting model of metrics and latency and load is not very robust as in we are using just HTTP filters and WebAssembly. A better alternative would be to move to Wasm services and do very little in the request path that I think would better the performance of the system a lot. The other idea that some students in our group are working on is per request routing where you can prioritize certain requests over others. So imagine you had a request tacked with a certain header. It would somehow get priority throughout the service mesh and have the lowest latency resulting as opposed to all of the other calls. The final example I want to offer is call graph prediction. Essentially, when you have these large microservice deployments, the call graph can be dynamic. So in models like Book Info, even where it's super simple, there's still a little bit of variance. As one version of reviews does not call ratings but the rest do. This is not very predictable but a behavior like where a certain header is tagged with something that results in a different call graph and just general call graph classification where given a certain method and path at the Ingress Gateway, can you figure out what upstream services will be called? That's a pretty useful model for models like cost optimization where you need to know the call graph ahead of time. It's also useful for other things. So that's an area of future research that we envision could be very fruitful. Thank you for listening.