 Hello, everyone. I'm Kang Wook. I'm a second year P.C. student at UIUC, and we are a network research group from UIUC. I'm looking for a 24-server internship. My area is performance optimization for distributed application, and she's 10-foot machine learning. Hi, I'm Aditya. I'm a second-year undergrad at UIUC and also a software engineer at AVATrix, and I also work on this stuff. Okay, so let's talk about our project late. Oh, wait. Slate, service layer traffic engineering. So, Microsoft's applications are very complex. It means it is really challenging to optimize performance of this application. One service can call dependent multiple services, and one application can consist of multiple tiers. So, we raised the question, what could be a good request routing for this kind of complex Microsoft's application? We need to consider multiple vectors. Here's a setup. There's service A and B, and we have four different replicas for B. The first thing we need to consider is network RTT. Different, like, geo-disk feed region can incur different network RTT. Even within the same region, locality matters, and also there could be some network congestion. We need to also consider cost. For example, bandit cost. You're supposed to pay for egress cost when you want to transfer data from your local region to the remote region. And there could be different request type. Different requests might lead to different resource usage or different code graph, even. And this is not a single, simple, single hop client and server relationship, but it's multiple layers, so we need to consider multi-hop optimization. So, before going into our on system, we want to share some interesting survey results we conducted on Istio community about cross-closer routing. This is the first one. So, we ask what kind of load balancing are used for multi-closer service? Multi-closer service is a simply individual service deployed in multiple clusters. And the answer was locality weighted, locality failover, which we're going to talk about later in the presentation. More common load balancing policies like round robin, list response, list request, and so on. And the second question was, will it be useful if we provide some way to optimize cross-close routing for you? And most of people say yes, and the reason were to improve request latency, improve cloud bandwidth cost, load balance, and compute cost, and so on. So, you can see there's a huge gap between what they're doing right now and what they want. The complete survey is available in our repo. So, we want to fill this gap. So, this is our problem statement. The request routing in microservice application is a much more subtle problem than today's load balancing and traffic management. So, what we want to do here is globally optimize request routing based on admin's intent. So, some people might say, oh, I thought this problem was solved by fancy autoscalers, load balancers, and traffic management. Autoscaler, yeah, you provision resource, it can partially solve some problem. However, it doesn't control network traffic, which is request routing. And also, it is really slow. It takes like seconds to minutes because of sequence of process that you need to take, you need to process. But request routing is millisecond timescale decision. And load balancers, it makes wrong assumption, which is replicas are identical. It does not consider cost, it does not consider multi-up, and so on. How about traffic management? Service mesh, like Istio provided on traffic management scheme, like locally failover. For example, like one service fail in one cluster, a request will be routed to another cluster. However, this is simple heuristic and also this is not optimal. What about advanced traffic management? Like Google or Meta like company, they provide their own solution for traffic management. It do consider network RTT and also it has some global knowledge to optimize the request. However, often time you can find this is suboptimal. We're not going to talk about details in this presentation, but still it does not consider network cost, it does not consider different request type and no multi-up consideration. So none of them is solving the problem, the factors that we presented. So our project, service layer traffic engineering is inspired by network traffic engineering, which refers to the optimization of routing to improve throughput, load balancing, or latency of different classes of traffic. So we are here lifting up this concept from network layer, L3, to service layer to solve request routing problem. So this is our design goal. The first is we want to react to the changes in load and latency fast, and also we want to make it easy to express for different operator intents like cost, availability, or latency. And also we want to make it instant, we want to achieve instant plugability into existing deployment. The project is still implemented actively. So yeah, this is our architecture slate. Let's say there's application through services, and we have our on data plane, which is slate wasm, and we have cluster controller, and this is one cluster. We have multiple of them, in this case too. And we have global controller. And global controller will receive admins intents like dollar per millisecond, interesting metric we're going to talk about later in the presentation. Data plane will send start and end time information, also load information like rps to global controller through cluster controller. And we're going to run some optimization inside global controller, and output, optimize routing role, and it will be pushed down to the data plane through cluster controller. So what's happening inside global controller? Global controller receive needs to pull the information, and also cost information, and latency information. Latency here we model it as a function of a load. Here we are using right now linear regression, x axis is a load, and y axis is compute time for a particular service. So we do it for every service, for every cluster. And you can use relatively simpler function like step function where even more complex won't like machine learning model. And we express it into one problem formulation and putting into the groby framework. And we're going to use linear programming technique and output this optimized routing role. This is example of our routing role. For example, because flow from A to B, in west cluster, we want to route 90% of the load locally to the west clusters, and 10% to the remote cluster, which is east cluster. So now I'm going to talk about the use cases of our system that it performs the best in. The first one is where it can instantly react to fluctuations in load. Second one is when you can optimize egress costs in like these dynamic service topologies which have multiple call paths. And then we have handling different request types differently to optimize traffic. Okay, so the first case, let's imagine a setup where you have two clusters just perfectly replicated in different regions. So they're far from each other. Let's say traffic is stable between them, and all of a sudden there's a burst in US west. The first thing that's going to happen is product page is going to get overloaded. This is the default like Istio book info application. When product page gets overloaded, it's going to experience some latency. And ideally what you want to do is because you're going to experience a lot of latency locally because there's a big spike in load, you want to send some of that load preemptively off to US east so you don't incur that tail latency. So there's a tradeoff here between like, are you going to incur the network RTT in like the remote compute time versus the local compute time. And this tradeoff is not a trivial one to think about, but our system optimizes this based on the load conditions and the load to latency model, which is constantly modeling for every service in real time. So the real question here is like, given different load conditions, how should we distribute requests across replicas in different regions? Istio says everything locally until the local thing fails, but we say it's a little more nuanced than that. So we ran some simulations where we ran against two other baselines. You have local load balancing where you only route requests within the cluster. It doesn't matter if it fails within the local cluster or not, you just route it locally always. There's multi-cluster load balancing where you kind of split traffic between clusters 50-50. This is kind of naive, but that's the other baseline. And then there's our system where basically we have this notion of utilization in the current cluster. If we have some capacity, then we send it in the local cluster. If we've reached max utilization, we spill over to the remote cluster. And then so in the left graph, as you can see, this is a CDF. So at the last 1% of requests, our tail latency is like a lot lower than local load balancing and multi-cluster load balancing. So the dollar per millisecond thing gang was talking about, what does that actually mean? It's just a way to express what the value of latency is in your case. So let's say latency is super important, then the value of this metric will be higher. And let's just take this topology here. If, as you can see, the right cluster has a thousand requests and the left cluster has ten requests. If one is a lot more overload than the other and you don't care about cost, you're going to want to route as many requests as you can to the remote cluster. So in this case, it's roughly 50-50. More latency is super important. However, as you start going down, as this value starts going down, you'll start routing less and less requests away until you really care about cost and you don't care at all about latency and you're routing everything locally just like you would by default. So this is a nice way to kind of express as a spectrum how much you value latency in your case. The second use case is to minimize egress costs. So the question here is, if we have multiple ways to make a cut between regions, between services, how do we do it? So the same example, we have two geo-distributed services. Let's say one goes down, ratings all of a sudden dies. What's going to happen is requests are going to flow from egress to product page to reviews and then it's like, oh, there's no ratings locally, so I need to go to the next available one, which is in US East. Now this is probably fine, except for the fact that you can make the cut in other places. You don't have to make it at reviews, it just happens because you do some greedy load balancing. So the issue is when you have different call sizes across these cuts. Let's say the call size from reviews to ratings is 100 kilobytes, it's the biggest of all of them. You have to route from reviews in US West to ratings in US East because you have no other option and you chose to stay local the entire time. If somehow there was some future site and you knew that you would have to route away eventually, you should pick the cut that takes the least data so you can minimize your egress cost because cloud providers usually charge for egress out, data out. Yeah, egress cost is in dollars per gigabyte, so it depends on the amount of data you're sending. So the correct thing to do here is to route from product page in US West to reviews in US East because that has the least amount of data being sent. So we ran this in an actual multi-cluster environment with Istio and multi-cluster load balancing and locality load balancing because they do something like pretty simple, they didn't do as well. But then if you have future site like we do, you save a lot of money on bandwidth cost and on latency because you're sending less data over large portions of the earth. The third and final use case is when you have different request classes, how should you treat those in a multi-cluster setting? So I'd say you have two request types and they have the exact same call graph. So request type one and two, but they have different characteristics. Let's say request type one is very compute heavy, meaning that it triggers like a lot of CPU cycles, but it doesn't send a lot of data over the network. So it's pretty network light. Whereas you have the exact opposite on request type two, where you have a very compute light request type that triggers like not that much compute, but it's a very data heavy job. So what do you do here when you have total fail, when you have like overload in US West? The correct thing to do is to start routing the compute heavy away stuff first so your local cluster gets relieved and you want to take advantage of the local routing in your cluster to take, you want the network heavy request to take advantage of the local network in US West. So classifying this as hard and that's something we aim to do as well. So as far as status and future plan, there are currently some challenges with this, you know, solving these problems. Latency is hard to model. There's all sorts of conditions that subject to like a resource interference and whatnot. And also a call graph prediction at the per request level is something that has not been done before to our knowledge. It's pretty hard because like there are all sorts of things in the request that could trigger different call graphs like, and there's things within the system like caching. So that's kind of hard. There's also making the system more compatible. So right now it only supports two clusters. We want to expand it as many clusters as anybody wants. And also one of our primary goal system is to make it a pluggable. So it uses like web assembly plugins and just stuff you can do one kubectl apply and it's installed in your cluster. We want that to happen. It's not fully there, but that's a goal for us. And we have a demo. All right. So we're going to show the demo for use case one, latency optimization. We'll take like three to four minutes. So here this is our global controller and we want to send the load to the west cluster only for now. And Slade requires some profiling phase. It takes some time. In this case we loaded latency model already because of the time limitation. We take some time to profile and now see we're not using Slade for information. We're not using Slade right now. And you will see the latency here when everything is routed locally. So here you see 300 milliseconds for average latency and three seconds for tail latency. This is like 100% local routing without Slade. And we want to send some load to the east clusters and the Slade is profiling for east clusters. It will be done pretty soon. Okay. Now the Slade is on and you see your number here. Cluster zero to cluster zero, 100%. And cluster zero to cluster 1, 0% and vice versa. It means 100% local routing for now because there's no load to any clusters. And we are leveraging virtual service in east tail. We're going to show it again later. And now because we just turned on Slade, we want to send the same amount of load to the west clusters. And now you see here number is changing like in real time every a few seconds. Here you see from zero cluster to cluster zero which is local routing 77% and rest of them 20% to the remote cluster which is closer. And you can see the number is changing like dynamically every four to five seconds. Sorry. And here you will see lower latency in a few seconds. We are basically doing like offloading like optimal number of requests to the remote clusters. Take some time. Oh, before we see the latency here, let me show you you how we express our routing role in virtual service. You see here 65% from west cluster to west cluster. This is a connection between Ingress Gateway and product page. And from west cluster to east cluster is 35% which is a remaining percentage. So this is how we configure the routing role. And the latency you can compare now we achieve 52 milliseconds for average latency which was 300 milliseconds and 176 milliseconds for tail latency which is like much lower than the local routing. And when there's no load, the slave will fall back to 100% local routing automatically. You can see here the configuration expressed in virtual service. And this is how it works. All right. Thank you. That's it. If you have any feedback, we'd be appreciated. Hey guys, great talk. I want to make sure I understand the control flow. So you've got traffic is flowing through envoy proxies via Istio. And that's generating telemetry metadata that then Slate is observing to see those latency changes. And then Slate is applying changes within the Istio API in order to respond to that change in telemetry latency. So yes, it doesn't actually scrape envoy metrics specifically because we found that to go through this it's a little too slow to do instant reactions. So we actually send our own metrics. We create a wasm plugin which we plug into every proxy that wants to participate. And then we create our own metrics which are sent over our own API. And then you push them rather than pull them? Yeah, exactly. We push them. Okay. So thinking of how to tighten that control loop, is there something that can be done within the Istio API space or within the Envoy API space that would allow you to sort of take your intent that you've already expressed and have it be much more responsive within, say, a single Envoy proxy be able to make those decisions rather than require that loop out of Envoy and into a control plane and back? It's kind of hard to say because you need the global, it's nothing that Istio or Envoy API can do. You need the global view of all the loads and all the clusters to make the decision. We tried looking to do something local, but whenever you do something local, you kind of lose out on the global view. And for this, we decided to stick with a global view because you kind of need that for optimal routing. But the APIs are great. It was really easy to plug into the Wasm plug in API and Istio, and then just Wasm stuff in general. It was really easy. So APIs are good. Cool. Thank you. Great talk. I really enjoyed it. Thank you. Do you have any plans to also consider extending your traffic engineering approach to DNS so that the request lands on the right ingress in the first place? We're not, for now, we're not currently planning to expand it to DNS layer. But yeah, there's definitely something that we can try. But we want to solve this problem in the service layer first in an Envoy and Istio context. And if you're done, we want to expand it to other layer too. Sounds great. How do you run your global control plane in a highly available fashion? The global controller. It's not so highly available as a demo. Ideally, it would be run geo-distributed somehow. Like multiple global controller, like logically centralized. But physically, we can maintain multiple central controller. And also, the reason that we use cluster controller to minimize the network bandwidth between data plane and global controller so that we can aggregate information. So we want to kind of mitigate those risks. Sure. In your case, you have two regions, West and East. There might be use cases where you have a lot more regions, where that becomes much more important to have the control plane globalized, essentially, but in a fashion that it cannot really fail. But still, it's fast enough for every ingress to access it. Thank you very much. Thank you.