 Hello everyone, my name is Tianyu. I'm a software engineer at Google. And today I will talk about our work and design on global routine service. Here's an agenda for today's talk. First, I will start with a brief introduction about the background of our work. Next, I will talk about the design. After that, I will dive into details about our new design. So first, the background. So why we need to be limiting? Basically, effective traffic management is a key to allow customers to ensure that their micro-surveys and overall architecture is highly available, prevent any particular client from exhausting service resource, and highly reliable, be resilient to misbehaving clients from overloading the service. In envoy, it can delegate the written decision to external service. But it doesn't meet all our needs. So in next few slides of today's talk, I will explain why. So first, let me talk about the existing design. Here is a high-level overview of a request flow. So on the left-hand side, there are multiple clients sending requests to the service in the middle. Service have envoy deployed as sidecar proxy. So before the request reaches to the service, envoy will ask a written server if the request should be really limited or not. So if the answer is yes, the static code 429 will be returned. If the answer is no, the request will be allowed and sent to the service. So let's zoom in a bit, look at the protocol between the envoy and the written server. So here is a diagram of a request flow and the API portal interface. So a few things to highlight here. First, as you can see from API interface, it is using Unary GIPC mode. Basically, clients send a single request and a single response back, like a normal function call. Secondly, envoy query written server for every incoming customer request. Certainly. So on the left-hand side, the customer request is blocked while waiting for the response. So what are the problems with existing approach? Actually, some of them I just mentioned, but let's look at them all together. So first of all, all design doesn't scale very well. So envoy query written server for every incoming customer request. This still is infrastructure essential means using server info itself has to support the same volume of requests as epoxy, which could be huge. Secondly, there'll be high latency on the client side because the client need to wait for the response from the server, certainly. So there'll be low performance on the server side. So as you can see from the diagram here, so in the Unary GIPC mode, if there are multiple back-hand server available, each RPC could send to different back-hands. So what does this imply and why this lead to better performance? I will explain in the next few slides. So I will talk about all the design. Let's jump to a new design. So the overall request flow architecture stay mostly the same. We still leverage unvoiced functionality of delegating the written mid-decision to external service. You probably ask, what's new? So here are three new major features to be highlighted. I will go through them one by one. But first, in previous slides I mentioned, Unary GIPC is not good for performance. Here we switch to by-day streaming mode. It provides the persistent connection between client and the server. Basically, because in by-day streaming mode, so the client and the server can send arbitrary number of message back-hand force over long-lived stream. So why this is good for performance? First, let's look at the written-limiting server sign. So these stiganis actually help us to make a full use of server-side functionality. For example, caching. We can avoid the cache miss, on the cost of cache miss. So I imagine that repeatedly hitting different back-hand server will increase the chance of cache miss. So the written request will have to wait for the data to be available. Besides that, on the server-side, it can also have to reduce the synchronization between different back-hand. So single-body example, if one client report its usage to the multiple back-hand server, in order for written-in service to figure out a total usage number, it needs to synchronize between different back-hand to do the calculation, which will introduce additional overhead. So besides that, regarding protocol itself, it can also help to reduce, avoid the continuous RPC initialization, which includes like starting a new HTTP request at a transported layer. So next is our quota-based staple push. So what is a quota-based? Physically, we group the client into each quota bucket. Why we are doing that? So firstly, written-limits can be satisfied in various degree of granularities. By default, all clients are equal. But you can also group the client so that you can allocate more of your capacity to a higher-quality client. For example, if you have a production client and a developer client, you may want to allocate more quarters to your production client. Secondly, it can help to associate the response with your request. So in five-dice streaming mode, the order of a response request are not guaranteed. So basically, the client and the server can really write in whatever order they want. But for written-limit client, it is required to know which response corresponds to which request so that they can apply the written-limit decision properly. So quota bucket with a bucket ID as identified will serve as a bridge between the server and the client to establish the mapping between the request and the response. So next is how we group them. Basically, we generate the bucket ID either statically or dynamically. So let me use the diagram below to explain this. So basically, on the left-hand side is a configuration. On the right-hand side is a generated bucket ID. For the static method, so the key and the value from the configuration will be used as they are. For the dynamic method, so the value of the bucket ID is retrieved from the request header as highlighted in the green color. Based on request matching between the configuration and the key of the request header, highlighted in blue color. So I have said that. So our design operates on the quota bucket basis instead of individual clients. So the quota usage for each bucket includes information like number of requests and a lot denied. And as the quota assignment for each bucket from the server includes information like retrieving strategy and the left-hand of the assignment. So next, it's about how we make it a staple push. The answer is pretty simple. We are using cache. So we're leveraging the thread local storage to avoid to cache a response from written server. By using cache, it can first avoid redundant queries to the server. And also it can improve the latency on the client side if there's already valid quota in the cache. So last, it's about our report reply pattern with subscription model. So basically for written server, written client, it periodically reported its quota usage to the server. And for the written server, it will send back assignment once it has collected enough reports to make the decision. And the subscription model here means that the first report from the client will serve as an indicator to the server that, okay, the client is subscribed to receive future updates from the server. So what's the benefit of this model provide? Basically it can provide a more intelligent written decision. So right now the written server can adjust the quota assignment based on real-time usage report from the client. Think about an example below, they call it assigned per use. So there's like a social media website, news website, and a shopping website. On Black Friday, there might be a spike of internet usage on the shopping website. So once the server receives such report from the client, it can choose to allocate more quarters to the shopping website to allow more requests to go through. So let me summarize the design and wrap it up. So first, this design is more scalable. Basically, we have this stateful infrastructure which is called a cache can avoid redundant queries to the server. And also, this design is more intelligent because right now the written server can adjust the assignment based on the real-time client usage report. Last but not least, it designs more performance for some server side because it can make full use of server side functionality and avoid the heavy synchronization between different server back-ends. On the client side, it can avoid sending redundant queries and the response to the customer request faster. Yeah. So here is a acknowledgement. So this is a joint work across multiple teams at Google. Yeah. Thank you. That's it. Questions? Yeah. Sorry, what's that? So what's the time of granted for the written? So with both support, I think of per minutes, per seconds. Yeah. It can be satisfied in the configuration. Sorry, what's the question? So basically, you mean for, so right now it's a private streaming mode so we don't need the synchronization on the server side. So because of always one service sent to the, all the data to the one backend server. So this avoids the synchronization on the service side. Just wonder if we have a client that creates multiple connections, do multiple service, how do you synchronize that? Re-limit. You have multiple connections. So multiple clients, you want to synchronize between different clients? Different service. For a service. So basically, our design is not, don't really care about the client's information. So because the client is grouped to the quarter buckets. So what we care is about the each quarter buckets. So I think our design doesn't need to synchronize between different clients. Yeah. If I understand your question. Okay, so it's a local. Re-limit. So it's global. Yeah, it's a global really limit. Maybe I can sync up with your client. Yeah, please. Yeah, so we are talking to earlier next year. Yeah. So question is, do you have a targeted date for this design to be available to be used? Yeah. Yeah, please. Yeah, so that's a good question. So basically, we have a predefined configuration, even on the, so if there's, for example, if there's no assignment, so we have a predefined role instead of a querying server. And also in the cache side, in the server side, as you mentioned, we can preload the decision instead of waiting for the data to be available, right? Yeah, yeah. So yeah, on Envoyside, there's a caching for the, for the relatively response from the server. Yeah, and also on Envoyside, there's like a predefined role. So for example, if there's no response in the cache, we use a predefined role to say, okay, allow request or deny request. Then when the data is populated, we look at the data in the cache. You mean the priorities for, so this really depends on configuration. And also, not only just priorities, they can group the request based on other attributes. So priority is just one of them, so basically, yeah. It's customizable, yeah. Any additional questions? Thank you so much. Thank you. Next we have. Thank you.