 Hi, everybody. Welcome to my presentation. I could be conclinated, come North America 2020 virtual. The topic here today is API priority and fairness, could be API server flow control protection. Before we are getting started, I need to say that because this presentation is held online virtually, so I enrich the contents of these PowerPoints of this so that you can check out the details after it's finished. And the topic is about a new feature gates introduced into Kubernetes 118 release named API priority and fairness. And earlier this year in April, there was a official blog on the Kubernetes sites named API priority and fairness alpha, which basically tells you what's happening in the alpha releases. And if you don't got any background, so any other information about this feature, I highly recommend you to read this blog. It's only gonna take you like five minutes. So for folks who are interested in the design or implementation details, there is also a cat under the Kubernetes API machine folder. It's gonna take you have a deeper inside of this feature and see things you might be interested when in terms of further customizing for your cluster. Then I'm gonna have a brief introduction about myself. I'm main, I'm working at Ants Group as a software engineer. And I've been working at CAPM machinery for about three years. And I'm also a Kubernetes CAPM machinery sub-project owners, covering several sub-projects. My GitHub address is UEE-9944882. The same name is also my Gmail account. If you got any questions after this presentation, please feel free to contact me by either the email or the names on the Kubernetes Slack channels. So the team, the squad, building this feature covers many different developers from different countries and different companies. And my, so there's Mike from IBM, Daniel from Google, David from Red Hat. And there are also other contributors so far. Aaron, Jonathan, Bruce, Yu, Meng Yi. Thank you for all your contributions so far. Then let's take a look at the agenda of our presentation today. We're gonna show you the background and motivation of this feature. Then there's gonna be a visit retrospectional system design. And next, I will also show you the offer stage implementation, which basically covers the API models and the systems, the implementation sources. And then we move on to the demo in which I will show you how to customize flow control settings for your own clusters. Mostly we are using a KINZ cluster, which is just created by me in the virtual, not in the virtual box, it's just a new brand new cluster so that you can easily reproduce that. You can easily rerun that in your local environment and at last I will show you there's a few planned enhancements for the beta stage. Okay, let's move on to the first part, the background and motivation. Basically there are two higher goals of our feature. The first one is the self-protection which basically covers two points. The first is prioritize cluster critical request for self-maintainance. And the second one is prevents banning clients or buggy controllers from stunning the whole cluster. The first one is basically saying that we are going to sort all the client requests into different properties. And the second one is saying that we are not, we are not allowing any client requests to spoil all the clusters. There will be a vanilla isolation between clients. As for protection, we are actually protecting QBAP servers from incoming client requests. So we should start by understanding there are different kinds of client requests being served by the typical QBAP servers. There are API server loopbacks, delegated requests from a aggregated API server or emission web hooks. There are also controller requests and there are email requests. As for QBAP server loopbacks, so the QBAP server instance will be requesting against itself even if there are no client requests at all, such as there's a informal factory instance single chain in each of the QBAP server process. Basically the QBAP server need to know the actual status of each cluster by accessing the object cache provided by the informal factory. All of these informers will keep raising lists and watch requests against the QBAP server until it's down. And there are also several embedded controllers inside the QBAP server. For example, there are cluster CA rotators and there are CRD related controllers and there are also API server aggregation related controllers. All of these controllers should be regarded as first class citizens in the Kubernetes work because they are strongly connected with the house status of whole cluster. If you are filling the cluster loopbacks, then there's gonna be bigger problems in your clusters. The clusters is very, very likely to be questioned down. And as for the delegated request from aggregated API server and emission web hooks. So basically the QBAP server provided us some extensibilities to allows us to add new results to the cluster and also intercept some of the resources. These extensions will be involved during the time when the QBAP server serving an incoming request. For example, if you add a emission web hooks on the pod resource, then when you are starting a request against a QBAP server on pod resource, then the QBAP server will raise another request against the emission web hooks. The secondary request should have a higher priority than the original request. Otherwise there's gonna be a deadlock in the request chain which is going to cause some problems in the clusters. And there are also controller problems. We had a situation where a bug in the deployment controller caused it to run and make under certain circumstances if you're requesting a tight loop. So we'd like these controller bugs not to take the whole system down. These controllers can also be some custom controllers which is developed by leveraging some scaffolding toolings for example computers playing with the custom resource definitions. And if these buggy controllers can have these EU behaviors, it's gonna be a harm to the QBAP server as well. So there can be controller singletons and there can be also demons. It can be a Kublat, Kuproxy or other per node controllers. If there are bugs in these per node controllers, the impact, the influences on the system is gonna be multiplied from the controller singletons. So we definitely want to avoid them. On the other hand, the issues from demons doesn't necessarily connected with a bug. It can be the cluster reaching its scalability limits. For example, if you have too many nodes in the cluster, then your cluster will be reaching the scalability ceilings. But we don't know the ceilings until we actually reach it. So, but at least we don't want the newly added node to be taking the whole system down. There should be a way to make the cluster keep running, even if there's too many demons added to the cluster. There's another higher purpose multi-tenancy which demands us to provide guarantee the capacity for controllers that are considered less important. And the tenants in the same priority than the share of the cluster should get an eco-share of their service. So I believe that Kubernetes is designed to be shared by multiple tenants. There are many different kinds of tenants definitions so far, a tenant can be a namespace, can be a user, can be several users sharing a prefix or several users having the same host in the same group. And don't forget that there's another subproject under Kubernetes 6 named multi-tenancy which has a brand new definition of tenants which is basically a group of namespaces. There are also non-goals of our feature background. There will be no coordination between the API server, now is an external balancer. And we will also not attempt auto-tuning the capacity. We will not attempt to reproduce the functionality of the existing event we're limiting a mission plugin. The mission plugin is basically intersecting the request against event resource by using a token bucket filter. So this one is basically saying that we are not generalizing the token bucket filters to all the resources or something like that. Let's move on to the second part, the system design retrospection. So we need to know their basics about flow control algorithms. There are basically two kinds of flow control algorithms. The first one is like the source or the time-side and the second one is like the gateway or server-side we say. This client-side relimiting is already supported because Kubernetes client-go provides a token bucket relimiter for muscling the clients. And there was even a dedicated omission controllers for limiting the relimiting the events. But there are still a few known defects. The first one is that user can opt out from the relimiting by granting the token buckets minus or infinite capacity. And also it's sometimes it's tough to control the granularity if they're multiple controller, multiple clients in building in the same components in the same processes. For example, we want to hire a loose control of relimiting for one controller and tighter for another then it's going to be a bit tough to configure. Then let's take a look at the existing limiters in the cook episodes. There are two dimensions to do the limits. The first one is by configuring max mutating request in-flight, max request in-flight flex, which allows you to set the limits of request concurrence or request count either mutating request or non-mutating request. And you can also apply a timeout for non-wrong running request, which is basically the non-watch request. Then let's move on, learn from the Linux QDIS systems because it's going to be really helpful for us to understand how the Linux Nets working system, relimiting systems, which is already proven to be successful. There are basically two kinds of QDIS in Linux networking. The first one is classless and classful. There are a few outstanding algorithms from Robin, Kodal, TBF under the classless QDIS. And actually there's a longer list of it, but we're not showing every algorithms, just a few outstanding algorithms. Anyway, in the classless QDIS, every request will be regarded equally. But in the Kubernetes API server, there is already a system of authentication authorization. So there is already a abstract of user identities. So we need the classful QDIS in the building the Kodal server. There's a few classful QDIS algorithms building in the Linux networking systems. There's deficits around Robin and hierarchical token buckets. To mention, one thing to mention is that we are learning a lot from the both algorithms. I remember we spent several meetings discussing the differences from the deficits from Robin and there's a few other algorithms, especially the hierarchical token buckets. It's a bit complicated, but it's proven to be working well for most of the use cases. So here's a obstruction of flow control system, classful flow control system we wanna build in the Kodal server. There are basically gonna be three abstract components. The first one is the rule-based classifier, then a queue designer, then a queue scheduler. The three components is basically working like a lambda functions, classifier maps request to a request class, queue designer puts a request in a request class into certain queues, then the queue scheduler applies delays to the queues where the request is waiting. Then let's move on to what we can do in the classifier. We extend the obstruction of class from Linux TCC systems to priority levels in the Kodal server in the new flow control system. A priority level is a priority level band that requests in higher priorities should have a higher, should be executed in priority level priorities. And a priority level is a request class in which all of this, matching requests will be handled equally. And a priority level is also a request class where we applies the same rejection strategy. So to classify the request into priority levels, the information we can get from the Kodal server request context is for the first one is client identity. For now, there are two kinds of useful identities. The first one is the user or username, then the user groups like a list of tags on the users. Then there are requesting targets which basically covers the requesting namespaces and other request metadata. For example, it works then the target resource types, et cetera. So we know that what we are doing in the queue designer is basically mapping a request to a priority level using the identity of this. Then let's move on to the queue designer. So each priority level contains a group of request queues for scheduling. The question is how to map a request to one of the queues. One way, one is this way or intuitive way is to map one queue per user. If you have 10 users, then you're gonna have 10 queues for one priority levels. But there are gonna be problems if you have tens of thousands of users. If you have tens of thousands queues in your priority level, then there's gonna be a significant memory cost to avoid that we use another technique named shuffle shouting. The shuffle shouting help us to balance the memories into a fixed number, a constant number. In this picture, I showed you a shuffle shouting of hand size queue and basically the users from user one for example is started in a round robin manner to both queue one and queue two. So that's the other users. This picture also showed you that when there's some problem with the user user one, then the impacts will be spread to queue one and queue two. So the user three, number three will not be affected at all. But user two and user four will be partially impacted. The higher numbers of hand size, you have the impact will be shouted lesser. And as for the queue scheduler, we are really using an algorithm named fair queueing, which is basically aiming at achieving the following goals when as for scaling request form queues. The first one is the even distribution of service capacity and the other is the maximum fairness. There are a few, I got a few details about fair queueing, but we are not expanding them today because the time is limited. So if you got interested in the fair queueing algorithms, you can revisit these slides afterwards. So we had a variant of fair queueing for server requests to solve these limitations for group of server. The first one is this patching request to be served rather than packets to be transmitted. Second one is multiple requests can be served at once. The third limitation is actual service time is not known until a request is done being served. So we made a few mutations variants through the fair queueing algorithms so that you can adapt to the group of a server. I'm still not going to expand it to save our time. So this is the other flow control system going to be look at, there's a priority level classifier and shuffle sharding and the fair queueing. Then let's take a look at the alpha level API definition models. You can acquire the new feature by enabling the feature gates and add a new flags to the group of a server starting flags. This is the example flow schema where we can see it matches then a user in the group sees some methods assessing everything in the cluster then it will be matching the example schema. There is also a catch all flow schema, one thing different is that it has a distinguisher method. If you got interested in distinguisher method then you can take a look at the caps to know what this is doing. And in the priority level configurations, you have places to configure the hand sizes for shuffle sharding, which we just talked about. And then I'm gonna have a demo of customizing this. So I'm using a cluster. So we can create a new KMD cluster using these settings and here we got the priority level named workflow medium. We are creating this to the cluster and we create a new flow schema which matches the users from demo user. We create this QB cluster. Then we can take a look at this preference dashboard. In this panel, it shows you the capacity limit of this priority level and the QPS of each priority level, how the requests of each priority level are queued I've been waiting in the actual executing time costs of each priority level. All of these times are calculated by using P50, P98, P99. In this example, the medium, the workload medium priority level configurations only has a concurrency share zone one. Now we give it a higher concurrency share for example, like this P50. Then we apply this to the cluster. Then you can see the concurrency shares are recalculated. The medium workload priority level configuration now gets a capacity limits of 120. Yeah, and because the time is limited so I can't show you how everything works. If you've got questions, I can give you a detailed demo after or offline. Let's move on to the presentation. Then it's the planned enhancements for beta stage. There are blocking items and non-blocking items planned for the stage. As for blocking items, it's improving the observability and robustness. We added a debug. We are adding a debug endpoints to the APS server which is already done. I believe it's already published in Kubernetes 119 release and there's also a few new metrics added to the system which is definitely helpful. But I'm not showing them to you in this presentation because it's basically used for debugging. And if you've got interested, I can answer that in the Q&A section. And the second one is providing approaches to opt-out client-side real limiting. I'm not sure if it's done so far but the ultimate goal is to remove the client-side real limiting to move them all to the server side or at the gateway. And then the third one is we are doing necessary E2E tests. This is our current progress. We are adding E2E tests hopefully before 1.20 release. And as for non-blocking items, here's a few optional goals. The first one is to support concurrency limits on long-running requests. And the second one is allow constant concurrency and relative shares in product-level API model. For now, it's only a proportion of API shares, even API concurrency shares, we hope to allow users to configure a fixed concurrency or preserved concurrency for a product-level. The third one is automatically manages versions of mandatory and suggested configurations. This one is basically saying that we are providing a better user experience for users to use. To allow to update, upgrade the system presets, flow control settings, flexibility. And the first one is discriminative paginated release request. This is what we are trying to achieve before 1.20 release because unpaginated release request is very likely to harm the cluster and it's almost happening every day in our production clusters. So we hope to prevent this from happening. So this is what we are going to do non-blocking items for beta stage. For more graduating criterias for beta stage, you can also read them from the cap. And thank you for visiting my presentation. Then let's move on to the Q&A session. Thank you.