 Welcome everyone. This is SIG Scalability Introduction and the Deep Dive. And I'm Marcel Junba. I'm software engineer at Google and also SIG Scalability member. Unfortunately, Wojtek won't be presenting with me today, but hopefully he will join for the Q&A session. He's senior staff software engineer at Google and SIG Scalability TechHit. So first of all, let's start with what do we do as a SIG Scalability? There are five different main areas that we are interested in, starting with defining and driving scalability, definition and goals. Once we have those goals, we are interested in monitoring and measuring performance of the Kubernetes. So we have goals, we have measurements. So now it's time for improvements. Basically, with those measurements, we can find some bottlenecks and drive those performance improvements in Kubernetes. And those performance improvements can happen in two different ways. Either we just contribute to Kubernetes or we coordinate with different SIGs to make it happen. As you can imagine in Kubernetes with each version, there is a bunch of new features added. So what we want to do is also protect Kubernetes from any scalability regressions. Last but not least, it's consult and coach, community members about scalability. Imagine you want to add some new feature to Kubernetes and basically before you start implementation, it might be worth to consult it with us to make sure that there is no any obvious bottlenecks from scalability point of view. And the most important, it's to not to confuse SIG scalability with SIG autoscaling. It actually happens quite often. And yeah, we will go through those five different areas and during this presentation. So let's start with defining what actually Kubernetes scalability means. If you ask average user, okay, so what do you want in terms of scalability? Well, they will say scalable clusters. But if you ask, okay, but what does it mean? Well, most of the users unfortunately don't know. So maybe let's look back at the history of Kubernetes and Kubernetes 1.0 was released in 2015 and officially it supported 100 nodes. And this number changed over time. It was 1,000 nodes and then in 2017, 5,000 nodes. So you might ask, okay, so how much nodes Kubernetes supports right now? And the answer is that this number didn't change at all. It's still 5,000 nodes. So you may ask, okay, well then SIG scalability, what were you doing during the last four years? Well, it turns out that scalability is not just the number of nodes that Kubernetes supports. In fact, scalability is something that you need to analyze in many more dimensions. To give you a few examples, number of nodes is just one of those dimensions. We have a bunch of other dimensions like number of namespaces, number of secrets, services. And if you have number of services, then for each service, what's important is how many backends do you have per service and so on. So based on those dimensions, what we would like to do is define save zone, scalability envelope, which is a save zone. And by save zone, what I mean is that if your cluster is within the save zone, your cluster will be happy. But again, what does it mean that the cluster is happy? What is the save zone? We will try to define and explain that in more details on the few next slides. So maybe let's start with, what does it mean that cluster is happy? And here we have two key concepts, SLI and that's a lot. SLI is service level indicator. Imagine that you have cluster and you're interested, for example, in pod startup latency. Pod startup latency is the time, how much it takes for pod to be running since it was scheduled to particular node. So you're making those measurements, you have a bunch of them and then you take 99th percentile of pod startup latency measurements and it turns out to be let's say three seconds. So this would be actually SLI for pod startup latency for your cluster. And on top of that, what we do as a six scalability, we add threshold to it. And for pod startup latency, let's say it's five seconds. And this is basically SLO. So we can think of SLO as SLI plus threshold that should be satisfied. And in six scalability, we have a bunch of different SLOs. Few examples like API call latency, already mentioned pod startup latency, different types of networking latencies, like DNS latency. You can find all those SLOs that we support as six scalability on our GitHub page. So maybe let's see how the definition of API call latency looks like. It will be a little bit simplified version. You can find full version on our GitHub page. But basically API calls, we would like to divide them into two groups. One will be write calls and 99% of write calls, we would like to have latency below one second. For read calls, we would like to have, it's a little bit more complex. If you are just getting one object, like let's say you are getting pod, then the latency should also be within one second. But if you are trying to, for example, list all the secrets in the namespace, and then it might take up to five seconds. On the other hand, if you want to list all the pods in your cluster, all the secrets across all the namespaces, then this type of requests might be quite long running up to 30 seconds in total. So yeah, and remember that it's 99%. So you can always be unlucky and in this one percent, and basically your request will take longer than that. Okay, so we have, I think we have pretty much good understanding what kind of SLOs six scalability provides. How do they look like? So now we can actually define what does it mean that cluster is happy? So cluster is happy when all of those scalability SLOs are satisfied. And based on that, we can have this kind of like framework and that if you promise us that your cluster will be within save zone, then we can promise you that all scalability SLOs will be satisfied. But we still didn't discuss what's the save zone. So if we want to compute this scalability envelope, it's actually impossible to compute it precisely because there are multiple different dependencies between dimensions, which are quite complex. And obviously it's changing all the time. But instead of trying to find like this precise scalability envelope that you can put whatever configuration of cluster you can imagine and then get output whether you are within or not. What we can do is actually approximate it. And we try to approximate it by set of limits. And as mentioned before, we have one limit that it's number of nodes. It's up to 5,000 nodes. Then what we have is number of pods. Number of pods should be less than equal 30 times number of nodes. So we can think of it as 30 pods on average per node. And for example, other limit that we have is number of services, which is 10,000. And all those kind of limits you can find on our website. So at the end, what happens is that if you have cluster we are probably interested in those limits. And if they are satisfied, then we promise you our scalability SLOs will be satisfied as well. So now we need to make sure that actually for this scalability envelope, those SLOs are satisfied. And here we have scalability testing infrastructure. We will start with cluster loader. A few years ago, actually we didn't have cluster loader and our scalability tests were purely written in GoLang. You can imagine how hard it was to actually maintain those tests. And since then we developed cluster loader, which is really great. And you can basically bring your own YAML with test description. And the test description consists of states in which you want your cluster to be in. So you can have multiple states. You can also specify how to transition between those states. And during this whole testing cluster loader also is gathering bunch of measurements. So he can make a decision whether all scalability SLOs were satisfied during this test. Except for that, cluster loader also provides a lot of extra observability for debugging, which we will actually cover later. You can find all those features in our per test repository, which has also implementation of cluster loader tool. The other tool that we commonly use is Cubemark. Maybe it's not tool, it's actually cluster simulation. So in the scalability envelope, you saw that we support up to 5,000 nodes. And testing cluster consisting of 5,000 nodes, it's quite expensive and time-consuming. So instead of having actual 5,000 VMs, what we are doing is we are creating something we call hollow nodes. So here what you can see is, okay, we have three actual VMs and each of this VM is running multiple hollow nodes. And those hollow nodes kind of simulate regular nodes. The difference is that if you schedule some pod on a regular Kubernetes cluster, it will be actually running. But here if you schedule it on hollow node, then Kublet, which we call also hollow Kublet, will only say, okay, I'm running this container, this pod, but it will not actually run it. So this allows us to run multiple hollow nodes on one node and each of those hollow nodes is actually running three containers. One is Kublet, the other is Kub proxy and a node problem detector. So those hollow nodes connect to master and this master is fully functional. So basically we test this master and whether all the SLOs are satisfied with cluster loader with our cluster loader. The only problem with this kind of setup is, yeah, but how do you actually deploy those hollow nodes? So this becomes a little bit tricky because let's say you have hundreds of nodes and then you want to run those hollow nodes on those hundreds of VMs. Well, running multiple containers is something that actually Kubernetes sold. So those hollow nodes are scheduled and scheduled basically by separate master. This master is responsible only for scheduling those hollow nodes and running them on physical machines. And that's basically it. So we have two masters. This master is not being tested. And to give you some examples, I mentioned that, yeah, we support 5,000 nodes. We also have 5,000 nodes tests, but using Cubemark, instead of using 5,000 cores, we are actually, we can use 700 cores and kind of simulate 5,000 cluster, which is really great because it allows us to cut down the cost but also iterate easier. If you can imagine that there is regression that happens only from time to time, let's say one in five runs, then with Cubemark, we can easily run 10s basically of those runs to find this issue easier. So now we will go to observability and debugging. One of our best tools for visualization is Pervdash. It is really great because just satisfying as a law is not always enough. Here you can see example of pod startup latency and we can see, for example, the blue line is 99%ile of pod startup latency. And on the X axis, you can see different runs. It's around 100 points. So I would say it's like three months. And you can see that, okay, we had around four seconds, four seconds pod startup latency. And then it decreased, I would say on average, it's around 360, let's say. 300, sorry, 3.6 seconds. So from four seconds to 3.6 seconds. And the question is, was it some kind of scalability improvement? Actually it was regression that we found then we fixed. But we will talk about it a little bit later. So Pervdash basically analyzes all those runs and allows us to compare different runs, different SLIs, but also let's say CPU usage of VM or memory usage. So it's very useful for finding regressions that are not necessarily violating SLO. Except for that for each run, cluster loader has ability to run chromatic use within your cluster and gather multiple metrics. For example, CUBE API server metrics or at city scheduler and so on. But also if you are running, for example, DNS latency test, then you can configure it quite easily to also scrape metrics from some DNS pods that are making requests to check what's the latency. And with those metrics, we have pretty, pretty nice setup of Graphana that allows you to quite easily check what kind of API calls, for example, were evaluated in what part of tests and so on. Except for that cluster loader also automates gathering, profiling data. It's basically a CPU, but also memory and mutexes. And once you are running the test with cluster loader, what happens is then at the end, you also are getting bunch of profiling data, which comes in handy when dealing with a CPU regression or memory regression on Kubernetes master. So now let's go to scalability tests. What kind of tests we have? We have periodic tests. That's our, these are one of the most important for finding regression in Kubernetes. We have release blocking test but also non-release blocking tests. For release blocking tests, we have performance tests. Performance test basically tests all those SLOs for two scales. One is 100 nodes and 5,000 nodes, but also we test correctness. So correctness is not only if all the SLOs were fine, that's the purpose of performance test. In correctness, we also check if it's actually functional. So maybe I will tell you a little bit more how the performance test works. So I would say that the performance test, you can kind of have three main stages of performance test. Let's say we are starting with empty cluster and the first stage is to actually load the cluster. So we have the scalability envelope and we are trying to maximize all those dimensions that we think should be in save zone. And once we do that, what happens is that the next stage is scaling up and down different deployments, updating demon sets. So a bunch of different stuff that's generating a pot turn. And the last stage basically is deleting all those things. And we test this in two scales, 100 nodes and 5,000 nodes because 100 nodes sometimes help us to pinpoint exact commit that broke Kubernetes. And 5,000 nodes are run only once a day. For non-release blocking, we have different tests starting with Cubemark that I mentioned before but also storage and benchmarks. One of my favorite benchmarks is actually Golang compiler. So what we do is we took a performance test. We kind of fixed all the dependencies except for the Golang compiler. So we are taking basically one Kubernetes version, all the dependencies are the same. And what we are changing is we are compiling Kubernetes with newer and newer Golang compiler. And we are checking whether there's any regression introduced by Golang. And this is super interesting because actually in the past, we had a bunch of different regressions caused by Golang. Except for periodic tests, we also do have press submit tests. This is only basically performance test on the scale of 100 nodes. And we are running those press submit for Kubernetes. So if you ever created PR Kubernetes to improve something and you have this bunch of lists of press submits, one of them is our six scalability press submit. So how are we protecting scalability of Kubernetes? So first of all, we do have test grid. Everyone, it's publicly open. You can go there and see, okay, so what's the status of performance test at scale or what's the performance status of scalability test on scale of 100 nodes? And it's basically one of the first tools that we use for protecting our, protecting Kubernetes from scalability regressions. And the thing is that scalability is very sensitive. So we've seen regressions coming pretty much from everywhere. As I mentioned before, one of the examples is Golang, but then also operating system, controllers, API machinery, scheduler at CD Kublet. And what we do is we either fix by ourselves or we try to add to other six. So I would like to tell you a little bit more about two regressions that we prevented in 122. They were pretty interesting. One is pod startup latency regression. So I was actually showing you this regression in Pervdash. The difference was around half second, let's say. And this regression was really interesting because by itself it didn't break our tests, but due to the Pervdash we were able to find it. And what happened was that the number of goroutines in Kubernetes increased twice. Normally we have around 500,000 of goroutines running in API server at scale. But after this change, which was related to some improvements to priority and fairness, the number of goroutines jumped to one million. And all of those as laws were still satisfied, but we saw that pod startup latency 99 percentile significantly increased. So we basically debugged it and fixed the issue. Except for that we had really interesting other regression, which was really tricky because one of the features was introduced also to PNF. And it happened to be that bug was found in totally different place and the priority and fairness was only a trigger for this bug. The idea was that you have some periodic calls in API server like updating leases of nodes. And by default they happen every 10 seconds. So if you have 5,000 nodes cluster, then you can on average have around 500 updates per second. And if those updates are evenly distributed across the whole second, then it's fine. But some changes in priority and fairness cause those 500 or even more calls to kind of like synchronize. And this synchronization is really bad because the load was not spread evenly. So this was very interesting. And that's one of the regressions that we also fixed recently. But also we are driving some scalability improvements and let's go through three of them. So the first one will be efficient watch resumption. Efficient watch resumption is really great new improvement that helps you with upgrading your masters. So as you might know, if for example, Kubled wants to get secret or config map, then what it does, it's making get call, but then also it's making watch call. And this watch call basically should be kept alive all the time. But when the API server really starts or you upgrade it, then unfortunately this watch couldn't be resumed. And this was huge issue because let's imagine that you are upgrading your cluster and then all those watches break and then all 5,000 machines are actually trying to get what they want at the same time. And this can quite easily overload API server. So efficient watch resumption actually fix the issue where the watch can be resumed. Except for that, we are working with different seeks on priority and fairness. We are constantly improving it, also from scalability point of view. Like first version, for example, priority and fairness was not distinguishing between get calls and list calls. Also, as you might think, those watch calls are also important in Kubernetes. So we also added support for initialization and things like that. Basically we are working with other seek teams to make it better and better to have better reliability of API server. And the last but not least, immutable secrets. So immutable secrets actually reduced the load and potential load that API server is receiving. Going back to the same example, if you have POT and this POT is using some secret, what happens underneath in Kubernetes is that, first of all, Kubernetes is Kuglet specifically is getting the secret from API server, but then it also watches for any possible changes. In most cases, those changes never happen. And because they never happen, then it doesn't make sense to keep this watch. So this introduced immutable secrets that help with reducing number of watches and reducing load on API server. So if you want to get involved, here is some links. You can find our homepage, but also you can join our Slack channel, our mailing list. And if you want to get involved, then you can just ping us on Slack or just maybe you can check what kind of issues we have in our repositories with getting started or have wanted labels. And yeah, looking forward to it if you are interested. So that would be all and thank you for attending. And now it's time for Q&A. Thank you.