 Okay, hello everybody, welcome to Six Callability Talk. I'm so glad to see all of you here after such a long day. So today we'll be talking about Six Callability, introduction and deep dive. I'm Matzele Zyomba, I work at Isovolent and I'm Six Callability teacher. And together with me is Wojtek Czciński, who works at Google and is Six Callability Tech Lead. So, first of all, let's start with what do we do as a Six Callability? So there are basically five areas that we are interested in as a Six Callability. First, the most important is actually define and drive what the scalability is. And as we will see soon on later slides, it's not as obvious what the scalability means in terms of Kubernetes. Then except for that, we also do coordinate across different six in order to improve scalability of Kubernetes. We try to monitor and measure all of the scalability metrics that we care about in Kubernetes to make sure that we preserve the scalability properties of Kubernetes across different releases. And last but not least, we also consult with other six to make sure that all the features that they're trying to implement are basically scalable and we try to collaborate with them during the design process. And one more thing, please do not confuse Six Callability with Six Auto Scaling. Those two are totally different six responsible for totally different things. So what is the Kubernetes scalability? If we ask our users and customers like, what do they want? They always say, well, they want scalable clusters. But then we ask like, okay, so what does it actually mean? Well, they usually don't know what means that clusters are scalable. So what does it actually mean? One dimension that everyone cares about in terms of scalability of Kubernetes is the number of nodes. But that's not the only thing that we should care about. Scalability is something way more than just number of nodes. Number of nodes is just one dimension that we care about. And we would like to think about scalability of Kubernetes as kind of like hypercube where we have so many different dimensions. Except for nodes, we also care about like, what's the pod turn in your cluster or how many nodes, how many pods do you have per node? How many services? But then again, like, if you have services like how many beckons those services have. So with that in mind, what we do is we define scalability envelope. And you can think of it as some safe space, safe zone where if your cluster is within the scalability envelope, then we ensure that your cluster will be happy. But then the question becomes like, what does it actually mean that the cluster is happy? So we have bunch of scalability as a loss and we ensure that if you configure your cluster properly and you stay within a scalability envelope, those SLOs will be satisfied. Thank you. So what are the SLIs and SLOs really? So SLI, hopefully you probably know all of that. Like SLI is service level indicator which you can conceptually think about it about it as a metric, some kind of metric. SLL is a service level objective which you can conceptually think about it as a metric with some threshold. So those are like the core concepts that we use to define if, or to say if the Kubernetes meets the criteria or in general, if any system meets the criteria that it shouldn't meet. So for Kubernetes we define or first, purely from scalability perspective, we define those six SLIs and SLOs. Some of them are still work in progress. We still didn't like graduate them. We measure them in our tests but we don't necessarily block the releases on them. We, so it's still, there's still a bunch of work to do in this area. And if you look into those like, they are not covering like many of the areas of Kubernetes, for example, storage is not really covered well here. And there are many other dimensions that are not. So there's still a lot of place for improvement here in terms of coverage of those, of the SLIs and SLOs. So we really need your help here. So let's take a look into one of the examples. So one of the main, or one of the like main SLOs that we have is the API call latency. And this is like how the SLO was defined in the first blog post that we published or I published in the Kubernetes blog post in 2015. I just realized like an hour ago that if it would be my kid, which kind of is, I would already be driving it to school. It's like more than eight years old at this point. So what it, how it looked like back then was like 99% of all our API calls return in less than one second. And there is a problem with this SLO which is the fact that probably, I bet that my understanding of this SLO and many of you are very different. And this is a problem here, or this is a problem in general because like SLOs are our contract with users. So if we are in the situation when the system actually meets my understanding of the SLO but doesn't meet your understanding of SLO, it means that like we believe that the system behaves as it should and from your perspective it doesn't. So that is a big problem. So over time we were refining many of the SLOs like and here is how it looks like currently. So as you can see, it's much, much more precise now. Like it's much more precisely defining like what kind of like API calls we are talking about how exactly we are like aggregating those over what time period, like under what circumstances we actually guarantee something. So here we are explicitly talking about like default Kubernetes vanilla installation because things like web hooks obviously affect API call latency and we don't have really control over those. So you can install like a web hook that is sleeping for five seconds in your cluster and like there's nothing we can do and there's no way we would be able to meet those SLO. So there are like explicit constraints here and although still not perfect, we are in much, much better shape here in terms of like ensuring that like the understanding of our users and like the way we are measuring it actually is hopefully the same. So now taking a step back to what Marcel was saying, like the scalability envelope is the multidimensional cube or multidimensional subspace of configurations where the Kubernetes, we believe the Kubernetes actually scales, but the problem is that defining this subspace precisely is pretty much impossible. Fortunately, we can relatively well approximate this by a number of like kind of independent dimensions. Obviously they are not like completely independent but they are well enough independent for some definition of well enough that like it's basically how we are talking about like the subspace. So things like number of nodes at most, 5,000 number of services at most, like 10,000 and so on and so on. And if you are interested in, there's like a much bigger list of those like under the link here. Many of those or some of those at least are not still filled in. Like it's something that like we need better test coverage and so on. So there's still a lot of work to be done here. And with that, yes, I'm passing to Mark. Okay, so we know what kind of SLOs do we have? What kind of SLIs do we care about? And you saw some of the limits and that we are interested in. So the question now becomes how do we ensure that those SLOs are met for each of Kubernetes release? So I will be talking about our testing infrastructure. What kind of tools do we use? So let's get started. So the first, the most important tool that we use for our scale testing is cluster loader two. And you can think of it as a tool where you declaratively can easily specify in what states do you want to have your cluster in. So imagine you start with like empty cluster and then you want to scale up number of pods to 100,000 pods, and then you want to update like half of them and so on. So you can specify all those states that you are interested in, but what is even more important, you can also specify how do you want to transition between those states in the cluster loader. And cluster loader basically executes all those things that you specify, moving the cluster through all those states while also measuring all of the scalability SLOs that we care about and making sure that none of them are actually violated. So let's say that you are interested in running the cluster that is beyond the scalability limits that we officially support. Then you could use, for example, cluster loader two to test your own setup to make sure that nothing will be breaking at your scale, for example. And on top of that, when you are running the cluster loader two, it actually gathers quite a lot of information, quite a lot of metrics to help you debug in case some of the SLOs are not met. And if you are interested in more features, then what you can do is go to our GitHub repository per test, where we have way more details like what the cluster loader two can actually measure and help you debug scalability issues in Kubernetes cluster. Another thing is that running 5,000 nodes is quite expensive. So basically at minimum, you have to actually have like 5,000 CPUs just for supporting the nodes. So we were thinking like, okay, how we can optimize it and make it more affordable for testing scalability of Kubernetes. So we implemented Cubemark. And the idea is that instead of running like actual Kubelet that does all the work, what you can do is use Cubemark for simulating the Kubelet. And because of that, what happens is that using just single node, you can run multiple whole nodes that simulate the load of all of the Kublets on the Kubernetes control plane. And instead of using like 5,000 CPUs, you can basically go down to a few hundred of CPUs simulating the real cluster. And then you can just scale test like regular Kubernetes cluster but with way less nodes. So I talked about cluster loader that it exports quite a lot of data and SLOs about scalability SLOs that we care about. But then again, like during the development of Kubernetes, we want to keep track like how those SLIs change over time with the development. So we do have PervDash that allows us to visualize all of the data from cluster loader too. And as an example here, you can see pod startup latency because that's one of the SLOs we care about. And for example here, you see like 99th percentile of pod startup latency that it changed and actually improved over time for one of the releases of Kubernetes. Then we have obviously Grafana. We have pretty good Grafana dashboard that you can use if you are interested in debugging your run. So let's say you are running cluster loader too and we export quite a lot of Prometheus metrics and then you can just plug in our Grafana dashboard to visualize like what was going on during the test execution. On top of that, we also gather profiling data. As we want to drive different improvements to Kubernetes control plane, we are trying to grab as many as possible profiling data during the test execution. So then later on we can investigate what areas of Kubernetes control plane can be improved. So now let's talk about like, once we have all this testing infrastructure, how do we actually protect Kubernetes scalability? So we have periodic tests that we run every second day right now I think, which is like performance test with 5,000 nodes where we try to stretch all those limits that we mentioned before and ensure that all the SLOs are met. On top of that we also run like performance test with 100 nodes way more often to catch like significant regressions much easier, which definitely helps us to pinpoint different improvements, PRs that were merged to Kubernetes much sooner and we also test correctness. So even though we care about those SLOs, we make sure that if you are using like regular Kubernetes features, then even though you have 5,000 nodes, then all of those features still work. Also we have some non-release blocking. So those three that I mentioned before are release blocking. So we ensure that all of those tests pass before each of Kubernetes release is released. But we also do have some other tests that help us to debug different scalability issues. For example, benchmarks. One of my favorite benchmarks is that we actually take Kubernetes version that is fixed and we compile it with different going compilers like newer and newer. So we actually observe that going compiler can affect Kubernetes scalability quite a lot. So we develop this benchmark where we just basically benchmark going compiler. Except for that, whenever someone opens PR to Kubernetes, we also do have option to run the scalability test manually on 100 node scale test or 5,000 nodes scale test to just compare. If we think that this PR might affect scalability or might improve scalability, then we can just verify that by simply running the test for specific PR. So once we have all those tests that are running, we use test grid just to keep track if all of the tests are passing to make sure everything is fine before the release especially. And as mentioned before, we've seen scalability regressions coming not only from Kubernetes, but also bunch of other components. As mentioned before, even going compiler about operating system or scheduler, different at CD versions. So I would say that scalability is pretty sensitive because there are so many pieces that combined are creating Kubernetes and the regressions can come from anywhere. So once we identify what part of Kubernetes basically made the regression for Kubernetes and our SLOs, we try to coordinate with different six to get that fixed or maybe sometimes we are contributing ourselves. Okay, thank you. And the last part of the presentation, which is probably the most interesting for many people, which is the actual improvements that we are making to the system. So there are two main groups of improvements that we are making. The first one is improving the reliability at scale. Why reliability? Because you can think about scalability effectively as reliability at scale. So those are those kind of improvements that are not really pushing the limits further, but I'm ensuring that within the existing limits, the system really behaves and really is really rock solid and works. So there are no hard cleave system degrades gracefully and so on and so on. And the second category are those that are actually pushing the limits of the system. Given that the six scalability doesn't really own any non-test related code, like pretty much everything, all the things that we are doing is kind of joined with other six or is being owned by them, but contributed or even being driven by us, but those other six are the ones who will eventually or who are eventually owning and maintaining everything. So it needs to be agreed with that. And so most of the things that we are doing are kind of done together with API machinery. So if you attend DayTalks, some of the things may appear there too, but those are the things that we are heavily involved in. So let's take a look into examples. Let's start with the improving reliability and probably the biggest thing here is API priority and fairness. For those who didn't hear about that, it's this mechanism of rate limiting in the API server to ensure that no load coming from the clients talking to API server will actually overload it. And I'm super excited that we finally make that happen, like we are gearing it one with 129 release. It's been a long journey. Like we had alpha in 119, which is like three and a half meters old or something. And we started working on that like even before. I think the first code that we committed for APF was like 117, something like that. So it's like four years each since we actually started that. So it's great achievement here. The second thing is like improving upgrades experience at scale. We all know that upgrades are painful with Kubernetes. Scale is just adding another additional pain to that. So there are things that we are making to reduce that pain. Graceful shutdown in particular was one of the things that we physically improved in the last couple releases. Things like graceful shutdown of watches are probably the most important thing because we see cluster, we observe clusters in production with hundreds of thousands of watches. So in the past when they were immediately shut down at the same time and they were trying to reconnect again at the same time, that was overloading the control plane many times. Now we have a mechanism that can actually split or spread the closing of those watches over like somewhat longer time and make that like much, much less invasive. And the third thing that I wanted to mention, those are just examples, there are other things, but the third thing that I wanted to mention is like API streaming lists. This is the concept where like or why we are doing it is effectively it's not that hard to actually cause your control plane or your API server or API server and CD2oom by relatively small number of like larger list requests, like list all bots in the cluster. So we are actually introducing a mechanism to allow you to serve the response of the list not in a one big chunk, but rather like in a streaming way reusing all the machinery that we have for the watch API. And we made an alpha in 128 that had a bunch of rough edges which I think all of them are fixed now in 129. It remained in alpha still because, which is actually interesting, was as part of like doing a review of all the regressions that we have over the last three years in Kubernetes itself, one of the two main categories of those regressions were things that or one of the two main categories that were painful for us and even more for users than for us, are the changes that were enabling some new features on the client side because we didn't have a reasonable mitigation for easily disable them on the client side. So one of the things that we decided as an action item from that exercise in the production readiness review sub project was to develop the mechanism to like more or better enablement of features and on the client side. And given that we want to enable it, this feature also on the client side, not just server side, because we need clients to opt in to that new API. And given that this mechanism is not yet ready, like we decided to postpone beta for hopefully release, like we are committed to make it happen in 130. Okay, so the second category of like pushing the limits of the system, again, those are just examples, but the first example is in general, like improving CRD scalability, like it's one of the biggest problems that or the well-known problems that like the built-in APIs are scaling much better than those that are like driven by CRDs. And so one of the things that we are currently working on is introducing a binary protocol for serializing on foreign coding CRDs. We also wanted to make it happen in 129, but it didn't make primarily for the same reason as the API streaming list, that we don't have a reasonable way to enable it on the client side, but it's coming. The preliminary benchmarks show at least two X improvement in terms of like how fast we can serialize stuff and or deserialize and so on. So it will be a significant improvement. From the same category, like we improved like the how we are serializing the events, the watch events, we improved. So couple releases ago, or maybe it's not even a couple, it was more like a dozen of releases ago, ish. We introduced the mechanics where like if multiple watchers are watching for the same object, we will effectively serialize this object once and not independently for every single watcher. But we missed one simple thing, which is we did that for the object itself, but we didn't do that for the event, which is like wrapper and the object. So the event contains the object and the type. So like added, modified, deleted or error. And that part we were actually serializing independently for every single watch. And this is something that we fixed this release like a week ago, literally. So that also brings another like roughly to X improvement for like if you have a big cluster and a number of watchers that are watching for the same stuff. The second like subcategory are like a number of local improvements to the API server. Like we improved the compression. We improved like the handling of list requests with rare selectors and so on and so on. Like each of them is like relatively small change on its own and very localized, but they are like fixing concrete pains of our users and concrete issues that they face in their production. And the second or in the third category or subcategory is a number of like improvements primarily in the scheduler, but not only like some things on the API machinery on the API server. Potentially also those that I mentioned for CRDs are contributing here is the category of things that are improving the throughput of the system in particular the pot throughput. Like with Kubernetes being more and more used for batch workloads. It's important to that like one of the important requirements in the batch world is how fast we can schedule or how fast we can start a large number of pots. Like jobs with like hundreds, thousands, tens of thousands or sometimes even hundreds of thousands of pots. So the throughput here really matters. And I think with that, there's one more thing that I wanted to mention here is especially for the category of things of like in pushing the limits farther, it's not that we are actually blindly optimizing the system for the sake of optimizing it. Like we should always be thinking about like some particular use case, particularly like user originating problem and optimize the system only if there is a real need for that. Like you might have seen me like rejecting certain performance optimizations even though they were actually improving the system because they were introducing too much complexity to the system like versus the gain that we were getting from them. So I'm not sure like who came up with the idea but I first heard it from Tim Hocking and I really like it. It's like this idea of like complexity budget. So we have like a certain budget for things we can do, like for things we do in the system that we can consume. Like so if we introduce a complexity in a certain area to for example improve its performance we will not really have this cognitive budget to do some other things. So we need to be really careful to not do optimizations that like our users don't really need. And with that I think it's, I wanted to reiterate that like we really need your help. If you are interested in any of the work like we have a ton of work in any of those areas that we were both talking about, please reach out on Slack, please reach out on our mailing list. Please join our scalability meetings. We have like a bi-weekly meeting every Thursday or reach out to us here like we are both here somewhere around and talk to us and we will find a thing that you would like to help and that will be useful for us. And with that, yes, thank you very much for coming and joining the meeting. And I think we are opening like we still have five minutes or a little bit less than five minutes if you have any questions. There's a question there. My question is regarding the tests for cluster 2 that you're sure based on the SLIs and SLOs. For example, I just wanted to kind of understand when you're getting those SLOs or SLIs are using explicit mechanisms to extract those values or using something like, I'll take an example of the port startup latency. I think from 1.27 they introduced a new cubelet metric that directly measures port startup latency SLI as mentioned in the upstream, but I also was kind of wondering the fidelity of those numbers, especially if you're getting it. I mean, if we wanna measure it per pod individually for removal, image pull times or time-trained containers, we'd get an accurate value, but when we're using inbuilt metrics, what's the fidelity of it? Because it's a stateless path, but does it automatically exclude from the 99th person tellment calculating paths that have PB attachments or paths that have liveness and readiness probes, things like that, so I'm just trying to understand the fidelity of those metrics and how you try and get those numbers. So that's a good question. So I think it depends. So the metric, I'm not sure which metric you are talking about, but there is a metric in cubelet that we added for that and that metric doesn't do that on its own. It's either a to do in the code or there is an issue open for that to actually incorporate that information because we have that, we know what pod that is, so we can easily say what kind of pod that is. We are in the cluster loader itself, we have additional, we are measuring it slightly differently still. We are not relying on the cubelet metrics because they were added very recently and we still didn't migrate to them. And in the cluster loader itself, we are actually distinguishing those pods so that the numbers there are much more accurate. Although they are still not perfect, I think there are some corner cases which we are not contributing to the right thing, but on the other hand, those are tests, so you can easily check it or we can easily fix that if you find an issue. So I guess the short answer is on cubelet side, it's still TBD, although that part is actually relatively easy to do, so we just need to prioritize it. All right, thank you. Anyone else? Three, two, one. Okay, thank you very much and enjoy the rest of the conference. Thank you.