 Okay, hello everyone. My name is Wojtek Ticzynski. I'm in the TL of six scalability and together with like Shiam Gidi Gunta, who is like one of our chairs, we are going to make this six scalability update today. So let's start with clarifying like with what exactly do we focus in this SIG? And there are like four main and like one additional thing that we are primarily doing. Like the first thing is driving and defining really and what Kubernetes scalability really means, what dimensions it has, what do we really care about and what are our goals here? Second is like measuring where we are with respect to those goals. Third is ensuring that we actually meet the goals by contributing and improving the system, contributing improvements to the system and improving the system. And finally preserving that we actually don't regress from that point over time. And the fifth thing that I mentioned that is a little bit orthogonal here is also coaching and consulting like the overall community on how to put scalability into their design is how to think about scalability when implementing their features and so on and so on. So one thing that is worth mentioning is like you shouldn't confuse SIG scalability with SIG autoscaling, it happens a lot. Scalability SIG is about like how far in a certain dimensions you can go, like how big cluster you can have, how many pods per second you can schedule, how many load balancers you can have and so on and so on. Autoscaling is more about like dynamically adjusting both the cluster or what is happening in the cluster to accommodate the load. For example, cluster autoscaler is scaling the size of the cluster by adding or removing nodes. Horizontal pod autoscaler is like adding pods based on traffic and so on and so on. So this part is a different SIG. Here we are focusing on scalability. Okay, so let's go over them one by one. So scalability definition, what is Kubernetes scalability? The important thing that in general is you need to keep in mind for any scalability or performance improvement, related improvements is that we shouldn't be optimizing the system or pushing the limits for the sake of doing that. Like every such change is making the system a little bit more complex. So we should always be anchoring in actual user requirements and actual user needs. So if we ask users what they really want in terms of scalability, like they want scalable clusters, but what does it really mean? They don't really know it. In fact, in many cases they don't want to know it because Kubernetes is like their tool on top of which they build their businesses. So they don't want to understand every single detail of the underlying infrastructure. They just want it to know. They just want it to work. So historically we are thinking about scalability of Kubernetes as the number of nodes in the cluster, like how many nodes the cluster has. Well, it's very important dimension, very critical for many, many users. It's only like small part of the actual truth. Scalability of Kubernetes is in fact like a multi-dimensional problem. There are many, many dimensions that really matter on how well your cluster will be behaving if it will be able to run your workloads. There's things like podstern, number of pods per node, number of services, number of load balancers, and so on. Like there are dozens of those dimensions that matter here. So what is this scalability envelope? It's effectively like a zone or a subspace in which your cluster stays happy. What does it mean the cluster is happy? It means that the scalability, or in general, the SLOs are satisfied. Kubernetes isn't doing a perfect job in defining SLOs. In fact, I think scalability-related ones are the only ones that we as a community have defined. But in general, we should be talking about any SLOs that are defined for a given project. So what are SLOs? I hope you already know those terms, but as a service-level indicator, you can think about it as a metric. And SLO is service-level objective. You can think about it as that metric plus some threshold below which the system is healthy. And for Kubernetes, we actually have two really mature SLOs. For start-up latency and API call latency. We have few more that we are, sorry, taking us to the back. Those two are really things that we measure, things that are well-tested. Regressions on those are blocking releases and so on and so on. We have a couple more, primarily in the networking area, that we are already measuring. We have defined SLOs for them, but we don't really have defined SLOs. We didn't graduate them. We are not closely monitoring them and so on and so on. So given that at least some of those, for example, DNS part is super important in AIML workloads, which Kubernetes is more and more looking at. There's a bunch of work that we need to do here to actually improve it. And the first step would be to graduating and maturing those SLOs and starting to rely on those, really. So maybe just a quick case study. Defining SLOs is not easy. On the top, you can see how the first definition of the API call latency looked like from my blog post in 2015, so like almost 10 years ago. This is how it currently looks like, the definition below. So I am not going to... We don't have time to go into details here, but we clarified a bunch of things. We split it SLA and SLOs explicitly. We clarified which API calls matter, how they are aggregated, how do we are aggregating over time, over what time, and so on and so on. So in the end, what matters for users are two things. So first is what do we guarantee? And those are exactly the SLOs. And the second thing is what are the limits of the system? So how exactly this scalability envelopes looks like? The problem is that defining it precisely is pretty much impossible. Fortunately, we have like reasonable approximation of that. Things like number of nodes not greater than 5,000, number of services not greater than 10,000. We know many more. You can take a look into this link from the slides. There are still many TBDs there. Some of them we just need to document better. Some of them we still need to understand better. So there's still a lot of work here. So if you are interested in this kind of work, we have a lot of place where you can help here. And with that, we are going to the second category. So I'm passing now to Shyam. All right. Thanks, Witek. As many of you would have probably seen it by now, it's a pretty ambitious charter that we've got for the SIG. It's a pretty broad scope because it covers multiple aspects of the cluster because scalability is not really about a specific feature or a specific component, but it's really a property of the system. So how do we actually execute on this charter? And the answer is it's through a combination of tools, test frameworks, as well as processes that we've built and evolved in-house on the SIG over the years. Let's actually look into what those are, and some of those might be pretty useful for you as well if you're trying to run some scalability tests or evaluate the performance of your cluster and stuff. So I'll start with this. This is probably the most important tool we use. It's our primary tool. Again, it's built completely in-house within the Kubernetes project and in this SIG. It's called Cluster Loader 2. Essentially, it makes it possible to write scale tests in a declarative form, and this is something we actually learned the hard way over the years, writing end-to-end tests because scalability tests are mostly end-to-end tests or integration tests, and they can be pretty hard and detail-oriented to write, especially the way you simulate creation of workloads, the rate at which you do that, the kind of interactions you emulate. It can be pretty complex, and over time it was becoming unmanageable for us to write imperative tests, which were basically Go files. So we came up with this tool, which essentially models your test as a sequence of well-known operations such as create a set of deployments in this namespace at this given rate. So you can define functions for what sort of rates at which you want to induce load. You can define what sort of things you can create. And the best part is you can even leverage some built-in measurement functions that we already use heavily within the SIG, or you can also extend it because this tool is pretty modular. And all of this basically makes it easy as a user if you were to use this tool to kind of define your own tests and extend it to new use cases. There's a link at the bottom. You probably will be able to download it. Sorry, open it from PDF. All right, so the next tool, Upper Sleeve, this is a pretty powerful tool, too. It's called Cubemark. Essentially, it's a cluster simulation What it does is it kind of brings scalability testing within the grasp of a common man because the thing that deters most teams or companies or even individual developers is not being able to afford huge amounts of compute and access to those sort of resources that we use within the SIG for running the scale tests in the project. So essentially, it lets you simulate the behavior of a real cluster specifically in terms of the traffic pattern that is received by the control plane by creating fake implementations of Cubelet, Cubroxy, and those sort of components which essentially run as pods. So the whole blue octagon that you see that represents the base cluster which is where you actually run real Kubernetes control plane and real Kubernetes nodes. And then you launch these hollow nodes as pods on top of these nodes which then act as nodes of an overlay cluster that we're building which is the Cubemart cluster and make it talk to a new Kubernetes control plane. So yeah, it's also a pretty powerful paradigm if you've ever played around with it. You'd probably know the power that it gives you as well as some of the limitations that come with it. Okay, so we've talked about the tools that we use to actually run these test tests and kind of create these scalable clusters. As the next step, how do we actually make use of these? Like now, great, we have a cluster, we have these tests running. How are we actually going to measure performance and like kind of monitor things like memory usage, CPU usage and stuff. So there also we've kind of built a whole set of observability and debugging toolkit within the SIG. The first one is Perf Dash which is, I put a screenshot here and there's a link at the bottom. It's a public URL where you can access it. It essentially shows various kinds of metrics of individual components and systems within your cluster as well as end-to-end metrics like pod startup latency, maybe DNS latency, I'm not sure, but stuff like that. The X-axis essentially is over time as we keep running these jobs over and over again. These are the metrics that we see. These are values that we see for the metrics and it is pretty useful for us to also inform about things like how the patterns of various API metrics and performance of various components change over time as we optimize certain things or we make larger changes, maybe architectural changes and stuff. And the next one is for each of the test runs for these large cluster test jobs that we run, we dump the whole set of Prometheus metrics from the control plane components and a bunch of Kubernetes components on the nodes too as Prometheus snapshot, basically for the metrics that are captured through the job run. And if you want access to them, you can essentially use these with a bring up your own Grafana UI instance and point them to the metric snapshot and you'll be able to see how these things work. And the final one is profiling. This is basically for the more savvy profesh scalability or performance experts. We use these sometimes when things go south and let's say we've run into unusual API latencies and we cannot quite explain it with logs and metrics or let's say we see that processes get umkilled because the memory kept growing. So for these we also capture the CPU profile, memory profile. I believe we also have the ability to get log profiles for measuring how bad log contention is. And these are again also part of the output that these test runs view. Okay, so now we've spoken about the toolkit, we've spoken about how we can actually go around observing and debugging scalability. Now what does the SIG actually do in ensuring that the Prometheus project continues to scale to basically 5,000 nodes and all the other dimensions that Wojtek mentioned earlier. The first one is we capture the various scalability properties of a Kubernetes cluster through a series of test jobs. There are kind of two dimensions around these. The first one is the scale at which we run these. We don't just run the 5,000 node cluster test because it takes a while. It is expensive and also the feedback loop is lower for it. So we kind of complement that with a bunch of smaller scale tests that we run more aggressively to kind of catch regression sooner. Some of those are also run as a pre-submit. So if you submit a PR to the Kubernetes project, you essentially are evaluated against at least 100 node cluster, for instance. And the other dimension for these tests is what they actually exercise and what sort of things they actually make sure that they scale. And this includes a whole slew of different test suites that basically come from different sigs and things like storage, network, various benchmarks like for scheduler, for DNS and stuff like that. We'll talk about that in a sec too. Yeah, so we've got this whole battery of tests. And yeah, so this is the test grid again. There's a link at the bottom. This, the six scalability tab essentially, if you go there it'll show you this whole slew of the tests, test suites that we run. And it's pretty diverse in the sense that it runs on different kinds of compute like it runs on EC2 it also runs on GCP and Cubemark test as well. As well as they they test different kinds of things like for instance, we run tests against the tip of Golang version that we use to the point that we've actually caught regressions with Golang in the past with the Kubernetes scale test and have them make changes in response to those. Similarly we have a bunch of benchmarks and experiments and you can see many of the experimental tests are failing I guess probably because they're experiments that make sense but yeah, I'll definitely encourage anyone who's interested to understand deeper about what what sort of things we cover here to take a look at that dashboard. And with that I'm really, really excited and happy to announce on behalf of the SIG we have recently started running the 5000 node scale test upstream as part of AWS too. This comes in as a huge news for the SIG for a few different reasons the first and the foremost one is it essentially finally helps us to decentralize these tests have them run on multiple providers have have multiple people develop expertise on running these tests, the infrastructure and being able to debug issues as well as it allows us to wet different kinds of infrastructure. So there are some differences for instance how the AWS test runs that uses cops and the GCP tests run using different bootstrap scripts so the idea is to actually use both these tests fungibly as in both of these provide signal to the users as in how good we are doing and are we doing well and being able to use this fungibly is really powerful because sometimes when we run into an issue with one particular setup we can kind of have the other one as a hot standby and like increase and decrease their frequencies and along the way gather all these benefits that I just mentioned. Cool. So okay so now we also have CI tests for scalability actually running. How do we make use of these? How do we leverage these? So there are two themes, high level themes. One of them is we use them to catch regressions and these are used to basically ensure that we launch releases of high quality as well as in general the other thing that I was talking about is be able to catch general performance bottlenecks in the system. So that's one of the things that I mentioned earlier. So scalability was a pretty broad horizontal property of the system and we discussed about this earlier and indeed it is pretty sensitive to all sorts of things like you can have a scalability regression come from pretty much anywhere across your cluster up and down the stack. So for instance we've seen some really interesting performance regressions with Golang releases which explains why we added them as tests later on and quite often these are actually things that within Golang they don't actually change semantics or some of the API behaviors and stuff of the programming language but they let's say change the underlying implementations of something. The Net HTTP library there was a function for HTTP to serve which essentially in the default implementation in Golang they added a new case to a switch statement which seemed pretty much harmless at the time but it kind of ended up blowing up the whole cluster and we kind of went back to them and got it fixed. And similarly they could even come from operating system things like your syscuttle settings without provider like how your controllers are written the kind of traffic they induce on the control plane and then the most common categories which is the API machinery and there are a few examples of issues that interesting issues that we recently caught I'll talk about for those in the next slide but yeah for these regressions we end up catching them quite quite a good success rate in these tests and many times we debug them ourselves and try to fix them or often we also tries them to other six because it's all around the stack. Okay so these are probably issues from within the last six months to one year. They kind of highlight different flavors, different nuances of the issues we run into. It's a super hyperlinkified slide so I'm sorry if it's all too blue for you but I just kept the links in there in case anyone wants to actually go in and see what those issues were about. So the first one, this is a rather newer issue where we saw that under certain load patterns for newer versions of Kubernetes when a client was making a watch request to the API server, Kubernetes client without specifying a resource version we see that some of the watch events were being lost, they were being missed and we kind of did a whole drill around this to try to get to a repro and essentially what we found out was when many of such watch requests on a high churn cluster, let's say you have so many pod mutations happening, arrive at the API server at once and when combined with this resource version not being provided it basically the behavior is the API server forwards those watch requests to HCD and all these watches get multiplexed onto a single GRPC stream and the way HCD handles these watches kind of makes that has certain inefficiencies that were basically highlighted by this issue. So it exposed a whole bunch of things it kind of is a regression because it came up as a consequence of us fixing a bug or fixing a property in the API server but it did expose a whole bunch of latent issues with HCD similarly there are other issues around things which may have been not technically regressions there may have been issues for a while but we did end up catching those through either tests or basically experiments and we used them to essentially improve our performance for Kubernetes users around it. All right with that I think I'll hand it over. Oh sorry I think I think I have another slide. So quickly another thing we continue to keep doing is we are adding more and more to the test suit we are improving these coverage especially as new use cases evolve like the AIML use cases for instance that Voithec mentioned they require us to be able to handle high throughputs being able to handle high steady state number of nodes running in the cluster and so on things like HPA DNS that Voithec mentioned I believe he said some of those are work in progress but we might have missed them more recently so yeah I think yes we have DNS we just are not blocking releases on that on the regression so okay so going a little bit into like the last category of driving scalability improvements are basically two main buckets of things that we are doing in this area first are like improving reliability at scale without really pushing the limits effectively you can think about scalability as reliability at scale so those are ensuring that like your cluster keeps keeps running and keeps being healthy within the within the existing limits but under like new circumstances under when the upgrades are happening and so on and so on and the second area is like pushing the limits which is pretty clear what it means so most of like given the 6k scalability doesn't own any production code pretty much all of those improvements are joined with some other six most of them or maybe even currently all of them or almost all of them at least are joined with like CIG API machinery and CIG at CD so just looking into some examples some most notable examples from the like what is happening now and what was happening in the last like couple months we finally GAIT API priority and we started in December in December after four years each like I think we started like the first alpha release was 117 so like it was a long journey but like we finally GAIT it doesn't mean it's perfect but it's certainly is in a reasonable shape and we have a stable API there were a bunch of like improvements to upgrade experience more is coming there are some caps that are being worked on now that will help with another parts of like upgrade experience graceful shutdown is one of the biggest example that we that we've visibly improved over the past couple releases and there are two more things that are worth mentioning here one is consistent list from cache we know that HCD is one of our biggest bottlenecks in the system for pushing the boundaries pushing the limits the limits farther so effectively as much load as we can offload from HCD and serve it from API server caches the farther the system can go and consistent list from caches is one of the biggest improvements in this area like currently we can serve list request from the cache but like we don't guarantee any consistency for them with this improvement we will be able to ensure that they are consistent we were planning to to have it better in 130 but unfortunately due to some issues on the HCD site itself like namely with the feature that we are heavily rely here the progress notify features on HCD our testing uncovered like some corner cases that are not handled well so like we need to postpone it and fix those issues first and the second thing and here that is worth mentioning is like streaming lists it was similar it's very similar situation we were planning to get better in 130 it's not because we are relying on exactly the same feature from HCD but what it does it effectively allows us to to mitigate or work around a little bit the limits of how large the response can be and so on by using watch protocol to actually stream the contents of the list so from the user perspective they will get exactly the same but like we will be using a different protocol here and from like pushing the limits category like the most important ones are around improving CRD scalability scalability of CRD is one of the biggest problems that we have comparing to built-in resources that like we still need to take that into account we are working on on introducing binary encoding for CRDs currently CRDs are only like JSON only support JSON protocol the seabor based encoding is happening we are again planning for alpha and 130 it missed that unfortunately there are a couple more like more localized improvements for both like CRD scalability for like but also like other more generic improvements for like faster compression and so on and so on one thing that is not in the slide but is probably also worth mentioning is like open API improvements like we those aren't necessarily focused on like the largest clusters but they help a lot for how the open API how many open how many resources open API generation uses and finally like there are a bunch of improvements towards high bot throughput which is very important use case for AML training in particular for example and I think that's all what we have here like if you are interested in this area more we have bi-weekly public meetings we have Slack channel, we have mailing please come and join and help us with making all this happen yes thank you and I think we still have two minutes so like if there is any question yeah two questions Yuen Chen from NVIDIA firstly do you think the current definition scope or the scanning test can apply to running emerging AI workload in GPU clusters or in particular regarding the reliability we know the GPU class and very and not reliable at all so any plan and on the GPU cluster AI ML workload the scanning test or tuning support we would love to I think the problem here is obtainability of GPUs they are like super expensive so those tests will also be super expensive so if we have those if we can get those resources sure we can definitely do that I think the problem is having those resources to be able to test on those or simulating it somewhere else which if someone have time and ideas how to do that then like we would definitely appreciate the help here okay yeah that's my follow up question I think class node to my best knowledge have some kind of the 14 injection kind of feature any updated plan because that's something probably I think we can do and yeah simulated GPU thingy other thing is that something you think could be valuable so yes I think like there are many places where we can improve stuff it's a matter of like having capacity to do that like I would love those to happen if there is anyone who can help with that like we definitely can help with making that happen from like guiding helping with reviews and so on okay great so finally and how do you compare the cuba mark with the cork which is definitely we see and yeah receive increasing adoption and do you think potential can integrate the cork into the scalability testing replace even cuba mark I think we should think how to converge those two into something a single thing yes I didn't have time to look into that any in the past few months but I definitely would like to converge and have like a single thing instead of like evolving two things in parallel that effectively serve the same purpose okay great thank you very much okay I think we are out of time so thank you very much once again and if you have any other questions like we are both here like feel free to catch us during the conference somewhere on the corridor so thank you