 Okay, thank you. So hello everyone. My name is Vojtek Ticzynski and I'm like six scalability TL for ever almost At least that's that's my feeling So I'm with this community for eight and a half years at this point ish So and and today I'm going to to give this presentation about sick Six scalable what we are doing in six scalability both the introduction and and a little bit more of a deep dive So let's start what we what with talking what we are doing as part of six scalability And there are a couple couple different categories of things so First we define and drive what scalability of kubernetes really is it's as you will see in the moment Like it's it's not really obvious and what are our goals? In terms of where where we would like the system to get We're based on that like we coordinate and contribute the actual improvements to to to reach those goals that we defined we Monitor and measure the system and performance of the system to actually see that our goals were really reached based on based on the actual system behavior We protect the system from scalability regressions. That's that's probably one of the most important things and and finally we Coach the community and consult many different many different improvements or many different Features that are happening across the the the whole kubernetes area One more note here is we are at least sometimes confused with like the outer scaling sick Those are two different things here. We are we are focusing in six scalability. We are focusing on like the over Performance of the system how how far you can go how far you can push certain Certain limits of kubernetes how to see go to scaling is focused on how to How to horizontally or vertically scale like Certain aspects of the system. So for example horizontal odd auto scaling Or cluster auto scaling is part of out sigil to scaling and it's sometimes being confused by people. So It's not what we are doing here in six scalability Okay, so let's start with the the first thing defining in driving So what what is actually kubernetes scalability? and I think that the most important thing which is not like Specific to kubernetes itself, but it's it's important to keep in mind that We shouldn't be optimizing the system for the sake of optimizing itself It's important like every single optimization is making the system a little bit more complex harder to debug hardware to reason about and so on so It's important to always Uncore all the optimizations in the actual like user needs so If we really ask users what what they really want is They are often like saying they want scalable clusters and so on they want Kubernetes to be as scalable as possible. But if we really ask them farther, what what that really means that Isn't obvious for them and in many cases. It's it's not that they don't know they don't even want to know What that is because they they want the system They want to focus on their own business and they want the system to just work for them And they don't want to understand all the all the details of the system how that works how they're Interact with each other and so on So what we did is we historically thought about kubernetes scalability in terms of like size of the cluster or number of nodes But it's it's not really true or it's at least and it's not the full truth at this time Kubernetes scalability is like a multi-dimensional problem with many different dimensions Like and number of nodes or size of the cluster is actually only one of them to those many dimensions like number of secrets number of float balancers number of Persistent volumes number of pods in the cluster and so on those are all like dimensions that like affect and that affect the scalability and Those are things that that our users are asking for one thing that is is maybe interesting to mention is pot turn like with with a lot of Work happening currently in the community towards better support for batch workloads and on top of kubernetes What turn is one of the most the most frequently ask ask the Questions about from the users. So they are basically asking us how like we would like to support many more Can we create how many pots per second? We can create in the cluster and this is one of the dimensions That is probably that the most frequently asked currently as something that we should we should improve Okay, so what is this scalability envelope? I mentioned this so the scalability envelope is is a zone within the reach or Subsid of this this multi-dimensional space within it which your cluster is supposed to be happy What does it really mean that this cloud the cluster is happy? It basically means that like the scalability SLOs are Satisfied so so basically if you are within this Subspace will be a subspace within this scalability envelope. It means that like yours or our scalability SLOs Will be that or at least should be satisfied so I hope that you are familiar with this this terminology, but just very quickly like SLI is service level indicator And it's like you can conceptually think about it as as a metric as a low Is service level objective? We can conceptually think about it as like Met that metric plus a threshold that that needs to be satisfied in order to SLO In order the SLO to be satisfied. So we have a couple SLOs defined for for For Kubernetes for scalability of Kubernetes as you can see they are definitely not covering like the All surface of Kubernetes. In fact, they are covering probably Less than a half or significantly less than a half and it's it's not definitely it's not the desired state We would like to to to have much bigger coverage But it requires a lot of work and it's it's one of the areas where You can many of you can help hopefully So if you are working on a feature like it would be good to to to reach out to six scalability and and also Think about what like how you can measure how we can measure it how we can define the scalability limits of of your area and so on and so Please reach out and and and let's work together on extending this this coverage and make it better for for our users Okay, so let's look into one of one example of like SLO Here And and let's let's talk about like API call latency SLO Then the first mention of this SLO is coming from my blog post from 2015 when it was formulated as like 99% of all our API calls return in less than one second, which It's something but it's it has many problems with it in particular. I bet that My understanding of that and your understanding of that is slightly different I would even say a little bit more It's even that my understanding of this SLO now and my understanding when I was writing it even those were different so One of the the core principle when defining SLOs that they they need to be very precise because it's effectively our contract with our users So we need to ensure that what we want to we want to what we want to guarantee For them is exactly how they understand those guarantees. So How it's how this SLO looks like today we At least like we or we finally split into the actual SLI and SLO and The SLI is is currently the latency of processing the processing part was added like couple weeks ago to basically exclude the the waiting time with this like one of the things that is happening in Kubernetes currently is is Adding that the API priority and furnace to the API server, which is which is basically our overload protection and one of the Consequences of that is that like requests that are things in In case the control plane is overloaded by the load coming from different clients Many some requests may be hanging in the queue before they start being processing before being they start being processed so Given that it's it's basically depending on the load on the control plane We can't really provide any guarantees about that So we are we are basically excluding this this waiting time that or time those requests are spending waiting in those queues so So this is this this processing so like every single word basically every single word in in in the In the as both SLI and SLO definitions actually matters The mutating part we have a sibling Kassel or also for read only just with like slightly different thresholds The single object is also something important because it excludes delete collection We can't guarantee like low enough thresholds for delete collections that is that may potentially want to delete like 100,000 of objects in one call There for every resource verper Shows that we are not really putting everything into single bag But we are treating every every request type like post pods or put endpoint slices or whatever like separately and like four and And and finally like measured us like 99 percentile over the last five minutes shows exactly like how we are How we are measuring like over what period and so on so This this is super important to make it precise to to ensure that like we all understand The way we measure it is is the way like everyone understands it and the SLO Similarly like in default kubernetes installation. This is Similar to the like processing work We want to exclude the webhooks which we which any user or any operator can like install whatever they want and we have no control Over that so we can't like they can potentially even install a webhook that will be sleeping for 10 seconds and not doing anything else So we wouldn't be able to guarantee anything in that point So we want to exclude that and and then we are saying that for every such resource verper Excluding virtual and aggregated resources that that's another thing, but the 99th percentile per cluster day is less than one second So I think that the SLOs are are super important because that's exactly what how we define like scalability It's part of the scalability definition. So it's important like if If you are interested there a little bit more in this topic I had a talk like in Barcelona if I remember correctly that was purely focused on like SLI's and SLO's In kubernetes, so so you may want to watch if you are interested in this topic, but um, let's move on We have those SLO's we that define us this scalability envelope, but that is very implicit definition So it's it's not exactly what our users want and unfortunately like precisely Defining the scalability or envelope is almost impossible And it's it's not what our users want that they they really want to understand if they set up Who is within this envelope or not? Unfortunately, we can actually approximate it relatively simply by like providing different Different dimensions like number of nodes less than 5000 number of services less than 10,000 and so on There are there are many of those limits you can look into the link of the representation Many of them are still mark us to do and it's again something we would like to fill in and If you want to help with filling that in that would be that would be great and reach out to us And let's make that better for our users Okay, so That's mostly about like defining scalability. So let's see a little bit more on how we are Actually, how we are measuring those as allows what we are testing how we are testing and so on so over Over the last couple years we defined or we built like a bunch of scalability testing infrastructure That the first and probably the most important thing is our testing framework called cluster loader 2 It's basically effectively like a bring your own young test framework where as a user or as a test creator you are Semi-declaratively describing the desired state of the cluster and the cluster all there behind the scenes is bringing your cluster to that state and and verifying whether all those SLOs that I Mentioned before are actually satisfied Why semi-declaratively because in addition to defining this the actual state of the cluster that for example, I want like 10,000 pots across five hundred deployments and another Thousand of stateful jobs stateful sets and so on and so on you you also can define a little bit how They creation or the updates or whatever should be spread over time So for example saying that I want those those spots to be called those deployments to be created over the Evenly across like five minutes or something like that. Um, it was designed for like easy extensibility so adding new SLOs adding new new functionality or helper functionalities should be relatively simple to do that It already provides a bunch of like extra observability bunch of extra features So but you can read a little bit more. We don't have enough time to go over all of those today So you can you can read a little bit more about that in the in our documentation and I will go to the next thing which is Cluster simulation to call we called it cubemark. We built it like couple years ago, which is Which is a tool that allows us to simulate the Kubernetes cloud or large Kubernetes clusters with much less capacity the With much less capacity. So that that is primarily focused on Validating scalability or performance of the control plane itself. So what we are doing is we are running a regular control plane Regular Kubernetes control plane. Just we are faking a little bit the notes that are the notes of the cluster So we call them hollow notes and those are almost almost regular or almost regular Kubernetes note components like cubelet and cube proxy and so on just they are faking some of their stuff So so for example like cubelet it's an actual cubelet code, but underneath it's not really like that on the CRI is actually faked there and it's not really starting any pods or anything like that It's just pretending it started them and or similarly for cube proxy. It's it's like watching all this Services and end points nice is and so on computing how the IP table should should be looking at but it's not updating the real IP Tables and so on so so we are actually running those as pods in some other different Cluster and thanks to that we can we can simulate the actual large cluster with roughly like 10% of capacity that would be needed to run this cluster Run this cloud really run this cluster Three land this cluster Okay, so The next thing is observability and debuggability of of our tests in general So we have a couple tools here. The first one is Perv dash that allows us to track how a certain metric that we are interested in Like we have many of those but how a given metric is is looking over time. So how Certain things that we are doing or we are not doing but they are happening for example regressions and so on when those Happens how exactly that looks like over time and so on if we are if we want to debug a single in or look into into a single Single test run we are generally using Grafana we have a bunch of like defined dashboards for that We have like a Custom tooling that makes it super easy to set this up for a for a given test It's literally a single comment and so on and it's in it's what we are using for For debunning a single test and finally like profiling is another thing that we are we are often using it's it's natively integrated into cluster loader So it's it's you can just request it and it will happen out of the box for you if you if you want Okay, so this is about the infrastructure So how we are using it to actually verify the performance or how we are And how we are like primarily protecting Kubernetes scalability So we have like the main test that we called load test which is which is basically Supposed to stress the control plane of the cluster. It creates a bunch of like pods Deployment state will sets persistent volume services and so on and like it's it's stressing the control plane. This is This is a release blocking test. So It's we are running it at scale like of 100 nodes and 5000 nodes periodically It's something that we like release team is also paying for paying attention to so It's an important thing to keep in mind We used to also have like pre-submit tests at the scale of 100 nodes You might if you are contributing you might be probably familiar with them They are no longer required at this point They are only optional if you if you explicitly trigger them, which is good to good thing to do with if you are not Entirely sure like how your change is affecting the system but due to The cost cuttings that we had to do that you might have heard like we are We're on track to be out of the budget for like second half of the year or second Last quarter of the year We we had to do a bunch of cuttings across the whole project to reduce the cost of our infrastructure And like that was one of the areas that was hit by by by all of that So they are they still exist They are just like optional and not not run or not running by default and we have a bunch of like other tests like cubemark test benchmark and so on and If you are interested a little to learning to learn a little bit more about them like there is like we have a whole 6k ability dashboard when Where all of those tests are actually shown they are fortunately fairly stable at this point so Some regressions are happening from time to time, but I think we didn't have freely any in the last release for example So it's it's much better than it used to be like couple years ago where we were facing like multiple regressions per release usually Okay, so about the regressions it's it's worth mentioning that They are happening across all the system like obviously many of them were in the API server or Scheduler or like the core control core Control plane components, but we had a bunch of regressions. We've been cubelet or cube proxy or and so on because in it in a like In the large clusters that the number of doses is like the multiplying factor of how many of those is Super important, but we also have that like regressions in in go itself like the goal like the the newer versions of go were making Our components behave differently and introducing regressions. We had like we we've seen Regressions and operating system themselves. So even if like from the first glance like the thing that you are working on Doesn't look like like scalability related like it's it's always important to keep the Scalability in the back of your mind when you are working on whatever change in in in the system Okay, and the the last thing which is usually the most interesting for for many people We're just like driving the scalability improvement. So what we are doing here Unfortunately less than we would like to do again due to capacity Constraints, but there are still a couple things that are worth mentioning from the last Two free releases or so The like given that scale but six scalability is not really owning any Non-test code or non-test framework code Like most of them are generally joined with some other six majority of them is like joined with sick IPM machinery But we are like heavily contributing or driving those those things ourselves just under supervision of that other sick and The things that are worth mentioning here is like the API priority and fairness it's It's the overall protection in the API server that we are we are adding and we are working on that for like last Nine releases or something that's that's a huge thing that we are we are doing at a huge change In the biggest the biggest Advantage from it is it's obviously reliability. It's like overall protection but This is this also matters for scalability because like you can think about scalability like the way we are thinking about scalability is That scalability is really a reliability at scale. So reliability is like in the core of scalability But there are also like other aspects of of priority and furnace that are important here Which is that are purely scalability related is Is the system throughput that I mentioned before that is appearing a lot in terms of in the context of of batch workloads API priority and furnace is actually a neighbor for that because The throughput in in many places across the system is actually artificially blocked a little bit today by By our QPS limits on the client side for communication with API server or the control plane in general And The only way we can get rid of them is to really Make overall protection on the server side like robust because otherwise Those clients could potentially break the control plane completely. So so it's it's critical to or it's actually an enabler There are many different things that we would have to do but it's an enabler for improving the system throughput in general Streaming risks are Another thing which is which is also again a little bit more on the reliability area than the Scalability area, but those are like tightly coupled with each other the idea here is that instead of like the The way we are thinking about getting all the resources so send a request and get all the response response back is used is here use the watch protocol for Getting all the data go getting the list data, but using the watch protocol It's something that went alpha in 127 like couple weeks ago last week. I think was the reason And it's it's it's hugely helping for memory consumption of API server, but it's also helping a lot for For very large collection. So if we if you have like gigabytes of data, so like let's say you have you want to have 100,000 of pods and like they can actually be relatively large if if you are end up having Having gigabytes of data then just Downloading them in a single API call is it's just time consuming and it can even exceed the limit of like one minute that we have for all API API request and with Switching that to watch we are actually Working around that limitation graceful shutdown is another interesting thing that We did it's it's it's we've seen a bunch of cases where the once the cluster is it's working and stable that it Works fine, but even without any changes to the load if you try to recreate the control pane or or like upgrade the control Pane and so on that was like blowing up the cluster So and the reason was primarily watches So like there are in the biggest clusters we observe like even hundreds of thousands of watches So if we bring down one of the API server or couple of IPI servers at the same time Suddenly all those watches that were established and in it like just worked in a stable state they suddenly all want to Recreate or re-establish themselves again at the same time. So that was like blowing up Control plane completely in some cases even even to the extent that it couldn't like Recreate itself correctly at all. So so that was something that was also improved in the 227 release and a bunch of like other smaller things Unfortunately, we don't have much time to talk about them. So One thing that I want to mention like at the end again is You might have seen me or someone else like rejecting some of the improvements. It's important to that Something to get back to what I initially said at the beginning is We shouldn't be optimizing the system for the sake of the put them in icing Like I've seen the optimizations that like they were improving Like by 1% one of the random functions somewhere and they were introducing like hundreds of lines Of course, that's not really something that really won't so It's important to keep the complexity versus like return on the investment trade of in the back of your mind when you are on to optimize something and Yes, so if you are interested in any of those like as you have seen like we need help And we have like a bunch of work that it would be great to have done in many all those areas like please reach out to us please join our our by weekly meetings to reach out to us on the on slack channel mailing list or whatever and let's Let's help us to make the system better for our users Um, and with that I think we have five more minutes or something. Yes, five more minutes. Um, I'm I'm happy to take any questions that you have That's all what I have prepared for today I think excellent presentation. Thanks for that and From the user perspective, we know that many organization tend to have more than one cluster and Can you reveal some tips for for those who want to divide the workloads among many clusters? For example, which kind of workloads are so Intensive in terms of performance that they could you know impact other workloads. So Can you share some, you know some some tips for that? Yeah? Sure. Yeah, that's a great question. Like and it's something that we In really in order to be done correctly. It's something we need to cooperate more deeply with multi-cluster sick, so it's somewhere in between but Sorry, I think that there isn't like a strict rule that like those should be running together and those like shouldn't be running together I think that in general networking area is the most stressing for the control plane. It's and it's where we when we've seen the biggest number of like issues so I Think networking in general like how how you are How your how hold your networking stack works like it's also very different between like cloud providers or whatever technology you are If you're running it or on bare metal on and so on though those have like very different characteristics So it's it's it's important to keep it in mind, but it's also important Even when you are using like cube proxy-based stock or Celium-based stock or whatever. Those those are also Super heavy for for for the control plane. So Understanding like how big churn for for like you will observe in yours on your service and so on that that may be a Significant factor of how to split your workbooks across Yeah, thousand of services, especially like thousand of stable services that just are there are usually fine But like if there's like a big churn of in all of those that that is like what is often problematic The current scalability limits of 5,000 nodes I know that was quite a few years back that that came about and there's been loads and loads of work going on since then are there Any plans to kind of revisit that limit or see if we can test beyond it and what are the current bottlenecks that you're seeing so To be to be honest like this is what we officially support in the open source stack like Internally in Google where I'm working like the GKE supports like 15,000 load clusters and underneath we are using all but like Like open source Kubernetes that we are just tuning it differently and setting But it's not that we are like have gazillion patches that are improving that so It's definitely not a hard limit like the 5,000 nodes. It's it just like Many of the like the ecosystem around is is needs to be improved also So like there are all the all of those all of those things around that would need to be improved There we don't have any plans to push it farther really in open source because a it's Expensive to test for regressions and so on and so on and second There isn't much desire from like the actual users for that and and and we are getting back to like if People don't need the plant. Let's not complicate the system. So At least for now like we are we are focusing more on pushing those other aspects like what throughput for example System throughput in general and so on and we don't have any plans to do that at least in the next year or two For pushing this the size of the cluster makes sense So it's largely you're seeing a lot with like other External factors like third-party controllers there as well And is that like Ed CD use that one of the issues there? Is it just like the sheer number of watches and bad? At CD is becoming to be a bottleneck, but we are running in GK also with with at CD underneath So it's it's possible to do that. So Yes, hi, if you hit a scalability issue, what is the first step you would do to diagnose where the problem is? Because I I'm a bit blind when I see coming. It's it hanging or everything stopped working. I don't know where where the problem is here This is a good question and I I don't have a good answer to it to be honest like I mean Usually the scalability questions scalability problems are the toughest one to debug and like there are many like Even in in our team and like GK team like there aren't that many people who can who can debug those So it's it's a matter of like having expertise and Understand the whole Kubernetes and like looking a little bit like what was happening in the cluster looking into matters Like there's a lot of different metrics and trying to have some intuition So I don't have really a good answer. It's a dust you saw I saw a dashboard there on the slides Can this be used to see where bottlenecks sure? Yeah, those those are like. Yes. I mean In at least in some cases, they are helpful So that the dashboards the Grafana dashboards that we have are our open source that they are like in our our First this can be used on a live cluster. They can be used. Yes, you can just like Yes, yes, basically like if you if you are using from it use or something like that you can just just Just like Open them for your cluster and try to look to look how so yes If I yes, I recommend looking into them that they may they in many cases they might help you To see what is happening. I think we are running out of time. I don't know. Yeah, I think we have like Don't have time at all. So maybe the last one last question One question again, who was first? I'm happy to answer like some offline questions to after the presentation is there some subset of of the current tests that someone developing and an operator run to check that it's like To find the how scalable it is Where it breaks and stuff like that. So Yes, and no, I mean like all the tests are Available so you can you can run it's a matter of like do you have an like infrastructure to run it? Like you can we can you can do that as a pre-submit for your PR once you have something working So so that is that is one potential The the caveat here is that they are not really covering all aspects of the system So if you are looking into if you are working on I know some random feature on the node site Then like our tests probably don't exercise it at all. So It really depends what exactly you are you are doing Okay, I think we are out of time. I'm around here like until end of week. So feel free to grab me on the corridor and or whatever and I'm happy to chat more with you and Thank you very much for