 Hello, KubeCon. My name is Wojtek Teczyński. I work in Google. And I'm part of this amazing Kubernetes community for almost six years at this point. Amongst others, I'm the TL of six-scale ability. And in connection to that, during this presentation, I'm going to talk about example features that allow us to run Kubernetes clusters with 15,000 nodes. But let's start with making this explicit. Kubernetes clusters with 15,000 nodes are already a thing. As you might have heard, earlier this year, Biocrop Science and Google together published an interesting blog post. You can read there how, thanks to running on GKE clusters with 15,000 nodes, with almost a quarter million of cores and over 1.5 petabytes of RAM. Biocrop Science was able to process roughly 15 billion genotypes per hour as part of the pipeline responsible for deciding which seeds, which are the final product, should be advanced for further experiments in their R&D department. Couple months later, during joint presentation, Twitter and Google together described how we approached validating that their applications that they are currently running on their Mesas Aurora clusters on prem can be split and run on GKE clusters with 15,000 nodes. But the goal of this presentation is to show that this work matters for all of you, even if your clusters are order, or maybe even orders of magnitude smaller. However, we need to start with understanding what scalability really means for Kubernetes. Even though we tend to use node count or cluster size as a proxy for the overall scale, it's actually much more complex than that. Scalability is a multi-dimensional problem with dozens of dimensions, such as number of services, number of volumes, pod churn, and so on. Even though, in many use cases, they may scale together with cluster size, you need to acknowledge that node count is not the only thing that matters. So how did we approach scaling Kubernetes to the next level, which 15,000 nodes certainly is? The core principle for any scalability or performance work is don't optimize blindly and always solve real-life problems. Almost every optimization is making a system a little bit more complex, so it's super important to keep that in mind. According to this rule, we first found users for whom running such a massive clusters would have actual benefits. We started with understanding their use cases, but also their motivations. And as you may suspect, those are the ones that I already mentioned. The first one was buyer crop science, who is using Kubernetes to run embarrassingly parallelizable batch computations. For them, larger clusters immediately translate to having results faster. In addition to that, given that the users are generally data scientists who don't want to understand the underlying infrastructure, they want to make it as simple as possible, ideally a single cluster. The second one was Twitter, who would like to migrate the microservices apps to Kubernetes. As I already mentioned, they are currently running them on Mesas Aurora clusters, which can handle even 40 to 60,000 nodes in a single cluster. So while migrating to Kubernetes, they would like to avoid the need to suddenly manage an order of magnitude more clusters. In addition to that, they would also like to unify their setup and run applications that they are currently running differently, like some stateful apps also on Kubernetes. At this point, you are probably thinking that you are wasting your time. I'm saying that we focused on two users. Doesn't that sound simple, by the way? And your workloads probably aren't even similar to what they are running. However, just like landing on the moon will push the frontiers of technology, it's the same with our scalability work. The improvements we did not only push the scalability limits, but they are primarily making all clusters more reliable and more performant. So now I'm going to describe a couple improvements to show how they make your life better. So let's start with HCD. The main HCD improvement that is worth mentioning is concurrent reads. Before this change, every read operation on HCD was blocking all other operations, both read and write, for the entire time of its processing. Thanks to this improvement, we are now blocking other operations only for a very short period of time just to grab a copy on write pointer to the current state. While it's crucial for large clusters, it's not the only place where it helps. Imagine that you have 10,000 custom resources of your CRD, and I've seen real production clusters like that with just 10 or 20 nodes. Thanks to this improvement, when you are listing custom resources, you will no longer observe spikes in the API co-latencies even if your cluster is small. And it's already available in HCD 3.4. So let's now make a step up and talk about API server and API machinery. As you know, watch is the crucial part of our API. But no request runs forever. So what happens when it times out, especially if it's a very selective watch that selects only a small fraction of all objects of a given type? To resume the watch, we are using resource version of the last object that we received via it. But many other changes might have happened in the meantime. And now we need to process all of them again, simply because we didn't have a way to signal the client that they have already been processed. So we introduced a concept of watch bookmark. It's basically a new event type that you can receive via watch that basically tells you we processed everything up to a resource version X. So if you didn't receive anything, it simply means nothing much to your selector. As you can see, there is nothing specific to 15,000 node clusters. It helps for clusters with 1,000 or even 100 nodes. And it just happens out of the box because Kubelet that is watching its own pods is a perfect example of very selective watch that benefits a lot from this improvement. And it's already GA in Kubernetes 117. However, improvements were needed across the whole stack. So let's take a look into an example from the networking area. For those not familiar with Endpoints API, the Endpoints object contains information about all backends of a given service. So its size is proportional to the number of pods behind that service. That has many consequences, but let's think about Kube Proxy, which is an agent running on every single node in the cluster that is responsible for programming in cluster networking. In order to do that, it's watching for changes of every single Endpoints object in the cluster. So imagine that you have a service with 5,000 pods. The corresponding Endpoints object will have, optimistically, around 1 megabyte. So in 5,000 node cluster, and I'm not even talking about 15,000 nodes at this point, that means that API server has to send 5 gigabytes of data for a single change of that object. And for the rolling upgrade of that service, it would be 25 terabytes of data. To mitigate that, we introduced the concept of Endpoint Slice that allows us to share the information about the Endpoints of a given service into multiple Endpoint Slice objects. Thanks to that, we can significantly reduce the load on the control plane and amount of data that API server has to serialize and send. The pieces of that solution went better in Kubernetes 119. So let's take a look into one more example, this time from the storage area. And let's talk about secrets and config maps. As you probably know, whenever any of those changes, cubelets are reflecting those changes to all pods that are mounting them. But in order to do that, they are opening a watch for every single secret and config map that the pods are using. That may translate even to hundreds or hundreds of thousands of additional watches in the system. Optimizing them would bring a lot of complexity to the API machinery. And as I mentioned before, we should always be solving real life problems. So we started talking to users, and it appeared that they don't mutate majority of secrets and config maps at all. So we introduced the concept of immutability to secret and config map API. When explicitly marked as immutable by the user, their contents cannot be changed. But cubelets also don't need to watch for them, which is vastly reducing the load on the control plane. As in the previous examples, there is nothing specific to large clusters. And you can take advantage of it even if your cluster is small. And it's already beta in Kubernetes 119. However, we were also looking outside of core Kubernetes. We worked closely with Golang community on optimizing its memory locator. It may be surprising to many of you, but log contention at the level of Go memory allocator is actually one of the bottlenecks that we really suffer from. While some optimizations have already landed in newer versions of Go, even more are coming. And this benefits not just Kubernetes. It benefits everyone who is writing their applications in Go. I described a couple improvements, and all of them, as well as tens or maybe even hundreds of others, were done in upstream Kubernetes. However, that doesn't immediately mean that every Kubernetes distribution will scale to 15,000 node clusters. In order to work, your ecosystem has to work at that scale too. That includes the underlying infrastructure, both compute and networking. You need 15,000 VMs or machines and networking between them. It includes auto scaling, logging and monitoring, control plane upgrades, and many other things. Based on GKE experience, I can say that it's a huge effort to make all of them work. So Kubernetes improvements are necessary, but they don't solve all the problems for you. There is one more question that we should answer here, which is how do we know when we can stop? Fortunately, the answer for this question is fairly simple. As soon as we meet our SLOs, Service Level Objectives, you can think about them as system level metrics with thresholds. While the concepts they cover in Kubernetes are still fairly basic, like API co-latencies or pod startup type, they greatly correlate with user experience. So to summarize, scalability work matters for almost everyone, because scalability is much more than just the cluster size. The improvements we did to push the scalability limits are also, or maybe even primarily, making smaller clusters more reliable and more performant. So if your cluster didn't work because of some scalability or performance issues in the past with Kubernetes 118 and higher versions, it's probably time to reevaluate it. Unfortunately, we don't have more time, so I will just mention that with upcoming releases, you can expect even more improvements and extending the portfolio of use cases supported by clusters with 15,000 nodes. And with that, thank you very much for staying with me.