 Thank you for coming to listen to my presentation. So let's start. So in this talk, I'd like to give a brief introduction how we encountered these issues. So UpBound is one of the main maintainers of the crossplane project. And I will talk a little bit about crossplane and how we hit these issues. And then I'd like to briefly mention the client-side CI scaling issues that we have hit. And also, I'd like to talk about the server-side issues that we have identified and how we identified them. Specifically, the high-resource utilization we have observed with the API server and how we profile the API server, what findings we have. And also, I'd like to briefly mention how you can profile the API server. So it's really straightforward. So I'd like to talk a little bit about it. It helps us a lot in this respect. And also, I'd like to talk about the Kubernetes scalable to dimensions and about CRD scaling in that respect. And this will conclude the talk. So UpBound is one of the core maintainers of the crossplane project and the crossplane project. We are working and generating sometimes custom resource definitions, which correspond to actual cloud provider resources. And as you will imagine, via these custom resources, you can manage the infrastructure. And as you know, there are hundreds of different resource types in this cloud provider. And crossplane is a multi-platform project so that either your resources are in AWS, Azure, or GCP, you can have a crossplane provider and manage those providers. This also means that in a single Kubernetes cluster, you may want to install multiple crossplane providers. So recently, we have introduced TerraJet, a code generation framework, where it can generate crossplane providers on top of telephone providers. This also means that in a single provider, like we have generated provider, Jet AWS, Jet Azure, and Jet GCP, we have these providers. And on top of these, you can have hundreds of custom resources. And this is where we first observed our issues, because in a single, for example, provider Jet AWS, crossplane provider, we have over 700 custom resource definitions. And in the other Azure and GCP providers, we have, again, hundreds of custom resource definitions. And if you would like to install one of them or multiple of them in a single Kubernetes cluster, this means that you will have thousands of CRDs installed in a single Kubernetes cluster. And this is how we entered this domain. This is where we started experiencing issues. So when you sum up, we have about 2,000 custom resource definitions. So we can categorize the issues in two broad groups. First, we have the client side issues that we will talk about. And we also have the server side issues. The first issue that we had observed was very high CP utilization. That also led us to profiling the API server, trying to understand what's going on there. The upstream crossplane community was already aware of high memory utilization. And I will talk about them later in the slides. So I will also briefly mention our experience with the managed Kubernetes services. And let's start with the client-side throttling issues that we have run. So I think most of you, if you try to install hundreds of custom resource definitions, might have seen these error messages. So when you run a QCTL command on a cold cache, so I will explain what I mean with a cold cache later. So please bear with me. Sometimes you can observe some warning messages, which says that the request was throttled. So you will see that the throttling was caused by client-side. It's not related to API servers, private and famous flow control mechanisms, but it's on the client-side. So if you run the same command in a small period of time, then you will not be observing this most probably. It depends on the QCTL command that you run. But for example, the QCTL getNotes command, run shortly after the initial command, will just return the notes. So to explain this, we will need to talk a little bit about the discovery client. So the discovery client is responsible for, as you will imagine, the available APIs in the API server. So it discovers the API group list first. So which group versions are available at the API server? This also includes, of course, as you would imagine, the resources that are part of the API server itself that are shipped with the core cross-plane distribution, but also the custom resource definitions. And for each discovered group version, the discovery client has to discover the resources, the actual kinds under that group version. So we have this discovery client running, for example, as part of QCTL or any other API server client. It could be. And we have on the right the API server. So the first request is down to the slash API endpoint. And we get a v1.API group list response from that listing the group versions available at that endpoint. And also, this was a legacy endpoint. So we also need to hit the slash API's endpoint. And similarly, we will get an API group list from that point, giving us the available group versions there. And then the discovery client needs to discover the kinds available at these group versions. So this is done in parallel. And from the discovery client, you see some parallel requests for each group version that needs to be discovered. So here, as you see, these represent multiple parallel requests. And the first one, for example, discovers the batch v1, meta1, and the second one, fc1. And the API server is, of course, expected to return the API resource list documents available at those endpoints. So what has been implemented is the discovery client itself is throttled without any feedback from the API server itself. So this is done via a discovery client, via a token bucket rate limiter implementation currently. And I'd like to briefly mention how this is implemented to give some technical details. So let's assume that we have a number of requests to do to the API server. But we have the token bucket rate limiter as a throttler. So you can imagine this as a kind of bucket initially filled with some capacity. And this unit of capacity is represented by the tokens here. And initially, our tiny bucket here has an initial capacity of three requests or tokens, let's say. And there's another parameter, which we call as the field rate. So as requests are admitted, each request will take a token from this bucket, as we will see. And this bucket has to be filled, refilled. So the R parameter, the rate parameter denotes the rate at which this bucket is filled. So let's assume that we have these requests each represented with a cube. But the time dimension is the y-dimensional and conventional, meaning that these requests are in parallel. And initially, we have three tokens in the bucket. This means that the three of these requests will be admitted, meaning that we'll be sent to the API server. But the remaining one request, because there is no token left in the bucket, will not be admitted. And it will be exponentially backed off to be retried again. So as time passes, with the field rate, the bucket is filled, we now have a token available in the bucket. And when the timer for the blocked request arrives, it will be admitted this time. And the error message that you see, or the warning message, I should say that you see denotes how long the request was delayed because of this rate limiting. So the client discovery client uses a token bucket rate limiter with the initial burst parameter of 300 queries or 300 tokens. And the rate limit is 50.0 queries per second. So these are the current parameters in Kubernetes 1.24. When we were tackling with these issues, we realized that in Kubernetes 1.23, the burst parameter was 100 queries, and the rate was just 5.0 queries per second. So this was basically due to a bug. These were not the intended parameters. The intended parameters were actually for B300 and for our 50.0 QPS. However, due to a configuration bug, the actual parameters were 100 and 5.0, which were quite low when compared to the number of CRDs that we were trying to install. And we saw lots of throttling because of this. So I'd like to also briefly mention about the cache discovery client. If you remember when I was showing an example of the client-side throttling warning messages, I mentioned about if you run this on a cold cache and we will see what that cache means here. So QPSTL uses a disk.cache discovery client, which actually, as you will imagine, maintains a cache of the API resource list and the responses from the API server. The discovery client gets from the API server. So this cache is maintained under user's home directory by default. And it's specific to the API service host and port. It's specific to the API server. And as you can see, if you are running an API server on the local host, you can see the IP address as the host name. Or if your Qube config declares an API server with a host name, then it will be under that host name underscore, the port number. And you can see a directory structure of this cache. So at the root level, under this cache directory, there will be the server groups.json file, which basically holds the group versions available at the API server. And for each API group, you will see specific versions available under that API group. And under each group version folder, you will see the server resources that JSON file, which actually contain the available resources under that group version. So running the Qube CTL examples that I've shown on a cold cache means running those comments on an empty cache. Like in that example, I had just run an RM command to clean the cache, so that the discovery client, the cache discovery client, excuse me, would not return responses from its cache. But this cache also, as you will imagine, has a time to leave parameter. Up to Kubernetes 1.23, it was just 10 minutes, meaning that every 10 minutes, even if the cache has already been populated, the discovery client would need to talk with the API server to fetch the latest versions of these documents. However, with Kubernetes 1.24, it has been increased to six hours. So the time to leave parameter for the default time to leave parameter for cache discovery client has been increased to six hours. So if you summarize the throttling issues, the discovery client needs to make at least two plus the group version count, number of requests, to discover the available APIs. So the initial two comes from the slash API and slash APIs endpoints. And then for each group version, we will be requesting the available resources under that group version. So those sum up to two plus the group version count. And also, please remember that you will also, for example, if you are running a kubectl getNotes command, it will also be a separate request that comes after the discovery phase. So if we install, for example, all three cross-plane, jet-based cross-plane providers, it means that we will have 370 group versions hitting the API server, right? So what are the consequences? If we have, as you see here, the important parameter is not the CRD count, but rather the group version count, right? And this table shows, for example, some parameter running discovery client with some different parameters. With kubectl, you cannot do this. But we have a custom version of kubectl through which you can explore these parameters. And as you would expect, since we are talking about 370 group versions, and since the burst parameter is by default 300, tokens just 300 queries, kubectl will be throttled if it's running on a cold cache, right? But if you increase up to the initial burst parameter up to 400, then it's no longer throttled. So with this tool, if you pass the rate parameter as minus 1, the client-side rate throttler is completely disabled. But we can see no improvement after 400 because it's already larger than 370. So maybe we should also discuss the API server, CPU and memory high API servers, CPU and memory utilization issues that we have encountered. So initially, we were installing about 700 custom resource definitions on a single cluster. And I think it's not readable. But as you see here, the API server was immediately started spending two cores. These are collected from Prometheus metrics, as you will imagine. And we investigated. Also, maybe I should talk about we also saw high memory utilization. Although we did not do a memory heap profiling ourselves, we initially concentrated on the CPU utilization. And I will first represent the results from our CPU heap profiling. So the immediate observation from CPU profiling was that over 40% of our CPU time was spent during open API V2 spec, aggregated open API V2 spec civilization. So I will briefly mention what this means. And when you examine the available profiling data, we spent, API server spends lots of time in JSON and marshaling and marshaling operations. And also proto binary civilization. And also another interesting point was because during this aggregation operation, lots of objects are being created and destroyed. So we had a very high heap churn. So the garbage collector also accounted for a large percentage of the sampled CPU time. So let me talk about what this open API V2 spec marshaling means. With the V1 version of API extensions, each CRD must specify a validation schema expressed as an open API V3 structural schema. And using this schema, the custom resource belonging to the custom resource definition is validated, as you would imagine, during creation and updates. And unknown fields can be pruned by the API server that are not available in the schema. And also, the schema is used for client-side validation. So when you run a kubectl command without specifying, for example, dash-dash-validate equals false. kubectl validates the CR manifest. And if it doesn't conform to the manifest, it will reject to work. So API server serves an aggregated open API V2 spec at the slash open API slash V2 endpoint. And this spec contains or documents the complete API available at the API server. So by default, this is served as serialized as JSON. So when you hit this endpoint without specifying with an accept header a different serialization format, like protobuf, the API server will return a JSON. But especially for intercluster communication, protobuf serialization has also been implemented in the API server. So here you see an example, aggregated spec example. And under paths, you will see, for example, the open API schema of a custom resource definition. So this also means that when you register a custom resource definition, the API server has to recompute the aggregated spec. So that custom resource definition itself becomes part of what's available as an API in the API server. And the aggregated open API V2 spec served at the slash open API slash V2 endpoint has to be updated with the new CRD. And if you are registering a bunch of CRDs, this also means that the API server will in the background update the aggregated open API V2 spec. So the profiling data had shown that this is where time was spent. So this was the root cause of the high CPU utilization that we had observed. And we got in touch with the upstream Kubernetes community. And luckily, learned that they were already aware of high memory utilization issues, also related to open API V2 spec processing. And a fix was on its way. So I will also briefly mention what the fix was. So the fix was, in fact, instead of calculating the aggregated open API V2 spec, each time a CRD was generated, the idea was to defer calculation of the aggregated spec till a request to the open API slash V2 endpoint hits. So it basically differs the calculation of the aggregated spec. And if you register a bunch of CRDs, the open API V2 spec is not immediately calculated. And whenever a request comes to the slash open API slash V2 endpoint, the API server will for once calculate the aggregated spec. So this saves lots of CPU cycles. And we wanted to see how this affected our profiling data or whether we could quantify the improvements here. And I think it's not that much readable, but sorry for that. Here you will see the CPU utilization running with this fix. So if you compare it, the previous one for about 30 minutes, it depends on the number of custom resource definitions you have installed with the provider. But the API server was spending about two course for a prolonged period of time. And with the laser marshaling of the open API V2 spec, things are in a much better situation. For a shorter period of time and with a smaller peak, the API server can handle the calculation of the aggregated open API V2 spec. So even with this fix, which is already available in 1.23, we were hoping that these would resolve our issues. But we could not validate this in the managed Kubernetes offerings. This was because we could not run a modified version of Kube API server in these managed offerings. But after the fix was available, we also revisited the situation in EKS, AKS, and GKE regional clusters. And this table shows our current situation. So in EKS, by the way, this version does not contain this fix, the laser marshaling fix. But we cannot install all three providers successfully. But you will see the client-side throttling warning messages, issues, and maybe even failures. Because requests may time out due to client-side throttling, but whatever the reason, from a cross-plane perspective, we could install all three providers. And in AKS clusters, we were good with provider jet AWS and provider jet Azure. But it was not possible to install a third provider. And in GKE regional clusters, we had some problems in installing a single provider also. So this also means that these issues are not resolved. And one of the purposes of this talk is the cross-plane community. We would like to share our findings and drive the upstream community and other solution providers who depend on large number of customer resource definitions to work with the upstream community to resolve these issues. So this is how we profiled the API server. So we were using custom builds of Kube API server, as you see, including the fix and not including the fix and also to test some other ideas. So this profiling data was collected on kind clusters. We first used kind build node image, running it on the specific Kubernetes source repository that contains the Kube API server to be built. And then using this node image, you can create a kind cluster. And then after running Kube CTL proxy, you can hit the slash debug slash pprof slash profile endpoint to collect profile data. For example, in this example, we are collecting profiling data, CPU profiling data for 300 seconds. And then using the excellent pprof tool, you can examine the profiling data. So it requires some experience to interpret the profiling data. But in our case, we were able to accurately pinpoint the open API V2 spec marshalling as the root cause of high CPU utilization. So Kubernetes has already a scalable thresholds document, which is maintained by, I believe, the six-scalability community. But it does not consider the number of CRDs per cluster as a scalable two-dimension. And the only thing that we could find about the number of custom resource definitions, the limit for limiting the number of custom resource definitions per cluster was in the GA availability document. I've put the references to those documents here. And it suggested 500 as a maximum limit for the scalable target. But as I mentioned, this is not part of the official scalable thresholds document. So we also had some chats with the upstream maintainers. And also, they also believe that this has to be part of the scalable two documents. But it needs some time. And they need to be a barrier of the use cases and use cases similar to ours. So I'm concluding my talk. Thank you very much for listening. Here you can find some further pointers, especially the top two ones from the cross-plane project. There are really detailed discussions and detailed technical analysis, et cetera, that you might be interested in if you want to dig deeper into this topic. And also, the OnePager has some good tools that you can use to generate custom resource definitions et cetera for your tests and for testing possible solutions. Thank you very much for listening. If anybody got a question, any questions? Let's take a look. Any questions? Hands? Not all at once. It's been a pretty long day, hasn't it? Lots and lots of action. I guess, is there anything else you'd like to add regarding CRDs? Is your experience with them, your experience before them, what we might expect from them in the future? Yeah, so I mean, as the cross-plane community, we really depend on them. So we would like to drive some effort here with the upstream Kubernetes community to make the Kubernetes control plane more scalable. So but this has to be a community effort. And people from other maybe open source communities or other vendors joining these efforts would be really appreciated. Will you be here for the rest of the week? Yeah, if folks want to get in touch, pretty easy to find. Let's give another round of applause for Albert. Excellent talk. Yeah, thank you very much. Thank you.