 Gdje to, shto. Še tako pravimo, da se pečete v kratku, da se pravimo, da se tako praviti ene bar, ne bo treba sprunoje Kuba Konez. Tako z Textilj, najmečo vedj будемo, o klasta repitosnovu, in tudi vedj smo o takom, da pržibimo prejeliste, o klasta repitosnovu P4, v 2.000 klastranjih. Pnešnji, hrašem, zelo, imam Fabrizio Pondini, imam po vsem, imam vseh klasterlijsko, tako leda in klaster API mantena. Ja, imam Stefano Püring, imam klaster API in kontrolujem ten mantena. Dobro, prej prvi, če so... tudi, kaj smo tukali, je to, da imamo klaster API 1.5.0, So it means more or less this summer. And what is interesting about this work is that we are talking about a series of concepts or experience or lesser learning that not only apply to cluster API or cluster API providers, but can be useful for everyone developing controller on Kubernetes. The presentation today is kind of dividing into part. The first part will be a very brief introduction on the tool that you need to do performance or scalability optimization. The second part will be a deep dive on how we basically manage to scale cluster API up to 2,000 clusters. OK. So let's get started. So the first step is you have to get the right tools for the job. We're going relatively quickly over some of those so that we can focus on more important parts later on. So the first things you need is metrics, profiling, tracing, and locks. Ideally. So we have a priority list on the right side, essentially. So the first thing is definitely that you need metrics, because otherwise you can measure anything. But what you also need is automation. And what we mean with automation here is essentially that you will have to run some sort of scale tests. And ideally you don't do it manually, of course. So you need automation for your scale test. And if possible, you should also use mocks. So in cluster API, usually we create clusters and machines in some real clouds. And of course, it's a lot better if you just use a mock, because you don't actually have to pay for the infrastructure, et cetera. So why do we need those tools? So the first four, metrics, profiling, tracing, and locks. We mostly need them to analyze performance and to investigate bottlenecks. The first two, metrics and profiling, they're more for getting a rough overview or a lot of reconciles, just to get data, average reconciltration, et cetera. While we use tracing and locks more like to dig into specific slow reconciles and try to figure out basically where we're losing the time. Then for automation, the important part essentially, we're running our scale test automated so that we can observe the performance, we can optimize it, we can run the test again, we can see how much we improve. And of course, once we did our optimizations, we just tried to run them periodically so that we don't run into any regressions once we implement new features and other stuff. Yeah, and mocks, I already mentioned it before, but the most important part essentially is to increase speed. So that you don't actually have to wait for real infrastructure to come up, we can just decide in our mock how long does it take to create a machine or whatever you want to do. And of course reduce costs because we're talking about really a huge amount of clusters, a huge amount of machines and you don't actually want to pay for infrastructure. Thanks. Okay, let's take a closer look at metrics. So we basically made our own categories here, so we separated them in user-facing and in internal metrics or system metrics. The idea is basically that user-facing metrics are actually metrics describing what a user cares about. So things like how long does it take to create a machine, to lead a machine, create a cluster, and we use them to define goals and to measure our success. So the really important part is that if metrics describing what a user cares about and those are the ones we're trying to optimize. Then we have internal metrics and we use internal metrics to take a deeper look into the system and try to figure out why the user-facing metrics are so important. So an example is we're looking at average reconciterations of our controllers about work, memory usage, CPU usage, number of go-routines, that sort of stuff. So basically we use them to understand a little bit better how a system works and try to pinpoint what we have to optimize. Okay. So how can you get all of this? So the good news is users you get almost all of it for free because, at least with Glastravia because you already did a lot of that work we have a lot of stuff just upstream in repository and can mostly use it. So for user-facing metrics the ideas, in our case that we essentially infer those metrics from CODs. So we have our cluster, our machine, et cetera, our CODs and we just give essentially cubes in metrics a conflict so that it can infer certain metrics from those CODs. So basically our cluster object already contains the information how long it takes to create a machine and that sort of stuff and we just infer it from there. For internal metrics we're using the metrics server included into control runtime and we're getting essentially out of the box control runtime metrics, client go metrics and go metrics. And essentially for all of those we have a configuration upstream and we can just use it and one bonus on top is that we also already have dashboards. So for example we have a control runtime dashboard which should work with every control runtime base controller and you don't have to do anything, just deploy our stuff, scrape the metrics and that's it. Then for profiling we're also using the metrics server included into control runtime and then we use parser to regularly scrape the metrics from our profile endpoint. Then for tracing you actually have to do some work so you have to instrument your controller but you can focus on the ones slow and just add more and more over time. Here we do it similar as for metrics basically we have parka running parka regularly just gets the profiles from sorry from our controllers and stores them so that we can take a look over time how our profiles evolve. Then for logs you hopefully already have logs and basically if you see during your investigation that you need more logs to add more but in general it's probably good enough what you already have and we're using Promtail to ship the logs to Loki and then essentially we take a look at the logs via Grafana yeah the nice thing is because we're showing logs and traces via Grafana we can basically cross correlate them so we can see which traces are matching with which logs and investigate via reconciser slow then automation here similar as above you probably already have some sort of end-to-end test and you can basically extend them to just a scale test by creating a lot of clusters in our case a lot of clusters at the same time and then you have your scale test and for mox that's a bit more work so what we did essentially is we implement an entire fake infrastructure provider which creates clusters and machines just in memory and simulates in this hierarchy cluster essentially so we can now enter in the core of the presentation and basically talk about how we scale up cluster API to to thousand cluster so the first step which is more important than what people usually think is to define a goal and we use the metric cluster provisioning time and we define our our goal in this way so the cluster provisioning time must remain almost constant from the first cluster till the last cluster so we want this time to be constant scaling up and yeah guess what we did it so at the end basically the average provisioning time from the first 100 machine to the last 100 machine there is a negligible increase of the 2% and as you can see all during the test the provisioning time was almost constant I remember we were using in memory provider cluster provisioning is just above one minute so we were going very very fast really being aggressive this is great we managed to do it but I think the most interesting part and this is what we want to talk about in the next slide is how we manage basically to keep cluster API performance to keep the system responsive from why scaling up and also after scaling up while running at scale and in order to do so it is important to have a common understanding of how controller works and this is what we are talking in the next slide yeah so basically starting with really like controller basics so let's just first take a look how controller actually works that's really a simplified just to be clear so basically what we have is we have events coming from the API servers so objects are getting created up that they are deleted they are essentially enqueued in the queue so we see taking elements from the objects from the queue and reconciling them we can have multiple workers which we call here concurrency and yeah that's the basic model one more okay and the main performance characteristic of a controller is essentially the latency between an object getting created up that they are deleted and until it's successfully reconciled yeah set up this formula here which is basically the number of objects in the queue divided by the concurrency and then multiply it with the reconcilation so very simple example let's say we have 2,000 clusters we have 10 workers which means every worker has to reconciled through 200 clusters and if every one of those takes 3 seconds then we need 10 minutes just to reconciled through 2,000 clusters so you might ask yourself is that like a realistic scenario that we have all of them at the same time in the queue yes, unfortunately because controller has something called periodic resink which essentially means when this resink happens every single existing object that we have in those case every single cluster is getting enqueued in the queue on top of that we also have sometimes the case that basically our reconciled code at the end decides oh this cluster is not actually finished we have to reconciled that one again so basically the important thing is that whenever a periodic resink happens we have to reconcile all the objects in queue as fast as we can so that the queue is empty again so just to one back if you look at this diagram or you can essentially see the peaks are periodic resink and within like 15 or 30 seconds we are getting the queue back down to zero so you can imagine basically if you create a new cluster at the peak of this queue you basically depend on how fast is this queue going back down to zero until your cluster is actually reconciled and if that takes like 10 or 20 minutes you won't get anywhere because of course one single reconcil is not enough until an entire cluster is created ok, so now the question is what can we do to improve the situation and yeah, I mean it's sort of obvious we have like those three option, we have those three variables going into it so we can increase the currency of our workers we can reduce the reconcideration and we can reduce the objects in queue and that's basically what large part of this talk is about ok, so let's look at how we work on option one which is increased concurrency as Stefan just explained basically a controller have many workers running in parallel the number of workers is called a concurrency and as you can imagine if we have only one worker this is not good when you work at performance because your queue gets empty in sequential manner and so why not simply increase the number of workers ok, well it turns out that this is not a silver bullet because what happened is that the more workers that you have running in parallel basically the more query the more things are running the more you are eating the API server of your cluster so the risk if you increase if you have too many workers is that yeah your queue get empty fast but another component of the cluster starts suffering so you have to to find a good balance and what we learned is that the default number of workers in cluster API is 10 and we learned that 10 is quite good cover most of the keys and with 2000 cluster it work very well with reconcile loop that take 250 millisecond in this BORPA most of the time empty and when there is a sink things go down in 10, 15 second so this is a good value we increase this value only for one of our reconcilers which is the kuban min control plane kuban min control plane is a very complex reconciler that connect to the workload cluster so it take a little bit longer than 200 and 250 millisecond so we added more workers and we really did a lot of work on the kcp controller in order to make sure that even if we are running with many with 50 kcp reconciler in parallel we are not creating problem on the API server and then we will explain how we did this so option one is not a silver ballad then we have to move to option two option two is basically to reduce the reconcil duration and as you can imagine this is a little bit more complex that simply increasing the number of controller running in parallel the good news is that we learned that if you follow the leads that your metrics are giving you the leads that the profiling and tracing are giving you you can improve performance by doing very small surgical changes so and this is really effective when you combine this with mox and automation because basically you measure we find a bottleneck you address it and you repeat and the next slide is about explaining how we did this in cluster API so the first step was remove noise so what was happening when we were starting to scaling up cluster API above 300 400 cluster we were getting a lot of noise basically the performance of our controller were not deterministic and we try we basically dig into what was going on and we traced the source of this lack of determinism in the client side rating of the client go what is the client side rating every client go client like our controller as a mechanism which is client side rating this mechanism is a safeguard that protect the API server from client that are too aggressive it basically ensure the stability of the entire system so it is a good server to have and in cluster API the default rate limit is 20 square per second as 30 query burst that mean that for a small window of time you can have about 50 query but if the number of query keep going basically rating limit starting slowing down your query and this impact your reconciler your reconciler randomly get picked up to be slowed down and so your system is noisy you don't understand where the issue are we basically played a little bit around those number and we found out that for working with 2000 cluster it is just enough to increase query per second to 100 and burst to 200 which is ok in cluster API management cluster where cluster API itself is the main process running in the cluster ok good so as soon as we get rid of the noise the next step is ok let's find the first bottleneck the first reconciler that we have to work on and this was pretty easy because without noise you simply look at the reconcil duration matrix that you get from controller on time and you pick the slowest one in this case it is the yellow one that is clearly is not performing like the others and the next slide show how we are we work at we know which consulate we have to improve then how we can actually improve it and the first pattern of problem that we start finding out is that when looking at profiles we were seeing slow operation repeated many times and if you think about it it is quite common we as a software engineer we like develop utility and then we reuse them all around the code so that mean that if you have a utility that does not perform well you start seeing pattern like the one shown in this graph where there is this time consuming operation which is repeated for time in this case it is again the KCP controller which is connecting to a t c d to first release the member of the cluster and then connect to 3 member so 4 call to a t c d and what is happening is kind of hard to read is that for each connection we were creating a private key which is a time consuming operation so how we solve this simply by creating one key on reusing for many call by doing this simple optimization we found similar problem when creating Kubernetes cluster and other kind of expensive operation but by doing some simple caching it was possible basically to reduce the reconcile time of this controller by 75% so the next thing we found were essentially API codes obviously controllers are using clients to talk to the API server and to read and write data and the more complex the controller gets the more calls we get basically automatically and that has a huge impact on reconciterations one reason is because of network latency but also because the API server simply needs time to answer our requests if you look at this picture that is a trace from the KCP controller it is only a subset of what can happen and if you see at the top of the trace basically it is an entire reconciteration and basically every bar which is a little bit wider is actually an API call so if you look at this you can say 80-90% of the reconciteration of this KCP controller is just doing API calls against the API server and then the question is essentially how can we improve this and ideally also while improving this API server so first we have to take a step back and take a look at how the client usually works so if you just take like a regular client go client or if you yeah also the if you just create a controller client by yourself it also does the same thing so if you call this client to just in this case read or write a cluster it just goes directly to the API server which is obviously not great but controller time is doing if you just take the default client that controller time gives you is essentially is using a cache for read calls so the write calls are still going directly to the API server but all the read calls are just hitting a local cache and returning whatever is in that cache to fill that cache the controller time cache is running in formers and reflectors which are just listing and watching objects and then storing them in a local cache in a local store yeah so in cluster API in general we basically use this default controller time client almost everywhere which means theoretically almost our read calls should be already cached but it's not that easy so first of all I already mentioned that write calls are never cached so the only thing we can do about write calls is just trying to make sure that if you have a single recon style we don't write the same object 5 times so that's not really necessary and for read calls there's one caveat and that is essentially if you use the regular types like the cluster type or something like that they are cached per default but if you use unstructured they are not cached per default you can fix this by configuring controller time accordingly but yeah the default behavior is basically that they are not cached and the tricky thing about cluster API at least core cluster APIs that most of the objects we're using they are not cached and then you don't know if it's like an Azure machine or a vSphere machine so we use unstructured everywhere and that was a major major problem for performance so the biggest improvements we made is just by making sure that we're using caches for read calls everywhere of course there's a tradeoff there and the tradeoff is that of course your memory usage goes up but it still seems reasonable for us because we had something between 2 and 4 gigabytes of memory usage for a core controller which seems fine of course it depends a little bit on how big your clusters are the exact topology but in general it seems like a worthy tradeoff and then of course we're using a cache so the standard problem is that the objects in cache can be stale but we took a closer look at all the places where we changed this and I would say in general it's fine because what usually happens is when you cache a stale at some point you will get an update event for the stale object and then you will get another reconcile which eventually reconciles with the up to date state of your object so that wasn't really a problem for us to just use the cache everywhere ok so we basically talked about increasing concurrency we talked about how to improve reconcile time the third option is to reduce the number of objects in the queue and if you remember for the previous slide we don't have full control on this one for sure we can not control how many cluster the user is changing upgrading or creating or deleting for sure we can not control the controller on time rescink eventually we can make a rescink less frequent but sooner or later a rescink has to happen and so the only things that we can control is how our controller add back object to the queue and it turns now that the controller when you develop a controller and the reconcile finish you can basically return three types of answer one is success my object is up to the desired state forget about it the other two are the queue with the cough and the queue after so how we use them so we use the queue with the cough for errors and this is fine because when error you wanted the system recovery fast and but if the error is persistent basically you wanted that your system reduce the frequency of our try and because the error is persistent and does not make sense to keep mirroring the system with new query the last option is req after and when we are using it for instance in the PI we provide machine and when you create a machine then behind the scene of when get created and so a way to wait for the machine being created is that every 10 second you basically go and check if the machine is there but if you keep doing this polling at scale what happen is that your system basically continue to run this reconcile loop that are just checking for something to happen and at scale this create noise, this create load load of your system so what is the better way to do this is to use watches you can think of watch like a notification is like when you instruct the WM that whenever the WM get ready he send a notification back to your machine controller and this notification is just another event in the queue and this allows you to keep your system idle doing nothing while waiting and then when something get happen in your system you just react and this actually a good way to reduce what we have in the queue ok, so reconcile best practices just basically like a simplified few of our reconcile loop usually works at the beginning we are reading the current state and also the desired state then we compare the current of the desired state we align to the desired state by writing something in some way so in the first phase what you would recommend is that try to read all the objects that you I would say usually need for most of your code paths so you have them available and you can pass them through to the functions where you need them because if you don't do that usually you just try to read them at the various places multiple times and yeah that's not really great especially if you're not using client caching then definitely try to use client caching because it difference between microseconds to many seconds from just reading low cache but be aware of the caching caveats and ideally really run the scale test look at the traces see if you're actually using a cache that really helps then in the second phase just try to avoid duplicate expense operations so like the example we had before so generating private cases definitely something that you should not repeat if you can avoid it and yeah just take a look at the profile and it should tell us should tell you maybe you're wasting your time and if there's something repeated in there then try to write only once per reconcile so what we usually doing in cluster API for the current object that we reconcile is just in the beginning of a reconcile function we have basically defer so that whenever that reconcile finishes in some way it doesn't matter if it's in error or not we just do one patch call to this object that you potentially write just try to minimize the write calls there and then if applicable just try to use swatches instead of constantly requeuing with something like 10 seconds or so doesn't always work but in a lot of cases you can just avoid a lot of unnecessary reconciles ok great if you follow these best pratics your controller are kind of powerful but then if you really have to do optimization there is a list of do and don't so first do is get the right tool of the job without metric is not possible to do performance improvement without automation doing performance improvement become very long and not effective and it cost money define measured goals because at every iteration you can check if you are making progress or not and you can also define when you are done invest time in learning how controller on time and client go works because they provided a tool to improve performance and the last do is iterate fast do small changes and then look at the system because sometimes small changes just change the behavior of your system and it does not make sense to make big action of big plan just small change remove one button measure again repeat and as fast as you can don't do is don't do optimization if there is if you don't have a metrics that point clearly to a bottleneck because if you change code based on guessing or whatever you risk that your system become less performant or you risk to introduce unnecessary bugs and with that we are done I think that before opening up for question I would like to thank you a couple of contributors that really helped in this effort Christian Slotard which basically is driving all the work for generating metrics out of CRD and this work can be reused by everyone for every CRD we would like to thank you Lennart because it triggered a lot of discussion about how to optimize controllers and we would like to thank you Kilian and Yuvaragi because they really helped in building up the automation and the mock provider that made it possible basically to implement to do all this job in three weeks and thank you everyone again for attending if you have a question we have a loop for you I think you have to work to the microphone I have a question I think that one of the topology controllers was one of the first to use server side apply I saw in that last slide or one of the last slides that one of the ways you control how many times you write per reconcile loop is using a deferred patch we do the same thing and have in various cases run into some awkwardness awkward interactions between a deferred patch and server side apply do you guys know if you ran into anything similar? I can take that only if you want so we didn't really mention what we did with server side apply because that happened like a half a year ago or something so the policy control is a really special case in cluster API so we did a totally different custom solution to make this performant essentially what we're doing is we're getting the current state, we're computing the desired state and then we do some sort of hashing to figure out if we even have to do a server side apply and we avoid it at all costs and that is basically the thing that makes the policy control a performant that in a lot of cases we just don't do anything because we just know that nothing changed so there is no reason to apply again so there's definitely no deferred patch nothing in there so there is no deferred patch that we built there which tries to figure out if we even have to do a server side apply I think the summary is that if you can avoid an operation it is the best of the individual that you can do and so we avoid a server side apply whenever we can so thank you for the talk I think a lot of useful learnings out of how to optimize the controller in general my question is more towards the cluster API project open source project that was introduced inside CF called open cluster management OCM so I really like to know if you guys have looked at that and what is the difference between OCM and cluster API because both are saying that we manage and maintain clusters Kubernetes cluster to be honest I asked the secretary of questions at the last kubicon and I don't remember if I remember correctly open cluster management is used in some red head products and potentially in combination with cluster API but maybe just after the talk just ask vins or someone sitting around there they hopefully know the answer so OCM itself is the open source project inside CNCF cloud native foundation and then the product red hat brought out is the advanced cluster management and normally red hat is really good at that they do open source and then bring products out of it but I always was confused and I wanted to see if you guys were more into it understanding what is the difference between the two and I know cluster API was backed by tanzu and VMware so other than the rivalry open source part of these both projects where do they meet and where do they differentiate unfortunately we just don't know just one clarification cluster API is a community project and currently there are 80 company contributing to it including red hat including red hat, Microsoft and many others I don't have contests about the product but what I want to make clear that it is a community project and everyone is helping to make it possible OK, we have another 30 seconds if not a question great, thank you everyone for attending enjoy to the fun