 Off we go. I want to thank everyone for joining us. Welcome to today's CNCF live webinar, how we manage thousands of clusters with minimal efforts using Gardner. I'm Libby Schultz and I'll be moderating today's webinar. I want to introduce our speakers today, Samarth, a software engineer and Hardik, a software developer, both at SAP. A few housekeeping items before we get started during the webinar. You're not able to speak as an attendee. There is a chat box on the right hand side of your screen. Please feel free to drop your questions there and we'll get to as many as we can at the end or throughout just depending on our flow. In addition, please join our CNCF public Slack channel hashtag CNCF dash online dash programs to continue the conversation later and address any questions you had. We didn't get to this is an official webinar of the CNCF and as such is subject to the CNCF code of conduct. Please do not add any questions that would be in violation of that code of conduct and please be respectful to all of your fellow participants and presenters. Please also note that the recording and slides will be posted later today to the CNCF online programs page. At community dot CNCF dot io under under online programs. They're also available via your registration link and the recording will also be available on our online programs YouTube playlist on the CNCF channel. With that, I will hand it over to our speakers to kick off today's presentation. Thanks so much. Thank you, Libby. So, thanks a lot, everyone possible joining and let me first of all introduce myself so I'm hardic I'm a software developer developer on Gardner on metal. Obviously I was working on Gardner mainly the machine management and auto scaling of Gardner and otherwise I am also once in a while active in the cluster API community and the auto scaling community. Hello, everyone. Thanks for tuning in. This is summer. So they are going to I'm a developer at Gardner and I primarily work on a component called machine controller manager. Okay, so let's get started. First of all, the first thing first, what's the motivation. So the webinar is actually about God. And brief about it says basically an open source initiative by SAP. It's basically a fully managed control plane as a service that offers homogeneous clusters potentially on any cloud provider and is fully customizable and scalable. And we have been thousands of clusters for real since recently, and this webinar is about giving a glimpse around what and why and how and we will basically be doing something interesting. Yes. So managing thousands of Kubernetes clusters at scale is not a cakewalk. And over these three to four years Gardner has evolved to be so robust and scalable that we have actually made managing thousands of clusters, a cakewalk. So Gardner primarily runs everywhere. It runs on our own infrastructure. It also runs on other cloud providers. And the experience that it gives with respect to the versions of the features offered is pretty homogeneous, even though the support is for various cloud and even our own infrastructure. It is also fully automated and fully managed practically with zero manual ops. And it is highly scalable, even beyond a single cluster. And it is highly customizable. That is, we have given all the configuration knobs for your Yes. And to communicate the idea a bit more effectively, this is what we are going to do. So we are going to host a hypothetical, high nickels in the ability to solve an application. We are going to call it the botanist quest on a platform, which would really need something as robust as Gardner. So we will be assuming kind of a roles and we will do the role play where Hardik will be the founder of this application. He spoke about the botanist quest, which is hypothetical, and I will be the product manager for this botanist quest. And this webinar is basically going to be, you know, a set of arguments and brainstorming between us to design this Gardner from scratch and convince you that such a robust platform is practically possible to exist for applications like botanist quest. And or for also other critical applications of yours that needs such a platform with a foresight features. Let's get started then. So, hey, Samarth, shall we then start planning on taking botanist quest to the new heights already? Hello, Hardik. Yes, that's why I'm here. What do you have so far? So this is what I have. I have got one very nice Kubernetes cluster with three dedicated control plane machines. And it's basically serving a bunch of beta users and we have got a really good feedback already for the application. And we are going to launch the general release very soon. And then our initial set of the target users would be around 500 or so. Pretty good. And how does the platform look like today? So I say we have a beta app, and that's hosted on one cluster and the plan is that we simply scale this cluster to file. So we would have basically five clusters with dedicated control plane machines. And all of them would basically be hosting the botanist quest in a hybrid model. They would also be running on different cloud providers. And yeah, that's the situation at the moment. That is good. But five dedicated clusters might not be enough scaling because as per my research, botanist quest is having a lot of traction in the market. And I see something similar to Pokemon go might potentially happen with botanist quest to that is you might have planned for an expected influx of a 5x in the worst case, but your traffic might hit to 50x, which happened with the Pokemon go right. So you might have to have the scaling done massively and across geographical locations. You said about multi cloud. So are you planning to take the managed clusters across different cloud providers, because if you're doing so, probably there won't be a homogeneous experience when it comes to managing the clusters and also the transparency of the control plane. I doubt if we would get that. And also I want you to employ our own infrastructure that we have at different locations. So, can you align the strategy something in these regard. Yeah, yeah, I get it. I get it. So basically, my team is then going to replicate the installation, we would have 30 Kubernetes clusters will use this awesome full bar tool that we have been using. It will be multi cloud. Yes. And all of them will also have three dedicated control plane machines again. So it will be super reliable. And I think it should work like a charm. I said, what I would also do is physically put each off or divide the clusters across different regions so that the customers in different regions are better served. And that should I think be good enough. Yeah, that does look good, but 30 clusters with three dedicated control plane nodes for each cluster doesn't really ease as cool as it appears to be. You know why the first reason is that the control plane nodes are never fully utilized. They're always underutilized. So let's say for 30 clusters, you will end up having 90 control plane nodes. And these will only incur your cost faster than your team scaling up your clusters. And not just that. Have you even considered about the operational complexities that your team might face. What if some cluster goes into volume mount issues, some cluster has an API unreachable issue, and some other cluster has some other issue. And if these things start happening simultaneously, then the team will go heavier. And more than that, how have you planned to manage the tracking of these clusters for their config files and cloud credentials, etc. Manually managing 30 plus clusters that to with a dedicated control plane nodes is probably not how a cloud provider or a software as a service should operate. So to summarize what I just said, with the proposed solution of having 30 clusters with dedicated control nodes for each cluster, the excessive underutilized control machines will only incur cost. There will be operational complexities, as well as cluster management complexities. Yeah, those are actually good points. And let's take a first look, I would say at the Kubernetes or let's, let's, let's ponder upon what we already know. So, first of all, we know that the dedicated control plane machines are usually underutilized in most of the setups, as you say. Second point that is very well known, also the beauty of the Kubernetes is that both the control plane and the workload are somehow decoupled, so they can they don't really necessarily has to run all this together. And the third and the more important point is physically the control plane components themselves are actually a full-fledged workload applications, they are, they can be probably be treated as the workload themselves. Okay, so what can we actually infer from this information and maybe innovate to address your concerns? Maybe we can experiment treating the control plane of the Kubernetes as yet in the workload or, or maybe we actually host the workload, maybe we actually host this workload on some other Kubernetes clusters, we basically do Kubernetes inside the Kubernetes. What do you think about that? Wow. So, so you basically want to migrate the control plane of these clusters as workload onto another cluster, though this essentially is Kubernetes inside Kubernetes. Isn't this like Cubeception? Yes, it is the Cubeception and so in order to improve the results utilization of the control plane nodes, I think we will spawn one Kubernetes cluster manually, and then we'll call it a management cluster. And then we do, we basically use that management cluster to host the control plane of other clusters. So for the visualization, this is what you what you see on the screen is, is what I would like to propose. So we have clusters across different locations and we simply move the control planes as a containerized applications into a single nice management cluster. And yeah, that's what is sounds like a good idea to me as of now. Okay, this looks good. Maybe I want to take a closer look into the management cluster. Can you take me around it? Yeah, sure. So let's take a let's double down and let's take a bit closer look. So essentially, the control plane of each, let's call them child cluster would basically have that one dedicated name spaces. So that's first level of isolation. Of course, we don't want them to mess up with each other. So we can also isolate them using the network policies. So that would be the baseline idea. And then another thing which I would really want to consider is that we really use Kubernetes here and not invent the reinvent the wheel. So I would use deployments and stifle sets and such battle to state in build controllers to deploy the API server at city keep scheduler and components like that. And this this essentially should actually reduce the blast radius by effectively having to manage only one management cluster against 30 clusters as it was in the previous case. That's that appears to be pretty efficient. But, but I think it only addresses the cost issue, right, where the excessive underutilized control plane machines where rather migrated as workload onto one single management cluster as workload. But when it comes to lifecycle of these control planes and the lifecycle of the underlying machines. I think we are back to square one. In my opinion, you must take care of the lifecycle of these hosted control planes and the workload machines of the child cluster more efficiently because the traditional Kubernetes is not having the domain knowledge, which it might want to have to manage these Right. Well, agreed on that. I think we are we are circling back to the main issue. So let's again, a step back and look at it again. So first of all, it rather actually looks like a natural candidate for the controller or operator pattern. So just to just to reiterate or remind so what an operator is basically a go controller or Kubernetes controller which also comes with additional domain knowledge to manage its own resources. And what we have here is basically abstracted control plane as a part as parts. And what we could do is basically represent this control plane with the dedicated CRTs. So in a sense, let's let's do it this way. So we have a controlling parts, and then we basically represent them using the cluster CRD. This cluster CRD would have all the knobs and necessary configuration options that decides the whole lifecycle of a given cluster. And in a similar way, this will help me to get the cluster creations rollouts updates patching hypernation deletion and whatnot. And then, but this should not be it, in my opinion, so on top of the cluster CRD, we would also need to take care of another very important and very dynamic infrastructure component which is basically workload machines. So we can also introduce a CRD for machines and then it could look like machine deployment machine set and machines in a similar way that our deployment replica set and parts. So the way the deployment controller somehow always ensures that a certain number of replicas of the pod are always running and it does a very fine grained rolling updates of the pod. We could also implement the similar similar functionalities for the machine. So we have machine deployment which would basically help us do the right kind of rolling updates and so on. And, and yes, with this kind of abstraction if you call it the machine API, we get seamless auto scaling as well because with such abstracted, dedicated CRTs, then the higher level functionalities or the higher level automations becomes really easy. So just imagine cluster auto scalar making use of this and we get free auto scaling for all cloud providers even parameters and so on. So this me a lot. That is that is that is really good. So so essentially you're telling that the machines and the clusters that we are trying to deal with will now be treated as the first class citizens of Kubernetes along with cluster controller manager in place. Yeah, so and here is also detailed this realization so so essentially we would also have the cluster controller manager of course we would have a controller which would take care of both kinds of CRTs and this controller would be running in the management controller management cluster and upon creation of the cluster CRT the control plane of the child cluster should be deployed first by the controller and then in turn the same controller also deploys the machine CRTs so it basically pulls the information from the clusters worker section and then rest of the things are would basically be taken care of the machine deployment. Nice enlightening. So this design proposal looks good to me so shall we prototype it and maybe you can showcase a demo for me. Yeah, sure. Cool, then let's move to the demo. Okay, you might be surprised that I already prepared it. I think I'm so fast. So let's look at the demo. We have three terminals here to management clusters and one to my load cluster. I won't quickly see the shoot. So I'm going to I'm going to call it a shoot because it's my cluster. It's my part on this twist. So for the key one should cluster which is called the Q demo CNCF. I have I should have a dedicated namespace. The namespace which designates this particular shoot cluster and I would expect that this namespace hosts all of the control plane components that are necessary. Let's take a look into it and we already see that here I have on top of the essential control plane components like API server scheduler controller manager. I also have a few other controllers I have introduced something also for the machine API a separate controller which we call machine controller manager then also also as a separate controller on top. So Autoscaler already talks to our machine API now. And let's also take a very quick look inside our cluster or shoot and let's take a look at its spec. So here it's a glimpse where we see that in the spec section we can already see there are multiple sections. Hibernation is one of my favorite it saves so much cost for us we can also kind of configure all kind of given this related stuff fully transparently via the spec. And here we go and we see the worker section the worker section is basically what we just discussed so based on the information that I gave you here I tell minimum is high three maximum is five. I keep a very fine grained information like max unavailable and max search which should be respected during rolling update. And then this controller basically fetches this information and prepares the right kind of machine deployments. Let's take a look at what we have in terms of machine deployment machine set and machines. Sure. So we see that we have a machine deployment with three replicas and then representing one machine set with three replicas and then three actual machine objects. Okay, nice. Let's also take a quick look inside the machine deployment spec and see what the API actually contains. Here we see again for the consistency through replicas. This allows us to the rolling update. This also allows us to the recreate strategy. In case of rolling update it would really delete the machines one by one the way the deployment controller does and we have a reference to the machine class and the node template to sync to sync the labels and other metadata back to the node object back and forth node object and the machine objects because essentially the machines are really dynamic. Right. And yeah, what we could do more I think we could quickly change because I claimed it should take care of the life cycle. Let me actually make make a very small change. In the shoot before that I would I would watch the machine deployments and I would also watch the nodes of my workload cluster. So three nodes designate the three machine objects that are there in my management cluster. And let's edit our shoot object or the cluster object. And here I would make a small change what I would do is that I would change the machine types from let's say X large to two X large. And with minimal change and this is actually the power of the declarative approach with once I make this change my controller which is running in the back end is going to reconcile this particular change. Form or it's going to update during the reconciliation it's going to update the machine deployments machine sets and so on that there is a change in the worker section where the machines previously learning X large now they should be learning on two X large. But the the catch or the magic is that it should not be done abruptly because we actually don't want to handle only infrastructure here we also want to take care of the parts running on them so because I said Max unavailable was zero and Max search was one. It created a new machine and it will actually wait for the new machine to join until then it would not delete the machine from the previous previous set of machines. So one machine is in the pending state and it would wait till this new machine joins and only after this new machine joins it would go ahead and delete one of the old machines. Okay, this this looks good so essentially every machine is backing the node object that is actually attached or registered to the cluster correct. Yes, that's true and the internal node object is basically the virtual machine or the real machine and also one thing is that this this what we are what we see here is infrastructure related stuff but if you would have a part running on one of the machine and if that part says. I have XYZ pod SLS that this particular machine I am running on should not be deleted unless there is another replica or there are enough replicas then this controller is smart enough or or I would say this controller uses a brain feature in such a way that such pod SLS and destruction that it's also very properly taken care of. And the GIF with a bit of a fast forwarding in between we already have all the three brand new nodes available placed one by one. So that's what I have as a very at a very initial stage to show you this how do you like it. This this already looks good to me so it is honoring the infrastructure SLS as well as it is also honoring the SLS for the applications that are running as parts within that infrastructure and this looks pretty good to me to see the infrastructure being handled as custom resource objects in a pretty declarative way this this is really nice. Okay, this is all good, but I see that all the control planes of these shoot that you call because your botanist quest all these shoot clusters control plane are sort of hosted on a single management cluster as workload, but the controls are distributed across region so I see a potential issue of cross region latency here alongside this. If I'm having one management cluster let's say and this guy is hosting several control planes of several shoot it should hit an upper limit correct that after a certain number of clusters for control planes hosted there, maybe it cannot accommodate so you might want to scale the management cluster right so can you just explain me how you're handling these two aspects. Those are again good points and honestly I see a very straightforward solution to this so see we have one management cluster and if if the latency is the problem then we could simply replicate this management cluster to the regions and that should basically solve the problem although it looks that we are having instead of one management cluster we are having more management clusters but all of them would physically be auto scale so each management cluster has the cluster or a scalar which will make sure that they do not have access number of replicas access number of worker machines with time. Okay okay if I have to rephrase what you just told then we are going to replicate the management clusters and host the control planes of those shoot in the geographic vicinity of the distributed management clusters correct. Yes, that's true. Good so this idea is nice so probably the cross latency cross region latency is kind of handled here but with time you see with increasing workloads with increasing number of shoot the density of shoot will increase and we might have to scale the management clusters also in a pretty large number so how do you plan to manage these management clusters. So is there a elegant mechanism that you have already thought about. Okay yeah that's also a valid argument and you already caught me through entangled into it. I would say I feel we are I don't want to back I don't want to fall back to the square one so let's take a look at it again from what we have discussed so far. So what we had in a phase one we had a plenty of clusters with a dedicated control plane in one location. And then we saw the problem that we have a lot of resource being underutilized so we decided that we move few clusters of the clusters to different locations in different regions and this work pretty well. But again, this situation has its own set of problems that again we have plenty of control planes standing at plenty of locations so we then say let's go to the face to an introduce a management cluster. So we said, okay, having plenty of clusters is fine but let's move the control planes from them to one single management cluster and that's all the problem at certain extent for us. But this again introduced the issue of the latency. So the latency is again a bit of a trouble so we replicate the management clusters to the geographic vicinity. So we moved all of the clusters to their different regions. This worked well but again we fell in the same problem that we could have actually possibly plenty of management clusters. So the way we had to manage plenty of chute clusters we now also have to manage plenty of management clusters. To be honest, what I'm looking at the key perception and you can looking at the recursive approach, I would go bold and introduce another cluster. Let's call it a super management cluster and I would migrate the control plane of this management clusters to this super management as the super management clusters work load. And to make our lives a bit more easier I would introduce another CRD and I'll call it a management cluster CRD. So essentially now I have a management cluster CRD which takes care of my management clusters and then I also have to shoot CRDs and machine CRDs. So the machine CRDs can also be used for the management clusters in general because it's completely recursive and our cluster controller manager runs at the top level at the super management cluster. Okay, okay, I think this proposal is also pretty good. So if I understand this correctly then the cluster CRD and the machine CRD that we were speaking about in the before slide. So that cluster CRD will now be a part of or it will be applied or created in the super management cluster because the cluster controller manager is also running there. And even these super management cluster or even these management clusters will be represented as management cluster CRD in our super management cluster, correct? Yes, that should I think solve the race issue. Hopefully. Yes. Yep. I think this looks like a sophisticated design that kind of convinces me that we can now manage thousands of clusters. So to actually answer this question, can we now already manage thousands of clusters? Probably we want to look at the flow of adding one new cluster to this ecosystem and see if there are any unknowns, right? Yeah, certainly. I would not be so quick to judge. Let's take a quick look and let's see what we have so far and what we can do. So currently we create a cluster object in a super management cluster. Okay, and that will be processed by the cluster controller manager. Then we are manually assigning this cluster to one of the management clusters. Okay, that sounds fuzzy. Then the third step is basically the cluster controller manager reconciles this cluster object and then it creates the control plane and in the dedicated namespace. That's good. And at the end, the cluster controller manager of course takes care of the rest of the life cycle of this control plane and maintaining it. Looking at this flow, do you also see what I see? Here, yeah. So there seems to be a similarity between what we do with the control plane. I think you wouldn't just actually does something very similar at a very fundamental level with the pods. I would say then let's compare and find the design parity. Maybe we are aligned. Yes, I would actually think that let's take a step back and let's let's look at a Kubernetes what Kubernetes does with the pods. So essentially we have a QAPI server. Yes. And then we have scheduler controller manager. So scheduler's job is actually, although it's really important, it's job in a sense is to assign a node. So it would just update the node name filled on the pod. And that's the job. Keep control manager, of course, takes care of certain other life cycle aspects. I also know that there is a cubelet on each of the nodes. So that so when a pod is introduced, it's assigned to a node and then the cube but the respective cubelet will basically fetch the definition create the pod or container. That's actually a deja vu moment. So let's see what we have. We have a cluster aggregation server or let's say a standard server which is going to host our cluster CRTs or should CRTs. We already have the cluster controller manager which and this controller manager is creating control planes. Okay, that's something that could be improved. On the other hand, I also see that we have we already have the control planes running on the management cluster. So what's actually missing here? I think there are maybe two components which are really at the core of Kubernetes and could really also be helpful to us. So I can already think of a cluster scheduler. So a cluster scheduler which assigns a cluster to a particular management cluster the way a key scheduler does and a cluster lead where a respective cluster lead will basically fetch the definition of the cluster object and then spawn up the control plane and then do the rest of the business automation. This is the business logic that we need to do. And I think the interesting phenomenon something or something that I would really like to put out explicitly is that with the introduction of the cluster lead we are actually separating the whole the business logic out of the cluster controller manager which was ready to deploy the control planes and this really, really helps us scaling. So I can think of plenty of management clusters now and plenty of cluster leads doing the job independently. That is nice. And now let's let me stretch a bit and name it. So for the part on this quiz, let's name or whatever small design that we have prepared and let's name it a gardener. And I would introduce gardener here and I would say what's the design of the gardener. So the gardener's design is exactly similar to what we saw previously in the previous slide that we have a gardener API server. Then we have a gardener scheduler gardener controller manager, which does the same as what we just discussed and then the gardener garden lead, which is sense of high scale. And then on each of the seed cluster, we have one gardener garden lead hosted, which is responsible for managing the control planes which are going to run on those particular seed clusters. So if you look at the mapping, then this gardener scheduler becomes a Qt scheduler, gardener API server becomes a Kubernetes API server, gardener controller manager becomes the Qt controller manager. The seed cluster becomes the management cluster or the seed cluster becomes node objects in the Kubernetes. The gardener lead is of course the Qt lead and the control, should control plane is the pod pod. This looks fascinating. And you know what, just to add, so this is the core of the garden design and maps to the design pattern of Kubernetes SVC. So we can really actually reuse the skills in effect, in effect, a Qt exception model of turtles all the way down along with the requirement of delivering a fully managed Kubernetes as a service step by step. Actually led us to this architecture. Now, initially our requirements are for running the botanist quiz application on Kubernetes motivated the whole platform. But now seems this platform is not only this platform is really for everyone for to build applications on top as well. That is good. That is good. So what we started for ourselves looks like it has become a novel for all other potential users to this is this is good. So going forward, I really like the design already, but you know, we don't want to miss one important aspect because now we are making it available for majority of the customers that might find potential usage for this. So with increasing adoption, we might want to support even more cloud providers. And this may force us to switch to different operating systems and to different network plugins and to different other aspects of the cluster management. So with ever evolving cloud native ecosystem, our systems also have to be completely extensible. So some system where, you know, the batteries are included, but they are swappable. So in essence, I just want to bring in a thorough extensibility to this gardener that you have built. So that's really a great point and I can't agree more with you. I would say the extensibility should be at the very, very core of any good design and gardener supports are very neat and Kubernetes native extension model. So the point is essentially a provider specific controller, very similar to how extensibility is designed for cloud controller manager, for example, in Kubernetes, and a very simple example would be actually for the cloud providers themselves where you can see that the gardener would basically declare a neat colon interface for a provider and then the provider would have to implement that particular interface. The interface content would be a bare minimum functions which would be needed for a gardener to support or which would be needed for a full-fledged Kubernetes to run on a particular provider. One of the simple example would be the gardener extension provider AWS, which is targeted for AWS. And this, of course, this approach recursively builds on Kubernetes support for various other providers as well. So that's theory, but let's actually look at a beautiful outcome of a well-defined extension model or I would say the power of well-defined extension model. So this is basically a single gardener installation where there are large number of clusters being managed on different cloud providers. Basically, a gardener is the support management cluster. It runs inside the support management cluster that can host the control plane of seed clusters, which are the green dots in the support management cluster. Then the workload machines are basically the seed clusters. The workload machine of this seed clusters are basically deployed on different cloud providers in different regions as the case fits best. And then to maintain the, to have the least resource latency, the workload machines of the actual end user clusters or let's call them the chute clusters. Those are deployed in the same region and they have the control plane hosted on the management clusters workload. And this is what I have. And this looks as if it can actually handle a large number of clusters. If this actually works also and not only on the paper. Yes. And this picture of this ecosystem of Gardner looks so beautiful. I'm actually not able to take my eyes off, but I'm forcing myself to do so because I want to see this in action. So let's go to a demo. Okay, so let me show you the demo of what I just talked about. What we see on the screen is a Gardner dashboard to have a better user experience. Of course, everything can also be done from the terminal. What we see is basically clusters to create members, some utilities. And then we already saw a demo, the BQ demo CNCF cluster where there are a few fields, different sections that can be configured in a given cluster. And you have a chance to directly fall into the terminal from there. So what you just saw in the overview, of course, everything there, the essence of them is basically also in the DML file here. So basically change this YAML file so you can declare everything in the YAML files as well. And let's actually try to create a new cluster and see how the flow looks like. So I'm going to create a cluster on AWS. I'm going to call it botanist. Here, the version is 1.20. I can set different purposes. Let's call it evaluation purpose. I'm going to use the standard AWS secret for that, just XSKey and so on for the worker pools. I would basically use M5.large. I can choose other worker sizes. Let's use M5.large. I could use Cod and Linux operating system, Docker. I would keep min and max as 1 and 2. Maintainance. So I would keep it as it is. Only in this maintenance window, the cluster would be rolled out and not in any random time. And then, of course, my personal favorite is hibernation where I would say every day at 5 p.m., my cluster should be hibernated because this is just evaluation clusters. I would actually save a lot of cost by bringing down all the machine and control frames every day at 5 p.m. Of course, same can be done with the YAMLs and let's not wait long and go ahead and create a cluster. I also see is a tracker. Basically, it says create processing, so the tracker keeps up to date in terms of what's going to, what's happening right now with some of the detailed messages and it says it's deploying the external domain. And let's now look at the backend. So at the Gardner API server, I would expect to see another shoot cluster, which we just created by the dashboard. I already see a botanist as a new shoot cluster and now it says the creation is processing. I would also like to see a seed clusters, specifically the AWS seed cluster because we created the cluster on the AWS. So I have one seed object which says it's ready and it is in the EU East 1. Now I am going to watch my shoot clusters to see how the progress is going on. The next terminal, let's relate our terminals to our actual diagrams. So we have a Gardner scheduler which makes the Gardner scheduler here and I would expect it should have already done something with the shoot object that I just registered. Let's see. I see a message which clearly states that it has been scheduled to the seed, which is AWS. It has different, it can also be plugged with different kinds of scheduling strategies if you want the way we do with the scheduler and perfect. Let's, if it's assigned to the AWS, then this is the card and lead which is running on the AWS seed cluster or the AWS management cluster. And here I would expect this card and lead should have at least started doing something. At least it should have fetched the definition and start to create the control brain ports. And I already see that there are, there are demo pro what on is less what on is which is basically first over cluster. And it has already started processing that cluster. I think it's doing something in the background. We get to know that next next terminal. Then I can also take a quick look at the Gardner controller manager now. So we have from the diagram, I know that controller manager is responsible to take care of other life cycle aspects of my shoot cluster. And I see that the hibernation and maintenance are basically the subcontroller of it and they are also already taking care of those aspects of my cluster. And now the most important or the most interesting aspect of the whole system that we just, just we get it right. So the control plane is what we are going to look at now that which is good running in one of the seed or the management cluster, which is AWS in our case. And let me zoom in. Let's search for the namespace which is dedicated for our cluster. I already see one namespace. Let's get inside it and see what's what's already deployed or what's happening. Okay, so I see there is already API server at city. I think more parts are coming. We also have a nice logging and monitoring setup with Loki. And on the other hand, I would also be later on be interested in looking at the workload cluster, but essentially, if you try to recall what we just learned in the diagram. So from the API server, it has reached to the cotton scheduler from scheduler to cotton lead from card and lead to in parallel to card and controller manager and now on the management. So we see something happening. And the dashboard says it shows clearly that the creation is ongoing. I think it should take this in the infrastructure few five to seven minutes or so. I would suggest that let's take a look at another key other key features somewhere. Meanwhile, while the cluster is being sure sounds sounds good. So to add to the whole thing. What we saw was the day one. So creating a cluster, even creating thousands of cluster is still is still okay but what really happens. And what's really more fascinating is what what's going to happen on day two or day three and so on. So we are going to have or we already have the customers or the people who would generally create lots of lots of workloads on that cluster. And in such cases, we don't want our API server to die. We don't want our other control print components to be exhausted. So what what's there to save us is physically the horizontal vertical or not a scalar. So what it does is really fascinating it autoscales the pods control print pods both vertically and horizontally at situation demands. And then we have at city backup and restore. So this is our savior for the disaster and recovery where the this sidecar container physically keeps on taking the snapshots of the at city. Let's say every one over this perfectly configurable. So it takes that full snapshot every one hour and then it takes it does snapshots every few seconds and then at any point in time things go south. It would physically restore the entire cluster using the snapshot taken previously and it would give us kind of a point in time recovery of loss of only two seconds of the data. The next one is also my favorite where Gardner goes one step beyond and it actually does the automatic seed provisioning. Now we if you look at the design we saw that we assume that there are always X number of management clusters available. But what if we have a sudden increase in the cluster number of clusters and someone just creates 1000 more clusters and then we don't have enough capacity in the existing management clusters. So Gardner also offers such features where the new cluster is automatically new management cluster is automatically added and it actually load balances the whole control print across different management clusters. And that is very thoughtful. Yes, that that comes with the experience with the hard way of learning things and then the last one is of course the auto scaling of all of the clusters at all of the layers that we just talked about and this auto scaling is the key key feature of enormous amount of resource savings. I would say because this cluster or a scalar basically always kills our super management cluster management cluster and the actual shoot clusters and it works in a cloud agnostic fashion as we discussed that because it supports the machine API which is Gardner's API if it's it basically only has a common denominator requirement that if a product if a cloud provider has a create machine and delete machine API is implemented which is like bad minimum then the auto skill would be able to do the do its job of auto scaling all the machines as per the requirement. That that is pretty cool. Probably we want to relook at the cluster creation state here looks like it is done. That's beautiful. Yeah, it's already created and it takes only a few minutes. Okay, perfect. Wow, that was really good. And so we I already see that we have a set of adopters who are running Gardner and managing thousands of clusters at ease. So, obviously at SAP, we use Gardner internally for development purposes, and also for production workloads. And it is utilized by software developers and all the line of businesses across globe Gardner creates hibernate scales deletes hundreds of clusters on a daily basis. Gardner is operated by a central platform team and its premier usage within SAP leads to synergies on total cost of development and reduced cost of operations in multi cloud environment. Gardner is also in use by other cloud providers, such as FITS, who have extended Gardner into their metal stack and stack it T systems 23 technologies applies Gardner's multi provider feature for its Gaia X, a federated European Sovereign Cloud initiative, pink cap, the makers of TI KV and TI DB run their commercial database as a service, offering on top of Gardner landscapes, and many of the same reasons apply running critical applications and critical systems off record requires you to have complete access on your control plane. And each component of Gardner is completely independently consumable, right, which is why it has also witnessed some of the nice innovations, wherein they have powered Raspberry Pi with Kubernetes using Gardner's machine API implementation. And Gardner also often sees some of the external contributions from adopters to support next generation use cases, such as spot instances of Kubernetes nodes. And interestingly, the innovation does not stop in the infrastructure domain. Peculiarly, Gardner managed seed clusters around globe can be thought of standard Kubernetes infrastructure that can host a platform service outside the end cluster but near to control plane, right, and Gardner also ships with multi tenant multi cluster capabilities of DNS and the certificate services. As a user, you just need to annotate or apply a custom resource in your cluster to consume these value added but managed services. So think about it. Gardner has the minimum architecture that is needed to provide all types of related services. So for us and our community, Gardner is more than just a Kubernetes cluster as a service. Hey, and yeah, of course, towards the end bullseye question or a million dollar question, what's the relation with the cluster API. We know what's cluster API. It's a great community project which has a very similar purpose and we are often asked about it. In general, with the latest cluster API specification, it is possible to delegate the specifics of cluster management to a separate control plane provider. So it's the extension model of the cluster API. So there is already a battery included with this cube ADM control plane provider, and that works pretty cool with the dedicated control plane machines but then with the whole concept of control plane controllers being an extension. What is the possibility or what we are planning to do is basically to have another control plane provider provider, which is going to be the Gardner control control plane provider. Of course, it is not yet implemented. It would be implemented and we would be really, really interested if there is any traction externally from anyone who would be willing to help us or would be willing to consume or have a feedback on this. Wow. So great things in place and great things on the way. Nice. So, and I just wanted to show a funny, funny meme which one of our fellow developers have created because he could not really resist when the whole team was creating thousands of clusters and Gardner managed it nicely. Thanks to team for that. Yes. So yep, that is it. Yeah guys so Gardner already has actually a significant community in the social media place to join us. You'd love to hear your feedback, suggestions, contributions, complaints, or otherwise just to say hello. Is there any other questions from anyone sounds like a good job. I see one question from through Tucker. I can probably answer that question. So his question is that doesn't replication of management clusters across regions again increase the cost factor. Yes, certainly. And I think in the flow, we answered that already so increasing having more management clusters has basically two effects. One is the complexity of manage managing more clusters one and the other one is the costs. The first one is of course targeted by the super management cluster and the second one is targeted by having a very well defined auto scaling of the management clusters themselves. So if you have a one management cluster with large number of machines that would not be very different. If I have a few management clusters but their machines are physically divided across different regions. Right. Last chance anyone have anything else. There you go. Thank you. I think there was one more question. There is one more question from Mandar. Does Gardner facilitated kit ops and deployment of workloads to the cluster. This is a very, very interesting question. So this this question falls slightly outside of the bucket of the gardener but we do have the we do have the ecosystem where we want to basically not only take care of the cluster management but also take care of the life cycle of the applications which are going to run on top of the cluster and that you can probably find under the gardener or I could share it later with the flying it just follow the name of that. But we do have the whole facility which is which is basically going to take care of the whole ecosystem of the applications themselves as well. Thank you all very much everyone. Thanks for joining us. This again the recording and slides will be up later today on the website and thank you again for attending another CNCF live webinar. Thank you to our presenters and everyone have a great day and we'll see you next time. Thanks everyone. Thank you everyone.