 All right, thanks for joining this session after lunch. And I'm really excited that a lot of people in there is room. I actually modulated a couple more sessions yesterday, but some room is not quite full. So I'm really excited about this topic. And so my name is Daniel O. I'm a CNCF ambassador. I'm so happy to be here as a model at this session today. And thanks for joining us again. So we're going to talk a little bit about our next topic. It's how to migrate 700 Kubernetes cluster, not 7, not 70. It's a 700 Kubernetes cluster to cross the API with zero downtime. So I'm really happy people here. And then please welcome our next great speaker, Sean and Tobias from Mercedes-Benz Tech Innovation. Please welcome the speakers. Thank you. So hello, everyone. And welcome to our talk on how to migrate 700 Kubernetes clusters to the cluster lifecycle management software called cluster API. So today we will present to you how we replaced our legacy Kubernetes fleet management with cluster API. Formerly, it was based on Terraform. And the most important part was zero downtime and no effort for the cluster users. I'm Sean Schneweis, a software engineer from Mercedes-Benz Tech Innovation. Also, I'm a maintainer of cluster API provider OpenStack. This is my first KubeCon. And I'm so happy to meet all of you. Hi. My name is Tobias Giesel. I am also a software engineer for Mercedes-Benz Tech Innovation. I'm also maintainer for cluster API provider OpenStack. And I'm working with Kubernetes for five years now. So this technical presentation targets software and operating engineers and all who is into cluster management. It's maybe a bit hard to learn in the beginning, but it's definitely worth the effort. So keep it on. Mercedes-Benz Tech Innovation is a subsidiary of the German car manufacturer, Mercedes-Benz. The headquarters is located in Ulm. And at Mercedes-Benz Tech Innovation, we don't build cars, we build software. And we are formerly known as Daimler TSS. Just in case you want to check the commit history or you already know us as Daimler TSS. Our platform team develops and operates huge fleet management. And we're operating 900 clusters all over the world in four data centers in Atlanta, Beijing, Frankfurt, and Stuttgart. And just a note, during the time we have submitted this talk, we only had 700 clusters. So 200 clusters more in half a year. That's not too bad, I think. So the agenda, we first step in and set the stage just that everybody knows what we are talking about. Then we will talk about the legacy provisioning architecture. Then after this, the target picture with the migration to cluster API. And in the end, we will talk about the lessons learned and our next steps here. OK, so we want to get to know you a bit better. So please raise your hand if you know cluster API. Who of you knows cluster API? OK, nice. Who is using cluster API in production? Still a few, nice. Who manages and doesn't have to be with cluster API. Who manages more than 10 clusters? 100 clusters? 1,000 clusters? Oh, OK. Quite some experience in this room. Really like it. So cluster API provides essential cluster management for the complete lifecycle of a cluster. So this includes creating, deleting, updating to a new Kubernetes version and scaling, so adding additional nodes to your cluster. Cluster API is a project maintained by the special Kubernetes special interest group cluster lifecycle. And it perfectly works with multiple cloud providers such as AWS, Google, Azure, OpenSec, and many, many more. Actually, right now there's a separate talk that is teaching how to create your own cloud provider, plug-in to cluster API. All right. So this feature of supporting cloud providers is why we chose cluster API. So we can not only offer private cloud with OpenSec, but also public cloud to our business partners. Here you see the turtles that are about to start their journey, the cluster API turtles that are about to start their journey. But before that, I would like to clarify some basics, the key aspects, of the title from our talk. First off, how to migrate. So starting point for this is any existing cluster. Doesn't matter if it was created with some kind of automation or if you created it on your own. In our case, it's a legacy pipeline. And then we want to manage the infrastructure and the cluster itself, the Kubernetes style. So normally this is with custom resources and controllers. So the architecture of cluster API allows an incremental transitioning path. So you don't have to do everything, like at one point, you can just split it up and just adapt those parts that you really need. If you take one of those three turtles and take a look at a single turtle, this stands for a workload cluster. One of these workload clusters is what we usually give to our users, the workload cluster consists of a dedicated control plane and multiple worker nodes. And we want to migrate 700 of those workload clusters to be controlled by cluster API. Our largest management cluster, that is the cluster that manages all the cluster API resources, controls up to 200 of those workload clusters. And when you look at the outer shell of the turtle, that is the infrastructure, the networking, the firewall, the router, load balancer, and that also is managed by cluster API. Now, when migrating, one really important requirement for us was zero downtime. We have business critical applications and downtime is just not acceptable for our users. So we definitely want to make sure that the Kubernetes API and the application is available throughout the migration. This is not a greenfield approach where you get a new empty cluster and the users migrate from one to the other one. No, we will not touch the user's workload. This is really important. So as we now have clarified the basics, Tobias, would you like to tell us where we started from? OK. So this is the picture from our legacy cluster provisioning architecture. So we have a custom user interface where our users are able to order a cluster. And we have a custom API that triggers several Jenkins pipelines. And these pipelines trigger some Terraform pipelines for the cluster provisioning. So the infrastructure control plane, worker nodes, et cetera, et cetera. If the cluster is provisioned successfully, we have Ansible to deploy the runtime deployments. Runtime deployments in this case are, for example, the Kubernetes core deployments like CNI, CSI, Codin, S, whatever. And after this, we will deploy the cluster add-ons with Ansible like metrics exporters or custom controllers, et cetera. So this is the legacy provisioning. And as you know, where we are coming from, we will now take a look at the target picture. This is the target picture. The user can order a cluster like before. But in the background, we have some changes. The main change is that Jenkins will trigger a custom binary that deploys resources into our cluster API management server. And cluster API will reconcile the infrastructure, the control planes, et cetera. And after the cluster is provisioned, our custom Qubular controller will hook into the correct time and deploy then the runtime deployments. And Flux will deploy our cluster add-ons. That's it. OK. Perfect. Let's start with the preparation. So we are using OpenStack as the infrastructure layer. OpenStack or the cluster API provider OpenStack identifies the resources by its name. So we must change the resource names to the cluster API provider OpenStack name matching. So this is the first step and can be done by Terraform. For example, just make sure that this is renamed correctly. For example, in the cluster API provider AWS, they're using IDs and not the names. So this is much easier. Just take a look at your provider. How there's some. So there are multiple ways to migrate to cluster API. We decided to do it in three steps. So the first step after the resources were renamed is to migrate the infrastructure. That cluster API is able to reconcile. We need the cluster resources and the infra cluster resource, in our case, the OpenStack cluster. This is the first step. After this, we can go further to the second step. We decided to use the worker machines as the second step because we have, during the runtime, more changes. And we can have cluster API to reconcile the machine deployments, et cetera, if a user changes the replicas or the image flavors, whatever. And because of this, we decided to do, first, the machine deployment or the worker node migration. OK, the third step is the migration of the control plane. This is also really critical because if there is something false or whatever, the entity could have data loss and your cluster will be broken. So keep care on this step. OK, let's move to the first step, the cluster and infra cluster migration. This is really, really easy because you can just use your infra cluster spec and use, from the legacy provisioning, the specs and deploy just the resource. You can then deploy the cluster resource for the cluster API to reconcile and just reference the infra cluster object. And that's it. Your infrastructure will reconcile and cluster API is able to manage the infrastructure. Perfect. Sean. Thank you. OK, so we now migrated the infrastructure. And we have continuous reconciliation by the cluster API controller with metrics and status easily available from our management cluster. Next step will be to migrate the worker nodes, existing worker nodes. And a node, typically, in cluster API is managed by a machine. And therefore, we will create those three resources, machine deployment, machine set, and the machine itself, and also the provider specific implementation, the open stack machine. Now the problem is the existing control plane, which is not part of the second step. It's part of the third step. Plus, the API is not yet aware of this control plane. And to overcome this problem, we will create a fake QBADM control plane. The QBADM control plane controls the control plane for cluster API. And this fake QBADM control plane, of course, needs a machine. So we would just create a dummy machine, which is not combined with any infrastructures. We just have to create the resource, and then reference this at the control plane label, which identifies this as a control plane machine. And on the right side, also create a dummy QBADM control plane. And please be aware, we added the post annotation to both of those resources, which will prevent cluster API from doing anything on those resources. And then the last step to get your fake QBADM control plane is to patch the status fields and add the condition to the cluster and just say, OK, you're ready now for cluster API. And that's it for this part. Now we can actually migrate the existing worker nodes to be represented by cluster API. So for each worker node in your cluster, we'll create an OpenStack machine. And it is connected by the provider ID provided, in our case, from OpenStack. And this OpenStack machine, the name of the OpenStack machine, is referenced in the machine. The other fields, such as the bootstrap secret, they don't really need values that are correct. You can just create an empty secret, put in the name, and then we're good to go. So we now have all workload nodes represented by an OpenStack machine and the machine. Next off, we want to create the machine set. The machine set is an immutable abstraction over machines. Now we have labels, specifically the deployment name that should be added to all resources on this slide. So the machine deployment needs this label to identify those resources to be part of the machine deployment. So make sure to add those two. And now the machine deployment, as you see, the name is the same as in the label, the deployment name, should be created and match the replicas to the amount of workload nodes you have in your existing cluster. Also, before creating the machine deployment, create the QBADM conflict template in the OpenStack machine template, put it into the machine deployment at the Kubernetes version you are using, and that's about it. So pretty easy, I guess. Now if you have a look from the customer's perspective, what he will see, it looks as follows. So on the top of the slide, you see the state when we migrated the workload clusters managed by Class API. The name stays the same. As we then perform a rolling update to a new Kubernetes version, new workload nodes will be added to your cluster and will have new node names as they were bootstrapped and provided by Class API. OK, so what will you continue with step three then? Of course. Thank you, Sean. So after we have migrated our welcome machines, we can come to the last step, the migration of the control plane and the QBADM control plane cluster API resource. So let's take a look at this. First, we have to create a new QBADM control plane with its real data. So we can define here the version, the QBADM conflict spec, where the cluster join configuration is in it, the machine template with the infrastructure reference, et cetera. And because why we are using a new QBADM control plane is that cluster API denies some changes in the KCP specs with the webhook. And because of this, we will create a new one because it's the easiest way. We can create just a new name and that's it. After we have created the QBADM control plane, we can now adopt all the legacy provision control plane machines. OK, so we're just creating a secret and acute QBADM conflict like in the machine deployment step with a empty value. It's just a dummy and reference all for all legacy provision machines a machine resource. We have to add the provider ID of the OpenStack instance just to reference the infrastructure itself. OK, after we have the machines, we can create the OpenStack machines. This is the resource for the cluster API provider OpenStack to be able to reconcile the virtual machines itself. So we have to here reference the machine we have created. We have to reference the QBADM control plane we have created. And also, sorry, one step back, we have to pause the KCP as well and also the machine just to make sure that the QBADM control plane in the machine will reconcile at the correct time. So we now have the OpenStack machines created and that's mostly it. We only have to change the QBADM control plane's name in the cluster reference, unpause the QBADM control plane, unpause the machines, and that's it. You can see in the screenshot below that the new QBADM control plane, the new control plane from cluster API will join the cluster. It's not ready yet, but it will get ready. And after it's ready, cluster API will remove the legacy provision control plane and we are done. So easy. OK, after this step, we can clean up our dummy objects. That's simple. And we are done. We have migrated our first cluster and then our 700th cluster and that's it. We are ready with our migration. OK, so how does it look? Are you still following like an easy one to three approach? I hope so. So now we want to share some best practices, some tips and tricks, and lessons learned with you to sum this up. And the first one is not really cluster API specific. It's a general recommendation. So whenever we talk about zero downtime, we ensure that we can safely drain the nodes. And this is an action that has to be performed by the users. So we ask our users to add pod disruption budgets to their deployments so we can then safely drain their nodes. And sometimes, one of those pod disruption budgets is not working, or even the upgrade, the Kubernetes update is stuck due to one of those PDBs. And what we do is we contact the customer, ask them to fix it, and then the upgrade will continue by itself through the cluster API reconciliation. The next two things are really nice functionalities from the cluster API controller. One is called the pre-drain hook or annotation and the other pre-terminate hook or annotation. It's just executed before draining a node. We use this pre-drain annotation to add a node schedule chain to all nodes that will be anyway deleted during an update. So whenever a pod gets rescheduled to another node, then it is mostly most of the time scheduled then to a node that is a new one and not the old one. And this increases the update a bit. The second one is the pre-terminate hook. This is executed directly before the node is deleted. And we use this for mostly infrastructure related things such as volume attachments, so we remove the volumes. And the second is to remove the node from the load balancer. So I highly encourage you to make use of that functionality. So the next one is nightly builds and testing. Testing is key. So we highly emphasize and recommend to you, please test the migration, but not only the migration itself from legacy to cluster API managed clusters. Also test creating new clusters and maybe deleting or upgrading clusters. This really helped us to identify nasty bugs and issues. So yeah, testing is always important. And the last thing is actually a problem we face. This is a caching problem. The infrastructure is not always as stable as it should be. You probably know. And sometimes OpenStack machines failed or had like a status that was error, something like that. And most of the times, we enforce the deletion of one of those OpenStack machines with maybe removing the finalizer. The problem is the controller that watches on those resources doesn't know that the resource was actually deleted because the cache is then invalid or has false data. And to mitigate the problem, we just restarted the ports. But luckily, one of our colleagues created a pull request for the controller runtime project and fixed that small bug, but nasty bug. OK, that's about it. Tobias, where we are going next? Thanks, Sean. OK, with the propagation to cluster API, it's not done. Our platform is under permanently construction with feature improvements, bug fixing, user feature requests, et cetera. So the replacement of Ansible with Flux2D is our next thing and also in progress. Another thing is that during the initial planning of the migration cluster API features like cluster class or the runtime SDK, machine head, et cetera, were just not there. So we are also going to plan the implementation of these features. What we are really proud about is that we have developed a cluster API state metrics exporter, which is a metrics exporter for cluster API. And we are going to merge this code to the core cluster API repository. And because of this, it's a donation to the CNCF. And a big thank you to all of you. Last but not least is that we're moving forward to also offer to our users public loans. Because of cluster API, this is really easy. We can use another provider. And that's it. Yes. If you want to get involved, just ping us on Slack or follow us on GitHub. And by the way, there's a cluster API intro and deep dive the next room in a few minutes from Yuba Rush and Vince. So just join. And maybe you can learn something. Thanks. And if you have questions, now's the best time. Hello, is Anderson here from Brazil. I'd like to know the time frame since the start of the planning until the last demigration of cluster number 700. OK, so the question was the time frame from starting the planning and then migrating to 700 clusters. Right? Yes, thank you. OK, I'll just answer. So initial planning was around, let's say, the end of 2020. And then the complete 2021 year was used for migrating those clusters. Because usually, what we do is we wait for Kubernetes release where we have to redeploy all the worker machines and use that time frame to also do the migration to cluster API. So maybe you saw it on the slides. With 120, we migrated the worker nodes. And then with 121, we migrated the control planes. So that's about it. But when actually migrating for 700 clusters, we took maybe some two weeks, maybe, because doing batch updates to those workload and control plane machines. That was also another question if we are doing the update batch-wise or cluster by cluster. If we release a new Kubernetes version, we have also migrated each step. So during the Terraform renaming, we used a release for Kubernetes release update to do this for the step one. For the infrastructure, we have used a release. So it took four releases to do the complete steps, to be honest. Yeah, but when updating the clusters itself, it doesn't matter now with cluster API. The cluster API controller supports the machine concurrency and the cluster concurrency and actually a concurrency for all objects. And this is usually set to 10 for us. So if we have 200 clusters in our largest management cluster, we'll just trigger all of those 200 at once. But cluster API has this concurrency, and we'll just work on 10 objects at the same time. That's another question. One of the slides at the beginning showed that there were five platform teams, how many engineers in total taking care of the platform in these 900 clusters. So in general, as far as we know, we are a platform team, or the five platform teams are 30 people or 40. The core platform team for the cluster API migration, I think we are about eight, six to eight people. So the migration of the 700 clusters to cluster API have been done by eight people. Any further questions? Any other questions? Oh, yeah. There we go. I have a question about the initial provisioning. So you say you manage your clusters using cluster API, but what about management clusters, right? So let's imagine a situation when everything is burned down and you have to deploy the management cluster. How are you going to deploy the first one? To provision a management cluster, we're using kind. And then we are provisioning a new cluster, and we're doing a cluster cartel move to move all the management resources to the new cluster. So this is the initial check-act problem step. And we are doing backups with Valyro. So we can do a restore with all its resources, and then it's done. Does that answer your question? OK, nice. Perfect. The technique which we used while migrating the cluster is canary, blue-green, or something else? As far as I know, we have, using canary updates, canary testing. Yeah, canary, I would say. Thank you. Yeah, this is not blue-green. OK. OK. So we've got to have a more couple of questions. Hi. Did you have by any chance bare metal servers anywhere and bare metal workload? Or any plans for it in a couple? We are not using bare metal servers. We have OpenStack as our way we talk to, beneath its virtual machines, and the next step is public clouds like Azure, AWS, Google, and so on. OK, I was thinking about ironic on OpenStack. Did you ever try it? We haven't yet, but definitely interesting. Thank you. Sorry to tell. All right. One question in the back. OK. I think you mentioned 200 workload clusters per management clusters. Why there's a limit? Is there a limit for how many clusters can be managed by the management clusters? That's not really a limit, but we have multiple private cloud regions or zones. Because of this, we have per regional zone a separate management cluster. And this is the reason. So we don't have more users on a single zone right now. But I think it will scale pretty well until a certain point probably, but we haven't reached that point yet. OK, any more questions? Yeah, we've got one more here. OK. Hi. Was this an unattended migration? And if yes, you could decide the user when you want to migrate the cluster. It was attended migration because we want to use upstream features from cluster API. And what was the last question? Sorry. Have you heard it? OK, perfect. Thanks. All right. Any other question? OK, I think so we've done. And thanks for a great presentation. Thank you so much. Thank you. Thank you. Thank you, Daniel. And then thanks for attending once again. And then hopefully you'll be able to do more.