 Hello, and welcome to our talk on how to migrate 100 classes between clouds without downtime. My name is Manuel Strasburg. I'm a systems architect and tech lead at Kubernetes. We're doing a whole bunch of Kubernetes and cloud-native consulting, and we're developing the Kubernetes platform, as well as Kube1, our Kubernetes management tools. With me today is Tobias Schneck, head of professional services at Kubernetes, and he's going to tell you a few words about himself now. Yes, thanks Manuel. Yeah, my name is Tobias. I'm already working two and a half years for LUTSE and mostly responsible for professional service, where we discovered the way how to migrate clusters. So this was a fancy idea, and now we want to present you how far we came to the journey, and so heading back to Manuel. All right, thanks. Right, why would you actually want to migrate classes between cloud providers? And there are actually a couple of reasons for that. On the more business-y side of things, you might have better contract conditions at another cloud provider, so you would be able to save costs. There could be the need to migrate data centers to a different hosting provider or cloud provider from a logistic point of view to a legal point of view, or you're driven by a multi-cloud strategy and you want to decrease your dependency on one single existing cloud provider and expand to other providers. There are also some technical reasons for that. So again, more logistical kind of reason might be that you have a location migration of a data center or you might want to migrate to another network segment for separation of concerns or other reasons. You might be adapting improvements in your own program and cloud environments at a new provider that you want to use, like new features, new different infrastructure technologies you want to use at a new provider. Or you're bound to some constraints when it comes to data location of certain services you're running maybe on some cloud-offered services. For example, where you run your machine learning and where you have your machine learning data or some GDPR compliance needs that you need to fulfill. So what are the main challenges around moving to another cloud provider? Kubernetes itself abstracts infrastructure, but it does have several kind of dependencies nonetheless. So it does consume infrastructure resources. For example, the virtual machines where the cluster itself runs on its users and consumes the network provided, the IP address space, routing and firewalling rules, a management of ingress and egress traffic, and also DNS, as always, and external storage systems. Then there are Kubernetes components that are actually dependent on a certain cloud provider. And most of that is the Cloud Controller Manager that contains the node controller for updating Kubernetes nodes. The service controller, which translates a service-type load balancer within Kubernetes to an actual cloud load balancer in the cloud providers environment. The route controller, which is responsible for setting up network routes in the cloud providers network. But there are also things like storage classes that map to cloud provider-specific storage offerings. And sometimes the oblin network you use has some dependency on a cloud provider as well. And just to remind ourselves, a quick overview of the components of Kubernetes, the central one being the API server, which kind of handles all the changes in state of the cluster itself. And the worker nodes with the kubelet, which runs the actual workload that users run on top of that cluster. And between the API server and the kubelet, there has to be a two-way communication happening. And between the kubelets as well, I mean, between the worker nodes as well. And that is one main challenges we're kind of solving today to enable these ways of communication. And that directly leads us to our actual dependencies when we migrate. So for us, the application workload has the highest priority. But we need to ensure for that to be the case, we need to ensure fundamental networking rules that Kubernetes expect to be in place. Those rules are that all containers within a pod can communicate unimpededly on layer four, so TCP, UDP. All pods can communicate with all other pods within the cluster without netting. And all nodes can communicate with all pods and vice versa, also without netting. And the final one, DIP, that the pod sees itself is the same IP that others see the pod as. And this is really important to keep up so that the actual networking between pods and applications works. We have to have some external dependencies that need to be reachable, like externally routed IPs for load balancers and load pod services within the cluster, and DNS names need to be reachable slash resolvable. And storage, some applications might have to have state that needs to be migrated without data loss. So when we look at that now at a level of scale of hundreds of clusters, maybe, we look at large organizations running a whole bunch of clusters in different locations for different organizational units in different time zones. And for those users, cluster users, the cluster itself is just a service that they consume. And that means that the cluster connection and secrets, so the actual interface that the user has to the cluster is not allowed to change. Otherwise, it would impede the actual service that those users consume. So how do we solve for that? The status quo is we'll have a multi-cloud setup with the Kubernetes platform, our open source Kubernetes management platform. And it has a concept of a seed cluster that holds the containerized control plane of the user clusters, the user clusters being the Kubernetes clusters that are managed by KKP. The worker nodes itself are provisioned via the Kubernetes machine controller, which is a cluster API conform operator that translates machine deployment object into actual machines, VMs on cloud providers. And we'll use Cannell as our default overlay networking, which is effectively planl BXLan overlay networking with a Calico network policy plugin. And the target is that we migrate the user and seed cluster control planes and worker nodes to a different cloud provider. We'll keep all the external cluster endpoints stable. That means the control plane Kubernetes API server endpoints and the actual application endpoints being the DNS and ingress routing. Out of scope for now is the storage replication. The assumption is that the actual application layer manages the storage replication, like at CD, which is a feature that we will use to migrate the user cluster control plane. So how does it look? We have a Kubernetes installation that has a seed cluster, which is also just a Kubernetes cluster running in the Google Cloud. And it hosts a couple of user clusters, and we'll have a look at the vSphere user cluster that runs worker nodes on a vSphere cluster. And we want to move all of that over to AWS, because for whatever reason, we want to run on AWS. And just some recommended prerequisites doing that in production. You'll have to announce a maintenance window and block cluster updates so that those doesn't interfere with the actual migration process. We'll have to ensure that our backup and recovery procedure for the seed and user clusters, but also for the application workloads, works is tested and proven to be working. We'll should create a target cloud cluster as a reference, in our case, an AWS cluster, just so that we can copy and paste some stuff over. And we'll have to ensure that we actually control the DNS entries and be able to switch over DNS entries to the new cloud endpoints once we migrated the workload over to the cloud provider. And now we'll look at the actual solution approach. And Toby is going to give us a quick demo on how that works. OK, then, over to me. Thanks, Manuel. Let me share my screen. Hopefully you can see it now. And yes, so what is the solution approach? So first of all, we want to migrate the user cluster workers. So in this case, we want to migrate it. And we want to have new workers in the target cloud how we can reach that. So we're using the so-called machine controller in Cubimatic and Cubone. And this controller can create workers based on CRDs, what's called machine deployments. So machine deployment is similar like the deployment of pods. It's a deployment of machines. And if we change that machine controller and give them a new specification, we can create the machines in the new cloud. What is needed to ensure the traffic is reachable, we need to somehow need a way how to communicate between pods and pods and nodes to nodes. For that, we create a VPN overlay by a demon set and route them to traffic of the CNI in our case channel through the VPN network. And that's ensured that we have maybe different network segmentations, but still can talk to each other if we have this client to client VPN traffic enabled. And at least we should ensure the reachability. That means that we try as long as possible to keep the interest endpoint stable and then transfer the workload to the new cloud. And after the workload is the new cloud, we delete the connectivity. OK, how SAS can look like. So we have the seed class of what hosts in the containerized control plane. Here we have some controllers and we have the Kubernetes containerized control plane. We have their pretty for the VPN server. This VPN server is used for VPN traffic between SAS control plane and the workers edge romantic. So that workers can connect with the control plane and the control plane can back tunnel that you control logs and exec cores or with this VPN tunnel. Also, we have a machine controller that is configured to place on machines on the vSphere cloud. And we have between the vSphere workers, we have an overlay what's based on kernel. And we have a Metal API service. What creates then the inbound traffic load balancer to the dedicated application points. First step, how to migrate is now to deploy a VPN demon set. This VPN demon set ensures that we have a VPN client area worker node. This opens in the worker node in new interface and we route the traffic from the VPN interface through the VPN interface our kernel to our dedicated overlay. So and for that we need to pause the cluster because our cluster is controlled by a cluster controller and this cluster controller would then reconcile the machine deployments and the VPN servers. So to ensure that this does not happen, we pause the cluster to make some patches there. Next step is after we have the credentials for the new cloud adapted, we can update the cluster spec and specify the new AWS cloud. Then the machine controller get updated and we get a new machine controller instance that can now talk to AWS. New nodes get created and also joins our VPN network and joins so also the kernel overlay routing. The cloud controller now ensures that we also have new AWS LB what get created. Currently this LLB is not routed, but anyway traffic goes from here to over at the Metal LB service to the dedicated workload. After this happens, we can remove the workload from the old workers and move the workload to the new applications. After this is done, we can also rename the DNS names to the new cloud load balancer and ensure that we have now migrated traffic to the new cloud. At least then clean up the old resources and we removed, we not needed any more VPN overlay because this two workers can now talk to each other with ETH interface and that's how we migrated it. So to give you a short look of what already is happening, we prepared some demo. Be aware that currently SysProject is not fully finished so it's more a proof of concept state. So as we see here, we have here an app or what is our reference echo service and SysDeployed on the cluster. Let's take a look in our electromagnetic control plane and in SysControlPlane, we have the dedicated clusters here. So first we have here SysCubeCon Migrate cluster what we want to migrate. Here we see SysCluster contains the containerized control planes, a machine deployment of two nodes and a so-called echo service engine X and Metal LB. This Metal LB points to the IP address of the vSphere and deploys our echo services. So if we go to the vSphere, we see here under the cluster ID, what is here KBHC, you see here the little machines running. In this one, we want now to move to AWS. First step, what we already did is that we deployed a VPN. So we patched our VPN server. And for that, we see that we have a cluster spec and we have a, so we see here also in our C cluster that everything what you see in KBHC is also represented as a cluster CRD. In the cluster namespace, here, SysNamespace, we see the control planes and here you see that we have like the VPN server running and we have API server and all other components running. If we connect here to the user cluster, so let's take this shortcut. Yes, I want to go to the cube cluster. We see a bunch of containers here and we see here that we have our echo service, we have our engine X, we have our canal, we have our VPN client. And that's now what we want to migrate. So yeah, first step is the VPN is already there. So we can now deploy our control plane and migrate it. What is I think the most interesting part of the whole thing. So we can then say, okay, here our update target cloud script, what does it do? So we have here, yeah, we first, we need somehow the cluster ID and we need a project ID. It's a project ID we find here in our hubomatic URL here to project ID, we can copy it and we can here place it. So what is the first step we create in backup? That's the backup of the specification of this cluster. So yeah, I want to create it and as next step, I want to pause the cluster. And yes, after I pause the cluster, the controller does not appear anymore. So now I can safely update my cloud provider. So yes, I want to patch the cloud provider. So I create a new cluster, what we now can take a look on. So if we go now to my files, I see here that I have here under control plane, somehow a new file. This should be here. Let's see, here we are. Now we have one backup cluster YAML and the patch YAML. So let's see what's the difference is. So yeah, so currently that's our cluster CRD. On the left, we have the backup cluster spec and we have here our API server token and so on and we have to finalize us what cleaned up our cloud, then we delete the clusters and that's now the important stuff we have here, the vSphere configuration, what reference a credential and the folder and so on. And that's something what we now remove as well as status fields and we apply this change now to our cluster. The first step to remove the credentials is this needed. So let's apply that and see what happens. So we see now we have configured and now to start reconciling, we need to unpause the cluster that the cluster controller can take care about the change. And yeah, let's see what now is happening. So we have now the cloud spec, what get recreated by the cloud controller. So hopefully everything works well and we see that the cluster get now reconciled. Yeah, we see now we're getting the new object, we have the vSphere is empty, okay, this looks good. And we can now watch that hopefully, yeah, we see the API server is reconciling to now an empty cloud provider. Okay, cool, that was the first step. So now as next step, we want to change to our AWS. So let's go out of this view and let's pause the cluster again. Like currently, depending on the computer controller, we need to step upgrades because we are just a Qt control client and not operator. And now we can patch the cloud provider. So what happens now? Yes, I want to patch it. Yes, I'd want the new secrets. So there we create a new secret reference for the AWS credentials. And then we created a new cluster patch file again. So let's also see here what's different now. So we see now we have changed, basically removed the finalizes because this are not valid anymore. We added annotation here to the Qtmatic AWS region. We patched cloud spec AWS with the credentials and we have a migration, a new data center. So that's the new migration data center where we want to migrate it. After we now make that pause to pause, the controller try to reconcile where new AWS should place there. And the nice thing is therefore, Qtmatic creates now a new security group, new roles and that's hopefully what's now happening. So let's see. So let's patch that cluster and configured in the, yeah, now let's unpause the cluster and see what's happening. Good, so, okay. So what we see now here, we can reconcile. We see we get now here also an AWS data back from the Qtmatic controller and see what we created new security group and the instance profile. Cool, so let's see what happens with my components. We now seeing, yeah, we have restarting API server again. It started now with a new cloud prevention credentials. So let's try to find a little bit out of what's happening here. So we have here a deployment. So, and let's take a look into, it's the wrong side. Let's see what we have placed there. So can I get, explain deployment? For sure, I need to write Qube config. And now we should go to the API server of our cluster and see that here we have specified hopefully now the cloud provider with here our new cloud provider AWS. And now that's reconciling take place and we also have a new machine controller that now is able to talk with AWS. So what we can now create is the new AWS workers. So let's go back into the user cluster. So, yeah, here go to this Qube migration cluster and first, yeah, pause all machine deployments to be sure. So a worker machine deployment pause that we don't upgrade since all machines and yes, pause, done. And then we can now create our new AWS workers. Therefore, we need first a few inputs. So here we need to specify the cluster ID, the instance profile in the security group that Qube had created automatically. So here, yeah, I want to see the metadata's. We have here the cluster ID. So let's change that one here. We have the AWS instance profile would have created that one. And we have the AWS security group. Okay. So let's create that one, save it and then running the script deploy. So now, hopefully fingers crossed, we create new AWS workers. So yes, I want to create one here we see that's now rented in the machine deployment. We see our security group. We see that we all want to have a T3 medium and yeah, that in the USB's best one seesaw. Okay, then let's deploy it. So fingers crossed. Yes, as has been created. So I want to watch the creation. Yes, I want to see that. And we now see what the machine deployment, we have a new machine deployment with a boot target AWS. What creates two new machines and what get provision now in AWS. So let's see our workload is still running as we see. We have here our target cluster. We see also here, we have the new with boot target AWS node group. What's now get muted. And yeah, let's go to AWS and see what's have been created. So here, hopefully the AWS. Console is fast enough. We can now go to the EC2 instances and should see that we can have booted now new two instances. And let's see. Yeah, we see here initializing. So that's the new two machines, what we created. If we take a look here, we also see the tag. That's the cluster for what we have. And for that, so we have created new machines. So let's wait until the points are get booted. In the meantime, we can take a look in the load balancer. So hopefully we also have a load balancer created. That's what the AWS cloud controller manager will create because we already have a service type load balancer in the cluster and that AWS takes over and creates also load balancer on their side. So let's see, that's not the right one. That's the right one. Here we see a cluster, right? But we don't have any instances here because the instances get now booted. So let's hopefully, let's go back and see how fast they are coming up. Okay, we now see that we get a new node. That's not ready, but we can already connect to him. So let's try to SSH into one AWS node. And let's go here for that kind and see what's happening there. Yes, I want to connect. And yeah, we see here that we have here a flannel route and we will soon have a cube or interface for the VPN server as soon as that started. Here we go, here we see that interface and we can now try to get the interconnection between one cloud to another. So I then connect here to the on-premise node on the down and let's connect to that one and test if we are now can talk between clouds. Now here as well, I have IP address for the cube interface, what is 10, 20, 42. And let's try if we can bring it from our AWS node. Okay, now we see that we get a connection here from the cloud node to the on-premise node. Good, cool. So next step, what we now need to do is to migrate our workload. Okay, cool. Then let's go back here and here as well and try to migrate the workload. So we have now switched here to the user cluster to see what's happening and to take a look in the Echo server. Namespace, sorry. We see now here, okay, the Echo server is deployed on the migration vSphere node. We want to now to roll out the new workload to the new cloud. So let's try to coordinate the nodes, coordinate the nodes because that new workload should not anymore go to this old nodes. Good, let's coordinate and we have the null node to coordinate. So that means that the node should be now marked as not scattle able. Yes, so scattle and disable, perfect. So we can now use the cube controller roll out restart feature to now restart our deployment of the Echo server in the namespace Echo server to trigger like a row in the release of the Echo server without any change. So yeah, let's trigger that one and see what's happening in the down we see, okay, that's still we have the application up and running. And yes, so now we see the container get created in a new cloud, so first success. So hopefully this will now work. We see that now a new container is running on the new AWS workers and we can now terminate the old one. So that's now going step by step and we have three new workers and you see the service is still reachable and we can get now hopefully also if we go to the browser, the, we can see here detail that we get back from the AWS node to cause. So we have, where is the host name? No, it's just the host name. We see that the, yeah, the workers are running in a new cloud and still reachable through our old endpoint. Cool, so that seems to work like now that the next step would be to migrate all other workload to the new cloud, remove the load balancer and then using the new DNS name. So we can quickly try if the new load balancers now listening. So yeah, we have here the new instances and maybe the DNS name is already propagated to see if this is working. We can see, okay, let's change that to the new DNS name and no, so DNS is not there. So yeah, good. Then finally, migrate it to the new cloud. So how are the next steps looks like? So that's how we can create the migration to worker user clusters and to move completely to another cloud we need now to migrate also the seed cluster. For that, how we can achieve that is the same way, we're reusing the same principle. So here we have the workers now in the new cloud and now we need to migrate control plan. The good thing is on that in any case, the workers workload is still safe because it's already the new cloud. So we can migrate it even if we may break it. The only thing what would maybe not work is an upgrade of the Kubernetes but still the workload is safe. So how we migrate the seed clusters first we create new master nodes for the seed masters. So that means we creating new nodes, putting a new Kubernetes API load balance on it, update the API endpoints because this API endpoints are stable in the Kubernetes cluster. And then we block for sure the seed cluster for upgrades so that we don't anything trigger unexpected during this migration. And then yeah, we migrating the user control planes in the same way as migrated the workload now at the user clusters. So we move the HCDs to the new cloud and using HCD quorum for data migration. And yeah, for sure we should have backups and recovery for all the components. How does this look like? For example, here we have our seed DNS for our API server was pointing to three masters. There we have now our AWS clusters hosted the control plane and we have the workers who hostings a quantitative wise control plane, everything on GCP. Then as fast first step we would again place the VPN that time in the seed cluster connecting every worker nodes and master nodes then create a quorum with five each CD for the master nodes that ensures that the new two HCDs get the data replicated from GCP nodes to the AWS nodes. Then as next step, if we have achieved that we can remove two GCP workers and still have a quorum or three or three or five if enough is enough for the quorum. And we can therefore also change now the DNS name to the new AWS provider, low balancer. So there now we can then remove the quorum also from five to three and we have a stable control plane. For sure, also in this procedure we need then to clean up, remove the old GCP node again create a new AWS master node and then we have three HCDs healthy in the new cloud migrated. In the same way as migrated to user clusters workers, we now need to migrate the seed cluster workers. For that, we here can change the VPN again create a new worker nodes and then in this kind also changing here at the cloud provider here that it's pointing now to AWS. After this has happened, we can increase here in problematic settings that default value of HCD replicas to five. That means that the new problematic user cluster controller manager creates two new HCDs for every user cluster. That's ensures in the same ways as me migrated the HCD on the top, it's migrated also for the users cluster. So I have now a quorum of five HCDs running still pointing to the old DNS, but between clouds. As next step, we now creating a new cloud balance load balancer and renamed the white card. So that's the connection from the user cluster workers what we already migrated to the new cloud load balancer. And yes, then we switch over. We adding one new AWS node replacing all GCP nodes. So we have then a quorum of three HCDs per user cluster on the new cloud. That is the most reasonable target. So now we are safe and we can remove the old missing node worker nodes, clean up the old cloud resources. That means we scale down to HCD level three. We remove then the VPN overlay and we now have the canal away what's routing between the workers again between ETH zero. So then we are successfully migrated every user cluster insert seed cluster to the new cloud. So that's the approach what we think is possible to migrate really 100 clusters is out downtime and in a stable way. So for sure we are not right now that we just press a button, but that's our future target. So we want to automate also the clean approach procedure and for sure if you want to really make it scalable we need to write an operator. Currently we did everything by a hacky help bash script and to not like doing that by hand the health checks we need to have an operator who checked the health check the conditions and have also some repair options and retry options. But therefore the key we need is a reconciling pattern match in the same way as match for our cloud migration where the automatic controller creates the new cloud provider and so on. We can use the reconciling also for the migration and using okay we have a new target cloud operator take care about that matching the old state to the new state. Technical we also need to stabilize the VPN connection. So right now we have only one VPN server this could be maybe on a bottleneck on bigger clusters maybe we should deploy multiple VPN clusters and we have like maybe have also more soft switch between VPN and host networking overlay that's something what we need to explore more and maybe their wire card can be an alternative as a VPN connection what could help maybe on the setup. Okay one more detail I currently we tested it in a 1.17 seed cluster where the managed fields features not included. Somehow this makes a little bit of trouble between okay you have like patch by Qt control and you have an operator what reconciling the fields so that's why currently I'm a little bit limited on the 117 seed clusters as to so on the demo the user cluster can be on the 118. Cool then yeah thanks for your attention. We are happy to answer any questions so we're free to reach out now and yeah thanks to listening and yeah Manuel few last words from your side. No that was amazing thank you so much and we're open to questions now. Okay thanks a lot and looking forward to your questions.