 Welcome to KubeCon North America. Today, Kong and David will be going through this migration from single node Kubernetes control plan to HG production. Over to you, David and Kong. For housekeeping, Q&A will be at the end. So please raise your hand and we'll come back to you. Thank you. Here's the Databricks. And we're going to be talking about our experience migrating from a single node Kubernetes control plane to a highly available control plane in production at Databricks. So briefly, the outline of the talk is that first I'll talk about how we use Kubernetes at Databricks. I'll talk about the non-HA control plane architecture that we used for many years. Then I'll discuss the HA control plane architecture that we moved to and how it handles different kinds of failures. Then Kong is going to talk about the migration process we used to move from the non-HA control plane to the HA control plane. And then we'll wrap up with a discussion of some of the modifications we made to our day two processes to accommodate the HA control plane. So as a brief overview of what Databricks is, the Databricks product is a SAS data platform that runs on the public clouds. We call it the Databricks Lakehouse platform. And it's a unified platform that serves many enterprise data use cases, such as data warehousing, data engineering, data science, streaming, and machine learning. The service operates at a very large scale. We have many thousands of customers. And the aggregate workload that we manage is very large. We launch more than 10 million VMs per day across Azure, AWS, and GCP. And our customers use the platform to process many exabytes of data. So this slide shows the high-level architecture of the Databricks platform. It consists of a control plane that runs in Databricks-owned cloud accounts and a per-customer data plane that runs in the customer's cloud account. So you can see in the left-hand box some of the services that constitute the Databricks control plane. These are multi-tenant services that run on Kubernetes clusters. And then the right-hand box shows the data plane for one customer, which consists of cloud storage and the compute that does all of the data processing. For historical reasons, on GCP, the data plane runs on Kubernetes. But on AWS and Azure, it runs on VMs that are not managed by Kubernetes. But that's the data plane. This talk is about the control plane, all of which runs on Kubernetes. So the Databricks product gives customers the experience of a single unified system spanning the three clouds. But under the covers, the control plane is built from per-cloud region Kubernetes clusters. Databricks is currently available in more than 60 regions across three clouds. And we have at least one Kubernetes cluster per region to host the Databricks control plane services. Now, Databricks was a very early adopter of Kubernetes. The company adopted Kubernetes in 2015, which was before all three cloud providers had managed Kubernetes services. So we built our own tools to deploy and manage clusters directly on top of Cloud Provider VMs. We've recently also started using Cloud Provider managed Kubernetes clusters for certain services. But this talk is about our self-managed Kubernetes clusters in these 60 regions. And we're still primarily using the self-managed clusters, even though we're starting to adopt a cloud provider managed Kubernetes. So this slide shows the non-HA cluster architecture that we had been using. It's a very standard architecture that most people who are running Kubernetes on the cloud are using, whether they're doing it self-managed or using Cloud Provider Kubernetes. The control plane pods run on a single VM. And we use Cloud Provider block storage for the EtsyD storage. There's also a boot disk that's not shown here. The only part of this architecture that's maybe a little different than you normally see is that we have two separate IP addresses for the VM and therefore for the API server. There's a public IP and a private IP. The private IP for the API server is used for the Cubelets and the workloads, like the pods that are running on the worker nodes, to talk to the API server. And the public IP is used by external clients like Cube Control and our CI-CD system to talk to the API server. So the problem with this architecture is that if the control plane VM fails, then the workloads will continue running. But the cluster essentially remains static. So the cluster autoscaler can't scale nodes up and down because the cluster autoscaler runs on the worker nodes if it needs to talk to the control plane. Same thing with horizontal pod autoscaling. The horizontal pod autoscaler runs on the worker nodes and needs to talk to the control plane. So if the control plane's down, it can't do anything. If a pod fails or a node fails, the pod can't be rescheduled to another node. And if a pod fails or a node fails, and it's part of a cluster IP service, then the endpoints won't be updated for any of the services that it's a member of. So this is not good. And so the solution to that is to use an HA control plane, a highly available control plane. And the idea is to replicate the control plane VMs across multiple cloud zones so the control plane can continue functioning when there's a VM or a cloud provider zone failure. This is kind of a picture of the architecture that we use. It's similar to the approach used by CubeAtom, and some other Kubernetes tools except that we have two load balancers. And I'm gonna walk through to the components here, and then at some point we'll come to the load balancers and I'll describe what we're doing with those. So as I mentioned, the foundation of the HA control plane is three replicas of the single node control plane. So three VMs spread across three cloud zones instead of just one VM in one zone. The three API servers run as a stateless load balance service where any replica can handle any API request and they're all active at the same time. And each API server replica reads and writes its local EtsyD replica, which brings us to EtsyD. So there's three EtsyD replicas, one in each VM, and these three EtsyD replicas form a cluster with one leader and two followers. And the reads and writes are always served by the leader. So if an API server's local EtsyD is a follower and that API server tries to write to EtsyD, then that EtsyD replica will forward the right to the leader which then commits it and then sends the acknowledgement back to the API server, well back to the EtsyD replica on the API server and then back to the API server. And same thing for reads. If the API server's running on the node with an EtsyD follower, then when the API server does a read, that follower will forward the read to the leader and then forward the response back to the API server. This is to ensure consistency for reads. The scheduler and controller manager in this architecture are master elected. So all of the replicas are always running, but only one is doing work at any given time. So this is a little different from how the API server worked. If you remember I mentioned that the API server, all three replicas are always able to handle requests. They're essentially identical to one another in their behavior. Whereas the scheduler and controller manager, only one is actively doing work at any given time. And there's a lease mechanism that allows one of the standby schedulers or controller managers to take over if the active leaseholder fails. And then lastly, we replaced the public and private IP interfaces from the single control plane VM in that non-HA setup that I showed you at the beginning with public and internal multi-zone load balancers. But the philosophy is still the same. The worker node cubelets and the services running on the worker nodes talk to the API server through the internal load balancer and external clients like the CICD system or an engineer using cube control talk to the API server through the public IP. So the main failure modes that this architecture addresses is a single VM failing or a single cloud zone failing. So let's talk about what happens in those scenarios. So the load balancers will notice that the VM has become unreachable and will stop routing requests to that API server. But they'll continue to route requests to the API servers in the zones that are still up. So those are the green arrows showing the load balancers routing requests to the VMs or zones that are still up. If the active scheduler was in the failed zone, then one of the other two schedulers will become the active scheduler and same thing with the controller manager. If the active controller manager was in the failed zone or on the failed VM, then one of the other two controller managers will become the active controller manager. And then lastly, if the XED leader was in the failed zone, then one of the other two XED replicas will become the leader. And so this setup brings the theoretical availability of the control plane from two and a half nines with a single VM control plane. That's like the typical VM SLA from a cloud provider, two and a half nines to four nines with this HA control plane. So unfortunately, although the system can tolerate one VM failure or one cloud zone failure, I can't tolerate two simultaneous zone or VM failures because the XED cluster requires a quorum of two healthy replicas. So if you have two simultaneous failures, the clients can still reach the API server in the one healthy zone at the network level because the load balancers will forward the requests to that one remaining node. But that API server won't respond to read or write requests because XED won't process the requests because of the lack of a quorum. Now you could tolerate two simultaneous failures by running five control plane replicas instead of three, but cloud providers generally don't have five zones in each region, so that wouldn't help tolerate multiple zone failures. And so we decided it really wasn't worth the cost that to have five replicas. And the last component that we haven't covered are the load balancers. So if the, and what happens if they fail? So if the public load balancer fails, then external clients won't be able to connect to the API servers. So for example, a client can't create new workloads or roll out a new version of a workload or anything else that external clients typically do when they talk to the API server. But already running workloads will continue to run and all the dynamic behaviors that I talked about earlier like pod rescheduling and auto scaling, those will all continue to work. And if the internal load balancer fails, then the opposite is true. In that case, external clients can still do operations on the API server like creating and updating workloads. But the API server will stop seeing the heartbeats from the nodes. There's like no communication, the communication with the nodes is cut off. And the node controller actually won't evict pods in this scenario. There's like a special case in the node controller where if it sees all the nodes die at the same time it won't try to move pods around. But so the workloads will continue to run even though the nodes have stopped heartbeating. But the dynamic behaviors like pod rescheduling and auto scaling won't happen. And then lastly, if the entire cloud region fails, then the HA control plane architecture can't help because the control plane VMs all run in a single region. In theory, you can run the same architecture where you spread the three control plane nodes across regions and use like a multi region cloud load balancer and then tolerate region failure. But because our control planes are per region, we set this up as replicated within a region instead of across regions. So now that I've described the HA control plane architecture, Kong is going to talk about how we migrated our production servers from the single node architecture to the HA architecture, all while the clusters continued to serve user traffic. Thanks, David. So in the context of migration, we know the non-HA control plane. So it has the cluster state stored in ICD. And then they had to interface as the two IPs for the single control plane VM to serve the API to access the mutated cluster state. And then we want to change the architecture from non-HA to HA. The cluster state need to be still the original cluster state in the non-HA. And then the interface to access the control plane should not be changed. So basically, we want to migrate the control plane for the workloads, as David mentioned. During the whole migration, we wanted the workloads to keep up and running. We don't want that workload get affected. In our case, we have production cluster across 60 regions. So basically, the required level requirements is to migrate the control plane without affecting the production workloads. So we defined the requirements as follows. So all workloads should keep on running during migration and no client reconfiguration. Also, the migration across the fleet should be automated. And this should support both roll forward and roll back. With this high-level requirement, before I share how we designed our migration process, I want to step back a little bit to think about what are the most important things when we do the migration, what we want to protect with the first priority and what we are migrating. So from the architecture we discussed before, you can see in a Kubernetes control plane, it has a cluster state in the ICD. And so from that, HA to HA basically is moved from one ICD node to three ICD column. And also, the cluster state served by API server, the interface doesn't change. Basically, it changes from IP to IPs to two load balancers. And then the cluster state will keep mutating as far as the control plane server metrics. For example, the controller manager can mutate the cluster state. During the reconciling the resources. Also, corporate from worker can report the node status so that the cluster state for the node will be mutated. Meanwhile, the pods, the workloads can also talk to API server like those operators can change the cluster state as well. So we can see actually during the migration, we have control plane, we have workers, and then we have workloads. So basically, we want to protect the workloads. So that means the cluster state needs to be safe. We want to migrate that to BHA, but we don't want to break the workloads. So with this keeping in mind, we designed our migration into three phases. And then for each phase, actually, it includes multiple steps. And I want to echo back a little bit what things we want to protect most again. So basically, we want to protect workloads. That means during the multiple step migration process, any single step can fail. Even that fail, we want that workload to be protected. And it fail, just roll back or fix that and move forward and do the migration again. So then here, we designed the three phases. The first phase, basically, to get the cluster state from the non-HA control plane. In particular, it's in the ICD. Meanwhile, because we want to protect the workloads, we will shut down all the traffic to the control plane. This is similar as like the downtime when we have non-HA control plane. Then next step, we want to migrate the cluster state to the HA control plane. So basically, we'll cover the details. So high level idea is basically replicate one ICD to three and then somehow make it work for the HA control plane. So here, it follows the same principle as the phase one. It's multiple steps. Any single step can fail. We want to protect workloads. So basically, we will shut down the traffic to the control plane to make sure the cluster will not get mutated. So it's consistent. Workloads just keep on running. It's degraded a little bit. As David mentioned, the horizontal scaling and the vertical scaling cannot work, but the workloads still keep on running. And the last phase, when we confirm everything works fine, then we will reopen the traffic. This means like a traffic from the public interface. So the CI-CD can do the deployment as well as the internal interface. Like the Kublaid can report no status, control manager can start to reconcile resources. So after the last phase, we will have a HA control plane with fully up and running. So now let's talk about each phase. As the first phase, based on high level idea, we want to get the cluster state from the HA in a safe way. So in the nine HA control plane, basically this is a diagram in a single VM. It has ICD as a couple control plane components. The first thing before actually we get to the state, we want to make sure the state is static at that point. So the first step is we shut down all the control plane parts. And then from the second step for the first phase, basically we want to build a snapshot, the input we can use for the second phase. Because at this moment workloads will not get affected, but just get degraded in terms of the dynamic change. So here we use one ICD utility two to take snapshot from ICD data folder. While we use the ICD control here, basically for ICD actually it has two types of information, one is the data, the other actually is the core memory information. When we change from nine HA ICD to HA ICD, the data is the one we want to keep. But the core memory information, we want to change that from one node to multiple nodes. As the next step is we shut down the single node control plane to release IPs. As we mentioned, basically the class of state in the ICD definitely is the one we want to keep. And then we use it for the HA control plane, also the two interface, the IPs. We will reuse that for the low balancers. So the snapshot file we use for the ICD control is in the ICD desk. And then the last step is we use the cloud snap tool, basically to take a snapshot so that when we create new disk for the HA control plane, we can just create the data disk from the snapshot. So with this, now we get everything prepared to be reused for the HA control plane. We can see actually this is a multiple step process. And every step it can fail. It can fail due to some implementation bug, it can fail due to transition to cloud provider failure and so on. Thank you. As the second phase, now we have the input as a class of state from the non-HA control plane. Also we have the IPs released so that we can use. At this moment, the workload is still keep up and running but just get degraded. So the second phase is to basically to bootstrap the HA control plane. But I want to emphasize again, basically we will not start traffic. The hollow idea is to keep the cloud state to be consistent for the whole migration process because it will take from 10 minutes to 30 minutes and every single step can fail. So the first step is to build the three control plane VM and then the data disk is from the snapshot we took before from the non-HA control plane. And then the second step in this phase is to build the two low balancer and then the IP for the two low balancers is the original IP from the non-HA single node control plane. But at this moment, we don't want to server the traffic. So that we create a low balancer but we don't open the traffic to access the API server. As a last step, now we have all the cloud resources deployed but actually ICD as I mentioned before there are two sides for ICD. One is the data, the other is the column. The data is there but even we use the snapshot to create new ICD disk, the column cannot build yet. So we use the tool called ICD control restore. So basically we statically rebuild the new column with for the three nodes. Till here, now actually we do have a HA control plane keep running but one thing is because we didn't restart the control manager and then we didn't open the traffic from the two low balancer. So basically the workload worker side can still not talk today the control plane yet but the control plane itself is isolated up and running. So as a last phase, we will just reopen the traffic because we confirmed the HA control plane is bring up as we expected. So during the whole migration process we can see in each phase it has multiple steps and the principle for each phase basically is to correctly get the snapshot and then to protect workloads by only when we are ready to make sure the whole control plane is up running and then we reopen the traffic so that this actually make the rollback much easier. So before phase three, we can just roll back by redeploy the non-HA cloud resources, the IC disk exactly says cloud state is not muted at all and after phase three, let's say we migrate to HA control plane after say a couple days the cloud state will be mutated. In that case, we'll follow the same principle to get the cloud state from the HA control plane and then do a snapshot and then from that snapshot we build a new IC disk for the non-HA control plane and then build a single node ICD coral. So basically the same principle will be applied for both rollback and roll forward. So lastly, actually I want to share one lesson we learned from one of the outage we caused by one inconsistent cloud state during the migration. So as I explained in the previous migration process we can see to protect a cloud state is the most important thing for the whole migration process especially keeping the cloud state consistent. In the phase three, we mentioned actually we just restart a cloud controller manager and then reopen the traffic. So in the last phase of migration we had one bug in the code. So when we basically try to restart a cloud controller manager and reopen the low balance of traffic at the same time but for some reason it took much longer than we expected to reopen the low balancer traffic. So that there was one state cloud controller manager is up and running but the low balancer is still closed. Then what will happen? So basically the cloud controller manager can talk to API server to start reconciling resources but for workers they use the low balancer using the low balancer to talk to control plan API server to report its no status. So cloud controller manager single oh actually some nodes it's not healthy because it hasn't been reported for a while. So the service controller inside the cloud controller manager start get kicked in and then start to remove some of the nodes from the low balancer for the services. Actually if we got alerted immediately if we keep this state longer it will like you know the some nodes will even get removed from the cluster. This is the one lessons we learned that the keep the cluster state consistent really important for the migration whole migration process. So because the migration process you know it's when we do the migration across different regions the behavior and the cloud provides transit failure could cause outage like this. So from the lessons we can see actually it's really important to protect the cluster state to be consistent during the whole process. So finally if I use one sentence to summarize migration I would say HAC control plan migration is really about the cluster state migration and keeping the cluster state consistent is critical to make sure workloads not get affected. So now let me hand back to David talk about our day two of the migration to HAC control plan. Thanks Kong. So the last topic we're going to cover before we wrap up is how we adapted some of our day two processes for the HAC control plane. So one process that we modified was how we upgrade the control plane to new Kubernetes versions. So previously the first step of this process was that we'd run a set of cluster level precondition checks like verifying that all the API objects in the API server are compatible with the new Kubernetes version and checking to make sure the control plane is at the same Kubernetes version as the worker nodes. So for the HAC control plane we extended the control plane version check to make sure all of the control plane nodes are at the same Kubernetes version as each other before we start the upgrade because it would be bad if they start off in different versions. So if the cluster level precondition checks pass then we start the upgrade of the first control plane node. We run some precondition checks like making sure all three control plane nodes are up. This is important because one of the benefits of the HAC control plane architecture is that you can take down one of the nodes for a Kubernetes version upgrade and still continue to serve user traffic because you only need two of the nodes to be up in order to serve the users. But if you don't start off with all three control plane nodes up then when you take one down then you lose that because then there will be zero or one nodes up. So in order to prevent users from experiencing downtime we make sure all three are up before we do a version upgrade on one of them. So then we shut down the VM that we're going to upgrade. We create a new VM with the new Kubernetes control plane version and deploy everything on it. And then lastly we run some validation tests to make sure that the upgrade was successful. We already had some validation tests we were running when we had the single node control plane architecture. And the one we added for the HAC control plane is ensuring that traffic is going through all three API servers successfully and approximately balanced evenly. And if any of the validation tests fail then we automatically roll back the upgrade. But assuming they pass then we repeat the same process for the second control plane node in the cluster and then finally for the third control plane node and then if the validation test passed there we move on to the next cluster and repeat the same process upgrading one node at a time and running the precondition checks and the validation tests after each node and then so on through all of the clusters in the fleet. The second day to process that we modified when we moved to the HAC control plane was how we measured and monitored API server availability. The reason why quantifying API server availability is a little more complicated with the HAC control plane than with a single node control plane is that with the HAC control plane if one API server is down then there's no user visible impact. So the first approach we considered is shown on this slide and the idea was to use the success rate of actual API requests to compute availability. So we have a Prometheus agent running in the cluster that talks directly to the API servers and scrapes the API server request total Prometheus metric from the API servers metrics endpoint. This API server request total metric for those who aren't familiar with it it records the total number of API requests received by or processed by the API server broken down by various categories including the HTTP response code so we can know which requests are successful and which return errors. And then we compute the availability as the number of requests that return a successful HTTP response code across all API servers divided by the total number of requests processed by all the API servers. So the downside of this approach is that if two API servers are down or the load balancer is down then this metric will give you a non-zero availability even though the end user is going to see zero availability because the cluster isn't accessible when two or more of the API servers are down or the load balancer is down. So the second approach we considered was to have the Prometheus agent periodically just probe the API servers metrics endpoints to see whether the API servers are alive and then we define availability as the number of times this probe gets a response divided by the total number of probes and the probing is done through the load balancer so it reflects the user-visible downtime but this approach has a downside of not reflecting the internal API server errors like the first approach did. So the approach we ultimately chose is a combination of the two. We used the same fraction from the first approach which was the total number of successful API requests divided by the total number of API requests processed but we also do the probing that I described in the second approach and we multiply the fraction from the first approach by zero if the probing approach shows that the user is not getting a response because two or more API servers are down or the load balancer is down. And so this way we end up with an availability number that reflects API server errors and also scenarios where the API servers are down from the user perspective. So to wrap up I'll just briefly summarize how the various aspects of our overall design allowed us to meet the requirements that Kong mentioned earlier. So we had the requirement that all the workloads running on the cluster must continue to run during migration and the way we accomplished that was to snapshot the single node at CD and clone the disk to create the two new replicas so that the cluster state after the migration was exactly the same as the cluster state before the migration. We had the requirement that there should be no requirement for client reconfiguration. We didn't wanna have to change any of the configuration of the cubelets or the workloads or external clients that are talking to the API server. And so to accomplish that we reused the same IP addresses from the VM in the single node control plane architecture as the load balancer IP addresses. So the internal and external VM IP addresses became the internal and external load balancer IP addresses. And then we had the third requirement that the per cluster migration and rollback should be automated and safe. So we accomplished that by having a migration script that automatically rolls back upon failure by not making any state mutations until the HA control plane is up and running. Kong talked a lot about that so that we could do safe rollback and ensuring that the cubelets have heartbeated before the controller manager is re-enabled. So the controller manager doesn't come up after this like Kong mentioned 10 to 30 minute process see that the nodes haven't heartbeated and then tried to remove them as load balancer backends or do something else and consider them down. And then lastly we have the requirement that the migration across the fleet should be automated and safe. So to accomplish that we used a pipeline for migration and check these preconditions and post conditions for each cluster before moving on to the next cluster. And I'll skip this since we're running low on time this is just summarizing the availability metric that we talked about a minute ago and the additional benefit of the HA control plane doesn't just give you high availability it also lets you tolerate sorry it also lets you continue to serve user traffic unaffected during a Kubernetes control plane version update because you can tolerate one node being down at a time. So thanks, I appreciate your attention and we're happy to take questions in the few minutes that are left. Thanks David and Kong. We are on time so if you have questions you can come and you can ask, yep. Thanks for the talk. I have two questions. So did you have to over provision your clusters to account for the 10 to 13 minutes that you guys were migrating and they were statically stable? So that if more sort of traffic comes in you can handle that load. The second one was did you change the instance types as you move from single node to HA cluster? Yeah, for your first question actually before the migration our cluster already had the headroom pause because we are using cluster or scalar for even during the HA control plane architecture. So we didn't do especially to add new nodes when we migrate from HA to HA but because the process actually usually take about 10 minutes when we automate that. The outage I mentioned actually that's one case because like the control manager get kicked in. So, sorry, what's your second question? Oh, yeah. So for control plane basically it's a CPU spike and then for the non-HA control plane actually it's a more CPU spike than the HA control plane. So as the first phase we didn't change the instance type and then now because the HA control plane has been rolled out for all production clusters we are looking at tuning the instance actually to get it down a little bit. We'll take our last question. Thank you both for sharing that information. You mentioned reusing IP address both for the public side and the private side and there are load balances that have you put three IP addresses instead of one. How did you manage those scenarios where the load balancer would require an IP address per zone which means it's not one, it's three. How did you cover for such situation? Did you have such situation and if you did how did you cover for such situation? Yeah, I think it's slightly different from each club rider in terms of whether a load balancer should have multiple IP in a single zone or is it just single IP across different zone. For AWS and Azure is a single IP and then you can deploy a load balancer across different zones. For AWS it requires different IP in different zone but for the LB, it's a little implementation detail. It has a zone on DS and regional DS for us because to make for an HA to HA it cross zone is the actual benefit so we use the zone on DS first. So for the two actual IP, we are not using it yet but after we fully migrate HA, we are using it. Okay, thanks and if you have other questions feel free to come up and ask us.