 Okay, hello everyone, welcome everyone. Thank you for coming. We are excited to be here at QCon. This is our first time as a speaker in Paris. So, and lastly, thank you CNCF for hosting. Shout out to all the staff and volunteers here. And let's just start. So today we are going to talk about the architecting resilience and what we learned from managing over 7,000 Kubernetes clusters at scale. I am Kwang-Won Choi. People call me Hooni. And beside me is Kyu-Tae Baek. Yeah. Oh. Okay. We are a cloud engineer at Kakao. We developed and maintain a private Kubernetes clusters as a service which runs on open stack-based infrastructure. Yeah, hi, I'm Kyu-Tae. I'm a Kubernetes as a service team leader in Kakao. Thank you for attending our session. We hope our experience can help you. Thank you. So before we start, let me introduce our team. We provide Kubernetes as a service in the company which is called DKOS. And DKOS stands for Data Center of Kakao Operating System. And as a member of the private Kubernetes service team of Kakao, we manage more than 7,000 clusters which consists of more than 120,000 nodes. And the clusters are deployed in different zones. At Kakao, we consider a single data center as a single zone. And lastly, we get a lot of on-course every week. We help developers get Kubernetes up and running properly. By the way, we only have seven people on the team so every week is very tough. So this is the sad story. Back in October 2022, there was a data center fire and we had a significant economic and social impact. Because of the variety of services such as messenger, maps, comics, online shopping, banking, and taxes, that Koreans use every day, the impact was significant. Since all of the services run on Kakao's multiple Kubernetes clusters, the failure in a data center affected these multiple services. So here's what really happened that day. When the data center suddenly lost power, all of our Kubernetes nodes in the data center shut down. Developers needed to migrate their Kubernetes to a new zone to restore services. However, the developers didn't have time to migrate their clusters to new zones because it happened so suddenly and even if there was enough time, it takes some time to set up new clusters and deploy new applications, set up load balancers, security checks, and proxies, and so on. So there are a lot of things to take care of. And in addition, as the Kubernetes nodes, they were down in the data center, came back up. It was difficult to determine the impact of bringing the service back up because we distributed the Kubernetes nodes across different hypervisors. When the hypervisor powered up, the Kubernetes VM nodes powered up randomly. And since nodes come up randomly, it was hard to predict when cluster manipulation would be available. So some data plane nodes have started booting up before control plane nodes. And so we start processing service requests before developers could manage application through control plane. Since, as you all know, control plane nodes must be available before adjusting application replicas or ingress rules. This caused confusion for developers. To think about why this happened, let's first explain the structure of DKOS. DKOS creates Kubernetes in a single zone by default. Because of this, most developers didn't think about running Kubernetes clusters in multiple zones for their services. And as a result, when data center power went down, services they were running in were connected to the fail zone were affected. For developers who are unfamiliar with Kubernetes, it's hard to think about running Kubernetes clusters in multiple zones. And most of developers are busy enough developing their own services. So of course, even with the single zone structure, if applications are spread across multiple clusters, which are located in different zones, with each load balancer connected to a GSLB, it will be more resilient to failures. In fact, there are many developers who build clusters with this structure. But that means more clusters for developers to manage, and more clusters to monitor, and more human resources to deploy and manage the apps they are serving. This is why many service teams have their own separate cloud engineers. So we will explain in a bit more detail with examples of what went wrong in the following slides. Qtale will take care. Thank you. Let's discuss all the structure in detail. Each cluster should use only a single zone. For the control plane and data plane, they are in the same zone. In a zone, there are nodes, and they are scheduled to run in different hypervisors. To explain, with just a single cluster, we can take full tolerance for hypervisors or rack level failures. Therefore, we had to use multi-cluster as shown in the figure. This is because we need to overcome data center failure where a zone goes down. As you can see in the slide, there are three clusters, so if two zones go down at the same time, the service will experience any problem. Why single zone? Now we have three or more zones, but nine years ago, in the early days, our private cloud started very small, with just one zone. After that, we have scaled out the zones. In scaling out, we adopted the way of zone duplication because to scale out quickly. As a result, as like the slide, we have a road balancer and Kubernetes cluster per zone. There might be the assumption that there is a failure or mistaken work in this structure. However, there is absolutely no issue in this structure. Even if zone A is unavailable due to our data center failure, GSLB will detect zone A's failure. It can take time equivalent to the health check interval. However, GSLB immediately exclude zone A from the service. For these services, they can still proceedly operate through Kubernetes cluster, locating zone B and C. However, we could learn the lessons from data center fire. In theoretical and structural manner, this architecture is robust. Still, there can be unexpected issue in the real world. From now, I would like to share with you what were the issues before. First, there is an omission of a modular redundancy. Our cluster servers diverse workloads ranging from thousands to hundreds. For example, developers from three execute a proof of concept of a new feature in zone A. After all, I call proof of concept as POC. So when you look at the left side of the slide, you can see red color of the icon. Those are POC. Here are the issues that developers may forget to deploy in zone B because POC module is prompt and immediate execution or rarely working set. We have two cluster in the slide, but there is a high possibility of omission when the increase in the number of cluster. Tools like Argo CD can help to reduce omission, but to the end, multicluster will lead to the omission of a modular redundancy. That's because we need to take care of more resource management. Second, conflicting jobs. When you run multiclusters, you have jobs that shouldn't be running in parallel on each cluster. One example is a job that performs post-processing on raw data. In the figure, this job is handled by a cluster in zone A, the green colored icon. Typically, for this post-processing job, we can manage the scheduling by our own. Therefore, identity is considered, but concurrency is not considered. In addition, this job is deployed in only one single cluster. The operation of a job within a single cluster has no issue in general cases. However, if there is down in zone A, it will be the issue. The developer detect failures and migrate the job from zone A to zone B cluster. The deployment to zone B may take time. However, it's only up on YAML. Therefore, the delay is not significant. The main issues occur when zone A is back up. That is say, when it is recovered and operated again. After the cluster in zone A is recovered, there will be jobs, which should not run in this operation. Those jobs will run in the cluster concurrently. In this situation, job can be stopped by Qubes-TL command. However, it can't be automatically stopped. When you can command Qubes-TL, it means Kubernetes API server is recovered. After it's recovered, Kubernetes job control will reconcile jobs. At this time, for a moment, there can be the conflict in the jobs. Afterward, it will be data conflict. Likewise, we had experienced that multi-cluster structure caused the omission of module redundancy and conflicting jobs. The third case is the omission of GSLB hot check configuration. Even if you are not under those two prior cases, developer must configure GSLB hot checks in appropriate manner. In this slide, we have an active GSLB and we are monitoring service A and service B. And the services consist of sub-directories. Here we have added a workload for a new service C. The green circles are the workload that we added. Of course, the ingress and service configuration would have been deployed, but we only added a deployment to help you understand this figure. Now, the service C hot check for this workload is supposed to be configured in the GSLB, but developers may forget to configure GSLB. This is because service C is a sub-directories of example.com. Therefore, DNS queries will work fine without configuration of the hot check. In addition, if developers cover more service, they might miss more configuration. As you mentioned earlier, when zone A goes down and recovers, nodes will randomly recover. Similar thing happens when the data center goes down. As same as zone A's case, nodes randomly go down. That's because the power goes down on a floor by floor or section by section. Therefore, some nodes with service A and service B are all, but only nodes with service C are down. Especially, this possibility will increase when it selects certain nodes for the pod. The GSLB still sends 50% of its request to zone A because its service A and service B are healthy. So in zone A, service will fail. We have discovered those three major issues and I turn them over to Kwang-Woo to discuss its solution. So for our new goal, we had a simple single goal for the new design. The goal is to make it easy for developers to deploy multi-zone clusters. DKS also requires users to manage the infrastructure of the control plane, such as OS update or monitoring. So configuring a single multi-zone cluster solves the inconvenience of managing and monitoring the clusters in use. In addition, since an application is located on a single cluster, it can be managed with a single deployment or state's full set enabling automatic failover even if a zone goes down. Now this is the recent image of our DKS web. It shows a single cluster deployed in multi-zone. You can see that there are five control plane nodes each evenly distributed across zones A, B, and C. As mentioned before, you can see that the control plane is exposed on the web because the user manages the control plane directly. So the final structure looks like this. In the new DKS structure, the cluster is still accessible in the event of a failure and it is easy to identify dead nodes. An application is deployed as a single workload. So if one zone fails, new parts will pop up on any other zone nodes. Services that should not be restarted upon failover can be controlled in advance with settings such as node quatern or drain. We therefore expect the new DKS structure to be more resilient in the event of a single zone failure. And now developers can safely use Kubernetes without worrying about it. And in the following sections, we will describe the parts we consider for the multi-zone provisioning. Since the DKS cluster is a stacked, highly available cluster structure, which means SED nodes are co-located with control plane nodes, SED needed to be verified the inner zone network throughput and performance was met. And also as the worker nodes use engine proxy to look at the control plane nodes and the worker nodes are deployed in multiple zones, the cost of using cross zone traffic between workloads was also an important consideration. And finally, we'll talk a little bit about the consideration for the GSLB controller we will be creating later. For SED first, since the traffic between inner zones means natural cost, we looked at how much SED inner zone traffic there is. The table shows SED traffic between zones. This is the result of four and only sampled clusters with different SED storage size. We measure the traffic going in and out of ports 2379 and 2380 on the SED leader control plane node. As the size increases, so does the network traffic on average. However, SED traffic usage was not excessive for multi zones. Also, SED uses a 100 millisecond heartbeat interval by default. But again, 100 millisecond interval was not a problem in between Kakao's zones. However, depending on the amounts of recast made to the API server, the value may vary from cluster to cluster, but we have found that using multi zone shouldn't be a big problem. And in the following part, Q will explain more about network traffic in cross zone. In multi zone, there will be increase in cross zone network traffic. This figure shows a client path sending package to a Kubernetes service. The service is handled by Qproxy. Does Qproxy will road balance the existing path in the cluster? If the selected end points are path, which are located in different zone, there will be cross zone traffic. Of course, you can isolate the selected application in the service or use topology array of hints to avoid cross zone traffic. However, in this case, for developers who concern about increasing resource management and who are not used to this structure, they choose not to isolate their end point. What will be the most significant issue in cross zone traffic? That is latency. Even if the latency of the cross zone traffic is just one millisecond, the latency of the service can be more delayed. It's explained that the length of the chain can be longer in the microservice architecture. In the worst case, let's assume that the length of the chain is 10. Then it will be delayed by 10 millisecond. For the more, it's even harder to estimate it, latency if you are calling other department APIs within the chain. For the specific example, like advertisement services, they are particularly latency sensitive. Since the nature of advertisement is to deliver messages, it should be delivered quickly to its users at least one millisecond. How can we reduce the latency? To reduce the latency, we may choose to revamp physical impra layer and virtual network layer. In fact, it is not the only solution. As you can see in this slide, we can choose to revamp the software performance in container network layer. Because we mainly manage Kubernetes world layer. This is a container network before the reorganization. At the time, we use Selium among the CNI plugins. And we use Selium with EBPAP and IPvS. These figures show how packet is folded from node port to the pod. It comes as it's zero and it is translated by IPvS. Then the address is changed to endpoint pod address. They're starting with 192, as seen in the slide. After this, it is processed to routing decision. Next, the result of routing decision is forwarded to Selium host interface. Then they are forwarded by EBPAP to another load or to the local pod. In the case of another load forward, there can be cross-john traffic. To reduce the latency, we have changed the structure as follows. As can be seen on the right, we remove the IPvS and made a replacement with EBPAP. The EBPAP code on it's zero for the packet from the node pod to the local or other loads. The simple structure reduce the number of CPU instruction for Kube Proxy. For your instruction means better performance, which means lower latency. As a result, now we have the chance to improve the performance of both intra-john and cross-john traffic. Let's see how much the latency has improved. These are test environment. You can see those five. We use two node cluster with the general performance type in Kakao Cloud. It's node in a different zone. And we measure the round-trade latency of one byte of TCP using that perp. With the full EBPAP, we reduce the Kube Proxy latency by 0.25 millisecond. To explain, that is changed from 0.35 millisecond to 0.1 millisecond. Typically, our cross-john latency is between 0.25 millisecond and one millisecond. The difference occurs by the physical placement of the zone. In the figure, we use the example of a cross-john latency, 0.5 millisecond. In this case, we reduce the latency by 0.25 millisecond. Thus, it can reduce the cross-john latency by 50%. And finally, the GSLB custom resource. I mentioned that when you miss GSLB house check configuration, it can be the issue. So we're going to provide a custom resource as shown in the slide. After that, we'll develop the controller so we can configure GSLB easily. In addition, we remain mandatory house check for GSLB. This is a custom resource real example. We expect by the road balancer to connect to the GSLB configure house check. We allow developers to configure GSLB through the Kubernetes APIs. They are most familiar with. Until now, we've talked about the challenges and difficulty of single zone, multi-cluster, and our new goals and our consideration. Thank you for listening, and let's open our question and answer. Thank you. If you have any question, there's a microphone, two microphones on the hallway, so. So I would have a question. Okay. Yeah. My question would be in this multi-cluster setup if you have like quick failovers with applications and stuff. How did or do you manage stateful applications and their storage if applications are running in multiple clusters in parallel? So we guide to the developers to use storage class each on different zones. So if you want to deploy apps in zone A to use zone A storage class. Okay, but so the storage is not highly available for the stateful applications. So the applications stop to ensure that they can fail. Okay. In the case, you have to use other type of volumes like S3 or other types. Yes. Thanks. Thanks. Yeah. Hello. Yeah. So yeah, my first question is actually the same as. Oh, okay. This guy, but the second question is how do you provision all of those clusters? What technologies do you use underneath? I'm sorry, could you speak that again? How do you provision those 7,000 clusters? What automation do you use? Or it's a company secret. That's also fine. It's not a secret. We have a blog, how we provision it. We use for static data, we have a base image and for different dynamic type of info we use cloud in it for VM nodes. So we inject those type of info into the cloud in it when it put up, is that enough? Okay, thank you. Okay, thank you. Hi. Hi. I really liked seeing the CRD for the GSLB, but I was wondering why are you not using annotations? Wouldn't that be easier for the developers instead of learning a complete CRD, managing the life cycle of the CRD, just adding annotations to the Ingress? We actually use as what you said, but this is our option for automatic usage. So there can be two types to use. First is as you said, Ingressful. You can add an annotation of which zone you want to use and we are trying to make it with the GSLB. So I should say there are two options. Okay, nice. Okay, thank you. Thank you. Hello. I just want to ask about the GSLB thing you said in the GSLB, you can define your services and the type of services, et cetera. So from the infrastructure perspective, where exactly this GSLB CRDs will be applied? Do we need a separate cluster for the GSLB? And the another thing, if we really need a separate cluster for the GSLB, what about the internal services which are not exposed to the internet or something? So how exactly the GSLB will connect to the service which is internal of type load balancer? I don't use GSLB, what do you want me to do? You can explain it to me. Okay, GSLB. One second. No problem. You can explain the GSLB as a multi-zone. So for the first sensor, we do have a multi-zone GSLB cluster. Okay. But we don't manage it. We have a separate GSLB team. And for the second question, you will answer. Yes, we know that problem. So we are planning, we might get a load balancer, yeah. But by the way, we may have a new region and may use GSLB for it. Okay, okay. So to add it up, we only use LB in single zone. So this is why we are trying to make a GSLB so that we can connect all the LBs. But for private load balancers, as you said, it's hard to use GSLB. So we are trying to make load balancers have members in all different zones. So users don't have to use GSLB in the near future. Yeah, okay, okay. Thank you so much. Thank you so much. Hi, so if I understood correctly, you have one control plane per cluster and there is nodes, for example, SED nodes on every zone. When you introduce this change, did you have to tune anything in SED or QBPI server, any other configuration to account for the fact there's now more latency between the nodes? Is there any tweak that you've done as part of this migration? We looked at all the options SED had, but we didn't change any type of options. We worked, so. Wow, fantastic. Yeah, fantastic. Thank you. Hello. So how totally avoid interzone traffic, some heavy traffic passage without creating any additional services and with any additional labels? Would you please say that again? I couldn't get it. Yeah, would you please say that again? Yes. So we have multiple zones and we don't wanna interzone traffic passage, like from zone A, ingress controller should not reach to the pods to the zone B. So how avoid that case without creating service with different labels for each zone to avoid like managing overhead? Topolus oil hint. Right, right, right. Yeah, we recommend using Topolus oil hints then you can separate the endpoint by part zone. So it is possible to avoid cross zone traffic. With the topolus oil hint, we can avoid chain lengths of the microservice architecture and latency. Okay, thanks. Okay, thank you.