 Hi everyone Thank you Thank you for attending our talk surviving from endless issue coming from 7,000 plus Kubernetes clusters First of all, I'm very excited to be here because it's my first time abroad In three years since COVID-19 pandemic My Korean name is Seogyong Hong, but you can call me just Dennis I'm a cloud native cell leader at Kakao developing a private cloud And I'm one here. I'm also from the South of South Korea I'm also working as a cloud engineer at Kakao Corporation First time in North America and Kupcon also for Kupcon as a speaker Hello Despite the enormous size of Kubernetes clusters, I believe many of us here do not know what company this is Kakao is the mobile life platform companies that serve messenger portal bank Mobility commerce webtoons and more We cannot say everything here. So if you're curious Visit kakao.com for more information One of the in common is that all of them are internet service that need a server as you can guess Most of the service in kakao are running in Kubernetes now We started our transfer from Apache messos to Kubernetes in 2080 and 99% was complete this year When we decide to use Kubernetes There are many concerns about how to provide Kubernetes to in-house developers the biggest consolation in adapting Kubernetes was Whether to provide a single Raji cluster as a tenant for namespace or provides separate clusters for each organization or service We consider various factors such as cluster management, resource efficiency security freedom of development extra In conclusion, we decide to provide lots of small clusters because isolation and security guarantees essential requirements for compliance So we developed our own private Kubernetes service to provide the Kubernetes cluster to in-house development that developers In a automated way It's called DK OS It only take it only take three steps to get a new Kubernetes clusters Login set cluster names and select region and done and this year We will create Kubernetes clusters not only in kakao's private clouds, but only a but also in public cloud such as AWS Delivering a Kubernetes cluster in automated Way while ensuring high isolation and freedom through self-service has a both Good news and bad news Good news is the number of Kubernetes clusters grow successfully and more than 7,000 plus clusters are operated Bad news is so did our own core issuer we have moved the Cloud successfully by delivering automated Kubernetes service this approach has advantage of it easily creating a Kubernetes clusters and meeting individual requirements But also has this advantage of being too easy to make one and causing various edge case and the growth of Operate operational cost is a barely manageable Let's talk about bad things further first too many clusters are not being used Developers made their cluster for production service Developing testing or whatever reasons the problem is that they do not delete Deleted even though their purpose is done it is okay if we have a Countless server in a data center with infinite space, but sadly we do not Some of my thing that then why just find them out and delete all of them first We do not expect our service growth this far. So we do not have use firm metadata metadata for this task Second determining whether this cluster is in use Programmetically is not an easy task for example the cluster use using less than 10% of CPU Could be a cluster not being used But also could be a production CDN service which just using less CPU and The cluster having no deployment could be unused cluster But also could be a cluster used by a bunch of custom resources The point is that we have exceptions for every criterion So we have to consider multiple factors The second problem is that we've got on core issues that are very manageable This is an example of our on core request. It basically said Good morning. Some of our parts are research pause are restarted at 5 a.m We've got notifications for that, but we don't know why so can you help us? That's the request we received and the reason can be anything starting from a Application bug to known Linux counter issues and we have to find out the reason from scratch every time What we did with this half year of 1000 plus on core data was conducting a research To be more specific we conducted research based on ground grounded theory Including open x year and selective coding techniques So one thing I want to share from this research is that is the fact that not all Developers do know where about Kubernetes For example, we all know that Using the container image tag latest could be caused unexpected version of container being run But this happened frequently and they asked us why some of their pods are running unexpectedly or using the host path host path to store some critical data something like myscare database and Creating new VM to scale up and ask us why their data is gone All these issues could be fixed just by saying hey, you have potential problem here you'd better fix it now, but the problem is Where we have a 7k plus clusters and it's never been a good idea to do it manually So what we need is someone or something other than us The rust problem is that we forget what we know after a few years Descripts show how we handle our issues When the issue can be handled by a user them them serves We just show manual or documents to handle by them serves if it is a Administrative request such as increasing quota or allowing public IP We just allow or disallow and the rest of things Processed automatically if the thing is known issue we will label it an instant if not we mark it as a problem When you find a reason for the problem we write some documents and share it with our last of uncle members But sharing knowledge or documenting is always a challenging task So at best after a few years we forget it and try to find reasons for once-known issues again Even though we have operational tours like chatbot and monitoring tours. This is not enough for these situations What we needed was a detection as a code to examine to examine Mertifactors for deleting clusters to let users know what could be a problem without human intelligence And to find the known issues automatically without waste of time So we made a detect show for detecting Kubernetes known issues which is an extensive problem detecting command line interface tool for Reliable reliable Kubernetes cluster operations and rapid problem detecting made by go and Thanks for taking photo of this, but I can show the link Yes, we open source it and you can find it on github.com slash cacao slash detect and This is how we use the tag Yeah Before starting the demo, this is our in-house service. So this demo will show a lot of Korean But don't worry. Opposite version of detect does not have single letter of a Korean and alphabet. So, yeah so this is our list of clusters that I can have access and we will use Cluster name Scottie Scott, which is my English name and Requesting a new analysis diagnosis running down you can see the result of diagnosis As you can see there's fater error and one things There's a part pod with no selium end points and service with no available pod which is detect server, you know and We can see the pod restarted frequently, which is 12,000 times Oh God and we have outdated TLS certifications Which is outdated 53 days ago Yeah, this what detect it and that's all it finds the known issues which Frequently happened and show say user to act something or With further automation we can do something Automatically that can handle by automatically. Sorry. No worries Yeah Detect has a pretty simple Internal structure from variety sources like Kubernetes, Prometheus, SSH or whatever can access Components core collector in detect try to get something like pod manifest deployment manifest Prometheus metrics and Infrastatus and Save it to in memory in memory key value storage in detect and Series of components called Detector try to find useful insights for us something like TLS Certification will be expired soon or some of pod restarted more than millions times Lastly final report is generated in a form of JSON or Yammer for further automation Or can be seen as an interactive form of HTML But both collector and detector components are extensible so you can add or remove whatever Rural suitable for your environment Now I show you a demo for creating new rules for detect let's say you are in a situation where deploying more than 10,000 parts Will kill you kill your control plane and want to detect to such Situation we are going to show how to add a new detector called No more than thousand ten thousand parts and this is my base code. Here's the reverse the clone the repository of detect and As you can see there's the directory called cases and detector what I just said And I'll make some detector rule called too many parts starting here type too many parts Some type hinting for VS code Implements this interface and you can see the detector Detector is consists of two methods, which is called game meta and do for doing it And In a game meta you can define some informative things in here something like ID to many pause description more than 10,000 part where if happen its labor will be worn and And it if happen Is something like we need some data from collector called pod list Which is type of pod list from current it is and yeah, that's the order the other thing do you need in game meta and To detect if there is actually there are more than 10,000 part You can type typing. I want to get some part list From context get collector Give me Paul list and you will get Paul list if error Are not handed this now and just returned Report which is which is has passed if Land of power is the items is less than 10,000 and doesn't return error. Yeah, that's the order All the things that you need to define a new detector and I will show it Currently in my laptop it have connected with Kubernetes which has Two walker nodes And our run this time Sorry Our run export our reports as HTML Yeah, and this is what you get as a research reports from the tech and as we just defined it there's a Rune saying too many parts and current state is everything is normal, but there's no informative Oh, sorry, I forget something Yeah To add some devolved information Description shop or pause data when Power list item generating report There's Rue court too many parts detect nothing and there's a total of 56 parts in this cluster And if there's more than 10,000 parts their labor will be warning and Ask user to delete some unused part immediately. Hopefully please Whatever Yeah, the tech has pretty simple structure as you saw just saying some meta and some rules to Define what is the problem you can do such a things like connect via connecting each node via SSH and check some counter parameters or Checking the container images from unknown registry or check wound killed pods Such a thing and Yes, that's our hope sorry Miss my scripts So I just saying something and yes, hope this helps Kubernetes users to use Kubernetes more effectively And thanks for coming and listening Yeah, that is done Any question here Yeah, hi, thanks. Thanks. Really good talk you guys I was wondering I took a look at the repo and there are Maybe half a dozen Six or seven ten detection rules or something in there Are you planning on publishing more detection rules because I'm sure you've got a lot from your operational experience Yeah, in in-house we have country more than 50 rules in their spot. Oh repeating the question he said if the There's currently a six detector rule in our public github repository and he asked me Did you have any plan to expand it? adding more rules for that and my answer is yes in in-house we have more than 50 rules for detectors, but Yeah, we don't have time to Evaluate whether is this fine to open source it or we have to Constrand it. Yeah, that is the decision problem. And yes, we have planned to add more rules in this year Hey, just good question. So for your scroll scripts since they're accessing some of those Kubernetes resources, are they deployed externally outside your cluster? And if so like how do you go about deploying it? Oh Sorry, my English is very poor. So can you speak more explicitly? Oh, I just I was just wondering if your goal scripts are deployed in like outside your cluster inside your cluster And if it is they'll play outside like Where are you deploying it? I guess so the question is that how you? Execute this script it needs to be deployed or it is a command line Thank you So is have you have deployed this externally or is it just a command line? All we need is coupe config prior to access the command in his cluster. Oh Oh That's not enough for distance. Oh, you can do it in internally and externally Yeah Where we just show the demo which is out of cluster and we can we already deployed is as Container so you can deploy in your pod and there's example in our github repository So, yeah One of the challenges we have with many clusters is handling upgrades like Kubernetes upgrades How do you manage that with so many clusters? where we also made API for that and We already made some live upgrade solution for this and We share it in Korean, but we are not yet sure we are not yet opened in English. So sorry Sorry Giddles no We cannot handle it giddles things. We even cannot use a cluster API since our cluster has more than 7k plus Question here How do you run those scripts? It's unnecessary that runs the the scripts or it can be the developer that Knows to go to that site and run the scripts. Do you have a website to run those scripts? Can somebody else replace this Sorry Who execute these scripts is a for developer or ops or is it a schedule? We do not explicitly separate those rules. So Developers deploy their Manifest we just checking if there's the problem, but before the problem. We don't know nothing and That could be your answer I Don't want to wake up at 5 a.m. This is purpose for Before asking me just try this. This is its purpose Hey Thank you for that great talk. I have a few Christians about the seven thousand clusters. So Are all those clusters created by your developers for developing and testing? How many developers do you have like you are using those seven thousand? Clusters and how many cluster do you use actually in production services? One thing I can only say in here is that there's more cluster than our developers That's the only thing I can say now, sorry how many I think he said that there are more clusters than the developers He cannot disclose the numbers. That's what he's saying So I have a question So to execute this What kind of permission require for a developer or the operation guy because you are seriously doing some Inside information from the pod as well as a system So it requires admin access how the developer can use this whether they have sufficient access to the cluster or not So, yes, you're asking me how the police to access the close no no is that what? access require for executing this oh or We we have About three thousand developers in company and Access one cluster access developer about ten Ten to three No, my question is sorry. Let me repeat my question is that to execute this is developer require admin access or Developer will work on its own namespace Yes, we separate to all cluster so cluster user is Gain all namespace access Are you using bare metal or virtual machines for on-premise? We have our virtual machine solutions based on open stack. Okay. Any bare metal or just all virtual machines Combined Okay What was it like working the other week during that outage? two weeks ago With your data center issue or you guys were down for a couple days or no Do you have outages and Do you have outages on your data centers or on clusters? We pretty much not want to say Yeah, we have fire on they are data fans database Maybe last week Yeah, last week we very PG We were very busy last week, and I'm pretty much sure I'm still asleep. I want some sleep Yeah Thank you everyone for coming. Thanks