 Hi everyone, thanks for joining our talk today. Today we are going to share our thoughts on maintaining Kubernetes cluster reliability with an SLO-driven approach. My name is Chen, and I work at Ngroup as an infrastructure SIE. We have another SIE here on the cloud joining us today. Welcome Tom. Hi everyone. All right, so our primary interests include running large-scale Kubernetes clusters while maintaining high reliability. So today's talk is going to divide it into three different parts. First, I will briefly talk about our motivations and the problems we encountered when we are doing reliability engineering on Kubernetes clusters. And then my colleague Tom will do a deep dive on SLO designs specifically for each Kubernetes components. And then I will share some thoughts about doing SLO management. Specifically, I will introduce our SLO-based learning mechanism to help us to monitor our system better. So, motivations. Today, cluster management becomes really increasingly challenging for us because our clusters really grow up really significantly. So, for example, at Ngroup, we have thousands of nodes in each cluster with millions of parts. And, you know, like as an SRE, our philosophy includes, say, our headcount, say, SRE headcount should only grow, say, subminiarily or even remain constant while our corresponding service grow up. So we have to deal with the exponential growth of Kubernetes part with only a small team. So, fleet management is really challenging, especially when we are doing releases. I think SRE, we always do releases, but sometimes we have to roll out new components for master components in Kubernetes, say, API server and schedulers. And somehow, like when you are doing releases, you have to consider reliability, right? If you release too frequently, then in the early days, you may not know what happened in the cluster, actually, right? Because all of the components interact with each other, and it's really complicated. Another story is about doing node-level component upgrade. Say, you are going to release a new version of Kubernetes. You know, when you are doing a release for 10,000 node cluster is a totally different story than doing a release for 10 node cluster, right? For example, let's say your Kubernetes is trying to list all parts. And in a 10 node cluster, your API server may feel nothing because the load is really, really low, right? However, for 10,000 node cluster, if all the Kubernetes are going to, like, say, list all the parts in the entire cluster and send a request to API server at the same time, trust me, your API server will shout out loudly. And you may encounter situations like out of memory or even high CPU load, which makes your cluster unusable. Well, you may argue that, well, the recommended way to do Kubernetes update needs to first reschedule all the parts from the node and then do a code update, right? However, this is not possible in real-world because you may not have enough capacity. Okay, so now let's talk about how we are going to do reliability engineering. So the general approach consists of three different stages. So the first stage is the production stage, which represents your service current status. And somehow if your production hit some issue or had some trouble, because maybe you triggered some alerts or some incident happened, then you are going to do, going to the design and the improvement stage, where you sit down with your dev team and your customer to figure out what happened and maybe you have to do a post-model review to figure out what to improve. Maybe it's some updates or maybe it's some alert management or some monitoring changes. Then you are going to the change management state. In this state, so you may follow your company's like change management policies. Maybe you can only do new releases during, say, business days under business hours, right? Or maybe it's about to be since giving days and you have to do a production freeze. No commit or no release can be pushed except for, like, say, emergency packages. So an SLO based approach is just a specific implementation of the general reliability approach. So you just use the SLO as a lead to go through all the different stages. From the production stage to the improvement stage, you use the SLO based alerting to trigger your incident response. Your SREs will focus in more on SLO alerts and then use them to drive your dev team to do improvements and new designs. So in the design and change management stage transition, you may say, redefine your SLOs if after, like, say, a careful review, you believe your SLO doesn't represent or doesn't align with your user interest. Then you may go through, like, say, the refinement stage to redesign your SLO and then maybe some align with some code changes going to the change management stage. In this stage, you will use SLO documentations and agreements. So let's say you and your dev team agree on, like, say, if the SLO is broken, I'm not going to push new releases until, like, say, SLO is back to the normal level. So this is a general SLO based approach and the philosophy under it. And then I will hand over to my colleague to do more deep lives on SLO designs for Kubernetes itself. Hi, everyone. I'm Tom and I'm going to talk about SLO design for Kubernetes. First, let's do a quick recap on the definition of SLO and SLA. SLO is a matrix that you use to measure your service health needs. For example, for HTTP server, they can be request error ratio or the average request latency. SLO is the quantitative target that you and your team try to achieve. For example, one can target 99% of the success ratio for HTTP requests that you must. And finally, SLA is the promise you make to your user, like what would the user expect to receive when you broke the SLO. For example, if somebody paid to use your HTTP service and you fail to deliver the 99% SLO target, you may compensate to your customer with cash. Sometimes internally, the word SLO and SLA are interchangeable. So back to our day-to-day drop as an infrared SRE, what do we care about the Kubernetes itself? As Chen mentioned before, at the anti-group, we run dozens of clusters globally. And we have internal users to use Kubernetes for different purposes. For example, provision pavement and transaction system database, as well as running batch jobs like machine learning. When we design our SLO, the first put is user first. That means SLO should represent the real user behavior and experience. Each user may have different purposes when using Kubernetes. However, we can summarize them in four categories. The first and the most important SLO is about resource delivery. Resource here often means holding or computing resources in a shape of deployment or step set. We care about delivery speed and success rate by using asking to create deployment in our clusters. Related to this SLO, we believe we need to manage port lifecycle precisely. This SLO target on a single port, but during it is creation, upgrade, deletion, behavior. We use this SLO to figure out our system overall health status. And the third SLO, we believe we should care is about whether all existing ports in the cluster are running as expected. If the cluster is having a network issue and the port are disconnected with other clusters, we may say the port entry or unhealthy state, then we can sum up all the health time of the all ports to calculate the last level port uptime. Finally, we know user often interact with API server directly for query or other behavior. So we care about API server success ratio and latency. Let's take a close look into one of the end-to-end SLO. We would like to come up with a metric to recognize the end-to-end port creation behavior and the user are happy to see a promise like 99% port should create it successfully to reach a ready state within five minutes in a month. As you can see, the port creation workflow in Kubernetes is a real conflict. In real production for Kubernetes cluster, you request the will pass virus customized webhook maybe a secondary schedule to reach cobalt. Sometimes you may have multiple user defined controller to register your port to different external dependence, just like CMDB network or metadata center. What a long journey for a port to create it. Our solution for measuring port creation SLO is to collection virus for the related audit log and event, put them into a centralized log and assign system with port name as the unique index. Then the log and assign system can identify the latency during each port status transaction. For example, they spend three seconds during webhook admission, 10 seconds during the scheduling and a certain second in imaging porting. Also, the centralized log and assign system can identify if the port is in the desired state, size user may have different requirements for creation latency. Some may tolerant to 10 minutes, but some may only tolerant to 3 minutes. Finally, we produce time series metrics to record the end-to-end SLO and let our monitor system describe the metrics and produce meaningful SLO based alert. So we will talk about alert later. After implementing several end-to-end SLO, we are facing new issue. Although we can leverage port event for log and assigns and the individual tracing, it is less of wheels to identify the case for some component issues and affiliates. When SRE receives alert about SLO affiliates, often we need to check different component dashboard to see if API server is returning better response or customer controller is not working correctly to patch a required label for the port to run. The entire process is like top-down approach. You first know something bad happened and then you start to read down on each individual component to find the path for. The good thing is about end-to-end SLO, it is less noise once your metrics is well designed. However, the root case search may take a longer time and may have your relay on past experience. To take a step back, let's revisit the flowchart of the port creation. Although we have 10 components in this graph, we actually can modern them in two types, namely synchronize the components and asynchronous the components. For synchronize the components, like API server, etcd, mutating, and the validity webhook, their interaction model is more stressed forward. Get requested in, process internally, and send the response out. For asynchronous components, just like scheduler, controller, or customer operators, their interaction model are more likely classical list-and-watch pattern and intent driver. They need to watch resource status change and then into the reconcile queue and do the reconcile relation repeatedly. So to design SLO for synchronous components, we believe we need to pay attention to request the error rate, latency, and also the component appetite. According to the Google SRE workbook, the other two golden signals are situation and traffic. But we believe that these two signals are less important and we can have another way to include them in SLO design. For example, traffic actually can be majority from the client side. Almost every component talks to API server and each of them has metrics to indicating the success rate to talk to API server. The client side monitoring a creation on the request the error ratio can somehow reflect the situation and the traffic condition. To design the asynchronous component SLO, similarly, we care about the reconcile error ratio and latency. Of course, component appetite says most of the component runs as a leader follower pattern. So we can define SLO for leader selection time and no leader time. The reconcile queueing system is another interesting accepted of asynchronous component. So we can design SLO for the queue depth and the on-economy.item to help us mature component healthness. Finally, to summarize our overall SLO design is a combination to top-down approach and the bottom-up approach. We first separate the SLOs into two layers. The upper layer is about end-to-end SLO, which we believe both user and SRE care, including the resource deliver, pod lifecycle, API server interaction, existing pod healthness. The bottom layer focuses more on individual component and we believe SRE and the depth scales more on this layer. Each individual component SLO helps us identify bottlenecks and the limiter in our Kubernetes cluster and then we can take concrete action to improve each of them. Okay. Next, I will hand it back to Chen to talk about SLO management. Thank you. All right. Thanks, Chong, for sharing. Now I'll talk about how we do SLO management. Specifically, I'll use an example of doing SLO-based alerting. Suppose we have an SLO, say 99.9% of the requests during the server are successful every month. So this means a successful request will, like I say, return HTTP status code less than 500. This is kind of simple and straightforward. So as an SRE, you can actually set up a very simple alert based on the ratio rate. As I listed here in this following promissious code snippet, you first define two recording rules, one to sum up all the SLI error count in a minute and the other recording rules to sum up all the total counting a minute. So now we have the per minute error count and total count. We simply do a division and compare it with 0.1%. And if the error ratio is greater than 0.1%, we alert. And if you have set up this type of alert in the past, you may already find out that this is very problematic because you will receive a lot of noise. As I demonstrate here in this hypothetical error ratio graph, so if you use the ratio rate alert setup, then every spike here, which above this horizontal line, will trigger the alert. And it will cost you and the SRE team a bunch of operational hours to deal with the noise and maybe to deal with the real outage. And then you may think about, say, how can we reduce the noise, right? A very straightforward thinking is to add a holding period. Let's say, if this error ratio continues for five minutes, then we trigger the alert. However, this is also problematic because I can easily bypass this type of alert but still broke my SLO because you can think about this situation. Let's say in the first three minutes, your APS server receives like 100% error rate. And then in the next two minutes, because there is no traffic, let's say there is no traffic. And so the error rate is like 0%. And then the traffic comes in again and it becomes 100% error. And then in two minutes, there is no traffic. This repeated pattern will cost you 60% error rate or let's say 100%. So again, this type of error fail to capture this type of like outage. So what can we do? So let's reconsider our alerting philosophy. We know that our SLO target should never be 100%. Otherwise, it's meaningless. That means like for the total requesting amounts, we can divide them into two parts. We have an expected portion that we want to serve reliably. It can be like 0.9% of the total requests. And we always have like a 0.1% error portion that we think we can manage the risk of it. So these risks can sometimes come from your dependency failures or you can encounter unexpected network hiccups. So this red portion we call it the error budget that we can tolerate. Then we can come up with our alerting strategy. So first, we set up our SLO target saying 99.9% the request to API server are successful every month. And then we implement our SLI metrics to monitor our SLO rate. So typically we have two metrics, the error metric and the total metric. And then we can figure out our maximum tolerable failed requesting amounts that is our error budget. And then we only alert if a predefined portion of the error budget are consumed in a short time window. So to demonstrate this philosophy, I set up two levels of our alerts. Say the first level is the page level. That means SREs needs to come up with a solution very quickly because it's really an emergency issue. So the page alert says number of SLI failures in an hour. If it's greater than 3% of the entire error budget of the month, the error budget calculation is a very straightforward simple math, 1 minus 0.999 times the number of total requesting amount. If that happens, then we alert so that we need to react on these failures. And we have another level, say the ticket level. This is to deal with those somehow mild situation. Say the SLOs, SLI failures in a day say it's greater than 5% of the entire error budget. So why we choose 3% and 5% here? So 3% in an hour. It means like say in 34 hours we will burn up all of our error budget. So we will miss our SLO target for the month. And for 5% portion in a day, it means in 20 days if we do nothing, that means the SLO will be broken. So to implement these alerts in practice, let's say we use from ECSA the example in this code snippet. Let's say I want to implement the page level alerts. So other than the per minute SLI error count and the SLI total count, which we use in the ratio rate based alerts, we need to come up with two more recording rules. One to sum up all the error counts in the past hour. Say we sum up the time for the permanent error count. This is also very straightforward. And we can come up with another recording rules to get the monthly total count. This can help us calculate the error budget very straightforward. So then we can come up with the final alert. The left hand side is apparently the hourly SLI error count. And the right hand side is the 3% of the error budget. Anyway, due to the time limit, I can only share this much about SLI alerting. So to do a quick recap on today's session, we basically talk about three different things. First, how we define our fine grained SLOs for Kubernetes clusters. We do component level and we do user level. So users may care more about N2N SLOs and we care more about component level. And we know how to use SLO-based metrics to do our monitoring and alerting because they are less noisy and can be very active. I don't have enough time to go through how we do SLO documentations and how we do discussions with our partners and their teams to reach agreements. But we can discuss them later. Well, in the future, since our Kubernetes clusters continue to grow and we need to continuously review our SLOs to align with our user interests. And maybe we will do like same monthly or quarterly SLO reports for like the entire fleet. Anyway, we already started another very interesting project on automation. We want to automate the recording process based on our SLO-based alerts and other SLO-based metrics. Anyway, thanks for listening and now we can take some questions.