 So, welcome to our talk on using cluster golden signals to avoid alert fatigue at scale. My name is Anusha Raghunathan and my co-presenter is Sahil Badla, and we are both software engineers working at Intuit. Now, let's look at the agenda for the talk. We're going to give you a background of Intuit's Kubernetes-based platform infrastructure and will introduce you to some of the problems that platform engineers like us deal with while building, managing, and observing our Kubernetes-based infrastructure. We'll then introduce you to the concept of cluster golden signals, why we did it and how we did it. We'll follow that up with a demo, which highlights the power of using cluster golden signals in determining, lowering your MTTD and MTTR during a service incident. We'll then talk about anomaly detection in Kubernetes clusters and what we have done with anomaly detection. And finally, finish off by talking about what the future holds for this project and take away. So, for those who don't know what Intuit does, Intuit is a fintech platform company that is popularly known for building financial products and services such as TurboTax, which is used for tax prep and tax filing, QuickBooks for accounting and payroll, Credit Karma for credit score analysis, and MailChimp for small, medium business marketing needs. Now all of these products and services run on our Kubernetes-based platform infrastructure. Now let's look at the numbers at a glance. This fleet that runs all of these powerful products and services comprises of about 275-plus Kubernetes clusters. These are about medium to large-sized Kubernetes clusters that comprise of about 20,000 namespaces hosting 2,500 production services. Now note that these are just production services. We have a whole bunch of pre-prod services running E2E, QAL, Perf environments that are just the same number of services or even higher. And all of this supports about 900 developer teams comprising of 6,000-plus developers. So you get an idea of the large scale that we operate at. Having said that, let's look at a day in the life of a platform engineer that manages and observes this fleet. Now there are several components to a Kubernetes cluster as we all know. Starting with every node that makes up a Kubernetes cluster to Kubernetes components, pods, and other synthetic monitoring. So for a node, we monitor the CPU, memory, disk, network, and processes. For Kubernetes components, we monitor all of the native Kubernetes components as well as the cluster add-ons that we add on to the cluster. Then we also monitor the pod state. And finally, we have synthetic monitoring where we launch on-demand workflows to monitor the health of a particular subsystem. Now what are the metric sources for these components? For node-level metrics, we go with Telegraph, Kubernetes, we use Prometheus, like pretty much everyone else. Pod state, we use KubeState metrics. And for synthetic monitoring, we use something called Active Monitor to do synthetic monitoring. What we do is we use Argo workflows to launch on-demand workflows to check the health of a particular subsystem within the cluster and make sure that it's working as expected. And all of these metrics are generating alerts. And they go to our platform engineer, typically the platform engineer on call. And they're frantically looking at a bunch of dashboards, trying to do some runbook lookups, and then trying to mitigate the problem, reducing the MPTD and MPTR of whatever issue is at hand. And note that they have to do for 275-plus clusters. And there are about 100-plus others per cluster. So you do the map. It's pretty overwhelming for our platform engineer already to be on call. Now let's make things a little interesting and throw in an incident. Who in the audience likes to be on an incident call? No one. Neither does our platform engineer, but she has no choice. Now she's pulled into an incident call, whether it's a platform problem or a service problem that platform engineers are always involved. And she has to answer questions like, hey, there are one or more services that are impacted by this incident. What is the health of the cluster that is running these services? How healthy are they? What alerts in these clusters am I going to be looking at? How can I quickly resolve and tell whether it's a service issue versus a platform issue? And how do I determine the blast radius of the impact? When I know that a service in a particular cluster is being impacted, how do I know whether services in other clusters are affected or not? What's my blast radius? So the platform engineer is getting very overwhelmed. In fact, this is how they feel. They are not just getting overwhelmed. They are drowning. They are drowning in a sea of alerts. Now, I don't know if any of you have this experience being in platform engineering or being an SRE in your organization, but some of us do. In fact, whenever I go on call, that's how I feel. And it's no fun. So what the platform engineer truly needs is they want to reduce the MTTD and MTTR for an issue. They want to have less false positives and less false negatives. Alert me only if I have a problem. And always alert me if I have a problem, nothing in between. And I want a few good quality signals from my clusters. I don't want to be alert around everything. Just filter the signal from noise. And I might also like some tulips. So having said that, in the next section, we will be talking about how we focused on solving for all of these platform engineer problems by defining the concept of cluster golden signals. And for that, I would like to hand it over to Sahil to take over. Thanks, Anisha. Let's talk about cluster golden signals. So the term golden signals is not a new term. It has been around for a while now in the services field. And I want to quote the Google SRE handbook that golden signals are a reduced set of metrics that give a wide view of the application that is running. So it is suggested by the handbook that if you want to start monitoring something, these are the basic set of signals that you should start monitoring. So the golden signals has been around for some time now. And we're talking about latency, traffic, errors, and saturation. The good news is these golden signals can be applied on platform as well. And that's what we're going to talk about more. But before we dive deeper into cluster golden signals, we want to look through the perspective of service owners and then figure out what exactly the service owners care about. They care about the top level requirements that is availability of the service, the scale of the service, and correctness. And to figure out this problem and to build this product in the right way, we started mapping these core capabilities to the components in the Kubernetes platform. And they are as follows. For availability, we have the cluster control plane and cluster networking that powers the availability and then how the cluster components work with each other. And then this is the core of the cluster. For scale, we have cluster autoscaler that scales up the cluster. And then we have horizontal pod autoscaler that is scaling the pod needs. And for correctness, we have a couple of components that is cluster authentication, cluster networking that powers the networking between the pods, and critical cluster add-ons that provide a certain level of functionality. And these add-ons are running in the cluster, taking care of certain functionality individually. And some of these add-ons are the VPC, the CNI, which is the pod networking add-on that we run. And then a couple other add-ons, which is handling a certain functionality in a different way, like monitoring, observability. And at Intuit, we have about 30 of these add-ons providing critical cluster functionality and that all is part of the application needs covered under correctness. So from here, let's talk about cluster signals now. We know we have these concepts available from the application side. Can we use these in platform? Yes, we can use them and they are based off of the core traditional golden signals as is. And the components have been defined as errors. And there are components in Kubernetes which can get into the error state causing the applications to fail, so we want to track them. And there are components that are dealing with APIs or providing the baseline for the APIs in the services to run, so we wanna look at the latency of these components as well. Capacity, these platforms are running on infrastructure, so we wanna look at all the components that power the infrastructure itself and then are we hitting the limits there? And traffic, which is the type of workloads running on these clusters and we wanna measure the patterns and then we wanna see how saturated they are and what can be defined as a good or the right or normal pattern for a certain type of cluster. So once we have all of these signals, it's really easy to get the overall health of a cluster by aggregating all of these four golden signals over all components that gives us the cluster health and then it can be defined in the three states which is healthy. That means all the components in the clusters are healthy and degraded when at least one of the components is degraded, we want to change the cluster state to degraded. And critical if at least one of the cluster state is critical, we want to show that something is wrong, really wrong in the cluster and that's the third and the most critical state, the critical. So taking another closer look at the error golden signal, that's the first signal that we talked about. So in a cluster, right here we have a query which is a Prometheus rule that combines all of the components error signals into one signal which is overall error golden signal of a cluster. As you can see in the picture, we have components from order scaling, control plane, authentication, a couple from networking and add-ons all aggregated together that give out one error signal as a golden signal for that cluster and this can be defined into the three states by using a number range. So for us, when this adds up to zero that means all the clusters components are healthy and then if the sum is greater than 10, then we change the state to critical and then when it's in between this, this is degraded state for us. So before we go into individual components, we want to talk about error states. So a component can be in error states and then the right way to measure the error state is in two ways, one is the success rate and let's look at one of the components here which is no local DNS and it's really easy to measure the success rate SLA over a preset window of time because this component gives us a response code and by filtering over the response code, we are able to figure out how many calls have been failed and it's a simple calculation from there which is denoted in this Prometheus rule. We are filtering all the calls that have been made by this component that have returned with a serve fail error and you subtract that from 100 that gives you the success rate or the success percentage but defining the success rate over a preset window of time, like for us it's a five minute window that gives you a success rate and we can use that success rate over a range to figure out if the cluster state is healthy, degraded and critical. So like we've defined in this Prometheus rule, if the success rate drops below 99% we are changing the state to degraded and then if the success rate drops below 95% we are changing the cluster state for this component to critical. So that's the first type of error metric that we can use to form this golden signal which is success rate and going to the next component we'll talk about AWS CNI and a different way to look at the errors from this component that's error count SLA. So some components are not per se dealing with the APIs but they provide an essential functionality and if there is an error in that functionality we have error counters that increment with every error you see. So for those type of components what we can use is a static threshold, a number, if you go beyond that over a preset window of time then this component state is changing. So for us as shown in this picture if we have less than two errors in a five minute window from this component AWS CNI we're saying that this component is healthy and if it grows beyond five errors in a five minute window we're saying that this component is critical because this exceeds our threshold. So we've defined how we've come up with this formula to get overall cluster health and digging deeper into one golden signal which is errors and then how different component aggregated together give out this one golden signal for one particular vertical of cluster golden signals. So by adding this we've built a system we ultimately want to get rid of the alert fatigue but there's no secret sauce. We, to get there we have to build a toolkit, a mechanism for it to alert us when the systems are not doing healthy and how can we achieve that? So that's what the platform owner woes were. We want to be alerting on the right signal so we have to basically assign the right priorities based on the type of the clusters and filter out the noise by looking at the right signals at the right time. And ultimately it improves the mean time to detect by acting on critical clusters faster. And improving, for improving the mean time to resolution we have to build an automated incident creation system based off of these alerts when critical system are in unhealthy or critical situation. We want to alert the platform engineers as soon as we detect the error. So we start jumping into fixing those issues before our application team starts seeing those issues as well. So with all that in place we are actually making the system better by adding these new signals and not getting overwhelmed by these new metrics. So this is one simple prometheus rule for alerting when once you have the overall status of the cluster when the status dot health of a cluster is greater than zero which is not the healthy state it can be either degraded or critical for at least two minutes we want to alert with the right summary and description of the cluster that the platform owners can take a look at and then jump into fixing this issue immediately when we see that. So with that said we have the bigger picture and I want to dive into a demo of a simulated incident that explains a similar situation that happened in one of the test clusters where we have an actual anomaly in the cluster and we have services running in the cluster and how soon we figure that out using our new system that we have built. So I'm gonna pause right here to give a quick summary of what's going on. So as we see there are three screens here. On the left we have the state of the cluster and then this is looking at one of the component which is AWS VPC CNI and what it does is it provides the pod networking in the cluster it is the CNI layer and it's responsible for assigning new IPs every time a new pod comes up. So that's the basic functionality of this add-on and then we've injected a failure which stopped this component from talking to AWS APIs which means it won't be able to provide new IPs to new pods that are coming up and on the right side here I have another Prometheus dashboard that shows the state of an application running in namespace and it has the query right now is showing the number of pods running. So which means that the pods are running and they are in the healthy state and we have three pods right now and by injecting this anomaly as we can see the state of the application is doing good but the cluster is not right because this component is already failing and in the picture, in the demo here my window is stretched to one hour so this anomaly was injected in the cluster almost an hour before we start looking at the application so you see the application is perfectly fine and the application owners do not know about this problem that is already there in the platform. So let me continue the video to show we're gonna look at the pods and the status of the pods so I'll do a quick describe here over the pods to see they are in the running condition and then quickly go over increasing the number of replicas of the pods to see when they start failing. So you can see this is the overall status of the component and then overall status of the cluster it's all critical already and it's almost been an hour since it has been in that state but the application owners do not know about that thing because their pods are all running fine and healthy and as of now it's not affecting their component and then we added this anomaly by simply changing the IAM role permissions of AWS Component so to see the application failing now we would go and then edit the deployment of the application and then add more replicas and this is where this component will stop providing new IPs to the new pods that they'll be coming up but on an instance type there is still a warm pool of IPs that might be available so by bumping this up to five let's see if this application goes into a bad state right away or not. So I'll do a quick time jump here so that the Prometheus chart loads so we'll see that the application pod is still not affected by this anomaly because we see the number of pods up and running now are five that means we still had some warm IPs available on that node in the system so the application owners still do not know about this anomaly happening on the cluster which is already injected an hour ago. So to take it up a notch let's increase the number of replicas for this application to 15 and we want the system to fail to prove this demo that the component AWS CNI is failing and it'll soon run out of IPs so changing the deployments let's look at the status of the new pods and we see there are a couple of pods in container creating phase which means some of the pods are pending and let me do another time jump here to let the Prometheus chart load so the chart reflects the current state of the pods which is 12 in running phase and three in pending so something is wrong and now the application teams have detected the issue let's describe on one of the pods why is it failing to look at the error message and right here we will see the status of the pod and the error state, what it's complaining about which is plug-in type AWS CNI fail to assign a new IP address to the container so this is a really interesting example of an anomaly that's going on in the system and by building this new system the platform owners will right away be paged about this thing happening even before the application owners know about that this is one scenario where it took about an hour for the application owners to know about this because this is when their application scaled up and then experienced this problem in the cluster but that can be days depending on the type of the cluster or hours so the point I'm trying to say here is building the system right, getting the right metrics out we are helping the platform owner woes go away and then get into a state where we are looking at the right metrics alerting at the right time and then getting to the real issue at the right time so with this system built we rolled out the first phase of this project the cluster golden signals implemented on all our clusters and then we started taking learnings from those rollouts and then Anusha I'll hand the mic back to Anusha to talk about what the future looks like Thank you Sahil That was a powerful demo In InShot what we saw was if you relied only on your service golden signals you would take a lot longer to detect an issue with your platform In fact what Sahil showed was he introduced a bunch of scale up events in order to trigger the real issue that to be caught in the service side but if you actually built your cluster golden signals we were able to detect it within five minutes In this demo it was five minutes versus 60 minutes and in a real life situation that scale up event can happen hours, days, weeks after the real problem happened in the platform so it's much better to actually have golden signals on the cluster side alongside the services as well Now we rolled out the cluster golden signals for an error vertical for the error pillar and we ran into a few issues we had a specific set of static Prometheus thresholds for all of the golden signals and we assumed that that would work fine across our Kubernetes fleet and what we noticed was we assumed that they were all similar clusters when in reality we had a heterogeneous mix of clusters and these clusters were coming in various shapes and sizes and we're hosting a variety of workloads for example our TurboTax clusters are very seasonal they are very busy from January to April when it's tax filing season in the US and not so busy the rest of the year and then we have our QuickBooks clusters which had very traffic during US daytime so 9 a.m. to 5 p.m. Everyone's logging in doing their payroll accounting and then logging off not so busy the rest of the day and then we have platform clusters such as build clusters, ML processing, stream processing clusters that are very high volume of workloads so you have a whole bunch of jobs that come up at a time and then they are done and then they go down they're not like long-lived pods they're very short-lived jobs and they have a very high volume as well so what we realized was, hey you know we're looking at a static one-size-fits-all set of thresholds for a dynamic set of clusters that have very different set of workloads and behaviors so how can we actually detect anomalies within a cluster that are specific to that cluster? You know maybe have a baseline for every single cluster and then detect any abnormal behavior in that cluster rather than having static thresholds so that's when we started exploring anomaly detection and how can we actually detect anomalous behavior specific to a cluster, right? And at this point I wanted to just call out a Sysdig blog I like the definition of what an anomaly is they talk about anomaly as an outlier in a given data set for specific to an environment and it's a deviation from a confirmed pattern and anomaly detection is about identifying these anomalous observations from a set of data points collectively, right? So if any of you have been doing this in your infrastructure you know that anomaly detection is very useful for these use cases and so among a lot of tools that we explored one was Zscore, Zscores are a pretty popular statistical measure that are used for finding outliers in a normal distribution and it takes historical data and then determines how your new data falls in that pattern and whether it's an outlier or not and the general calculation for Zscore is you know your current Zscore metric is basically the metric value minus the average over time divided by the standard deviation it's basically a mean based approach and you get a Zscore and you can actually map that to say any score that is between one and minus one it's regular anomalous behavior and then if it's the outside of two and minus two then it's considered to be degraded or maybe slightly anomalous and then if anything is over the three and minus three range then it's considered to be really anomalous behavior. So we tried this out, we tested this and then turns out there are several pros and cons to this approach. Pros, people really understood what Zscores were they're very well understood there is a lot of supported documentation about what Zscores are and they did provide what we wanted which is cluster specific anomaly detection and it has simple and built in Prometheus rules so writing or alerting rules were pretty straightforward using the Zscore for anomaly detection. How are there were cons? Zscore is again like I mentioned a mean based approach so detecting on outliers in like let's say you have a downward spike and there was an anomaly there it was difficult to actually detect outliers at that point in time because it would take an average over a period of time and then it would normalize the data and say okay your data is just fine and it would not detect that as an anomaly and it assumes that it's a Gaussian distribution your input data is Gaussian distribution so it assumes that bell curve and real time data is nothing like a bell curve all the time so there were cases where it did not work and it's very sensitive to outliers like the point anomaly within a huge spike and make the detection very less accurate and for that we had to actually keep weeks and weeks worth data and then use those for getting a reasonably good Zscore for anomaly detection and overall this approach did not work so we look further and that's what we're gonna talk about the future of this project we look further and that's when we landed on a NUMA project now NUMA project is a open source project that has been incubated and open sourced into it and it's a collection of open source projects for real time analytics and AI ops on Kubernetes and there are several projects in this umbrella but the main two ones that we're gonna talk about are NUMAflow and NUMALogic and NUMAflow is basically for massively parallel real time data and stream processing and in this case the stream that we're talking about are the metric streams that we are getting from Prometheus and NUMALogic is basically a collection of ML models that can do anomaly detection so we use this framework to see whether it would work for identifying anomalies in our Prometheus metrics emanating from Kubernetes clusters so we experimented with a few development clusters so I'm gonna just walk you through an AI ops pipeline and how we went ahead with the architecture so on the top you will see all of the Prometheus metrics and the aggregate rules that we were talking about in the previous section that Sahil mentioned so there is the node local DNS and then there is the CNI API error counts all over a five minute aggregate window and then a whole bunch of other critical Kubernetes components that were emanating some aggregate metrics for error counts and then Prometheus, we have Prometheus installed on all of our clusters and that's actually scraping all of the metrics every 30 seconds so that's your data stream and then in all of the clusters we install an AI ops namespace in this namespace we have a NUMAflow controller and this controller installs several CRDs to manage the AI ops pipeline and perform the anomaly detection for us so everything inside this purple box are basically pipeline steps that will help us in this anomaly detection so the first phase is basically a window phase and this is where we actually this takes as input whatever metrics are being ingested from Prometheus and the reason why we are windowing here is because every metrics comes in as one data point from Prometheus but our ML models actually require a sliding window of data so the windowing phase actually collects and windows the data before it's being sent to the ML model and then there is a pre-process phase where there is some transformation applied to that data to be consumable by the ML models then we have inference where there are predictions that are calculated from the model and then the threshold is where is the phase that is responsible for calculating the raw anomaly score use an auto encoder ML model to actually do this calculation and then in post-process the anomaly scores are transformed into like a human readable anomaly score from zero to 10 the mapping is anything between zero and three is non-anomalous behavior three and seven is slightly anomalous seven and 10 is really highly anomalous and that's when you actually alert and say hey maybe this is something you need to be looking at it's critical maybe converted to an incident or so on and this is also the phase where we push the anomaly metric back to Prometheus so that you can do or any of the alerting and incident management on it and of course from threshold we have the training phase which is where the models are trained and so for each of the metrics the pre-process and the main neural network and the threshold models are actually stored in a model storage which is basically used back for inference where we are actually calculating the anomaly score for the next set of iterations and with that this is what are the future stands for our cluster golden signals project we want to be able to integrate AI ops into Kubernetes clusters golden signals and be able to very smartly determine whether a particular Kubernetes cluster is healthy or not and avoid alert fatigue so the main takeaways are implementing cluster golden signals will help reduce in fact we are seeing some promising results within introvert already and anomaly detection using NUMAPRAJ is promising go check out our GitHub page on NUMAPRAJ the main three ones that we'd recommend are NUMaflow, NUMalogic and then there is a special one called NUMalogic Prometheus which has the Prometheus integration for Kubernetes clusters and with that thank you very much for your time we'll take some questions if you have any so the question is how did we end up having 275 plus clusters so on an average our clusters have about 40 to 15 nodes and the reason why we came up with this kind of cluster level isolation was because we had a lot of business units that needed more autonomy in how the clusters were managed so we have like the bigger level BU's like I said TurboTax, Min, Credit Karma QuickBooks and all of those have different products and services that they offer so they needed the isolation for each of the bigger services that they were offering and they wanted more control over it and again there were platform teams like for example our build team has like a couple of dozen clusters that manage Jenkins and beyond and they needed more control over how they were running their cluster so in general we have a lot of services and use cases that are being managed over Kubernetes so yeah we could have if your question is more like hey you know how about having larger clusters but then have a smaller number of large clusters we do have, I mean we done performance testing whenever we have a new Kubernetes release and whatnot but we've noticed that having medium-sized clusters is more manageable we used to have our own control plane and we had to manage like HED API server and all of the control plane components and upgrading them on a very regular basis so because we're a fintech company we have to rotate out our nodes every 30 days to keep up with security and compliance that we go through an AMI rotation so that process will take a lot longer if we had like say 1000 or 2000 node clusters and then it's also harder to provide isolation to the end user business units if you had like a 2000 node cluster but then you're still trying to find some sort of isolation so in some sense yes it is RBAC but at BU level or use case level good question good question so the question was for all of the ML processing that's happening using Prometheus metrics how much human resources is it human resources? oh okay cluster resources okay so yeah so how much cluster resources do you need to do this additionally good question so the main resources that are being consumed are the like I mentioned there is the there's a separate namespace then there is the pneumaflow controller which is basically a pod and then you have like three instances of the pod for HA and then you have the pipeline the pipeline actually autoscales depending on your stream so if you have a whole bunch of metrics coming at the exact same time then it's going to actually use HPA and scale up and then scale down right so the pipeline has roughly about I would say six to eight different steps and then each of that can actually scale up so what we've done like in our test environment we did not create a separate instance group or like basically we use instance groups to actually provide some isolation in between nodes within a cluster so we did not but that's one thing that we are considering to have all of this processing run on its own instance group which can scale up and down as needed so yeah so I would say anywhere between three to five nodes can be allocated for like I don't know if you use AWS we use AWS and we use M5 to Xlarge I think by default so I say three to five nodes on M5 to Xlarge separate IG to do your processing would be good yeah good question so the question is how do we respond to the alerts and how are we making those actionable steps on each of those alerts and basically like what is the strategy to optimize the alerting right so the alerting as we mentioned in the slides that we are assigning priorities to the alerts from the type of the cluster they are coming in right so the production cluster being the top level priority so definitely yes for those clusters if we are getting critical signals we are taking a look at them first and then the degraded ones so I think this is a simple strategy to look at the most critical workloads running in the clusters and then assigning them the right priority and then responding based on that which is like you know which is a learning for us as well and we are improving from that because you know it is it is additional level of alerting that is being added but as you saw in the demo is it benefiting and is it the right metric to alert on that is really the problem that we are trying to solve and gradually improving day by day that we are alerting at the right time and then that should be the right alert for the engineer to take a look at and we are not wasting time there if there are no more questions we actually have some t-shirts giveaways if you are interested you can come and get some or we can ask you questions to see if you are paying attention you guys want to do a trivia? sure okay alright so we talked about two Numa project names can you tell us one of those yay we have a winner so Sahel's demo had two different signals one was a service signal and one was a cluster golden signal what was the MTT D on the service golden signal roughly it was about 60 minutes about an hour does anyone remember what it was for the cluster golden signal yeah it was good observation but we were actually showing you like every so many minutes we were actually trying to set this cluster golden signal how many minutes was it yay you get another t-shirt it's a plus one yeah we talked about four Intuit products can we name three that's one yay okay we have a winner thank you all for joining us thank you