 Hello everyone. Welcome to the one of the last sessions of this coupon. Thanks for sticking it up. Today's session is going to be on Smarter Golden Signals. My name is Anusha Raghunathan and my co-presenter is Venkat Gunapati. And both of us are principal engineers working on the Kubernetes platform infrastructure at Intuit. Today we'll be talking about what we are here for, which is Smarter Golden Signals. Why we went about exploring this project. What does it mean to have cluster golden signals? Anomaly detection for a Kubernetes cluster and tools that we explored. We'll introduce you to Numa Paraj, a new open source project that has been incubated at Intuit. And finally finish it off with a demo and takeaways. For those who don't know about Intuit, Intuit is a fintech platform company. That's popularly known for building financial products and services around tax prep and filing. Accounting and payroll, such as QuickBooks. Credit score analysis using Credit Karma. And small, medium business marketing tools with Mailchimp. And all of these financial services run on our Kubernetes based infrastructure. Now, if you want to take a look at the numbers at a glance. We run about 275 Kubernetes clusters. They are mid-sized clusters that have about 20,000 plus namespaces. And they serve about 2,500 production services. I want to call out that these are just production services and we have a lot of pre-prod environments for QA, Perf testing and whatnot. And this is about 900 developer teams. And serving about 6,000 developers, more than 6,000 developers. And some of this is also seasonal traffic that will go up more than what these numbers are. And if you look at what a platform engineer does to observe the Kubernetes fleet that I was just mentioning. We have several different components in a Kubernetes cluster that we monitor. At a pretty high level, we have the node level components for CPU, memory, disk, network, processes. Then we have the Kubernetes components themselves. And then follow that up with part state information. And finally, some synthetic monitoring that we run. And synthetic monitoring is mainly for launching on-demand tests so that we can make sure that maybe a particular workload is running, particular type of networking workload is running, and so on. So what are the metric sources for all of these? For your node information, we use telegraph. For Kubernetes, we use Prometheus. For part state, we use KubeState metrics. And for synthetic monitoring, we use a tool called Active Monitor, which is incubated at Intuit using under the Keiko Paraj umbrella of projects. And all of these generate alerts. And when they do generate alerts, the platform engineer or the SRE that's on call is getting an overdose of all of these alerts. And they're frantically looking at a bunch of runbooks, looking at dashboards, making sure that things are working OK, trying to mitigate and remediate as we go. Now note that, as I mentioned earlier, the scale is quite high. There are about 275 plus clusters. And each cluster generates about 100 plus alerts. So you do the math. It's going to be not too happy. Our platform engineer is getting a little overwhelmed here. Now to make things a little interesting, let's throw in an incident. Who here likes to be on an incident call? OK, no worry. You do? You work for PagerDuty? So now the platform engineer not only has to worry about mitigation, but they're also concerned about, does the services that are impacted running on clusters that are healthy or unhealthy? They also need to understand whether this incident is this a service issue or is this a platform issue? And they also need to understand what the blast radius of the incident is. If this service is down in one cluster, is it going to affect the rest of the clusters as well? Is it just a matter of time? That's how they're feeling right now. They are not just a little overwhelmed. They're literally drowning in alerts. I'm sure many of you can relate to this because that's how I feel when I go on call. And what the platform engineer really wants is very simple ways to reduce MTTD and MTTR. They want less positive, less false positives and less false negatives. When I get an alert, I want to be actually only alerted when there is a real problem. And I also want a few good quality signals. I want a filter signal from noise and I might also like some tulips. So let's talk about cluster golden signals. That's the motivation behind us exploring this concept of cluster golden signals. Now golden signals is not a new concept. When you have a service, whether it's a microservice running on Kubernetes or a service running elsewhere, the health of a service can be determined using four metrics, four sort of high-level signals. Google SRE Handbook released this a few years ago as some of you might be familiar with. And they talk about these four golden pillars, which are error rates, latency, saturation, and requests or traffic. So with getting a few good signals from all these four pillars, you can actually determine whether a service is healthy or not. So we realize that, hey, you know, it applies to services, can be mapped that to Kubernetes clusters as well. So as a service owner that's running on the Kubernetes power infrastructure, the service owner's concerns, yes, we do offer a lot of bells and whistles, but at the end of the day, they care about three very high-level fundamental things. They care about availability, they care about scale, and they care about correctness. So we were looking to see if we can map this cleanly to Kubernetes components, and we realized that, yes, we can. For availability, you can map it to the control plane, and you can probe a few things in there, along with networking, which is pretty fundamental to availability. And then for scale, you can map it pretty cleanly to cluster autoscaler, horizontal pod autoscaler and VPA if you have implemented VPA. And finally, you can map correctness to cluster authentication, cluster networking, again, in terms of latency and packet loss, and maybe you have some very custom cluster add-ons that actually give you some of the correctness. In our case, all of the Kubernetes clusters, we slap on about like two dozen cluster add-ons that are helpful for day two operations, for security compliance, and so on. So let's take a look at what these golden signals translate to for us. So every golden signal that emanates from a Kubernetes cluster, the top Uber golden signals, can have three distinct states. They can either be healthy, which means that all the components that we are probing for are healthy. They can be in a degraded state, which means that all of the critical components should be healthy, even if one of them turned to be a degraded state, then we mark them as degraded. And finally, critical when even one of these components again turned critical then the entire Uber signal is marked critical. Now, let's take a look at what the error golden signal looks like for us. This is a sample Prometheus rule that we've written, which is basically an aggregation of all of the golden signals that we cared about. So for example, autoscalar, networking, and so on. And you will see that the aggregation rule, you can actually customize it to how you need it. But in our case, we just aggregated and normalized it over a particular number range. Now, let's take a closer look at the individual golden signal for a particular component. In this case, we're talking about Nordlocal DNS. Nordlocal DNS, as we all know, is very crucial for making your DNS lookups within your services. And if that fails, then you can be rest assured that your availability and your correctness are going to go down. So what we do is Nordlocal DNS has, we have Prometheus rules that make sure that Nordlocal DNS's success SLAs are being met. And how do we calculate the success SLA? So we look at the response total and then we calculate any serve fail errors over a preset period of time. So for example, in the last five minutes, how many serve fail errors did we get over the total number of responses? And that's going to be your error rate. And then we subtract 100 from that, and that's going to be your success rate SLA. And a simple way to bucket this would be, hey, if your success rate is over 99, then you're healthy. If you're below 95, you're unhealthy. If you're anywhere in between, then you are in a degraded state. Another way to look at error rates is using error counts. So some of these components that we monitor add-ons or whatnot come up with error counts instead of error rates. So for example, we have a CNI component that's from AWS. And there are three distinct components that make up the CNI health. One is the AWS API error count. Another is the IPAMD error count. And then the final one is the part ENI error count. So what we do is basically we look at the error counts for all of these components. And then we're like, okay, fine. If the error count for this particular component over the last five minutes is lesser than two, then hey, you know, it reconciled eventually. And then yes, it's going to be okay. If it's over five, then hey, it's been reconciling but never got to get out of that error. So hey, you know, maybe it's unhealthy. Anywhere between two and five may be degraded. So we came up with these rules. And that's what the CNI looks like. And we did this for pretty much all of the critical components that we thought were responsible for the error rate. And we rolled it out and we found a few things. We thought all of our clusters were one-size-fits-all as far as error rates were concerned. And we found they weren't. In fact, they come in all sorts of shapes and sizes and workloads. Some clusters like the TurboTax clusters are very seasonal. They are super busy from January through April of every year when tax peak is really high in the U.S. And some have variable workloads throughout the day. For example, QuickBooks, it's our accounting software. So people log in at 9 a.m., log off at 5 p.m. So it's super busy during U.S. day times. And then it's like super not so busy in the evenings. And then there are specific platform clusters that run our stream processing, build processing, and machine learning workloads that have like super high volume. But they're not long-lived pods. They're more like jobs that get scheduled and then they basically go down. So every cluster is so unique in their workloads. So how are our static Prometheus thresholds going to work for these dynamic changing workloads? So what we ended up having was, hey, the concept of cluster golden signals worked out fine. But it's not going to work with static thresholds because every cluster is unique. So then we said, how are we going to do this without static thresholds? And that's when we started doing, exploring anomaly detection. And anomaly detection is basically, and I'm going to quote a cystic blog here that specifically, I like this definition, it says, anomaly refers to an outlier in a given data set that's pulled specific to an environment. And it's a deviation from a confirmed pattern. And anomaly detection is about identifying these anomalous observations. So a set of data points will collectively help detect anomalies. So we said, OK, let's explore anomaly detection where we can specifically look at signals in a cluster and say, hey, yeah, this works for this cluster and this doesn't work for another cluster. And also note that the 275 clusters that I'm talking about is a mix of prod and pre-prod environments. Of course, we all know prod clusters are not very tolerant to errors or latency issues, whereas your pre-prod environments are OK and so on. So we wanted to also have that error tolerance treated differently. So we explored the concept of Z-scores for anomaly detection. Now Z-scores, as you might all probably know, is pretty standard statistical model to detect. It's basically a mean-based approach where you have a normal distribution of data and then you figure out the average and you get a Z-score for a particular point in time. And it's based on the data set that has already existed. So you get a baseline and then you say, hey, this new data that I am getting does that look like any of my old data or not? And the thing with the Z-score calculation is that you can get a Z-score metric for the current data point using a pretty standard formula. So it's basically the current metric value minus the average over time divided by the standard deviation. And it works pretty well. Like you can say, hey, an anomaly is detected if the value is between, it's over the 3 to minus 3 range. It's normal if it's below 1 and minus 1. And it's slightly anomalous if it's over 2 and minus 2. So that's the standard mapping that you get from Z-score. You can look up any statistical model paper and they'll tell you more details about what Z-scores are. Now the problem with Z-score, yes, so we explored this. We experimented with some of our error metrics that we wanted to propagate as cluster golden signals. And here's what happened. There were several pros. It's very well understood. Z-scores are very well understood. And they provide cluster-specific anomaly detection, yes. And guess what? It's also available as part of Prometheus as a native primitive. So you can actually write alert rules pretty straightforward. Well, what are the cons? Z-score actually assumes a normal distribution. So it assumes that there's going to be a bell curve. But in real life, bell curve data doesn't exist all the time. In fact, real world data is very different from a bell curve. So what happens is when there is a slope, then when your data is actually trending on the downward spike and there is an anomaly there, Z-score just averages everything and says, oh, okay, yeah, you don't have an anomaly because your average over the last five minutes is going to be not so different from this spike. And then so it doesn't detect the anomaly there. So it's perfectly good for a normal distribution, but anything else, it's going to have issues identifying the anomaly. And the second, yes, this is what the Gaussian distribution I was talking about. And if you want to overcome this Gaussian distribution issue, then you have to get data from weeks and weeks worth so that you can say, okay, so we can average it over and over, and then we have to store the data over several weeks, and then we have to understand this a lot better. And honestly, this was just not working for us. So we were looking for an anomaly detection tool for cluster golden signals on Kubernetes that was reliable and at good technical fit. And that's how we landed on Numaproj. So Numaproj is an open source project that has been incubated at Intuit. And it is meant for real-time analytics and AI ops on Kubernetes. And since they are part of the larger platform team that we are part of as well, we work closely with them to basically get the cluster golden signals and anomaly detection for those. Now Numaproj has a few different projects, but the main core ones are Numaflow, which handles massively high real-time jobs and it's a stream processing engine. And Numalogic, which is a collection of ML models and libraries that will help with anomaly detection. Now let me walk you through what our cluster install and AI ops pipeline looks like. So over here we can see the metrics that we were interested in. We were just looking at in the examples the node local DNS, the CNI, error counts and whatnot. And you can actually bucket all of those as part of your aggregation rule and push it down to Prometheus. In fact, Prometheus scrapes, scrap interval is 30 seconds for us, so it scrapes one data point every 30 seconds. And then what we have over here is an AI ops namespace that we created in every cluster and install the Numaflow controller in it. Now the controller itself installs a few special CRDs for pipeline. We'll talk about the pipeline in a bit for pipeline vertices and there are buffers for each step of the pipeline to make sure that there's flow control between them. And we'll talk about the pipeline. We'll see pipeline, how the pipeline looks. So this thing in purple consists of all of the pipeline stages. Prometheus ingests the metrics into our pipeline. The first stage of the pipeline input is a window step. Now each metric comes into our pipeline as one data point. However, the ML models that are acting to do the anomaly detection require a window, a sliding window of data, so they don't just act on one. They require like a data set. So this step actually works in gaining all of those data metrics and then doing the windowing. Then we have the preprocessed step which actually makes the input metrics consumable by the ML models. This part is where any of the transformations that are required are done. And then there's the inference stage which does the prediction for the anomaly detection. Threshold stage actually does the raw anomaly score calculation based on a previous set of threshold ML models. And from threshold, we go into post-processing which actually does the normalization of the raw anomaly scores to a normalized range from 0 to 10. And the general idea behind the 0 to 10 mapping is that from 0 to 3, your data points are normally behaving. Nothing anomalous about it. From 3 to 7, there is slightly anomalous behavior. And then from 7 to 10, it's really anomalous. It's better to actually start alerting the right people. In fact, we have AI op systems into it for services which automatically create an incident if the anomaly scores go over 7, if it's between 7 and 10. And then this is also the stage where we push that anomaly score back to Prometheus. So we have a Prometheus pusher that pushes the anomaly score back into Prometheus. And then we have the training stage which is what trains all of the ML models. And then the pre-process, the threshold, and the actual neural network that's training the data, all of that is stored in the ML flow storage that is used back again for inference. So that's overall the AI ops pipeline and how it works. Now, talking a bit more about Numaflow, you can find more details in the GitHub page. And like I mentioned, there are a bunch of CRs that are installed in the cluster and a bunch of other stateful... Basically, there's a readers deployment and then there is an interstate buffer service that is installed as part of the AI ops namespace that helps with the different pipeline operations. And it's installed in a few minutes, like lesser than like three, four minutes. And so far, the results have been pretty effective. We've been able to get good results for a cluster and we see this in the demo where we don't need to worry about static thresholds. We can actually just push the data that we need as far as metrics are concerned to our Numa approach namespace and the namespace actually detects anomalies depending on the cluster's baseline behaviors. And there is a UI that's available using a simple port forward in your cluster. Now, one thing that you might be wondering is, hey, you know, I'm a Kubernetes engineer and I don't have much of an experience. Do I need to understand ML? So when I started using this project, I had the same set of questions as well, so I just put together a FAQ in case you're like me. So do you need ML experience to use this? No. But it might be just good to go and look at the GitHub repo just to get an understanding, look at our Slack channels and like engage with us as far as community engagement goes. And then tell me a little bit about the ML model. How does it work? So the model is actually an auto encoder machine learning model. And what it does is the way it does anomaly detection is there is some sort of transformation of these data points. And when the encoder and decoder cannot actually take this data set and be able to recognize it within what is expected, it produces an anomaly. And what is the purpose of retaining models? So basically what we do is we retain about five different ML models at any point in time. And here is why. Let's say you have an incident that affects a few of your clusters. And the incident is not resolved, let's say for a few hours. Now your ML models are getting trained on this bad data. So it's going to assume that this bad data is not anomalous anymore. In fact, it's going to think that that's normal behavior. So you want to be able to throw away the ML model that's being trained on bad data and you can use the previous model as a backup. And what is the model training frequency? The AIR systems at Intuit have about an eight-hour training frequency. So every eight hours, we assume that there will be some sort of data drift. So your model has to get retrained on new data. This is just to make sure that we don't assume that all data is going to be the same, even for a specific cluster over time. And is there a UI to observe the ML flow and the machine learning side of things? Yeah. So this is what it looks like. You can hit it on, like, basically, again, it's like a service. You can put forward on your local cluster. And you will get a nice UI with a lot of details about the ML model. This is just one snapshot that I took of the node local DNS that I was talking about. And you can see that there are five models that have been retained for this particular metric. And only one of them is in production. The other ones are all archived. And the way to re-trigger a training is to basically change the stage from production to archived, which will trigger a new ML model creation. So having said that, let's take a look at a demo from Venkat. It looks like I have to switch screens. Just one. Yeah, just take away. Just one must try. OK. Sorry about that little snafu there. Is there audio? Ah. Working Kubernetes platform. I'm going to quickly show you a demo on how we are leveraging NumaFlow to detect DNS anomalies. Let me share my screen and quickly walk through the setup. You can look at this screen, right? So we have left side. There is a cluster one. On the right side, there is cluster two. And what we are looking at here is the DNS metric on how much success rate we are seeing in the cluster. Now, this data is backed by amount of requests that we are getting into the cluster for DNS calls and amount of serve fails that we are seeing in the cluster. So serve fails are super critical in DNS. And those happens when one of the back ends of the DNS infrastructure is not performing well or being down. Now, what we are seeing here is that one of the clusters shows the behavior where between 83 and 84% success rate, but whereas other clusters shows 100% success rate. Now, in order to generate an alert for this in a large fleet of clusters, it gets really cumbersome and really hard because you need to. So let's say this 84% is acceptable in a preprod environment or like a test cluster where you are constantly doing experimentation and this is okay in that environment. Now, in order to generate as a platform team to generate an alert that says, hey, right now the DNS is bad or we need to take an action right now. Now, because these numbers are so drastically different, whereas one is 84% and another is 100% and then in production environment, let's say on the right side cluster here, but you cannot even take 5% impact. And that's really big deal, right? So now how do we come up with an alert that will satisfy both, right? So now what you end up doing generally is that you would generally creating every single cluster has a separate alert and they have a manual thresholds or something like that. Now, how is there a way to avoid that and then say, hey, I will create a single alert that says that will give me insight and kind of a golden signal for this, right? Now, what in this demo, what I'm showing you is that the data here, it is generated from node local DNS is going to be sent to NUMAflow. NUMAflow will take this data and then generate an anomaly metric front, right? And it is using a ML model behind the scene based on the data we pumped and then it is generating anomaly score for us and which is the same here as well. What we are trying to at least illustrate here is that if the number is below 3, that is good. And if we ever breach 3, that means there is something wrong and somebody need to take an action and take a look at it quickly, right? Now, I'm going to introduce some failures into the right side cluster, which is cluster 2 and see what happens to the anomaly metric. What we expect to see is that it needs to go up from whatever right now it is at 1 to more than 3. So that you trigger on alert that says, hey, there is something wrong, it is no longer 100% and whereas if we introduce whereas this is not 100% already and this is okay, meaning this cluster, this behavior is okay, right? So you can always retrain these models and say, hey, let's say there is a bug and this is not actually correct and we've addressed it and right now it's 100%. Now, you can always retrain the NUMAflow models and then get accurate alerts, right? So let me quickly induce some failures here. I'm going to just, I just started the failures right now. It will take about 30 seconds to reflect. After some time, you can clearly see that it's no longer at 100%, cluster 2 has now 6% error rate. Most importantly, you can also see that anomaly now went up to 10. Whereas in cluster 1, you always had about 17% failure and we never went beyond 3. We already concluded that going beyond 3 is bad and somebody need to take a look. And clearly cluster 1 in this case, even though the failure is only 6%, you would generate an alert. This kind of show you how you can leverage NUMAflow to generate alerts across large fleet of clusters. This concludes the demo. Thank you very much. So just to summarize what Venkat was saying is we had two clusters with different workloads. The one on the right was more prone to errors because let's say it's a pre-prod testing per environment. So the error rate was always around hovering around 13% to 17%, but the one on the left was a prod environment which had no errors. And he introduced the exact same number of errors for the exact same amount of time in both the clusters and noted that the one on the right, it wasn't showing any anomalous behavior. The anomaly score was between 1 and 3, but the one on the left, it immediately shot up to an anomaly score of 10 which shows that in a production environment, we are not tolerant to it and it was not a baseline behavior. So you can actually don't need to worry about static thresholds and you can actually implement golden signals using NUMAproj. So that concludes the main takeaway is go check out NUMAproj. It's a pretty cool project engaged with the community and start implementing cluster golden signals if your platform engineers are getting overwhelmed and you're getting alert fatigue and burdened by on-call. Thank you very much.