 Hello everyone, my name is Karan and I work as a data scientist at Red Hat. Through this video, I want to share some ideas on how you can apply machine learning to Kubernetes data and extract actionable insights. So in the next 20 minutes, we'll talk about, well first, a preamble on what monitoring in production might look like and why we felt the need for data science. Then I'll talk about the type of data that we use and the ML methods that we've explored. Then I'll show the results and impact that we have had so far and finally I'll show how you can get involved and contribute to this project. Okay, so let's start with a little bit of background. So like many of you might already know, in a typical Kubernetes deployment, there's several components involved. For example, you'll have your QBPS server, your HCD, and your worker nodes and even on one abstraction level above that, you have multiple operators or multiple applications on your cluster. So naturally at that level of complexity and interdependency, like whenever one component has any issues, there's a good chance that that would propagate forward to other components. So then as a developer, how do you get to the root of the problem? How do you debug the cluster? And that, I think, is one of the main reasons why someone might have monitoring setup. So tools like Prometheus basically collect usage data from your cluster and store it so that whenever you have any problems, you can refer to the data and debug it. So that works great when you have a couple of clusters, but as a company, you're gonna have hundreds, if not thousands of clusters in your fleet. So it might not be feasible to go through each and every single piece of data to diagnose the situation. So then how do you diagnose issues at scale? So that was, so one idea we had there was, you know what? Maybe we can start with creating simplifying rules. For example, if you see these two or three things happening together, then that means that the problem is this particular problem. Like for example, if you see that the authentication and network are down and the number of nodes is less than three or something, then that condition refers to this particular known problem, problem X. So these kind of rules can help you simplify your debugging process exponentially. But of course, the catch here is how do you actually find these rules? Right? Because there's thousands of clusters and tens of metrics coming from each cluster. So there's at least hundreds of thousands of possible combinations. So this was one of the main reasons why we turned to data science. And our main reason for using machine learning is to comb through this ton of data that we have collected and try to find if there's any meaningful, if any, there's any recurring patterns observed in the data that can refer to the any underlying problems. So doing this will obviously help us save some engineering resources. And also more importantly, doing this is a proactive way to identify and address issues. So that's what we're trying to achieve in our project. So in any data science project, the first part is obviously the data. So in this analysis, the data that we used was operational and health metrics, which were collected from customers' clusters using services like telemetry, insights, and must gather. So these are just some of the services that Red Hat provides and you can opt into them. And they'll collect information like how many nodes are there in a cluster, how much memory you have, how many CPU cores, etc. And internally at Red Hat, we have a CCX team that works together with the data hub team and essentially consolidates data coming from all of these different sources into one place and gives one umbrella term to each piece of the data coming from these different sources. And that umbrella term is symptom. So symptom basically refers to anything, any problem with the cluster. So for example, if there's a Prometheus metric that looks like this, which tells you what are the conditions of the operators deployed on your cluster. So what the CCX pipeline would do is parse through this metric and store it as this symptom. So a symptom will get saved as essentially one string that essentially stores all the main parts of the original data source. For example, this particular string, it shows you that it's a failing operator with the operator name, condition, the message associated with the condition, etc. So what you've done is essentially distilled all the main bits from the original source and saved it as one piece of string. Another example of a symptom might be that the insights operator running on your cluster realizes that, hey, there's not enough nodes in the control plane. So that information gets parsed and gets stored as this symptom ID. So again, it shows you that it's a rule type of symptom and that some specific rule is being broken. So if you put all of this together, we get our main data set. So this is a data set that we work with in our analysis. And in this data set, what it is is just one table of that lists what cluster is showing what symptoms. So this is our starting point for the project. And the first thing that we do is change the way that data is being stored right now. So instead of having multiple rows for the same cluster and having only one column, we pivot it so that we have just one row per each cluster. And we have multiple columns where each column is the name of a symptom. And a zero or a one in a cell indicates whether or not this particular cluster showed that particular symptom or not. Okay, so now that we have our data in a numerical tabular format, we are now ready to answer the million dollar question, which is, how do you actually find patterns in the data? So for that, for this particular task, we looked into the field of frequent pattern mining, which is an area of study in data mining, and it is exactly what it sounds like. So I think it's best understood with the help of an example. So let's say there's a transaction table available, which shows what are the grocery items that some customers have purchased from the grocery store. So what frequent pattern mining does is it'll go through this table and try to extract what are the items that are purchased together most frequently. So it might go through this table and find out something like this, that milk and bread are purchased together very often and beer and diapers are purchased together very often. So our use case is also kind of similar to this, right? Instead of having grocery items, we have symptoms and we want to find what are the symptoms that occur together most frequently. So that's what we use in our analysis, and specifically the two main methods that we explored were a priori and FP growth. Okay, so now that we have our method and we have our data set, are we good to go? Can we just go ahead and apply it to our fleet? Well, let's pause for a moment and think about what might happen if we do that. Okay, so let's say in your organization, you have 20 clusters. Four of them are from customer X, who's running one kind of workload, and the most frequent symptoms there are A, B, and C. And 16 of the clusters are from customer Y, who's running a different kind of workload where the most frequently seen symptoms are D and E. So what might happen if you apply our algorithms to the entire data set as one unit, as one whole piece? Well, if you guessed it, you guessed it right, you won't be able to surface all of the patterns. In this case, you'll be only able to see D and E just because there's more clusters from customer Y as compared to customer X. So the thing is, we don't want to find just all the frequently occurring patterns. We want to find frequently occurring pattern within some narrower domain. So then what do we do? Well, we go back and split our fleet into somewhat meaningful groups and then apply clustering, I'm sorry, frequent pattern mining. So the approach that we now take is, well, first, roughly bucket the clusters into groups such that within each group, you have Kubernetes clusters that behave somewhat similarly. In data science terms, this is essentially applying clustering. And then once you have this grouping, then you apply pattern mining on each of these groups separately. So let's walk through these steps one by one. Okay, so for clustering, if you remember our data set from a couple of slides ago, this is what it looked like, where we had several columns, maybe even thousand columns. So the thing is, most in general, clustering algorithms will not perform very well when you have such high amount of dimensionality, such high amount of columns. So the first thing we do is to reduce the dimensions by applying some dimensionality reduction method like UMAP or PCA. And once we have our representation in lower dimensions, then we go ahead and apply clustering. So after the first step, the dimension reduction step, we got a representation that looks somewhat like this. So on your screen in the picture you see, that's a snippet from the representation that the algorithm learned, where each point on here is essentially signifying a Kubernetes cluster. So this point in the middle, for example, that is cluster ID 2336, which is showing these three symptoms in particular. Okay, so already we started to see some interesting things. And one of the things that we observed was that during this step, during the dimensionality reduction step, the clusters got mapped in such a way that clusters that were behaving similarly got mapped close to each other relatively. So for example, all the green dots right here, these are the Kubernetes clusters that we knew that had some kind of SDN issue associated. And as you can see, they all get mapped in a similar region, in the close enough region. So now we have somewhat confidence that now if you were to apply clustering, we'll get somewhat meaningful results out of it, somewhat meaningful groups. So we did exactly that. And we applied clustering and specifically we use DB scan. And the results out of the clustering algorithm was that we get these groups of Kubernetes clusters. And now we finally go ahead and apply pattern mining to each of them separately. So from the first group, you might get one particular pattern. And from the second group, you'll also get a pattern. So this way, you're able to surface both the patterns that are relevant to you or relevant to your customers. And we actually go one step beyond and we try to refine these results even further. So one of the things that we observed was that some of the patterns that were surfaced through these algorithms, these were patterns that were occurring everywhere, no matter what, like they weren't really super informative, or they weren't really representative of any underlying issues. So what I mean is, let's say, for example, there was a pattern RST that occurred 90% of the time in one group, and otherwise only 2% of the time. And there was another pattern BC that occurs 40% of time all over the place all the time. So which one do you think is more likely to be a candidate for a rule? Which one is a better candidate? So our thinking was that RST should be a better candidate just because it is kind of specific to this group of clusters. And therefore, it's more likely that that particular pattern represents some underlying problem. So we took this idea and converted into mathematical form. And so what we do is essentially we list our results when we present them to the SME, and we prioritize by the difference in percent occurrence across groups. So for example, if this is what you extracted from one of the groups, group zero, let's say, then in this case, the pattern ABC occurs 85% of the time in it, but only 10% of the time outside of it. Whereas the pattern ABF occurs 95% of the time in this group, but also it occurs 92% of the time no matter what. So what we'll do is we'll not show that result or at least bump it down and show ABC and DE as the more probable candidates for rule defining. So we go ahead and apply this procedure to every group created by the clustering algorithm. And at the end of this whole process, what you've done is we have basically extracted all of the symptoms that would be relevant to us. And also we have gotten this rough grouping of clusters that are affected by these patterns. So there you have it. That's our process from start to finish. And we are now able to surface the symptom patterns that occur very frequently and are likely to be characteristic of some problem. So so far we've been talking in terms of toy examples like ABCD, but now let's talk about some of the real results that we have had so far. So we applied our analysis on customer clusters. And from one of the groups, this is the results that we got like these are the suggested symptoms that we were able to extract. And one thing we noticed was that a lot of the symptoms were symptoms that are already being used by engineers to define an existing rule. A similar thing happened when we extracted symptoms from another group. And it was that most of the symptoms or a lot of the symptoms were already being used by engineers in defining the rules. So we thought, you know what, let's provide the results from some of these other groups that we are seeing and show them to engineers. And let's see if that corresponds to some problem and if that can be made into a rule. So we went ahead and worked with a team called OT18 internally. And they used our analysis and using it, they were able to find six new rules and they were able to add it to their dictionary. So because of these results we've seen so far, we think this is very valuable to SMEs, to engineers. And this is definitely a promising avenue for exploration. So that being said, how do you contribute and get involved? Obviously, we have made all the source code available in open source, but we cannot open source data, which is customer sensitive. So we have created entire environments where you can play around with our notebooks and run the code. So if you're a non-Red Hatter, you won't have access to customer data. So you can go to this particular link on the slide and select OpenShift Anomaly Detection. And if you are a Red Hatter and if you do have access to some customer data, then you can visit Jupyter Hub and select Anomaly Detection internal. Once you're there on the page, you just click on notebooks and go to diagnosis discovery demo. And once you've opened the notebook, just click run all cells. So ideally with four or five clicks, you should be able to get up and running and you should be able to run all of our code and see the results. Okay, so once you're familiarized with the code, I would highly, highly, highly encourage you to open issues and PRs on the repo. If you're a subject matter expert, if you're an SME, then you can tell us whether the patterns that we're suggesting are the useful for your use case. Do you find them helpful for defining rules or not? You can also tell us if there's any specific domain knowledge insight that you have that might be helpful for us to improve our process. And also if you're a data scientist, you can help us with improving in so many processes, in so many places. For example, learning the lower dimension representation. That's for now we just did one hot encoding and dimension reduction. But maybe there's obviously there's a lot of different ways of doing it. There's different ways of refining results. For example, using TF IDF. So obviously there's a lot of things to do and areas that can be definitely improved upon. So please feel free to open issues, submit PRs, go crazy. And I think that is all I wanted to say for this talk. If you do have any unanswered questions, if you do have any concerns, please feel free to slide into my DMs. All my information is up there. And if you want to email the entire team, then the team email is also listed on there. Thank you very much for listening.