 So what I'm working on Red Hat is a team called Quality Custom Experience, where we are not sure we are collecting the whole data from the OpenShift class to our customers. If they are OK with that, we do or try to do some magic and then provide the profit in terms of better user experience for the customers. It might be, you know, services they can use based on this data. It might be improving the product based on the analytics and so on. We basically are focusing on the phase two of this strategy and what I'm spending most of my time. And in detail what I do is I see probability clusters and my clusters I mean OpenShift instances because if the OpenShift instance is happy I don't see it like it's boring, right? We are interested into the stuff when the clusters that have some problems. So basically we are looking at the probability clusters and trying to reason about that and the rest of this presentation will be tightly coupled with this. I'm a data scientist. I'm a data scientist by training, but I do data science by accident maybe or by necessity because without that you can't do much in this business. So that's like I have an interesting relationship with the data science like it's love, hate, the relationship I would say. I love doing that. Data science doesn't love me always, but we are very good. One of the problems of the OpenShift when it comes to looking at the problems or understanding problems is the underlying from infuse telemetry and alerting and the distributed notion of the queerness where each component has basically their own definitions of what's the problem. And the problem is when there is some central issues or root cause in the cluster multiple components start complaining about that. So what you end up having is some timeline where you have multiple alerts triggering at the same time. So there are maybe 20 different alerts that we've seen around 9am, but there were not 20 problems that the cluster has seen. It might be one, two, maybe three, but the problems are different than the signal that we are having. So the question is can we do something to reason better and show closer the root cause of that, which I believe I'm describing in this next slide, basically. We tried to do these alerts, the signals that we have about the cluster into some related things and ideally also being able to reason about the cause and the consequence of all these problems. And that's why this talk is about correlation and causation because it's tightly coupled with this particular problem. So we now will move away a bit from the open shift itself and I will talk about this problem in more broader terms. So one thing when it comes to the grouping many times you start thinking about maybe clustering and people that have tried or seen some machine learning thing, that's the first thing that you suggest. Of course it's clustering algorithms just do that. And the question is really should I do this path or this path or should I not? That's the question. So we tried that. We tried some clustering algorithms so we did our embeddings thing. We did principle component analysis. We did damage mode reduction. We did clustering algorithms of all sorts. And it kind of works. One of the problems that we had there is that once these are basically different symptoms that if they are close together somehow mean that they were related or happened at the same time. The problem with this approach is it's hard to interpret. You can feel that they are probably related but there is some limitations of reasoning about it. That's why in machine learning we mentioned that there is problem with explainability. That's one of the reasons. It does something but you can't tell much how it does. So we can also try a different approach and that's what this talk will be about mostly. So you have this all this fancy machine learning, deep learning, artificial intelligence stuff. You know in the state that there are high-tech skills that have been around for some time and I don't think we should forget about that. So we will be taking a bit more statistical approach on trying to find these relations. And I believe that's the end of my slides and now I will switch to this different deck and that's how we know that there is some data science going on because out of the sudden there is a Jupiter node. So you can be sure that there is some data science going on. There might be some other things going on in Jupiter node but we have also data. So there is something with data and in the next cell we also have some latex equations so that is science, right? So in the data we have science so we are doing data science. Before we jump to some data I prepared small magic trick. I never trained that so it might not work at all. So I have here for some toys that I have in this bag and I have a special capability which is called chromatic ear which means that I can hear the color. So I've never tried it. I just have a feeling that I know that and I have to do it. So I randomly choose some item and play silence play some voice. Okay, I think I think it's red. It's red, right? I haven't looked at it. I just saw it. I just heard it. So So how did I do that? You can only read objects in the bag. Oh, I forgot to show you. Oh, okay. That's all right. I'm doing it the first time I'm doing this. Of course there's a trick and it's a really silly one. So it's that the cube is the only one that's red? Yes, of course. You can't find no other cube with the other color. But you would not be able to do this trick if I gave you this bag because you would have no information about the statistics so that's what differentiates me about being able to do this kind of thing. And now it helps its nice segue to something called probably this occasional theory where this is all about these things. So if I talk and I have some marking here we're not talking about this a lot. This also is probability of one thing while the other thing is true. So in this particular case what's the probability of object being red when the object is cubed? It's one. 100%. But we can also do it the other way around if I have red objects and what's the probability of it being a cube? So I have three cubes one sphere so it's three quarters the probability. So it's one misleading thing about Bayesian theory is that probability of one versus the other is different when you switch the directions and that's like I will maybe sometimes still use this as the visual representation of that because it makes it easier. So we're not talking about objects and colors we're not talking about health data on the open shift and that's basically some property that you have or don't have which we call symptom and I will not talk about open shift I'll have to say because it doesn't tell you much but I will talk about something closer to humans which is the human disease I'm sorry about the feedback I was not able to figure out anything more positive. Not that it's being tested positive Anyway, so we have a data set that I just created for this occasion where how to interpret this is that I have for example 10 patients that are or have positive flu test and have this set of symptoms fever, cough, headache, you know it we have also some lucky patients that have flu but no symptoms we just happen to test them positively but they are completely fine so think around we also have some I think I included some unfortunate person that had chicken pox and all these chicken pox symptoms but we also break broken his leg so double unlock there but I included there just to show that we could have multiple reasons for the symptoms and the sensation we want to be able to spread that so in here we can reason about the consequences of flu and fever and how it relates to the open shift is that in the open shift role when you have just a stream of alerts you don't have any metadata saying DNS is down you can expect some component or some container being broken we don't have this metadata the task is can we find these relations the data that we have but back again to our data set with our patients so how we can find some or start reasoning about relations between different symptoms so we'll be using this Bayesian theory here where we'll be looking at the probability of one thing while the other thing is happening so in the first thing we'll calculate the probability of having a fever we'll have a flu so it's probably not that surprising that we have a flu there's a high chance that you have fever in that occasion so when we are running that in our data set basically we need just the number of patients that have both fever and flu that's basically in this it's the number of red cubes and the flu all the patients that have a flu that is the red thing so we have red as a flu and the subset of that is the fever and flu so it's red and it's cube so it's the denominator and the denominator is the whole set of objects so it's nothing else than just looking at all the patients with fever and several patients with flu and we get about 60% based on this synthetic data set so it seems that it's kind of related but we can't tell for sure yet why is that we don't know what's the probability of fever in the whole population like in this case it would be probably intuitive to say that probably only a few percent of the population has fever but if the fever was set with thresholds of 6 degrees I would not tell you Fahrenheit I'm sorry then basically everyone would have it and would be useless for our analysis so it's still good to calculate the probability of the fever itself and then we can compare these values so what we did here is something that's called likelihood ratio and at the top one we still have the probability of fever when in a flu and we compare it to the probability of having fever in the whole population and we get some number which if it's higher than one there is some sign it's significant there is some correlation so the higher the higher the number is the more these things are correlated so that's good what I mentioned there is that in the open-ship world we don't really have the information about correlations like what's causing we have just seen those of individual components so is there some way to set this direction like compare it maybe so if we look right now about the probability of fever when in a flu and then compare it to the whole population we can also try to switch around so taking the probability of flu while we have a fever just for fun so again like nothing new it's really almost the same as the first one we just really switch the order basically we just switch the denominator we have no different numbers so the results are different as well there is still high probability of having flu while we have fever like nothing surprising about that and we can also calculate this or compare it to the whole population probability of the flu which is this ratio so we have 0.6 0.23 and now we have 0.47 0.18 now the question is like what's the result of that and it's almost the same number it's not the same number just because I was rounding the numbers while doing this so that it doesn't produce too much output but basically these two values are the same so we two different directions we will get the flu versus fever and fever versus flu we even have different ratios but then when we calculate that we ended up with the same number which for me was surprising then I will get Bayesian theory and all this how it works and it has to be this way so the conclusion is that these numbers are symmetric and doesn't give us any indication of the causality just for this reason so when people say correlations of causation one of the reasons for that is many times when we calculate these correlations they are symmetric so you can't tell much more about can't reason much more about it Sir just to start to correct you is that an the fact that the numbers are always same is that because is that always going to be the true when you switch them or is that because the way that this data happens to be and that's always the way you can you can agree with just with Bayesian theory I don't have enough time to do it but it's fun for sure so it will not work for us for the reason but we can compare different things and I will repeat like what we compared before is having flu while fever compared to having flu without any other conditions we can also formulate it differently or use a different comparison mechanism which is probably I have flu while I have fever compared to I have flu while I don't have fever there's a difference I don't compare it to all the feverness or fluveness I compare the flu while I don't have fever which is something that I've learned later it's called relative risk because first I was making up some so if you search for it it's relative risk so probability of flu while I have fever well I don't have fever so you'll calculate that so this is no fever then no because there are many patients with other symptoms that is neither flu nor fever so probability of having flu without no fever means you need to have other diseases and they are also common so we ended up with 0.9 and we can do the same the other way around we still have the original probability I have the flu compared to flu and no fever and we ended up with a different number so basically this is the equivalent of our first situation in here where we were comparing of two probabilities now we have just a different correlation so it's so we have flu fever and no fever and now we can do the same for fever versus flu again I have different ratios there the question is will they be the same and if they would I would probably not give them this presentation right so they are different now we see that flu compared to fever has 5 and fever compared to flu has 4 and it tells of something so this is basically where static thinking can be used to indicate the causality of the problem so that's pretty much about the theory and the assumption is that the higher the score the closer the symptom that is at the first place there is causing the fever so I prepare also a function that is calculating these values in tables so that you have all these numbers in one place and we see that flu has higher relative risk than the fever compared to flu so we will indicate that flu is causing the fever also I can show this for the likely good ratio you've seen this it's 2.58 number it's this one so basically we can compare and you can see really here that this is electrical so you can't tell much about this so how does that actually work it's nice to have some synthetic data set and it just worked so if you want to understand how it works I have here really simplified data I have 10 patients with flu and fever I have 10 patients with flu and nothing else 10 patients with fever and nothing else and I have no other patients would that work in this case and it doesn't because it can't and why it doesn't is that we even have lower relative risk when it comes to flu and fever and fever and flu just because the symptoms are called this way but the data are not really real it just will not give us the correct answers so one thing that this approach really needs is quite good data and first of all you need some other examples some negative examples that don't mesh neither flu or fever so I will include some patients that don't have these symptoms and now we start to see some correlation between flu and fever because otherwise it can't reason about the probabilities it doesn't know about anything about how common fever or flu is and the other thing that we also need is some disbalance so right now it's still showing similar correlations because we have 10 patients with flu and fever and 10 patients with flu and nothing at all so we can't reason much but we can increase the number of patients with flu and fever and we can also include other reasons for the fever other than flu 50 and now it starts showing up like now we see that flu relatively ratio to fever is much higher than fever versus flu so there is like you need other reasons for the fever to be in the data set to stop being able to reason about this it's not that different from how all these patients see a large language model of work they need a lot of data a lot of examples to be able to deduce the probabilities it's much more simplified in this way but you still need data and on the other hand you can also use this as a test to tell whether the data you have are really useful or not, you can just have garbage in, you can garbage out you can't do much about it so that's some experimentation with the data and now can we do it for the whole data set I was comparing just the flu and fever but I had the chicken pox I had the broken legs I want to see the holistic picture so I have prepared this table so you can see the relations, right? it's clear you are a data scientist so let's do something so this is basically the graph representation of that basically we ignored the low relative risks we just don't put a line between the head and the broken leg because the relative risk was just below one or below a threshold and we include the line between them and you already see some components in this graph and you can apply some graph theory there where rush, chicken pox and fever could be considered as one common thing that if in graph theory I learned that it's called click or something like that that you can search for these components similar fever, flu, cough, sore throat even headache is somehow related and it's going to be a tiredness we have on the broken legs and bones part we have this whole component that is related to pain because I said the data set to be this way so because I think you see no arrows there so I was still ignoring that part but we can also draw the arrow, sorry the arrows there pointing right there direction and now we have the directions when the direction means we think one is cause of the other so the symptom is pointing to the cause and you can see the chicken pox and flu here are the only points that have already incoming arrows so everyone points to those directions which is in my opinion I'm not saying it's perfect it still relies on how quality data you have and all these other things are not that simple this thing can combine multiple symptoms and how they differentiate things but there are some properties of it that are also much much nicer such as it doesn't require any GPUs no GPU was hurt and you can still run the GPU if you can't wait for 5 minutes and you won't have it in 1 minute or so but it's much easier to do you can also with the machine learning algorithms it has some benefits as you can really reason about it it still some numbers that you can put some meaning behind and you can iterate fast and as you can understand how good you have or your data have I mentioned the cons with power so the conclusion I would say for that is that both don't be shy to use both, don't be shy to use stats just because machine learning is cool but notice that this boring guy is proposing machine learning the cool guy is proposing stats so that's pretty much what I wanted to the message I wanted to give here and the last thing just if you hear data science don't think about the job for the data scientists or any other experts to do this kind of things and compliment this whole area with some knowledge about the field itself and you really don't need to be organized to do this kind of thing you can look at me as an example of that so with that I guess we are running almost out of time so thank you very much for your attention and I will still have maybe 3 minutes for questions so how is the time to the questions do you want to have more trees because if we do I can't help you there is some question what are some real examples where you were able to use this and get to root cause analysis were you able to show us some examples great question and actually we were not planning this but I have some real example here about the root cause it is still something that I need to apply to the real world and we have seen some of these indications but when it comes to the clustering grouping the things together this is some real data where we have some cluster where it has some issue in you which you can see have multiple different alerts we don't really want to reason much about that we can do some time grouping which is just grouping the things together based on the time that appeared but then we can apply this additional contextual grouping and I will highlight this particular group and I will also do this thing so when we did the time grouping we had these ID members down and Brazilians other alerts but there was some noise introduced there because some other incident was happening at the same time when I include the additional contextual grouping grouping it is now split into two things so we still have the the ID members down but then we have the board security evaluation which is not related and just for the statistics that we have about that we can actually separate those two that is basically the area of applying this kind of approach for the real world not just for playing around thank you for the question