 Yeah, thanks a lot Paul for the introduction and thanks to the organizers for inviting me. Yeah. I'm glad to speak here. Let me briefly share my screen. Can you see my screen now? Yes. Yeah, so this talk is on the one hand about applications but first and foremost about methods about also about concepts, because it turns out when you work in applications that often the concepts that you need have not been invented yet and if you try to define the right quantities that you need and the right formal concepts for root cause analysis. Then there can be two, two sorts of reactions from reviewers either that tell you everything is obvious from the beginning or that tell you it's too sophisticated nobody needs that but it's sometimes hard to get something in between. So, I hope you will consider it as something in between. Yeah, so what is it about root cause analysis in complex systems or more general understanding complex systems. I work for AWS or cloud computing is for instance, very complex system with a lot of dependent services calling each other and there's a lot of behavior you want to understand. And then there are these distribution changes or whatsoever. And the perspective from which. No, let me let me start with the outline so I first want to explain the general idea in an informal way the idea of attributing behavior of a target to different mechanisms. From a regular perspective. Then to introduce the formalism of causal attribution would cause of analysis of outliers and distribution changes examples. I will talk about the practically implementation and challenges so the implementation in the open source library to why and how can we infer all that from data. This quantities that we consider crucial and how do the methods scale. So first, I should advertise a bit the open source library to why to which we contributed several of our methods so all the methods I'm talking about today are meanwhile in Dubai. So I will encourage the audience to to further contribute and also to improve on the methods but let me talk about that later. So from a regular perspective, I'm talking about this based on the graphical model. So, a causal based your network. In the following sense. You have nodes that stand for variables on some measurements in our applications that can be latencies of some services. There can be some quantities in the logistic chain or whatsoever. We have a causal tag or causal directed acyclic graph or causal graph. That shows cause effect relations between the variables so the arrow just means that this variable influences the other variable. And if this is, if this graph is the causal structure then the joint distribution factor rises, according to the graph in the sense that the joint distribution is just the product of the conditional distributions of each variable, so it's a direct causes. And actually, this formula is the most important one you have to keep in mind in this talk because this formula is comes with an important interpretation for us, namely, that each of these conditional distributions of a variable given its direct causes formalizes a mechanism so this module in this modular perspective of the world. And if you want more fine grained description, then you can also describe it in a deterministic way where every variable is a function of its parents and an unobserved noise variable that is not explicitly shown in the model. So this is just a structural causal model as usually in econometry. Of course nonlinear in general. We can either think of these conditional distributions here as one mechanism, all of this structural equation as one mechanism. Here just as an example for visualization business, some note of interested xj, and the red ones are the parents, for instance, so this, the corresponding mechanism describes how this quantity is generated from its parents. Okay, why are these mechanisms so interesting, because they come, at least for us, and for many other people with a lot of philosophical preconceptions about the world and what are, for instance, phrased in terms of the principle of independence of mechanisms. So, one way so the independence of mechanisms has been phrased in many different ways and all of them are somehow related but not identical. First, the idea that we can change one of the mechanisms without the effect, without affecting the others. And they don't share information in the sense that if you learn something about some of these mechanisms, you don't know anything about the others. This is maybe the most abstract part of the independence of mechanisms, I don't want to go much into that. The first one is the most interesting one for this talk. So the idea that the change independently across data sets or if we manipulate something at the system then it's quite likely that we change only one of it, or only few of it, but not all of them. And the independence of noise. So in the structural equation the these noise variables thought to be independent, unless we have some hidden common causes. So it's also part of this independence of mechanisms in a certain sense. This is discussed in our book so we diverted to one section to it with historical overview and a lot of background connections within them. Let me start with toy examples with three variables. So for instance, if we have two parents of a common child, then the joint distribution is just the distribution of these two parents, and a term that describes how the child is generated from its parents, or a causal chain where each variable is just generated from the one before. So x one then x. Yes, x two x one it's it's just swapped position is what the formula is right. So this is a complex system that people may think of when they think of Amazon as a supply chain. So this is a highly simplified version I don't want to go to the details there so for inventory planning for instance, you have a forecast some simulations and the bidding process and this is how this quantity influence the inventory level. And then you can factorize the joint distribution of all these quantities, according to the graph. And here you can see that these mechanisms really represent some different parts of the company so therefore it's rather intuitive. And a postulate this is that is tightly connected with this independence of mechanism idea is the so called sparse mechanism shift hypothesis that in this format has been phrased by show girlfriend collaborators in the paper about causal representation learning. We phrased in my words, it says, try to explain changes of the joint statistics by changes of as few mechanisms as possible. So let's say we observe that this the joint statistics of all these quantities in the supply chain shows week over week changes. So here's the odd distribution the blue one and then the new one, the red one with the tilde. Then we try to explain this changes by the change of one of the mechanisms so it could be that one of this from among these four mechanisms only the one that generates x three from x one and x two has changed. So only here's the tilde the red one is the tilde. And this is what we mean also by explaining the changes because then we know this is the mechanism, we have to blame for this change. So, other limitations of the sparse mechanism shift hypothesis should be always believe that only a small number of mechanisms changed. So, I have a short answer that just says no, it's just the working hypothesis so sense of outcomes result right to find simple explanations if possible. So for a longer answer it's about modeling simultaneous change of many mechanisms may indicate that the model missed a common cause. And just to show you an example. Here if a model with three variables. Simple, complete deck how we call it. Where we observe that in the new distribution all the mechanisms changed. Violating this boss mechanism shift hypothesis, but it could be potentially because we missed a common cause x zero, because if just the distribution of this common cause changed. Then we can also observe that in the factorization above all the conditional change. So, when too many conditional change simultaneously then there we have some reasons to believe that there is a common cause behind it. So, nothing happens simultaneously without a common cause that it's also a kind of principle and the philosopher. But just a side note. The relation to causal representation learning because as I said it was taken from a paper about causal representation learning, try to find representations such that sparse mechanism shift type of basis hurts. So, first the condition of causal sufficiency. Don't miss common causes. You're allowed to drop causes that influence only one variable because this is just a less detailed description. But in this framework in work in the graphical model framework you are not allowed to drop variables that influence to or more. And the second one, so a different question what is the right course waning. So, define the right aggregations of variables of course in complex systems you have lots of different levels of aggregation and if you think of thermodynamics. And there's the level of atoms, but there's also the level of, let's say, gas theory of the thermodynamics where you just describe a gas by some macro variables, instead of looking at the single atoms. There are similar things heard about modern complex systems supply chain or whatsoever to and constructing these representations is of course, big challenge for artificial and human intelligence. But I want to continue with the idea of what do we do with this mechanism change with this description of mechanism changes. So, if you think of root cause analysis of anomalies so you're given an observation. So lower case letters are always specific observations and upper case letters are variables. So given an observation that doesn't look like those that are drawn from the usual distribution. So apply the sparse distribution shift hypothesis and assume that for this anomaly, only one or few of the mechanisms are corrupted. So if you observe an anomalous behavior in, let's say, a target quantity of the supply chain then you would, as a first type of working hypothesis assume that only one of the components failed. And if several of them fail together then you would again asked what is the common cause of all of them. And if several of them failed then you want to quantitatively attribute the anomaly to all contributing mechanisms. And actually it's a special case of the distribution change, where only one data point from the new distribution is given. So you have this anomaly and this is one data point, the new distribution. For instance in this supply chain you could try to explain the anomalous value X5 by the corruption of just one of these mechanisms, for instance the one generating X3 from X1 and X2. We have a slightly different questions. It's a concept that we called intrinsic causal contribution, ICC. It asks which mechanisms should we blame for the fluctuations, for instance the variance of a target variable. And the idea is the contribution of each of the mechanisms to the variance of the target is the variance reduction after replacing that mechanism with an appropriate deterministic one. And we want to define this contribution in a way that deterministic mechanisms don't get any contribution at all. So we asked which of the components introduced the indeterminism, the statistical fluctuations that you observe in the end. And those that are deterministic don't contribute here by definition. And just to get it a bit more concrete, what this is related to my everyday work. This is the causal contribution analysis for dependent services in cloud computing. So a target quantity could be the user experience latency of the entire service. And here we have a causal structure, because if a service A calls a service B, then the latencies propagate from B to A. And this is at least the first approximation also working hypothesis, the many violations of this principle but as a first approximation. We stated that the causal graph is the service graph with the arrows inverted. If we are talking about how latency is propagated. And what kind of questions could we ask him. So for instance root cause analysis of an arm anomalies would ask, identify services that cost sudden increase of latency, would cause analysis of distribution change could ask. And services that cost increase of latency in a day to day comparison so if people wanted the service, several days, intrinsic causal contribution asked, identify the services that contribute the most to the fluctuations of latencies. And there's a very interesting semi real data set. The colleagues of mine have published now they published the application. So it's basically a toy application it's an online pet shop that is just hypothetical but it really runs. You can see the pets and you can play around with it. And it really consists of AWS services that interact, and you can, you can observe how latency is propagated and all that and they, they analyze this data set. This is to appear on archives one. Okay. So when I started my talk I emphasized that these are concepts that we're missing. And in order to explain that a bit more. So also explain why those concepts that are well known are not sufficient. And I think the most. It's a most popular quantification of causal influence average causal effect. It quantifies the impact of one variable on another Xi on XJ here. I'm used to pulse notation. So, I know that in this community maybe other notations are more common. So what I'm saying is just here the, the, the mean of XJ, given that I set XI to the value lower case Xi, and you compare that to the mean you get if you set Xi to a different value. So once for instance if Xi is binary then it's, it's the one on average causal effect or average treatment effect. Why is this substantially different? Why is causal contribution substantially different from causal contribution I'm talking about. So for instance the root cause analysis of anomalies and the, the causal effect from Xi to XJ doesn't tell us anything about whether the mechanism at Xi worked properly. And likewise distribution change. It doesn't tell us whether this mechanism changed, and it doesn't tell us either whether this mechanism contributes to the variance. So if we have a chain of events like thunderstorm causes power outage, this causes server outage and this causes revenue drop, then the treatment effect of usual interventions on server outage on revenue, maybe large. But the variable server outage was not the root cause of the revenue drop because this node just showed its usual behavior so the root cause was further upstream. And this example also shows that root cause analysis is relative to the variable sets you have observed. It depends on how far you go back in this backtracking process. So let me be a bit more concrete about root cause analysis of distribution change and how this is defined in a quantitative manner. So, we observe that the distribution of some target quantity of interest XN changed from P to P tilde. And we want to quantitative attribute this change to the different mechanisms. So, a priori we can allow that all of these mechanisms changed, although we believe that's only a few or we hope for the analysis that's only a few because this is the more interesting case. And then we do a computation where we would replace step by step the old mechanisms with the new ones, according to some random order, and then observe how the distribution changes step by step. For instance, when we consider the mean, the difference of the means then we can see that we get step by step from one mean to the final mean. And this defines the contribution of each of the mechanisms to this change of means. So, so this contribution sum up to the observed difference, but unfortunately this contribution depends on the order of replacing the mechanisms. So, in order to fix this to get rid of this ambiguity and ill-definedness, we use shadowy values and just average over all potential orderings. It's not a fundamentally heavy step but it can be approximated to talk about that later. It's just to have a good concept we average over that in order to get rid of this ambiguity. A toy example of supply chain also a very simplified model of supply chain that can be found in do why that's a toy data set where we have the quantities forecasted demand capacity constraints. This results in submitted purchase orders and this results in confirmed orders and this results in received orders. And this is the causal structure here with just five variables. When you observe the week over week changes, then you see that most of the variables so all but one have changed so constraint was the only one that didn't change from week over week so the other one. The simple ad hoc analysis just tells us almost everything changed from week to week. But the contribution analysis shows that it's just two mechanisms that have changed confirmed and demand. To show that this can be easily implemented in do wow to why just briefly mentioned the API so this is just how you define the graph. So the the causal structure needs to be provided. This is also something I will talk about later. It's a bottleneck one has to be honest about that. Where does this graph come from I will talk about that later. And then here. This is how you call distribution change so we have just have this one GCM dot distribution change and then data week one data week two and for the quantity received and then it produces the above plot. This symbol. It's pretty easy to use. Just to mention that. But I also want to talk a bit more about the details of the root cause analysis of outliers. And what is done mathematically. We assume that we are given the causal graph. But we also assume a bit more we also assume that we are given the structural equations. If we recursively insert the structural equations into each other. We can write the target function, the target variable, just in terms of the noise variables, just some function capital F. All the noise variables it as inputs. And then we want to assess the importance of each of the noise variables for the observed anomaly by setting it to normal baseline values and computing whether it results in X and being normal. So the function F is computed from the structural equations. And this generates us counterfactual values X and that would have been observed. Had we said some of the noise values to normal ones. And of course the big question the crucial question for the wood cause analysis is, which of the noise values do we need to change in order to get a normal value for the target. But I told you also that we want quantitative wood cause analysis in the sense of quantitatively attributing the anomaly to the mechanisms. And in order to do that, we first define an anomaly score. So basically just minus log probability so it just quantifies how unlikely the event is. So here G is some feature map. So if you just read this formula, it tells you how unlikely is it that the function G of X is even larger than the observed value G of X, or of the same size. You can take minus log for quantifying the unlikeliness of that event. So for instance, you can have a one sided tape probability where the anomaly score is just minus log of the probability of observing even larger values, or distance from the mean. And this is just a way that to to phrase it in a very modular fashion and and general way. Because it captures also non numeric data it also captures multivariate quantities and all that. And that's quite general formalism. Okay, now I get, I don't want to get too much in the detail just to give you an impression of that. I want to quantify as to what extent does a certain mechanism make the outlier less likely or more likely. In order to do that we consider the quotient of the probability of the outlier event, given that we randomize all variables. So I should have mentioned the random order of the labels first we start with some random order of the labels, and then we randomize all variables that are earlier according to this ordering and evaluate the probability of the outlier event and compare that to the probability that we obtain when we randomize all variables earlier, including nj itself. So it describes the factor by which randomizing and j decreases the probability of the anomaly. So we have again the problem that this contribution depends on an arbitrary ordering of nodes, and then we symmetrize all orderings, which makes it a shapely value based concept. For these shapely values we have also the decomposition formula that the total score decomposes into the contributions of the mechanisms so you can quantitatively tell how many percent of the anomaly has been generated by a particular mechanism. If you have just to see that concept at work, if you have a logical end gate, where y is just the end of the different inputs, then the outliers call is the sum of the input outlier scores. So let's say x1, x2 and xd are all binary variables that stand for rare events, then minus log of the probability of the target being one is minus the sum of the logarithms of the probabilities of each of the inputs being one. And then it turns out that the contribution of xj is just minus log the probability. And if we interpret that, it turns out that only rare events can get high contribution, which makes sense because frequent events don't appear as good explanations. So even if their logical role is equivalent, so if y is just an end of several inputs, we would still attribute y being one more to the rare events. Because these are the unexpected ones, those that are one almost all the time wouldn't get a high contribution by construction. And so this dependence on rarity is not put in by hand, it's just a result of this concept. And it aligns with our intuition so historically strong drop of Doe Jones cannot be explained by an event that happens twice a week. We would not really call that an explanation so it's this is also a bit about psychology or philosophy what we would consider an explanation. And I think it aligns with our intuition. Or I've just a data example with two dice, one with four different values and the one with 100 different values. And the outlier score of the event one one is the logarithm of four times 100. And the contribution of each of the dice to this outlier event is log four or log 100, respectively. So it's 27% versus 77% of the odd liars score. In summary the root cause analysis of anomalies. Describe a scale independent quantification wire probabilistic scores. It explains anomalies in terms of ancestor anomalies wire structural causal models. Quantifies contribution wire, controversial change of log probabilities, which achieves comparability across measures with different units. And notes with highest contributions are called wood causes here. I think they have to cut some parts a bit. Because I also want to give you a user guide for what I said. So estimating this contribution is challenging. So we have this inner currency of the structural equation so where do the structural equations come from. High computational load if we really properly want to evaluate this combinatorial impression expression. But if there is a unique root cause we can work with rather bad approximations and finding good simplifications and proxies for the above will be heavily use case specific. And this is a sort of explanation all how you should understand this talk. So this definition I've presented this definition in order to show that. First we should discuss what we mean by root cause analysis. And I think the quantity we define makes sense. And even if it's hard to estimate in practice, it's a gold standard maybe if we accept that and we have to find good proxies. It's better to to work with proxies for a clear concept, but to start with something that is undefined in the first place. And the buff is an attempt of defining what root cause analysis the first to mean in the first place, and not necessarily the way to go if you implemented on large scale so this is a question I want to talk later to you. So just some remarks on the practical implementation. How would you estimate the structural equations, for instance, you can go for an additive noise model as simplification where each node is a function of its parents. So that's a parametric restriction, the structural equation. Then this function f tilde can be learned via standard regression methods. And the noise values can be computed via the residual terms. So this is how we can reconstruct the noise values for every single instance, and talk about counterfactual changes. So what would have happened at the well, the noise value be different. We can average a few random permutations instead of average over all permutations as an approximation. If we have a unique root cause that shouldn't make a difference, although it's hard to make very general statements. So just as a real world example with the time series of water levels, where we see. So this is causal relation between the water levels at to at four different cities in the UK. And we know the causal structure because we know how the, the, we know the river topologies. And there we did in a root cause analysis of the anomalies of the high water level, and found that the downstream water levels were just a result of the upstream anomalies. So we could counter attribute this downstream and normally to the upstream water levels. So, maybe again I want to go back to to the general spirit of the principles of causal contribution analysis. So the common structure behind all the above contribution methods as we replace mechanisms at each node with a baseline mechanism. And look at the impact of the replacement on the target quantity which defines the contribution of this mechanism. And replacing the mechanisms one after another yields contributions that sum up to the joint contribution. So the change from red to black or vice versa. So and in general for all these problems. I think the problems that arise here are always the same namely, what's the baseline and that the contribution depends on the order of replacements so the second one is solved why are this symmetrization so they ship the value. The baseline is also something that we need to decide for all this attribution methods separately, and then to go to the details there. And I would rather like to talk about a problem that we are asked all the time. How do we know the causal graph. And. Yeah, so in do why we need to specify that. I mean in our use cases. There is a lot of domain knowledge. The total knowledge is derived from time order causes proceed their effects system topology for instance dependent services in cloud computing, and of course a company knows its own supply chain. So, there's a lot of knowledge collected. It's not always clear to what extent we can constructed domain knowledge because my experience showed that in order to define a causal graph. There's both domain knowledge and a lot of causal thinking, because it's sometimes it's even slightly philosophical question. Is this a cause or not. And so this domain knowledge may be incomplete and we should have a way to to check that from data. The question is, to what extent, can we test hypothetical graphs. And one minimum requirement is the one on causal mark of condition. Every node must be conditionally statistically independent of its non descendants, given its parents. So here's some note x j. And here are the parents, the red one. The descendants are here, and these are non descendants. So, given the parents, this node needs to be statistically independent of the non descendants. So, for fixed values of the parents. This non descendants would no longer contain information about x j. So the parents green of the non descendants. And this is a minimum requirement also on the God's book causality. How do we test that in practice so a conditional independence testing is hard. There is even a sense in which it's impossible, depending on what you demand. Kind of guarantees statistical guarantees you want to have so statistically independence taking testing requires assumptions. And they have type one error. So, even if the graph is right, you will have violations of the mark of condition. Just by usual statistical errors. So how do you decide whether it's still a good graph. And a very skeptical view that we started with this is the test result, even better than for a random graph. So that would be a minimum requirement. It should be better than a random graph. We are modest. Of course, often it is. But in order to test whether there are many violations of the mark of condition or not. We need a fair comparison. So, when we asked is a graph better than a random graph. We also need to generate a good random graph. And to get a fair comparison, we wanted to compare it to graphs with comparable connectivity. So we don't want to connect to compare a graph with a lot of connections. We want to compare with one that has almost no connection that is very sparse because the sparse ones will have much more violations of the mark of condition because it entails way more conditional than the dense one. So therefore it's not a fair comparison and a simple method to get a fair comparison comparison is to generate random graphs wire commuting variables. There are also more sophisticated arguments in favor of that principle. And we have implemented such a test. And do why as a first check for how good your causal graphs are. But what can we do with when domain experts don't know much about the causal graph. There's a big field of causal discovery in burying the causal graph from data only. There's different approaches. Meanwhile, all of them very interesting. The big question is how much can we trust them on real data. If they are tested on simulated data then the simulated data could just reflect authors preconceptions about generating processes in nature. And with real data we have the problem that ground truth is rarely known. So one way we tried to get around that serious problem as what we call self benchmarking we apply algorithm to different subsets of variables from a large causal structure and quantify how much it contradicts itself. So that would be at first glance it's just a sanity check but we also argue that it's a bit more because it can be a quite strong test. And it really turns out that there's a lot of contradictions so algorithms often contradict themselves. So here just a toy example where the right one is the true graph. And if you apply an algorithm to subsets you may end up with graphs that have a different direction for x causing y or y causing x. And if you demand that variables should that algorithms should at least not contradict themselves. Then you have a method where you can train algorithms without one truth. And this is, but this is further research. So it's not not something that I am suggesting here for let's say, direct application that industry. So the main references here. I want to give you these are the papers about different contribution methods, and again mention, do why, and to end library for causal inference, where my colleagues also contributed a lot. It's free to contribute improvements also of the methods that I was explaining the buff is meant to introduce a way of thinking about contribution analysis, rather than recommending a specific methods so I would say, so as a disclaimer. I think that the specific method needs to be adapted to the specific use case, but the way of thinking, I think this model of spec perspective decomposing the world into mechanisms. This is the general spirit, but yeah, I'm convinced of. Okay, let me close with this remarks. Thank you very much, Dominic. Thanks. Great talk had everything new methods. We were going to talk about I think applications of this there was already a lot of interest about would cause analysis but also. Well, the, the user guide as you called it a direct applicable in do why, and also a part on cause of discovery which is very relevant for our audience to so thanks for that.