 All right. Now for the moment, we don't need a computer. We need a piece of paper if you still have that. I know that everybody has a computer, but some of you still have a piece of paper. And so in today's, what I do today is I will introduce this concept of Bayesian network. And if there's still time, influence diagrams. If not, it will continue tomorrow morning. So the introduction today is going to be very general. I'm going to address also the topic of value of information. But I'm going to show you more generally what we can do with those tools. And the idea is that then from tomorrow on, you can also use those tools for the value of information analysis. I should mention that sometimes there's some confusions. The Bayesian network is both a modeling tool as well as a computational tool. I'm going to focus mostly on the modeling aspect of it. So how can I use that to actually understand and model and communicate my problem? And I'm going to look not really much at how do I use those tools to compute things. Maybe at the last day and the morning, there is a lecture on sequential decision making. And there I'm going to speak a little bit more about the challenges of actually doing computations with those tools. So let's start with a piece of paper. And I'm just starting directly with an example and jumping right into it. This example has not much to do with most of the applications that you guys deal with. But there is also an information aspect to it. Which I had one or two students working on warning systems for natural hazards. This is an inspiration to this little problem. We have a strong earthquake possible. And they have a certain probability, highly simplified problem, of course. The strong earthquake can cause tsunamis, again, with a certain probability. And you have your own house at the beach here. Here, the risk of tsunami is, I guess, not that high. But in that place, there could be the potential of tsunami. And you have a certain vulnerability that's here. Now there is also the aspect that the earthquake, obviously, can be measured. And a warning can be given. And this warning will be given when the earthquake is measured. The assumption is that we cannot, of course, we can in reality, but at this point, the warning is entirely based on measuring the earthquake and not the tsunami itself. The warning, just like your health monitoring system, is not perfect. So in principle, it should give a warning whenever there is a strong earthquake. But sometimes it fails. And sometimes it can give a false alarm. So there are a number of probabilities. And if you know basic probability one-on-one, you can solve this problem. But it's actually a little bit, it's not super straightforward. You have to think a little bit about it. The problem, so the question is basically, what is the probability that your house is destroyed, conditional on a warning being issued? So you lie on the beach, oh, no, better not. You're not on the beach. You're at home in your, whatever you come from, in Munich, I'm in Munich. And I hear on the radio that there is a tsunami warning, and I want to know what is the probability that my house at the beach will be damaged. So just think about this for a second, or maybe for two minutes, if you are a genius, think a second. Otherwise, think for two minutes. OK, somebody has a solution. Well, as I said, I'm not expecting you to solve this in your head in two minutes. It's a fairly, as I said, it's a very simplified problem, just a few aspects. Most of the aspects associated with the real problem. So we have a multi-hazard type of event here. We have a warning that is indirectly connected to what we want to know. Uncertainty described by probability, but still it's highly simplified. Nevertheless, it's a bit challenging. Now, if I want to address this problem, I'm trying to make a kind of a mind map, and that mind map can be our Bayesian network. Who of you has already come across Bayesian networks before? Some of you. So at least some of you are probably able to directly draw a Bayesian network on your sheet of paper. We'll see how that helps. So let's start with that, and then we will come to the solution of this problem. But let's try to create a Bayesian network for this. And we're just trying to create a graph here, which is actually not yet the Bayesian network, but just the general, we call causal network, which has nodes and has directed links arrows. The nodes represent the different variables of the problem, random variables or parameters, and the links represent the dependence or relation between those. And in the causal network, those links should represent the causal effect or the causal interaction between those. So let's start. What do we have here? We have an earthquake. So I'm just starting with an earthquake here. E. What else do we have? The tsunami. And how will that be connected? The earthquake causes the tsunami. What else do we have? OK, the warning. I'm making that here somewhere. Let's call it the, I'm not sure if I called it alarm or warning, but let's make it A for alarm or warning. OK, W, sorry. How is that connected? The earthquake causes the tsunami. And what else? The house, the condition of the house, or whether it's damaged or not, making it here. So, voila. That's our first Bayesian network. And actually, I sent around the link to the lecture nodes. You find this nicely there. So don't need to write the details. It's all there. So, this is a causal network. And personally, I find this also maybe because I'm a graphical person, but I find this to be very helpful in understanding this problem. In particular, we have correctly figured out here that the relation between what we observe or what we see here, in this case the warning, to what we want to know, D, is not necessarily as straightforward A to E to T to D. But actually, we have this kind of link where my earthquake causes both the tsunami and the warning. And tomorrow I will speak about how to model information, how to represent my information from monitoring, from inspections, and so on. And we'll see that, in a probabilistic sense, this model is a likelihood function. It's a conditional probability of observing something given a certain state of my system. And that's exactly what we have here. Behind is that, and I'll come back to that, is that each node here will be represented by a probability distribution, a conditional probability distribution, always conditional on the random variables that have arrows going towards the node. So the warning will be defined conditional on whether or not we have an earthquake. And that's exactly the likelihood function that is generally used to describe an information. So that's something like this. And this is the same thing, just slightly differently arranged. More generally, mathematically, so if I program something like this in the computer, the computer sees maybe something like this, x2, x3, I mean, x1 is missing, but those are just random variables. To us, they have a meaning. We have this story, and they have a meaning. But to the computer, your code, a priori, there is no meaning. Nevertheless, the computer will know from the way those arrows are defined how those random variables interact with each other. This is the first thing about the Bayesian network, the graphical structure. And the graphical structure defines the statistical dependence among those random variables. Now, what I'm teaching you now, I would normally teach in maybe three days altogether. So I have to rush through all these things. So if you want to know a bit more details about what is behind, there's more in my lecture notes. And then there are also a large number of books on specifically on Bayesian networks. But basically, you can, from the causal network, and I go back to this here, this is a causal network. Earthquake causes tsunami. Tsunami causes damage. Earthquake causes alarm. This is a causal network. We can derive rules that make sense. And most of, maybe some of you have read this book, or those who have looked at Bayesian network, I recommend this book by Pearl from the 1988. Los Angeles, not San Diego, but almost inspired by the sun. It's a very nice book, and he's a second book, which is called Causality. So it's one of the first works that really derives the philosophical basis for Bayesian networks. And it starts by the causal reasoning. The causal reasoning has the motivation for why we have rules that then have meaning about dependencies. Now, afterwards, that once I have this, I give the computer this structure that does not need to be a causal dependence. So there's a difference, and this is a very important difference between causality and correlation. So ultimately, the network does not have to be causal. For example, here it's a bit difficult to do, but let's assume I have only two variables. I have only two variables. For example, let's say, what is it? OK, for example, let's say I have some kind of illness. So this is my health status, this is my health. I'm going to the doctor, and the doctor measures my blood pressure. So this is my blood pressure. So what is the causal relation between those two things? Like this. But without all this introduction, you ask people to draw a network linking what the doctor tells you about blood pressure with your health. Most people will come up with something like this that says blood pressure is high. Therefore, something is wrong with my health. So this is what we call causal network. Somebody knows how we call this. So this is a diagnostic network. So I see that some people, some of you have good background. So and people think in a diagnostic way. So the doctor will think, meaning intuitively, he knows, of course, how it is. But typically they think, OK, blood pressure, 150, something is wrong with your health. Therefore, the diagnostic way of thinking. Now both are made in networks, and both are correct. So it's not that the second one is wrong. In this case, because you have only two random variables, this arrow just says that they are dependent. And the fact that this is causal, but it's not causal, doesn't matter. Because you have 200 variables that are dependent. It means you can't consider them as independent. And we can't further, there is nothing that can be wrong with this. Maybe I'll try to make that clearer later. But both are correct, page networks. This is a causal network. This is not, but both are correct. I will argue later and show you why that in many cases, this is a very bad model. Because it's difficult to construct. Because you can easily make mistakes. And when we extend it later or tomorrow to an influenced diagram, we realize that in that case, that model becomes seriously wrong. Because this model is not what we call consistent under intervention. So this model, let's say you do an intervention. Now I'm jumping a bit forward to what we're going to do later. But think of it, say, OK, I have a problem with my health. I want to improve my health. Now I can do an action. I can do an action, and I can change something in my system. According to this model, you could take something that lowers your blood pressure. And you will be more healthy. Get some degree that's also not completely wrong. But let's say you have a problem because you don't eat badly. Or you have another type of health problem. Just lowering your blood pressure will not solve that actual problem. But this implies that actually it does if you would make an action and say that you. So by changing the blood pressure, you can actually change your health. No. So in that case, the model here becomes wrong. So that's what we call that the model is not consistent under intervention, whereas this one is. So you improve your general, you move more, you eat better. You improve your health, and your blood pressure will go down. That makes sense. Yeah, just interrupt. Yes? Choice nodes, if you want. For me, action and choice nodes is more or less the same. Or what do you mean by choice node? Yes. To modify blood pressure effects. Are there all types of intervention that would make a diagnostic patient that go wrong? I think I've not understood. For me, an intervention is basically always something that changes the system. Now, I mean, when I have this type of model and I add additional random variables, I add additional nodes, it will also become very messy here. So let's assume I under these decision nodes that we need to do or action nodes, it doesn't work. But also if I add additional nodes, let's say I do additional measurements. And I have an example later on this. So I add additional measurements. I have to be very careful with this model. Whereas here, it's more or less straightforward. So I'll make an example later. And an example that is more closely related to what you do. However, you see that obviously, whether we are a medical doctor inspecting a patient or whether we are a civil engineer inspecting a bridge, it's pretty similar from a fundamental point of view. OK, I'll jump to it, but keep this in mind. So still, we have the thing that is just a mathematical structure. And unfortunately, I don't have the time to go into the depth. But so if you want to know more about it, this is taken from my lecture notes. So you can read there about this in more detail. What I want to highlight is that this Asian network structure tells us something about independence and dependence between random variables. And these are some statements about independence. For example, the first statement says, the probability of x1 given x2 is equal to the probability of x1. So that implies an independence between x1 and x2. Because the main definition or the main way of defining independence between two random variables is to say that the conditional distribution is the same as the unconditional one. Knowing x2 does not change the probability of x1. And I can go and look at this structure and I can make a statement about whether this is true or not. So is this one true or not? Is x1 and x2 independent? OK, they are independent. Maybe we can discuss later. Now, this is a bit more tricky, this statement here. Maybe let's look at statement 3 first. Are x3 and x4 independent? What do you think? So this is the statement says that x4 is independent of x3. Is that true or not? It's not observed. x2 is not observed. So you say that they are dependent. They are dependent. This statement is wrong. Why are they dependent? Well, they have a common parent. Now we understand these Bayesian people use these kind of family terms. No, this is a parent. This is a child. This is a spouse. And so they have a common parent. And that means that they are dependent, right? And now we go back. If this is too abstract, think of our initial example. This is the parent of this and this. What this says is that the alarm and the tsunami are somehow dependent on each other, which they should be, because otherwise it was the point of the alarm if it was not somehow dependent on the tsunami. So there's dependence. And the thing is, again, we just need to know the structure. And this is the so-called de-separation properties of this graph that say, for example, if you have a common parent or actually any ancestor, you will be dependent. As long as this parent is not known. If this one is known, that is what you said. Once you know this thing, these two become independent. So once you know a earthquake has happened, the warning does no longer tell you anything about the tsunami in this case. All right. So last thing to look at, just to finish here, statement number two is, what does it tell us? It has a bit more tricky to read. X1 is independent of X2 given that we know X5. Remember, here we said that X1 is independent of X2. That's what we said. And we see that they have no common parents. Makes sense that they're independent. But now, we additionally say, however, we observe X5. We have observed X5. And we are still asking whether X1 and X2 are independent if we have observed X5. They are now dependent. Yes. So interestingly, the one that we said they are independent. But when I observe X5, they become dependent. Now, in an abstract graph, maybe it's a bit difficult to understand. But maybe if we could think of it, maybe we could make an example here that represents something like that. So yeah, let's do like this. So let's say, not similar to here, not exactly the same. So let's assume that the alarm can be caused by the earthquake or by what we call a check. Once every two months, we have a check of the alarm. I remember from my childhood, there was always this alarm. And once every two months, they would test this very loud signal. It was always Wednesday at 1 o'clock. They would do that. So let's say that the alarm could be caused either by the actual event or by just a check. Now, those two things are obviously independent. So the check happens every Wednesday, first Wednesday in the month, the earthquake completely random. No effect. So those two are independent. But now, you are on the beach. And you hear the alarm. Maybe you haven't felt the earthquake, because the earthquake might be far away. But you hear the alarm. Now, suddenly, earthquake and this become dependent. Because you are on the beach, you hear the alarm. The first thing you do is, OK, is it a check or is it an earthquake? That's what you are going to think. And somebody tells you, today is Wednesday and they always check these things on Wednesday, it becomes suddenly much less likely that you have an earthquake. Or the other way around, if somebody tells you, although there was a strong earthquake somewhere off the coast, it becomes suddenly much less likely that it's actually just a check. We call this the explaining away effect. So if you have observed a child or any, not just a child, any descendant of a child of multiple variables, then those parents suddenly become statistically dependent. So you see, again, in the Kotl graph, it's kind of easy to understand. And those are rules that can be translated here. Ultimately, they boil down to three rules, which I summarize here, but not enough time. So if you really want to study this, you have to look a bit more into the literature. But anyway, this is the thing. And from this, just the last thing about this concept of dependence, we can construct something that is called a Markov blanket. For each node in this Bayesian network has a Markov blanket, for this node A, those are the gray nodes here. Markov blanket is basically saying that if I have knowledge of those gray nodes, then A becomes independent of all the other random variables in the network. Those are always the children, always the parents, so all the parents, all the children, and the other parents of my children, so my spouses. So if I know, and they can have multiple spouses, so if I know all my parents, all my children, and all my spouses, then I'm independent of the rest of the network. And this is still a very small network. So if you have a large network, it means that I just have to know for each of the nodes individually, I just have to know a few of the other nodes, and then I become independent of the rest of the network. This is very helpful for modeling purposes, because even if I have hundreds and hundreds of elements of random variables in my problem, I might need to specify only very locally my dependence. And it also can be very useful for computation. Many problems in stochastic, the only solutions that we have available are for problems that have a very restricted dependent structure. Markov chain is the most typical example. So all these things that we have available that have been around for many years are either because they are normal Gaussian linear, or because we have a Markov structure. And the patient network extends that Markov structure to a more general model. So much for this. Read more if you want to know. Now we come to adding probability to the problems. So we have a graph that represents a dependent structure. And based on those rules, and interpreting those rules as statistical independence, one can now show, again I refer to the lecture notes, one can now show that the joint probability distribution over a set of random variables is just, turns out to be just a product of conditional distributions of all the nodes given their parents. Not given the Markov, not given the Markov blanket, but actually just given their parents. So in this example here, I need to know the probability of an earthquake. There's no parent. I need to know the probability of a tsunami given the earthquake. I need to know the probability of damage given the tsunami. And I need to know the probability of an alarm given the earthquake. Maybe if I have those two, I need to know the probability of this given this ending. And I need to know the probability of this. And I'm going to argue, or I'm arguing, that these are probabilities that are relatively easy to obtain because behind those are always relationships that make a physical sense. And then it's easy to get those conditional probabilities. So just a product. So if I just take this simple example, just to show you how this looks. So if I have such a network, again slightly different than this, the joint probability can be written as a product of six conditional probabilities. And that's basically all there is to it. And if you know probability, and if you have the joint probability distribution, you know everything about your problem. And if you know about probability, you also know that the joint probability distribution is something very difficult to get, to describe. Unless everything is jointly normal or follow some other simple copula, you have troubles constructing a joint distribution. But here it's easy. And if this is still too abstract, let's look at this example here. So what do we need for this? And this is the information that was already given earlier. So if I want to define this problem here, this is the information I need to give, which is exactly the information that was given earlier in the slide. And you see that we, or this is what we call conditional probability tables. So if we have discrete random variables, then these are discrete random variables, we can give the information in the so-called conditional probability tables. The alarm given, sorry, here, alarm equal one, so alarm given, there is an earthquake, 95%. Alarm given, there's no earthquake, sorry, here, 10%. And then, no alarm, obviously, 90%. But this I have to sum up to one. These are even simpler. No earthquake, no tsunami. Okay, I'll show you this. So what I'll do is now I will just finish quickly two more things about why this is actually very efficient. And I'll quickly show you, it's very easy to implement in the software. And then we have a break. So why is it... So again, this, just two example of the, these are two-page networks where all nodes have just one parent, except for the first node that has no parent. So this is what we call a hierarchical structure. We have a, you could think of, for example, a set of components that have one joint influencing factor. And here's what we call a Markov chain. But you see, in each, in both of them, X2 to Xn always have one parent and X1 has no parent. And if you have n random variables, let's assume they have five states each. Assume now a discrete, for us to make it simple, a discrete random variable with five possible states. We need m to the power of, I thought it here, five to the power of m, parameters, minus one, parameters to describe the joint distribution. Because there are five to the power of m possible combinations of how those random variables can be. And if you would want to specify that explicitly, you, with 20 random variables, you needed 10 to the power of 14 numbers. We have continuous random variables. Well, but unless, again, you can assume a simple, like for example, normal Gaussian Coppola, you will have the similar, or even worse problems with continuous random variables. So describing the joint probability distribution in the general case becomes very, or impossible. However, if we follow one of those two structures before, to represent each node, because it has just one parent, requires only, and you now have to think of these conditional probability tables, requires only five times five, 25 parameters. So the conditional probability, so the probability of X2 given X1, it's 25 values of which five can actually be derived. So you need only 20 to specify 20. But in any case, 20, 25 doesn't really matter. You need a limited number, and then for every additional node, you just need 25 parameters more. So you end up having a linear increase in how many parameters you need to specify your model, as exposed to the exponential increase here. So that's a very efficient way, and that's why obviously Markov chains, for example, have been used for a long time, much before there was Bayesian networks. So those models were around and have been used long, long before Bayesian networks. I'm just, the Bayesian network just has now the advantage of being much more flexible. So I can make here nodes, here nodes, and still having the similar type of advantage of being very efficient in representing our problem space. So this efficiency then leads over to advantages in computation. I'll discuss that, after the break. But again, for me, the main advantage is in modeling, and I will focus, most of the time after the break, I will focus on the modeling. How the Bayesian network is useful in actually modeling the problems at hand, because for computations, you can also use certain software tools. Of course, some of you might, some of you might also need, maybe if you want to use those things, maybe need to program some things. But I think that the main aspect, or the important aspect to understand here in this course is the modeling aspect. So I'll send around the link to a software. There are a number of softwares that you can use. This one used to be for free completely. Now it's only free for academic purposes, but it's still okay for us. And it's this Gini software. So if you have downloaded it, you can also play around with it yourself. Here's this example that I showed earlier. You see, just to demonstrate that it really works like this. So when you look at how you define those nodes, you have those tables, and you define your states, and you add the probability to each state. And when you do that for a conditional variable here to NAMI, then it's a conditional table. Given earthquake, tsunami probability is 0.1. So the only thing you have to do to solve, remember this initial problem I posed you, the only thing you have to do to solve that is to construct the graph and then add those conditional probabilities. And then you can answer the question that was posed initially. NAMI, the question was, okay, first, what is the probability of a damage? Well, it's very small. Here it's zero, one in a, you can hardly read it, but one in a thousand. And now let's assume there is an alarm, the alarm where it goes off. Then I can fix, I can say, I can fix those, the value here, the evidence to, there is alarm. I recalculate, oh, this is a, and then the probability goes up from one in a thousand to seven in a thousand. Actually what this tells us is that even though there is an alarm, you can still be pretty relaxed about your house. The probability went up, but not by that much. Why? Because the information is very indirect. There is information, the probability went up by a factor of seven, but it's not like we directly observe a damage. But you could also feel, you could also condition, you could give information on NAMI, you could also give information on damage and see, okay, assume there is a damage, what is the probability that there was an alarm? You can do all kind of computations like that. But in principle, if you have a software, you have a problem that you can represent in a software, the only thing you need to do is to be able to construct the correct graph and to identify those conditional probabilities. Unfortunately in real life, it turns out to be not challenging for most people to construct the correct graph. And it's, of course, also challenging to identify the conditional probabilities. But typically, if you have identified the correct graph, then identifying the conditional probabilities is made also much easier. So in my experience, when people have problems, it's mostly because they don't identify the correct graph. Do you want a break now or later? No, it's okay. So we'll make a, yes, 20-minute break, yes, and we continue at three and, or is there any question immediately? No, I thought so. Okay. Sorry. Yes. Now everybody hates you, you know? Yeah. That's a great point. Can we go back a couple of slides? Yes. We have the conditional probability theory. I want to know, do you want to move back? Okay. Yeah. This one? Yes, this one. No, in this case, why are we looking at, yeah, why do we have X3 dependent on X1 alone? Why not X3 dependent on X2 in our equation? Because it's a mistake. Thank you. No, it's wrong. You're right. It should be X1 and X2. And this is also wrong. This is all wrong. Oh, actually, I think what I did is, okay, so I copied this together very fast and this is a different network than this. So why don't you just, okay, we can just, I'll write you quickly the, I leave you this here and after the break, you can just, you can write down the correct expression yourself, okay? It looked similar because there were six random variables, but it's completely wrong. It's from a different network. Thank you. I'll give you enough time to solve your homework afterwards. Well, leave that for a moment. This was actually the right graph. So now the equation is correct. If you haven't done it, so the, so if you see the equation, you could construct the graph, no? Or if you can construct the graph, you can, you see the graph, you can construct the equation. Now, I said that I don't want to focus on how to make computations, but I still want to give at least five minutes quickly a slight idea as to how one can do it, how one can do computations and also what are the advantages and limitations. It's called inference or statistical inference. And in general, what they try to do with the, what they try to do with the Bayesian network model is that we are trying to, oops, we are trying to solve a problem that is of the form. So very generally, we have a joint distribution, P of X, this is described in the way I just showed you before. And then we are trying to learn something about one or more variables. For example, the probability of failure. So the failure is not a random variable, but the condition of the system is a random variable, which is binary, either it survived or it failed. But more generally, maybe we are interested in a set of random variables. Maybe it's failure of multiple components or performances, doesn't have to be failure, it can also be whether the system performs good or with half, only half its potential and so on. So we have a set of random variables that we are interested in, sorry. And we have a set of random variables that we can observe. So why I would call the variables of interest, that the random variables of interest, and the set, sorry. And the set are the variables that I observe. So for example, in the example before, it would be the alarm and why would be the condition of my house. In the example of Jochen, why would be the failure event or a failure, no failure random binary random variable and set would be the measurements taken from the steel strength. So set are the observer foothills. And everything we want to compute is just, is always a conditional probability of why given set. That's the one thing we want to compute and if we have no observations, set is just an empty set. So set would just be empty and we have a margin of unconditional distribution. But every query or inference we want to do is something like this. That's the very general thing we want to do. And in the Bayesian network, that's computed quite, I would say stupidly by looking at the definition of the conditional probability. And that's equal to, sorry, I'm used to having very big blackboards. So it's a bit challenging here for me. So this is just the probability of why or the probability distribution. Okay, I'm not strict here, sorry. Probability distribution of why and set, the joint distribution divided by the probability of set. So that's the definition of the conditional distribution. And again, this is a joint probability, this is a joint probability in the general case. That is a subset of X. And so what we have to compute here are joint, so we have to just compute joint probabilities of a set of random variables, joint probability distributions of a set of random variables that is a subset of all random variables X. And that's what is done by all these algorithms. Kind of the most straight, I'm not going to explain it in detail because again there's no time, but what this amounts to, this is to make an example here. So in this network with this formula that we know now, in this network let's assume I want to know the distribution of X5 given that I have observed X4 to make one example. That translates to computing the joint distribution of X5 and X4 and the marginal distribution of X4. Now, how do we get that? This is very basic probability. If we want to get the marginal distribution, we take the joint distribution and we sum over all the random variables we are not interested in. So it's like if you think of a continuous two-dimensional random vector, and you want to know the probability of just, you're given the joint distribution and you want to know only the probability distribution of one of them, you have to integrate over the other. In discrete case, you have to sum over the other. So we just sum over all those random variables. And so that means sum over X1, sum over X2, sum over X3, the complete joint distribution. Now, think of back where I said that if we have for example 20 nodes, this here in total, the joint distribution has 10 to the power of 14 states. So, computing that, or constructing that, and then summing over that would not be efficient computationally. However, what these algorithms do, they're different algorithms. Variable elimination is very basic algorithm. There are many algorithms that different versions, different variants, but essentially what they all try to do is to say, okay, we have to do these summations over all these random variables of this thing. That would not be efficient to do directly. However, we can use basic mathematics. And as you can see, we can move the summations to the right. Some of those, because for example, the term X6 appears only in the very last expression here. So I can move the summation over X6 to the end. I can move the summation over X5 to here and so on. And then, actually in this case, because this is the summation over all possible states of X6, of the conditional probability of X6, but this still ends up being one, no matter what the distribution is. So this is just one. And then you end up being able to simplify that expression. But even if you are not able to do that, this operation ensures that you don't have to actually work with the joint probability distribution at once, but you can localize the computations. And so you can actually deal with possibly very large models by doing local computations. And if you have worked with the Markov chain, the classical algorithms for solving Markov chains are nothing else than doing something like that as well. What is actually the tricky challenge here is to find the right way of ordering these summations and ordering these terms. And so the different algorithms that exist, like Junction Tree, et cetera, they have different strategies for identifying the optimal way of ordering those summations and products. Okay, again, this is too fast, but just to tell you that what you should remember is this, this is what we need to compute and it's always, so basically you can write a Bayesian network software that has just a few lines of, well, that doesn't have too many lines of code because the basic operations are just that. Okay, alternatively, we can also do approximate inference, meaning using sampling techniques and that's particularly useful if you have continuous random variables. Most problems that we deal with are not discrete, but they are actually continuous. And if you have this continuous, obviously this idea of just doing summations doesn't work. So if we can discretize, well, this has also challenges, or we can use sampling algorithms. There are many, many sampling algorithms as well. One of the simplest ones is this likelihood-weighting, 10 lines, you can implement it. It's also intellectual. So, but very, I'm not explaining it here, but just to show you that in very few lines of code that you can run this, it's not a very efficient algorithm. But you can try that if you're interested. So these are some of the lecture notes, you can try it out yourself. This is the continuous version of the tsunami problem. This will be more realistic now because we have actually a different type of earthquakes and we have the warning that is dependent on the magnitude of the earthquake. Okay. One class of approximate sampling algorithms that probably everybody has heard of at least is Markov Chain Monte Carlo, and also MCM's, just an example here. So basically as you might be aware of, those are techniques that try to approximate the posterior distribution, so this here. By sampling from a Markov chain that has a stationary distribution, the posterior distribution. So it starts and it's kind of off and then if you have a good MCMC algorithm then eventually it converges. And here we have samples from a posterior. This can be very, so it's forwardly applied to Bayesian networks. And there's a special version of MCMC which is called GIP sampling. And again, there are many versions of GIP sampling. The principle of GIP sampling is that you can use, the advantage of GIP sampling in Bayesian network is that you can use the conditional independence properties. Because in GIP sampling you sample one random variable after the other conditional on all the others. But you can use, in the Bayesian network you just have to condition on the Markov blanket, not on all the others. So GIP sampling actually turns out to be quite efficient in the Bayesian network. If you ever want to use these things. Okay, these are just, but MCMC as you might know, if you have used this, it's challenging also. If you don't tune it well, you can also get not so good results. So all these algorithms have their advantages and disadvantages. The exact one, this one has the advantage of being, if it works, it can be very efficient, it's exact and it can be very fast. But for these co-continuous random variables, it's, you have to discretize and you add a lot of states if you have to discretize, and that might lead to problems you cannot really solve. These algorithms, they actually perform, the more information you add, the worse they perform. So they are not limited very much in terms of the size of the network or the size of the random variables, but they are limited in terms of how much information you add. And these here are a bit difficult to understand whether they have converged or not. So they all have their advantages and disadvantages and I'm happy, if you have specific problems you want to know about, I am happy to answer your specific problems. All right, so that basically kind of concludes the first part of this lecture on Bayesian networks. To summarize, so this slide is not from today, so but from when I introduce, when I take more time, so it's maybe good to repeat here quickly. Basically what you, in order to, if you are new to this and you want to study a bit more detail, what you need to understand in order to study this are just those four basic rules of probability theory. You need to understand the concept of conditional probability, base rule, the chain rule, and the total probability theorem. And everything else follows from this. We are, the basic is really called, you also sometimes see this term DAG, D-A-G, which means directed acyclic graph, there's a concept from graph theory, which is our causal network that we draw. The nodes represent random variables, the links represent the dependencies between the random variables, and each variable is described, each random variable is described by the conditional distribution. So keep in mind that these nodes are random variables, sometimes it's a bit confusing because the failure is an event. The corresponding random variable is a Bernoulli random variable that represents survival or failure. So sometimes it's a bit mixed up. In the Bayesian network, we have random variables. These separation properties follow from the directed acyclic graph, and if you really want to go into details or you have problems that are challenging, you need to understand those independence properties. If you have simple problems, you can maybe skip that. And this is something I didn't go into detail. Now? Maybe, yes? Just to connect to what was discussed before, maybe we can represent structural reliability problem as a Bayesian network in my exam. Yes, we can, yes, we can do that now. Yeah, no, it's a good- We're probably going to this hyper-parameters. Yes, maybe I will show a few more slides if you don't fall asleep yet. And then we do that. This is a good idea. No, just that for people who are somehow obvious that we are always talking about the same thing. No, no, I understand. No, no, I agree. And I think about a lot of different things because now I'm speaking of torques. And so, yes, it's a good point. I should show a few more slides that relate to these modeling aspects and then we try to do, or you actually can make a Bayesian network for that problem. This just comes back to the problem of causality, which is something I already said a bit, but basically these statistics here, it's a bit hard to see, but these are the data that show that there's a correlation between the number of the stork, you know, the bird that brings the babies. This is the number of torques and these are the number of birds. So you see a strong correlation. Or, yes, this is, I don't know, probably some of you have seen that because it's a very common example that is used to show that the correlation and causality is not the same thing. So maybe some of you can explain what happens here. Or why is there a correlation between the number of storks and the birth rate? Many different systems of different sizes. Just from 17 European countries. Yes, 17 European countries, yes. So there we naturally get a correlation? Yes. Big country with many storks and many births and lots more countries. Exactly. So obviously this is kind of an obvious problem because we know that storks don't bring babies. I guess. I hope my children were not brought by stork. I was there. So here we're gonna call it co-founding variable. So we have, as Jochen said, we have the land size and if you plot the number of storks, obviously the number of storks is related to the size and the number of births is also related to the size of that country. And in the page network, okay, you could just plot, okay, land size and then make a link to number of storks and make a link to number of births. And again, that's the causal model. And this is the same as before. You could kind of already did that before but I feel bad for all these trees. So I'm just trying to read like this. So again, if you look at the first graph and you play and say, okay, my page network is like this. So my page network is storks and here's birth. And now the German Ministry of Economic Minister comes to me and tells me, you know, we have a problem in Germany. We don't have enough children. We need more children. Then you would suggest to increase the number of storks. Again, so, and it's wrong because if this was the causal model, that would be correct. Okay, increase the number of storks, we have more birth. But because the actual model is like this, so land size and this land size is not the only co-founding variable but land size is the biggest one here. So land size and then you have here storks and births this model tells us that they are dependent. Yes, but changing this variable will not have any effect on the number of births. Yes. So it's important that we have the causal model. That's just to repeat here, this. And this is what we should always keep in mind even though it doesn't have to be causal. And so maybe I'll make one example that is here and then we try the example of Jochen afterwards. And it's a bit small here. I'm sorry for that. But basically we come now closer to what we are dealing with. Most of us, let's assume we do a inspection of a reinforced concrete bridge. And we want to figure out whether there is a problem of corrosion. Now you, I guess most of you are familiar with this but we have the reinforcement bars. They can potentially corrode. At the beginning you don't see anything because the corrosion happens under a layer of concrete. But if you use so-called half cell potential measurement device, you can actually pick up where the corrosion has started even before you see something because it measures the current. It measures the activity. So now the idea is that we have the condition of the structure and we can do this half cell potential measurement and or we can do just a visual inspection. So because if we wait long enough so if the corrosion has been going on for a while we will also see it on the surface. And you have of course all observed that if you go to all the concrete structures that are in bad shape you see either you see some stains of rust or you directly see actually exposed bars because the concrete has just spoiled off because of the corrosion. So if you wait long enough you will see something. But earlier than that you can already see something from the half cell potential measurement. And now let's just make a simple bench network on paper that represents this problem. So maybe just take one minute. I think that is all that is needed here, I hope, to model this particular problem. We have C, the condition. We have V, the outcome of a visual inspection and we have M, the outcome of a half cell potential measurement. Okay, so what is the, how does it look? What do you think? It should be the main. Corrosion or condition is the parent to V and M, okay? Do you agree? Okay, everybody agrees. Yes, but I mean I showed this example, I mean because I was, I am also working with the people in the materials group and they do actually a lot of, they develop these half cell potential measurements and also other type of inspection that can be done on these type of problems. And they struggle with how to combine the different sources of information. And they, I mean of course they work mostly in a more deterministic setting, but they even did some probabilistic assessment. But the problem was that, you know, it seems straightforward to us and trivial. But again, they had in mind was this. They had in mind a model that is exactly the opposite. For them, it's the out, the input from the inspection and then they determine the corrosion. So the diagnostic model is what they had in mind. And now, I want to say two things. Here with this example is that this here is much easier to model than this. But is this actually correct here? Who thinks, or is it wrong? So this model, okay? This model here, is it, who thinks this is wrong? Nobody? Who thinks, some people think, who thinks this is correct? Majority. Okay, those who think it's wrong, why is it wrong? Or why do you think it's wrong? But it's not. That's what you're saying. Yes. Okay, yes, exactly. So this is one way of seeing it. So following this, what we had before is that, okay, if I have information about C, then these two become dependent. But actually they should, if I know that it's corroded, the two tests should be independent. So that model is, for that reason, is not correct, but you can also see the other way around. Let's say that I go to the structure, and I do a first, I mean, I mean, visual I always do in a way. When I do this, I have to go to the structure, so I'm looking at it, unless there's a painting. But otherwise if I see the surface, I can see it. And as I'm going there, I want to do a half set potential measurement, and I'm going there and I see that it's heavily corroded. I will not do the measurement here anymore. But even if I would do it, I would expect this to be, to show me that it's corroded. So when I have information about this, this should actually influence this. But in this model it does not. So these two are independent. I saw me that I have no information here, these two would be independent. But that's not correct. Now I can make this to be a correct model, but then I have to introduce an arrow here. So let's for example say like this. Now it's a correct model, but it just means that there's no conditional independence actually. Everything is dependent on each other, so now this would be a correct model. If you link every node to each other, it's always a correct model, but it's not an efficient model. We see that this is a more efficient model. So, but also, let's assume you were okay. For some reason you want this model. Now, how are you going to specify those arrows? Here, how are you going to specify them? And this is a topic of tomorrow because it's the quality of inspection outcomes. So here we have the probability to get the probability distribution or probability of corrosion. That's something we get from probabilistic models that we have of this deterioration process. It's independent of inspection. It's just the prior model of the corrosion. And my colleagues in the material department they have these models, which some of them developed some of these models. Now that's okay, that's available. What do we need here? Here we need the conditional probability of a visual inspection outcome given corrosion. Now, we could do some tests, but actually that's pretty even pretty intuitive. So what is the probability that you see corrosion given that there is no corrosion? Zero. Zero, well, I wouldn't say zero because it has happened that people go to the wrong bridge or they, not necessarily the wrong bridge, but they report and maybe they report it that it happened on one side, but actually it happened on the other side, so there was some studies done on bridge inspectors. It turns out to be pretty bad quality sometimes. They actually don't go to the bridge. They don't go to the bridge. They don't go to the bridge. They don't go to the bridge. They don't go to the bridge. They don't go to the bridge. They actually don't go to the bridge at all, but just report something. So there is some probability, but yes, let's assume that they do a good job, then it's zero. So even without doing, it's intuitive. So, okay, given that there is corrosion, it's a bit more tricky, because then we need to understand the time it takes for the corrosion to actually be visible, but that's something they can also get from their models because they have some models that, okay, they have some models to model the process of ongoing corrosion, how long it takes for it to be visible. So this they can construct is conditional probability distribution. And here, the probability of the outcome of the half set potential measurement given corrosion, I'll actually speak about this tomorrow a bit, this particular example. This is something that they get from doing multiple tests where they first do these half set potential measurements. They make a prediction, then they open the concrete to see what actually is the condition. The thing is that this is conditional on the condition of the bridge. So this can be learned from, let's say five or six existing surfaces and can be transferred to other bridges. So it is not dependent on the bridge, it's only dependent on the technique I'm using and how it's calibrated. So this conditional probability I can learn on some bridges and then I can transfer it to my new bridge where I want to make a prediction. And then I have completely defined this here. If I would want to define this, I would have to define the, besides the fact that, okay, how would I possibly define the probability of a measurement outcome without knowing anything else? But here I would have to define the conditional probability of corrosion given my half set potential measurement outcome and given my visual inspection outcome. That's actually what I want to predict. It's not what I want to input to the model. That's what I want to get out of the model. So that shows that this model, even though now it's correct, it doesn't make sense. So keep that in mind. And also for tomorrow when we, when I introduce these models of inspection quality, we want to work in this causal context. And once you explain it to the people, it's obvious to them, but before you actually show it in their minds, it's often not clean. Okay, so if you do that, if this is also an election out, if you want to run it probability is given, and you can then compute the outcome. This is the outcome that you can get. So conditional on my different results from the inspections, I get different probabilities of corrosion. And this table would be the required input for the diagnostic model. Okay, maybe now we can go to Jochen's example and we can think of the, I mean, if it's still in your mind. So, I mean, I promise we have not introduced the decision yet. So just assume we have a given design, but we have the information from the measurements. So we have the information from the measurements and we want to predict the probability of failure of the system. So try to just draw a Bayesian network for that problem. Maybe five minutes should be enough, I think. And then we can discuss maybe that problem together. And maybe there are different solutions. Okay, I've seen that some of you have still arguing, some of you have come to some occlusion, some of you try to solve the homework. So, let's start and maybe see what we can get, what different options we can come up with. Somebody wants to, so let's forget about the measurements first. So just, we have a basic, simple structural reliability problem. What are the components that we have? We have the load, Q. We have the resistance. We have the W if you want to include it. So maybe, how do we connect those? Maybe, maybe I'll do it here also. Same thing again. Q, I think, no? Q, we have R of the failure or not. So the condition, yes? Yes, okay. Actually, I like what you had also. We also have this W. And in a way, I'm going to make it a box because the W is a decision. It's not a random variable. It's a decision, it's a design variable in this case. So now we're making a kind of an influence diagram already, but I also want to include it now because I like what you did. So one of your colleagues did this. So we can do, as you just said, we just have those three being apparent to the condition, or we can first do something here, which is like, it's called the capacity. And then do the condition. And I'm a bit imprecise, as I said, because this is the event, not a random variable, but I'm more clear. Yes, but F itself was defined as the event earlier, so. But anyway, so this is the condition. The reason why I like to introduce this variable, so in principle, you can forget about that and you can just make arrows from W, R, and Q to F. And it's this equation that we had, the beta equals mu R and so on. So, no, actually it's just telling me it's a function where these three enter. The reason why I like to do this is this technique, which in Bayesian network we could call divorcing, yes? So we divorce this parent from these two parents. In principle, this has three parents, but we divorce those, and now it has only two parents. And if you remember before, the number of parameters we have to specify here in the conditional distribution increases exponentially with the number of parents. If you can reduce the number of parents, that's actually going to be efficient. Plus it can help to clarify the network here. So those two build the capacity and then we have a Q. Okay. Now, how do we introduce the measurements? It's the observation of the resistance, yes? So I would have an observation and that would have an arrow to or from? To, to R. From R. So, because if you'd like it, and again, in this case, you would end up having this diagnostic model and you end up having all these problems that we discussed before. So you should have a link from R to that. And this we can do. So let's make a say, okay, R and then we have something called observation. However, I want to be a bit more specific, no? Because the observation really are five quantities, no? But in this here, we know this is a normal, this is normal distributed, this is just a parameter. And then this is also a normal distribution, conditional on those, and that here, if you think of this as the safety margin, would also be a normal distribution or it is a Bernoulli. So these are all well-defined. Now, if I say observation, that's not well-defined. So the observation is either a mean value as you had in your example, or it's actually five individual values. So, how do we include? So that's the mean value. How is the mean value linked to R? This is, now we have to think of the problem. R is my strength, in this case, in here, it represents the material, the yield strength of the material, multiplied with the area. And the observation is our samples taken from the same batch of steel, no? I'm looking at you because that's the way I remember the example, no? So we have the mean value for five samples taken from the same batch of material as this particular specimen that we're installing the bridge. So how are those samples linked to that? Yes, yes, just pick. Is it prior and only observation? Yes and no. So if you, it does depend on the prior observation, but if you introduce the observation as a parent to R, we are again in the same, so it's the same suggestion as before. And would consider R? Well, but you have to think that in the innovation context, in general, R itself, oops, we have a prior distribution and we have a posterior distribution. But we do not have a prior parameter and the posterior parameter. So I cannot hear, the parameter itself is the same because the steel is the same, it's not changed. Our knowledge changes, no? So our knowledge changes, and that's changing by introducing knowledge on these observations, but we can't have a prior and the posterior random variable. This does not exist. We just have a prior model and the posterior model. Yes. Yes, so in this case we could say that, okay, actually the model, it would be the way the model that Jochen presents is that we have the mean value that is common to the whole batch. And that mean value itself was a random variable. That's the one we learn actually about, if you. So not the prior, but we have this mean value. So was it me or was it R? M, M, M-R, okay. So this is the mean value of this particular batch. And the samples that are taken are taken from this batch. So how do I introduce those samples? Yes? M-R should cause this. Yes, so M-R causes these observations. I know, how are they called, the observations? X. X bar. Okay. Yes. But maybe we can do the X bar or this, but I prefer now to say that, okay, we actually observe X one, X two, and so on. I mean, the X bar is okay because we have normal, we assume normal distributions and then the X bar is a sufficient statistic, basically. But if we have a normal, but to make it more general, let's say it works also with other type of distributions, then I would have to introduce those separately. Because then it's not sufficient to just know the mean value. But if we have normal, as Jochen had it, if everything is normal, then it's sufficient to take just the mean value. But more generally, we have this and we assume them to be independent, conditional on M-R. That's an implicit assumption that you had. Yes, and that's basically it, I think. I'm correct. Yes, and you see, at least for me, when Jochen presented the example and I had to understand, okay, what is the meaning of these observations in, a graph like this helps me a lot to understand actually what is the information and how do I interpret it and what is the meaning of it. Of course, we can just use the nice equations that you give and implement them. But in order for me to understand really what's going on in my mind, that I would always try to make a graph like this. And now, going back to the computation, that's, in your example, everything is normal. So we can just use the fact that actually all these, except for this one, all these are jointly normal. And if you take here not the failure, but you take the safety margin, even that is normal. So, because if it's a safety margin and then the failure in function of safety margin, then everything is jointly normal and you can forget about all these Bayesian network algorithms because you can just use the fact that everything is jointly normal and you can do the analysis like this. But if it's not, if it happens not to be normal because reality, these things are not normally distributed. We, they are not easy analytical solutions. We have to do numerical computations. Then something like this can even help. Yes, yes, of course. So afterwards we, yes, yes. No, no, I mean, we also, we could, the influence diagram will come, I'll probably leave you time for the homework later. So I will introduce the influence diagram tomorrow. But we have to add this distinction on here. And we also have to add the consequences, costs. Yes. This one. Yeah, so. So you, so you, you are asking how, what your, how your computation relates to what is shown here. Is that your question? So basically what we have now in your case is that we have observations here. So we, we have now, to thank you, it's a good question because it clarifies, hopefully clarifies something. So we have now, in the posterior case, we have actually observations. And sometimes you see these E's. This indicated that there is evidence here. This is the term used. So we have evidence here. So we fix those. A priori, we don't have these. That would be the case where a priori before we have actually the measurements. But then we, we have the measurements. So we fix those. And now, if I go back to some earlier slide, slide that is here somewhere. Here. So, okay. Here, X. X are all those things. Y is this here, no? So I want to know the probability of F. And my set are those things here. So in the general case, if I want to compute, now, this probability of failure, conditional on my observation, I have to compute the joint probability of F and this data, and divide that with the probability of getting this data. Now that's very different from the solution that Jochen showed, no? Because again, this special case. So we'll, but basically what, if I would just think of this as a generic case and I would discretize, let's think of this discrete algorithms, I would discretize all these variables. In this case, that would actually be possible because, you know, this is actually relatively straightforward to discretize. So I could discretize all the random variables. So I'm not bound by normality or anything. And then I would have these joint distribution. I could put it into the genie or any other software. I would fix here those as evidence and I would check what is the new posterior probability of F. If I use these algorithms that I showed, they would then basically eliminate the intermediate variables to compute the joint distribution of F and these axes. What Jochen's solution basically is to use the fact that there is an analytical, in this particular case, there's an analytical solution for MR. So that's a posterior distribution of MR. Then there is a, there's a distribution of R and then we can directly compute this. Yes? Yes? Because? Yeah, I mean, I guess this is your mobile, I guess this is where you can interpret this. Not so you have different batches and even the same supplier, in some days he might produce better steel, in other days he might produce worse steel and I'm not a big expert in steel, but I mean it's the same with all the type of materials. So the idea would be, the way I understand it, is that you're, this is the variability from one producer to another or from one day to another. And then you have a variability within batch variability. That's basically what is represented here. But then also as another decision node on MR, and then I think it would be sufficient to just and just to represent the information from the data as X bar, as one node, X bar, and then we would have a decision node N here, that goes to here, and then conditional on N, we would have different distributions for MR, if you get evidence. So the gray posteriori was that we got an N, like five. We had no evidence on the X bar, so we used the prior, but we reduced variability due to N. So you have given conditional on the choice of N, you have these different distributions here. So if you have N equal to zero, then this is the entire a priori case, but as long as you get N equal to one, even if you don't get evidence, the variability goes down. And then later on when we come to this influence diagram, we can construct the value of, at least the preposterior analysis from there. But here of course, only if you fix it to five, then we can get evidence on all these five. I mean basically we can just skip the evidence here and then, or just give evidence to one or two or three, put it back and plot your 10, and just give evidence to five. These are just hypothetical. As long as I don't give evidence, they have no effect on the remaining as they should. So as long as you don't give evidence on those, they have no effect on the remaining, which is exactly as it could be. So, okay, yes. X bar, X bar, that could be effected like that. Because this N is connected directly to the parameter MR. Yeah. And that related to X bar. Yeah, that's another rule that you should also add to X bar. Yes, I think somehow make this impossible. But I suggest, sorry, I suggest we discuss the influence diagram tomorrow. And we'll continue with this tomorrow when I bring the influence diagram. And we'll have false input because it's not consistent. Because I'm also not fully agreeing with your solution. So we do that tomorrow, excitement. All right. But okay, so I want to finish the thing about Bayesian networks and to make things a bit, show another example. I like this, actually quite a, I've worked a lot with Bayesian network in different applications and I just want to show this one because it's kind of different and still actually very much related to ECC. So, this was work done with colleagues in the DLR, German Aerospace Center. They have a lot of satellite images and they try to do something with those images. And the task here was to come up with an algorithm for automatic detection of flooded roads after a flood event. This is more for remote areas. So the idea is that they do provide maps for emergency response. So they want to have up-to-date maps as to where emergency teams can actually go and where they cannot go. Now they do it mostly manually. They used to do it manually and they want to replace that with automatic algorithms. So they want something like this where you see green is okay, not flooded. Yellow means, okay, we don't really know. And red means you cannot go. And the blue one is someone where there's a change. So it used to be flooded, now it's not flooded anymore. So it's a temporal component to the whole thing. So that's what they want. And what do they have? They have their satellite images and they come in, so it's optical images. They come in this, this is just a great scale, but of course they have basically four channels, four great channels for different colors. But then there can also be clouds. So this is not a real cloud as you might realize, but they can be clouds and this is optical so if there's a cloud, you can't really see anything. So this one information source that they have and the second one that they have here is that they have also a digital elevation model. And if you know that there's a top of a mountain, whatever your satellite tells you, it's not very likely to be flooded. So you should somehow combine that information. And they had some algorithm, or they have virtual algorithms to do that, to combine that information, but it's more like giving scores to the different. Satellite gives me some score and then the digital elevation gives me a score and they somehow combine those scores and it was not very consistent way of doing it. And so, okay, let's, and so one of these guys was speaking with and then he made this into his PhD topic. So he developed the Bayesian network model for this problem. And I will just show you the result, otherwise I have to explain too much about the problem. So it's a classical problem of combining different sources of information. And that's really when the Bayesian network can be very helpful. So how is this information that they have connected? So you see the information is here and here. And this is a simple representation of how that information is connected. So it's a bit more to hit, but this is the concept. So essentially you have a flood and whether or not it flooded depends, a particular point is flooded, depends on the elevation. It depends, of course, on other factors as well, but that now in this analysis are not, those are factors that are not available and are not considered. But for sure the elevation has a causal effect on whether it's flooded or not. Then whether it's flooded or not determines to something, I mean, it has an effect on what I can see or what the satellite can see because sort of water has a certain reflection and if it's a row, and we are interested in roads, so basically if it's a road, it has a certain reflection and if it's water, it's a different type of reflection. So what we can see depends on that. But it depends, for example, also on clouds to make this simple example. It depends also on trees if you are in a forest and other things, but the cloud here is just an example. So what I can see, what is visible, it depends both on whether it's flooded and on the cloud. And then what I can see translates into what I actually measure and these are great channels. So these are what is physically measured by the satellite, by the image. And this is what I observe and did. And what I want to infer, so my why there is whether or not it's flooded. And the nice thing about this is that now I have a very rare, this is simple to understand, once you have it, there's a simple model that consistently, consistently combines the two type of information. One actually you see is apparent, to what I'm interested in, and one is some kind of ancestor descendant. And this course that they used before, somehow were not too bad, but were obviously not able to be completely consistent. Here we have a completely consistent model and that's what is not implemented. So I like to just show this that if you have worked a bit with these models, they can be really used in different contexts and they can help understand problems that seem trivial once you have the solution, but actually not so trivial before you come up with this model. The actual implementation then was a bit, obviously the guy had to do a PhD, so that was not sufficient for PhD. So he had to do some more things. So the actual implementation then is a bit more complicated because there is also a time in the spatial component. So if you have one place flooded, the spot, this is just for one pixel if you want at one point in time. But there's a dependence between different pixels and there's a dependence between different points in time. So if you are flooded now, it's likely to be flooded also in one hour. If this point here is flooded, the 100 meter down the road is also likely to be flooded and vice versa. So there's a spatial and temporal dependence and then that's not really a causal dependence. So you could think of the temporal dependence as some kind of causal dependence, but the spatial dependence is not really causal, no? So the spatial dependence is something that is not naturally represented by a Bayesian network because you would say that the flooding here is causing the flooding there or vice versa, not really. And so again he ended up translating this Bayesian network into a so-called an undirected graph in a Markov model that is a different type of graphical model. But the Bayesian network was the basis and then he was translated into this type of model for computation. And this is just, of course, you can imagine that the whole image and many, many multiple time series steps that has a large number of random variables. Anyway, okay, so it was implemented and the last point I want to make which is also something that connects to tomorrow here. In this example is then how do we evaluate the model? So one example is this Namibia here because they had data from a particular event there and we introduced artificial clouds also to make the things a bit at that point in time was not cloudy, not cloudy. In real flood events or in real, in most, well here's a real flood event in most flood events we have actually clouds. Here there was none but which was good because that we had better information but on the other side we wanted to have some clouds so it's what introduced. So here we have a ground truth meaning we know more or less at least the truth and we have the images and we can learn and check and compare the truth with what we predict. And we consider different cases so not clouded, more clouded, so on. And then it comes to comparison. Now who knows what error C means? Nobody? Who has, or who, if I tell you that okay, this here means false positive rate and this means true positive rate. Who has seen this type of figures before, curves before? Nobody? Really? Okay, good. So tomorrow I will explain more detail. Something useful, you should listen tomorrow. Anyway, but very briefly now so that you can understand. So false positive rate means it's a classification problem here first of all. So we reduce it to binary so if we have flooded, not flooded and we make a prediction that says flooded or not flooded. True positive means I can I correctly identify whether a flooded area is actually flooded. So if it's given that it's flooded, do I predict flooding? What is the probability of that? And here's false positive rate. Given that it's not flooded, what is the probability that I nevertheless predicted as flooded? So that, and of course for emergency response purposes that can be very bad if I have a lot of false positives meaning that I think everything is flooded whereas it is not that will not help me in my emergency response. So I want to minimize both errors. So I want to have 100% true positive rate and I want to have a 0% false positive rate. Now there are curves here. This one I'll explain in more detail than tomorrow but completely I can in this case change the setting of my decision. So the algorithm will predict me in certain probability I get flooded but then I have to make a decision. Okay, do I show it as red or green? For example, we might say that if the prediction is more than 50% flooded we show it as red and otherwise as green. And if I change this probability, I will change my prediction and I'll get different points along this curve. Tomorrow in more details. But basically there are different calibrations of my algorithms will give me different points here. So I can either have a higher true positive rate so I'm better in predicting the actual areas that are flooded but it always comes at the cost of also increasing the number of false positives. I'm also predicting more places as flooded that are not actually flooded. So at this point, so I can always predict everything perfectly, correctly as flooded if I'm just saying that everything is flooded. That means I have 100% true positive rate. My POD if you want my probability of detection is 100% but my false positive rate is also 100%. So everything that is not flooded I would, so I will get to this point here. In the same way it happens here. I can always say that if nothing is flooded my POD, my true positive rate is zero and my false positive rate is also zero. But then in between I try to get as close as possible to this point. And these are the, so basically these are four algorithms that are compared. The original one was this black one. This was the one that used this scoring function and then the three versions of the Bayesian network. You see that they improve quite significantly actually the performance. We don't get perfect accuracy. We can't predict everything perfectly but we get closer to the optimal case. Okay. So tomorrow I will show you that the ROC is exactly also what we need to quantify the quality of any type of monitoring. Here the monitoring is, the monitoring can be the actual device or as in this case here it's not just a device the monitoring is an algorithm and it's a device, a satellite but it's also an algorithm and the prediction made by this algorithm is in a way my monitoring. Okay. So this is just a nice example of a Bayesian network. Okay. I think I will skip most of the rest. I just will show one more thing. There are some additional examples those in the lecture notes. Say okay here's a pipeline and I have different, I expect different parts of the pipeline and what is the effect on the remaining element similar to basically what we have here. I'm skipping all this. Just one last point I want to make that there is something called a dynamic Bayesian network and this example I showed you before is already a dynamic Bayesian network. So this can be, in a way this is kind of an extension of a Markov chain again. So typically dynamic Bayesian network means that I have a process in time or in space but mostly in time and it can be used to model deterioration naturally in some way. What we have to do to do that is to specify in a way just one time slice. We call this a time slice. And then for each discretized time step we add one more slice. And computation time increases linearly with the number of time steps. So I don't know what I should say here. Wait, there's something, that's maybe something that should be said here which is about the Markov property. So the assumption here is if you can interpret correctly these arrows it says that if I know my random variables representing time two for example, if I know those then the past becomes independent of the future. This follows directly from these properties this de-separation properties. And it's also what we know from a Markov chain. So a Markov chain this is the nice thing about it is that when I know the present the past and the future are kind of independent of each other. Many real problems that's not the case actually. And that's however introduced here by a trick which is that we have something that is called here time invariant random variables. So for example, one reason why Markov chains do not represent actually the deterioration processes in many real systems is that some of the uncertain parameters that we have do not change over time. So let's assume that we thinking of fatigue or if you had corrosion before so let's take the corrosion example. So the corrosion, the development depends on the amount of chlorides that I have on the surface. Now I have uncertainty about the amount of chlorides on the surface. But let's assume actually it is the case that to some degree there's a variability in time but there's also a mean value that is uncertain that doesn't change over time. Which depends on the general exposure that I could measure if I wanted but before I do that I have uncertainty. Then uncertainty is going to be constant over time. So if I have a higher exposure to chlorides at the beginning I will also have a higher exposure at the end. And that is in contradiction to the Markovian assumption. Because the Markovian assumption tells us that once I know today whatever happened in the past doesn't affect the future but if that is actually constant over time it would tell me that well that factor if it was low in the beginning of the lifetime it will also be low at the end of the lifetime so there is a dependence in terms of information. But the trick here is just to introduce those time invariant random variables as if they were time variant. Where this conditional probability distribution given here, given this is in a discrete setting it's just the diagonal matrix of ones. So if the exposure here is 10 it will also be here 10 with probability one it will be here 10 here with probability one or if the exposure here is three it will be three with probability one and one. This trick is called state augmentation. So I pay a price because I increase the state of my outcome space but I can make the whole thing Markovian which otherwise it would not be Markovian. And then we have these time invariant random variables so those are things that change over time. The fluctuating part of the chloride concentration or the fluctuating part of the humidity. And then finally I have my damage variables of the amount of corrosion or the fatigue damage or so on. And so this model here is not introduced so much because of modeling aspects even though it could also but I think this is a relatively straightforward problem where you don't really need a Bayesian network to model the whole thing. Here we introduced it because of computational aspect to do Bayesian updating. So at each point in time we introduce observables. So here the assumption is that we observe or we perform inspections of the damage variables. So we try to measure whether it's corroded or not or we try to measure whether there is some cracks from fatigue or we try to measure the condition of the wear, the wear of material, how much material is left. But of course we could also measure some of those time invariant random variables or some of the loads so we could measure some of the time variant random variables. We could measure the weather for example. So we could also add evidence here and here and here and the good thing about some of these inferenced algorithms that I mentioned, from here we use the discrete variable elimination is that however many evidence you give the computation is actually not going to be increased. So if we have a fast algorithm we can observe as much information as we want. Computational requirements are the same. So here we can compute things like this. Probability of failure or liability index and we just here hypothetically assume we inspect every 10, this fatigue problem. We inspect every one million cycles and we have some good results so the liability always goes up. And the reason why, I mean you could also use other not crude Monte Carlo or not but you can also use other type of advanced sampling techniques in all kind of, or even form. But the reason why we like to use the Bayesian network with this is because of the fact that as the computation really don't, I'm not affected by the exactness and the speed of the computation are not affected by what type of information I give. And if I do now a value of information analysis which we'll see requires us to many times repeatedly compute the probability of failure or the reliability under different type of inspection, potential future inspection outcomes or monitoring outcomes. There it's very helpful to have such a fast and robust, particularly robust algorithm. So we use this type of thing here for computational reasons not so much for modeling reasons. So this can also be the case. But I still would say that the main, for me the main purpose of using Bayesian network is to do the modeling, not so much the computation. It helps us to understand how things are related in particular when we speak of information what is dependent on what and which because it's very easy to be tricked into double counting information in Bayesian analysis or into thinking that things are independent while they're actually strongly dependent. So you can do a lot of errors as we saw so in before and the Bayesian network really helps us to understand that. And it helps us to communicate that also to people who really have no background or little background in probability. When I speak with the people that develop the health monitoring techniques or the material people, it helps me to communicate with them. So that's the thing. Oh, you can do this whole thing object oriented. I'm not, so you can think of it quite clearly. No, it's just, I mean, you think of objects like bridge and in each bridge object you would have a different Bayesian network and then those are connected by input and output variables and you can build a large Bayesian network using these object oriented concepts. This you can also do in, I think in Gini or in other software they allow you to do that. So you create objects and you can reuse those objects. And the last thing I want to mention is really the last thing now. If it's not so relevant for what we do but in the machine learning community, I mean, Bayesian networks really develop mostly in artificial intelligence. And in this community and also now machine learning is a big topic. So that community Bayesian network is often used as a tool to just learn something from data. So we will not the way we do it where we construct Bayesian network using causal reasoning, but they have a lot of data and you typically need a lot of data for that. So it's not so relevant for us. But if you have a lot of data, you can construct the structure of the Bayesian network from the data. So basically what you do is, this is an example of one where we learn to Bayesian network for wildfire prediction, but so what you do is, you look at the independence properties in the data, say, okay, if I know this variable, these two become some high independent. So there must be the separation properties like this. So it's actually quite complex. And it's also not, it's also very challenging. So it's really need a lot of data in order to be able to do that. But this is something that is also done, just to mention it. And some communities when people speak of Bayesian network, they actually think of Bayesian network learning, meaning that they learn the graphical structure of the network from the data. And if you're interested in that, I can give you some references. I have not much, no, some exception, but I have not much work to do this, but I can give you some good references if you're interested. This is, as I said, it's not so relevant for what we do because we typically have quite strong, like phenomenological or causal knowledge. So we really know what is, what effects what. And we have typically not a lot of data, which means that we typically cannot learn just from data. What we can learn from data, of course, is the conditional probabilities. So once we have the structure, to learn the conditional probabilities is relatively straightforward. But to learn the structure is more difficult. Okay, so tomorrow I will, as this will be a short lecture tomorrow on influence diagrams, just because that's actually, we have almost done most of the things already. So it will be more interactive. And we'll try to construct the influence diagram for this example that we are dealing with here. And I bring another example. So this will be tomorrow morning. The first thing we do. And what is on the schedule, which is called modeling the quality of information will be a bit postponed a bit. But we will catch up. It's okay. Is there any questions or comments? Yes? Yes. Oh, it depends. But yes, in general. At that point, the probability distribution. Yeah, well, okay. So if you, I mean, if you can discretize, then that's what I, I mean, that's what I would do. So I mean, if you can discretize, means that if your state space does not become, because the problem of discretization is that these are vectors here. So if you end up having too many random variables here, this will explode. It increases exponentially with number of random variables. So you're actually quite limited in what you can do. Unless you have additional, conditional dependencies within those. So there's a limitation. If you can do that, and it works, then this sure is the best thing. However, if you actually cannot do, if you cannot discretize it, because it's too many. I mean, if you have realistic problem with, you can often reduce the problem size by first doing a sensitive analysis and understand that maybe just three or four random variables are actually relevant. So in many cases you have, people comment, they have hundreds of random variables, but you can actually forget about 95% of those because they're not strongly affecting the uncertainty. But if you end up with still with too many random variables and those, let's say if discretization is not going to be leading to two large problems, then you have to use, again, approximate algorithms. In which case, there is still some gain from this structure, but there is, maybe the gain is not so strong. I mean, you can also just use a brute force Monte Carlo, for example, if you have no information, or if you have information, you can use, try to use a brute force more advanced sampling technique that also can do Bayesian analysis. And, yes, so there is a kind of, if you have a limited number of random variables, discretization is for sure the best. If it becomes too large a problem, then you have to switch to approximate algorithms and in that case, there is a limited benefit of this structure, I would say. Can still be, but if you want to know more details, we can also discuss separately. Barrels, do we need to put into that network? You mean in general? Yeah, yeah. Okay, I mean, typically we first start out and we have the modeling process. So we sit here, we say, okay, these are the variables that are of interest and you, on a piece of paper, you just include everything. That's the first thing I do. Then I would run, if you can do Monte Carlo, I would just run, I mean, without forgetting about, don't put any evidence, just run a Monte Carlo analysis if it's cheap. If it's expensive, run some sort of advanced reliability method or something. Form is also not a bad thing in many cases, actually. So we run a straightforward reliability assessment and figure out the sensitivity of your prediction. I mean, if it's reliability, the sensitivity of your probability of failure with respect to the input random variables. And see if some of those input random variables actually have very little, I mean, the uncertainty has little effect. There's, I mean, typically I recommend that you use the form sensitivity measures, but, or there are other sensitivity measures, but you can use sensitivity measures to understand what is the effect of the uncertainty, for example, here, let's say you have not only Q, but you have multiple loads. Maybe one is that weight, typically we know it has a limited effect, because the uncertainty is typically quite small in that weight. So maybe we figure out that the uncertainty in G has a very small effect on the probability of failure. And then we can say, okay, G can be represented by the deterministic variable because its effect is limiting. Now, I mean, there's not a straightforward answer because let's assume you're going to use an approximate sampling-based inference algorithm. Actually, it doesn't really matter how many variables you have inside. So then you just leave all of them inside and you just compute everything. So if it's, but here, let's say you want to have something that is actually fast and efficient and you want to use the discretize your random variables. The number of random variables makes a big difference in the computational speed, so you really want to reduce first the number of random variables. And in that case, you say, okay, for this G variable, you would exclude based on an initial sensitivity analysis. So the thing is, so just to clarify this, so if you have, when you have no evidence, you can always, I mean, running a multicolor analysis in such a model is very straightforward because what you do, the way you would sample from that is to, you follow the kind of the hierarchy of the model. So you start with this one, you sample first this one. This, you can also do this, you don't sample, these are not relevant, but you can also sample then this, but you sample first this one, then you sample this one conditional on your sample here. Then you sample this one conditional on this sample. You sample here and you get this one sample of this. So a multicolor analysis is always straightforward, also with the discrete, you can do multicolor, no problem. The multicolor just is not, it's going to be infeasible once you start introducing evidence. Then the multicolor will not be efficient at least. So, but to do the sensitivity analysis, typically you can just do the case without evidence first. And that is what I would do. That also helps you to verify your model because it's easy to make a mistake. So just maybe program the whole thing in Python or whatever you use and then maybe run a multicolor simulation to verify your results. Do a sensitivity analysis, check the effect. Also they can maybe show that this also helps to understand whether your model is makes sense, whether you maybe there's a mistake or this is something that is cheap and can always, oh, I think always it cannot be done in case you actually have, and what I didn't say that some of these relations can be really deterministic relations also. It doesn't have to be probabilistic. So, for example here, this is a deterministic relation. So once you fix R and W, the capacity is given. Well, this is not a problem, but you could also have here a finite element code inside. Well, you have for example, some material properties and then as a function of those, you have some deformation that you predict. And maybe that finite element model is, or it's a complete fluid structure interaction model. It takes a lot of time to run. So then maybe multicolor is not an option. But in almost any other case, multicolor is always a good option. No more questions? Then thanks, thank you for your patience. And we have 15 extra minutes to work on Johan's assignment and we'll meet again tomorrow morning, no? 8.30, 8.30. Don't drink too much and see you tomorrow.