 Life's problems are, for the most part, problems of probability, of making statements about which you don't know the full extent of the truth. This is a quote that goes back to Pierre Simon, the Marquis de la Place, one time, I think, Chancellor of the Republic of France during the revolution. He was probably the father of probability theory. Actually, he is. He wrote the book on the subject, published about two years after Gauss. And this is the first book that actually contains the theorem that we now call the theorem of base, which we discussed last Monday, and which Karmagorov formalized into a proper mathematical theory. So we wrote out on Monday that this theory reduces to just two very simple rules. If you make statements about variables A and B and other variables, then there is one rule which allows us to get rid of some part of the variables which we don't want to reason about to only make statements about a subset of the variables. That's called the SUM rule. And for that, we just have to sum out all the possible values that the other variables might have. And then there is a rule that effectively allows us to make a statement about one variable based on observations of another variable. That is effectively the product rule which says that we can write the joint probability of two variables as the probability of one variable times the probability that the other variable is true if the related variable is true as well. And that can be rearranged to give us the theorem of base. And this mechanism seems extremely simple if you write it down like this. But first of all, it's actually a genuine extension of propositional logic in the way that we just briefly out here again, in which you may already have investigated in your exercises. And just to show that it's genuinely a proper way to think about logic, here is the modern father of logic, George Boole, which you all know from your undergraduate computer science lectures, talking about probabilistic reasoning in his actual treatise on the laws of thought. So you would have thought that the person who wrote the book on logic actually already knew about base theorem. So he writes, if he hears this principle of probability, which basically comes down to our rules of probability, if the probability of the occurrence of an event is p, then 1 minus p will be the probability of its non-occurrence. So that's the sum rule, essentially, actually. Or maybe it's an axiom of probabilities. Then the probability of the concurrence of two independent events is the product of the probabilities of those two events. We'll have to talk about this today. And then the probability of the concurrence of two dependent events is equal to the product of the probability of one of them by the probability that if that event occur, the other will happen also. So this is 1854. People didn't really know how to write math back then yet properly, so he had to write a sentence. Let's think about this again. The probability of the occurrence of two dependent events, so p of a and b, is equal to the product of the probability of one of them, p of a, times, so by, the probability that if that event occur, the other will happen also, times b of a, times b given a, p of b given a, or the other way around. And then he says the probability that if an event take place, an event f will also take place, is equal to the probability of the concurrence of the events e and f, divided by the probability of the occurrence of e. So that seems almost like the same statement, right? But Boo apparently thought it was necessary to write both of them down. So what we're going to do today, and just for today, and this is the last time we'll talk about discrete random variables, is to think about how powerful this notion is, but also why it might make things a little bit hard. And actually, the reason why it makes things hard is maybe easy to point out, the most easy to point out completely by a slide like this, which I have, again, taken from an earlier version of this course, which I used to teach with Stefan Hamiling. And this slide is directly from him. So if you've done statistical machine learning so far, if you've trained neural networks, then you know that what you have to store at the end is the set of the weights of the network. That's just a vector of number. If you wanted to store every possible vector of weights that the network might have, that's a much, much harder task. But for probability theory, we kind of have to do that. We have to keep track of all of the possible hypotheses. So let's keep this more simple. Let's think of a situation where we have 26 variables that we'd like to keep track of, 26 because the alphabet has 26 characters, so that's easy to think of, a, b, c, d, e, all the way to z. So if you let some of those might be observations, some of those might be variables you want to reason about. In general, if you want to write down a joint probability distribution over those 26 variables, then that's a function of 26 things. And let's keep it simple and say they are binary, so they are all either 0 and 1. Then we need to assign entries to all but the very last of the elements of this array. You think of this as an array basically, right? An array of dimension 26, but you can also revel it into one long array that contains 2 to the 26, which is 67,108,864 numbers. They are all going to be between 0 and 1. We know that they sum to 1, so the very last one we don't need to store, actually, but that doesn't help us at all. It's just one less. So the fundamental problem of probability theory is that if you want to keep track of every possible hypothesis about the world, you're going to have to store all of them individually. And 26 is really not that much, right? And it's binary. So it's like the easiest possible situation you could think of. So fundamentally, the problem here is that if we try to keep track of, if we try to hatch our bits, if we try to keep track of lots of possible answers, then we're going to have to store all of these. And that maybe is already as a first in, will typically mean if we talk about discrete variables that we have to keep track of massive arrays of possible answers. Once those variables are not discrete anymore, but maybe real value, things get even more complicated. So we'll have to talk about this next Monday. And that's why a large part of working with probabilities involves thinking about the structure of probability distributions and of algebraic notions of keeping track of them elegantly or efficiently. Today, we'll mostly think about algebraic structure in those distributions, which is related to the notion of independence, which Boole already mentioned in his treatise. And then from Monday onwards, we'll think about what to do with continuous random variables. But you can probably imagine that if you want to make a statement about a continuous object, you can't possibly write down a value for every single possible value it might have, because there might be even uncountably many such options. So we'll need to find some parameterized form of constructing distributions about random variables. But for discrete ones, even we can still think more about the algebraic object. And we'll do that for the entirety. What this basically boils down to is making statements about arrays and structure in those arrays so that we can write those arrays efficiently somehow. OK. And so for that, actually, I don't yet know how much time I'm going to need to that. For this, it might be that today might be one of the shorter lectures or the easier ones. Maybe I'll also gamble away much time by writing stuff on the whiteboard. The first thing I need to point out is that today is the first time I'm going to start messing with notation. I'm going to mess with notation the entire duration of the course. And I know that notation is often a problem for this particular course, because probabilities are hard to work with. So people have come up with all sorts of convenient notation that can be confusing at first. So the first change we're going to do is that I'm going to start talking about the elements that go into a probability distribution, so into this function p, which operates on the sigma algebra, with a notation that says the variable that's inside of the brackets is now going to be, at least for the moment, a binary random variable that can take values 0 and 1. So on Monday, we made statements about propositional formulae A and B and C. And they were all true or false. And then I sometimes wrote p of non-A and p of A, because that makes the connection to Boolean propositional logic particularly clear. From now on, I'm going to think of p as a function, just like in the programming language sense. So it's a function that takes in an input, a variable that we call A. And if that variable is a binary random variable, it can take the values 0 and 1. But when we write A, we just remain this abstract variable that can take on values. So in particular, this means that when we make statements about this function p of A and B, and say, for example, that it factorizes in a certain way, that we can write p of A and B as p of A times p of B, then that means this is fundamentally true for all possible values of A and B. Not just for A equal to 1 and B equal to 1, but for all of them. So in particular, this means all of these four things for binary random variables. And if those variables could take more, then two values, 0 and 1, if the array has more than two entries, then this is supposed to also still hold. With this, I've actually already made the definition of something you've probably all seen before. So this is one of these bits where we still keep catching up on things that you've probably seen in previous lectures, which is that this algebraic structure, of course, makes things particularly easy to think about. Because if you wanted to know what the marginal distribution of A was in this particular setting, maybe a first observation you can make that we're going to talk about for the entire lecture is that you don't need to know about B. You can just look p of A. You don't even need to evaluate p of B at all. A formal way of thinking about this is if you want to sum out B according to the sum rule, then if you were to sum in front of this p of A and B, then you can move the sum in here. But we know what the value of that sum is. It's 1 because it's a probability distribution, so it sums to 1. So we just multiply by 1 and we're done. And never even touch this other array. This situation is called, of course, independence. I could have asked, but it's 8 AM. So we are going to call two random variables independent if, and only if, their distribution can be written in this factorized form, or if this distribution factorizes, into two so-called marginal distributions, p of A and B. And then I'm going to use this notation, which is not standard. This is actually due to Stefan Hameling, this one. But you just need a symbol for it. There isn't this genuinely accepted symbol that says, that's supposed to mean A is independent of B, or this definition holds. So the classic example for this is the statisticians' most favorite toy, two coins, that you flip independently and independently already means that they are independent. So this bit is boring. I'm not going to waste much time on it. We noticed on Monday that conditional distributions are also probability distributions because they also satisfy the axioms of probabilities. So we can actually construct a similar definition, which is a little bit more interesting for what's called conditional independence. We're going to say that two variables A and B are conditionally independent given the variable C if, and only if, their conditional distribution factorizes. So not P of A and B, but P of A and B given C. And look at my watch. Maybe a first question to you to make sure you're paying attention is, you can wonder for yourself if this is a weaker or stronger statement than independence between A and B, or whether one of them implies the other. Is it weaker, in which sense? So which implies which? So I'm going to repeat this. We cannot actually imply one from the other. So independence between A and B does not imply conditional independence. And conditional independence does not imply independence. And it feels, I mean, I guess what you're saying is, this feels like a slightly weaker statement because you first need to know this other thing, and then they become independent. And I guess this intuition is good, although, of course, it depends on what C actually is, if it's something we can expect to know or not know. So an interesting situation to think about, actually, is, let me just think about how I do this, ideally. OK, I'll first show you a picture. It's the following setup. So let's imagine there was some fancy mechanism that someone set up in a box that you can't see. So let me imagine there is, like, inside of this thing, this is actually a locked box that I can't open either. Let's say there's a thing inside here, a little machine that does the following thing. There are two coins that get thrown independently of each other, now they're fallen. And then some camera looks at the coins and checks which side they show. Now there's also a bell in this box. If the two coins show the same sides, if they're both heads or they're both tails, then the bell rings. And if they're not the same, the bell doesn't ring. But we know when this happens, so we can hear the coins being tossed. So I asked table diffusion to make a little image of a bell. OK, and then what I'm often going to do, I'm not going to talk about what this formally means in a moment, or at the end of this lecture, so I'm going to draw a little picture like this, which for the moment is just a picture, but you can see that it's a graph, which contains three variables, A and B and C. A and B are the states of the coins. And C is the outcome of the bell, whether it rings or not. So let's first, without doing math, think about our intuition for this situation. First of all, are A and B independent of each other? Because I said they are thrown independently. And actually, it says so on the slide, so yeah. So is C independent of A? No? It feels like it should be, right? It should be, it should actually depend on A, because we're actually looking at the coin to decide whether we ring the bell or not, right? OK. And the other way around is C independent of B. It's the same situation, right? It's totally symmetric around A and B. Is B independent of C? So maybe this is the first point to note. If we decide that one variable is independent of another, so if C is independent of B, then this has to work the other way around as well, right? Then B also has to be independent of C, because of the definition of independence. It's just a product, right? A product between two random variables, between two, sorry, it's a product between two real numbers. This is the first time I make this mistake. I'm going to make it many more times. So the right-hand side, that's the product between two real numbers. The entries inside are random variables. This is two numbers that get multiplied with each other, and real numbers commute. So you can multiply them this way around, or this way around, so it both holds. OK. So I'm taking away the intuition was A and B are independent of each other for sure. A and C somehow feels like they should not be independent of each other, because we make C by looking at A and at B. And let's think of conditional independence. Is A independent of C given B? Some people are shaking their heads, and that's good, right? Because it's sort of, let me just check what I said, so I'm going to write it down so I don't get confused myself. I'm asking, is A independent of C given B? Question mark? Because we've got the time. Let's make sure everyone understands. So what that means is, if I tell you what B was, so if I told you the second coin was heads, can you then make statements about A and C independent of each other? No, right? Because if you now know actually for sure what C is going to be if you know A. So if B is heads, then if A is heads, then C is bell rings. And if A is tail, then C is bell, does not ring for sure. So clearly, this is not true. What about the other way around? Is C independent of A given B? Just sanity check. Everyone's paying attention. Let's just make sure everyone really got it. No, of course not. Because this is the same statement. But this is the moment in the semester to make this clear. Why? Because what does this statement mean? It means this, which is on the slide. It means that we can factorize these distributions like this. And because this factorization is a statement about a product of real numbers, we can exchange the elements in that product. So this is the exact same thing. I've asked questions like this on exams, and a surprising number of people get it wrong. So this is the same thing. So they imply each other. OK. Now maybe a more interesting final question. Is A, so we said that A is independent of B, right? Because we throw these two coins independently. What if I condition on C? So if I tell you the bell has rung, what about A and B? They become perfectly dependent on each other. So by revealing information, you can make things dependent on each other that were previously independent. And this is such a profound statement that it seems really simple, but it makes sense to think for a moment about what this means. First of all, intuition. Sure, if I tell you, oh, the bell has rung, now you know the two coins have the same side. They're either both heads or both tails. You still don't know whether they're heads or tails, but there's only one random variable left, because they're both the same now. Does this make sense? This is like if I have an actual coin here, I have an actual coin here, big one, two euros. Similar experiment is if I throw this coin twice, like on the ground, and you don't see what the outcome is, something, something, let me check for myself. If I were the bell, I would not ring. Now what do you know? You know that they have different sides, right? You know something about them. And if I told you what the first one was, you would now know what the second one is. Similarly, if I take this coin and let it fall so that you don't see it, but I let it fall from this high, and I do it again twice, what do you know about the outcome of these two throws? They're the same, because there was no turning around, right? But you don't know whether they're heads or tails. So you're still uncertain. But your uncertainty is somehow reduced, because I actually did the same thing twice, pretty much. This is a really profound thing, actually, as simple as it seems, because it says something about the nature of uncertainty. It's not about randomness. What I just did was completely non-random, right? It's about your state of knowledge, about what you know, about how much information you have. The second observation about this is that it's not about causality. This is not a statement of causality. So what we just observed here is that two things can become dependent on each other by revealing information, even though they were originally causally independent. So the causal process was these two coins. And they get thrown independent of each other. And we assume that they're fully independent of each other. So the causal structure is there's two things that happen that have nothing to do with each other. Now if I give you a certain kind of information afterwards, whether the bell has rang or not, then they become dependent on each other in the language of probabilities. So the word dependent does not mean causally dependent. It just means this, right? Or actually the negation of this, right? They are not equal to each other. It's this clear? Any confusion? So now let's see if we can do this actually in code. First of all, can you read this? Or is it too small? OK, I'll try and zoom in a bit more. The problem is, of course, then we see less and less of the code and I'll have to start scrolling around. So I tried and recreated this. It's a bit silly to do this because we're just talking about these eight different numbers. So it's a bit much Python code for whatever. But I tried to be very, very clear about it to get the structure across, even though the computation is totally silly, simple. So I'm very, very carefully actually using jacks.numpi properly because it's 2023 and we're going to use jacks a lot over the course of this term. This is not the kind of computation for which you need jacks at all. You could just use numpy. But I want to be consistent from the start and kind of give you advance warning that there'll be more jacks coming. So what I did here, let me see if I can show you the whole thing, is I'm, and I have to admit I didn't look at this code before the lecture again, so I need to check myself. I'm defining the two random variables A and B, and then the conditional distribution for C given A and B, and then a joint distribution. Why? Well, because I can write by the product rule, every joint distribution over three variables as, and this is true for every probability distribution, p of A, actually let's do it like this, p of A times p of B given A times p of C given A and B. So what I've done here is I've just gave an arbitrary ordering to these three things, ABC. I decided that this one is going to be the 0th one. This is the first one, the second one. Then I start from the left with a marginal times a conditional on the previous variable, times a conditional on all of the previous variables. And I could keep doing this with more and more and more random variables. This works for every joint probability distribution because of the axioms of probability theory. So that means I can write this joint thing by writing this down. Now in this case, I know something about this distribution. I know that B doesn't depend on A. And am I going to show you this or am I going to ask you to think about it for yourself? OK, so think about it for yourself, maybe. Independence implies if p of A and B is equal to p of A times p of B, then this means, and maybe you can convince yourself in like two minutes, that the conditional is equal to the marginal. It's a one line proof. So B, thanks. OK, there's no proper chalk here. So in this case, I know something about this distribution. I know that I can get rid of this of A. That's nice. Why? Because to write down this thing, how many numbers would I need to define in general? Four, or actually three, because the fourth one is like one minus all the other ones. Careful, actually two. So I need to say how likely B is if A is true, and how likely B is if non-A is true. So in this case, well, yeah, but that's still actually two numbers I need to say. And then the other ones are just one minus this. Does that make sense? So instead of writing down two numbers, I only have to write down one number, p of B. And then p of non-B is one minus p of B. So this actually is equal to p of B. And then the other one is actually still the full thing. So I do this here. We have a function, actually, for pA and pB. Why a function? Because probability distributions are functions. So I want to make this clear and write down this is actually a function. And this function depends on one parameter, which I put inside in this case. And it's a coin, so it's like 50-50. And this here is just the numpy way of defining a function that acts another way. Or it's one of the possible ways of doing it. Then I write the conditional distribution. I decided to use this notation for this particular case. I'm not going to make much use of this notation over the course of the term just for today. So this is going to be the function that defines the conditional distribution for C given A and B. And it, well, does the right thing. And why does this is the right thing? You can think about it for yourself. And then, after that, I've defined the whole thing. This is what we call a generative model. Because it defines the joint distribution. I can write down the joint distribution, p of A and B and C, by multiplying those functions with each other. So so far, I've created functions that can be evaluated, not actually an array. Now I fill this array. And to be honest, this is probably not the smartest way of doing this, because it's completely unreadable. But what this does here is, and this is also probably one of the very first Jacks code I ever wrote. So it's probably bad. They'll be nicer Jacks code later. So what I do here now is, I create an array, p, which contains three dimensions, which, well, OK. So first of all, they are just called dimensions zero, dimension one, dimension two. But in a moment, I'm going to think of them as dimension A, dimension B, dimension C. And then all of these variables will take value zero or one. So now I'm broadcasting those across all three of these Xs and evaluate this function which defines the joint distribution. So now I have a 2 by 2 by 2 array that contains eight numbers that get filled in by these functions. Now to make things really clear, this is sort of every centric programming, if you like, to use a fancy word, I'm defining names for the Xs, actually, because then I can call them in code that looks kind of neat. So I'm just saying the zero's Xs is called A, and the first Xs is called B, and the third Xs is called C. We're doing this so carefully to make clear to explain what this notation actually means that we use to divide p of A, B, C. p of A, B, C actually means there's an array that has three Xs called A, B, and C in the discrete sense. And then I can, for example, compute marginals, p of A and B, p of A and C, p of B and C, by using the sum rule. And the sum rule now actually is a function. It's called sum out C, sum out B, sum out A. And then I call some fancy function that creates an output that you can look at, because otherwise looking at three dimensional arrays is a big mess, so I'm going to make some plots. I can also compute the marginals over individual variables, A, B, and C, by summing out two variables. The marginal of A is the sum over B and C. And the other way around. And finally, I can compute conditional distributions by using the product rule. So p of A and B, given C, is the joint distribution, right? p of A, B, C divided by p of C. That's what the product rule says, right? p of A, B, given C times p of C is p of A, B, C. So we divide by p of C to get the marginal. And now what you do, what you may see here is that I actually evaluate individual entries here at the end. Can someone guess why that is? Sanity check to see whether you're still following. Probably not. So let me remind you, we're keeping track of any possible value of these variables, not just of whether A is true, but also whether it might be false. So these arrays, they have, in each dimension, they have two entries, the 0th entry and the first entry. And the 0th entry is the case of A is false. So in the very first dimension, we have A might be false or true, B might be false or true, C might be false or true. But I actually only care about what happens if C is true. So I'm evaluating the first element because that gives me the probability for A and B given that C is true. There's some nodding. And the same here and here. Otherwise, I would still get big arrays that are actually functions, which I could probe, but then I can't plot them nicely. OK, and now I just actually run this code. And let me hope that it actually runs today. I haven't done this Jupyter notebook yet. Here we go. And it produces these outputs because I made this fancy plotting array. Actually, I'm going to upload this Jupyter notebook later. I forgot to upload it. So what we can see is indeed the joint distribution for A and B looks like this. Does this look like what you expected it to look like? So this is the outer product of the vector 1, half, 1, half with itself. That's what an independent distribution looks like. It's just the same. It's not the fact that it's the same everywhere. It's just that it's an outer product. And the same for A and C and B and C. So why is that? Well, it's because if A is true, so if A is heads, let's say, then we still don't know what B is. So the outcome of C is pretty much sort of the reverse of the outcome of B. So 50, 50. Marginal distributions for A and B look exactly like you would want to look them like. And C, of course, because we sum over those arrays. And that is also what you think. So it's the coin flipping 50, 50. And now we have conditional distributions. A and B given C, is this an independent distribution? This diagonal matrix? No. But there is only one person going, no. That's not independent. That looks independent. It's the diagonal. Diagonals are cool, no? Diagonal matrices are good? No. That's not independent. That's perfectly dependent on each other. One way to think about this is if I know which column I'm in of this matrix. So if I know what A is, actually B, that's the column, because the rows come first. So if I know what A is, then I know that I'm in the first or the second row. And then I can clearly read off what the other variable is going to be. So I get some structure by deciding where I am in these columns. So this is not an independent distribution. It's a dependent distribution. It's actually a perfectly dependent distribution, even though there are 50% in there. Why? So that might be a good way to check whether you're confused or not. If I told you that A is 1, then how do I read off the probability for B given C? I need to divide by something. You might be confused that there is a 50%, because it doesn't sum to 1. How is that? Why does it not sum to 1? Well, because that's not yet a conditional distribution. For that, you need to S to divide by P of B. And P of B is 50%. So if you divide by it, we get 1. OK. So this maybe is a good point to go back to our slide and check that we actually got everything our intuition was right. So first of all, A and B are independent of each other. A and C are independent of each other. And B and C are independent of each other. Weird, no? That's against the intuition. So even though causally, A actually contributes to C, because it's one half of the causal mechanism that makes the bell ring or not, they are independent of each other. Because you can't predict C if you know A and just A. And again, it seems so simple, because the example is so simple, but that's a fundamental statement. Statistical dependence and independence is not causal dependence and independence. And then we also noticed that A and B are not independent of each other anymore when we condition on C. So two things that are causally independent of each other become statistically dependent on each other once we observe additional information. Fundamentally, this is, again, the statement that probabilities are not statements of randomness. They are not about dice and coins. They are about states of knowledge, about how much information you have collected. And information can make things dependent on each other, even though they were originally independent of each other. And then the other ones hold as well. Actually, that's maybe also interesting. A is dependent on C given B, right? Because if I told you that the second coin was heads, then if I told you whether the bell rang or not, now you know what A is and vice versa. So simple and yet so complicated. So propositional reasoning is an extension of propositional logic. See, you kind of maybe expect it to be a bit complicated, because logic is complicated. And in fact, it's actually more complicated than logic. Because in logic, we only keep track of the truth value in a deductive fashion of a bunch of variables. And if every variable is either true or false, then by the laws of propositional logic, we can derive statements about the five variables and then we know what their value is. But if we allow the incoming variables to be true or false with a certain probability, then we have to keep track of all possible outcomes. And that might make things much harder. But we also saw in examples like this that we can save some computation if we know something about independence or conditional independence in a joint distribution. And maybe that's the moment where we take this five-minute break that we talked about. It's now 5 past 9. I'll start again at 10 past 9. So don't run out. You can stretch yourself. If you want to ask a question, now is a good moment. OK, now what we're going to do for the rest of today's lecture is to look at one other example, an additional, almost trivial example, beyond two coins and a bell, which is also taken from someone else's book. But in this case, not from Stefan Hameling, but it's a very, very famous example by Judea Pearl, which was reused by David McKay in his book, which some of you may have seen before, which is really good to go through because it highlights the differences between causal and probabilistic reasoning than what knowledge does to you and what causality does to you. And here is the story. It's a very California story because it comes from Judea Pearl. So there's a guy who lives in the San Francisco Bay Area. And what that means is, A, he has to drive to work for a really long time in his car and listen to some stupid public radio station on the way. And B, there's a lot of crime because, I don't know, maybe he lives in Oakland. So now he goes to work at some fancy tech company. But Judea Pearl wrote this text in 1988. There weren't really fancy tech companies there yet. But let's imagine. And while he sits at his flex desk, he gets a text message on his phone telling him that his alarm at home went off. Back in 1988, the story actually goes, his neighbor calls him to tell him that his alarm went off. But these days, of course, you just get a notification. The story actually gets better by assuming you get a notification on your phone. Oh, your Nest alarm has gone off. Might be a break-in. So concerned, this guy jumps into his car, starts the way back two hours across the Bay Bridge towards Oakland, and hears on the radio that there's been an earthquake, because there's a lot of earthquakes in that region. He goes, maybe that's why the alarm went off, because alarms tend to trigger earthquakes and gets a little bit calmer. So should he continue to drive towards his house? Or should he think, ah, earthquake, whatever. It happens all the time. The alarm always goes off. So this is a very nice story, because it neatly lays out the kind of reasoning process that humans do all the time. Because it relates quantities to each other that are partly causally linked and partly not. And it gives a good insight into the conditional independence structure that arises from these kind of situations. So we're going to describe this in one of these graphs. So far, I haven't drawn the graph yet. We were just creating it. And these graphs work like this. We first identify all the variables in the problem. Here there are four variables. There is the alarm that goes off or doesn't go off. We call that A for our alarm. There is the earthquake, which happens or doesn't happen. We call that E for earthquake. For earthquake. Then there is whether there was a burglar in the house or not, called that B for burglar. And then there's the radio announcement, which happens to inform that there is an earthquake. And actually, that radio announcement is a bit unnecessary, because it just provides information that there was an earthquake. And the radio is pretty reliable. It's not really a random process, whether someone, if there's an earthquake, they'll say something on the radio. And if not, then they won't lie about it. But still, we produce a variable called R for radio. So that means we now need to write down, in a second step, the probability of everything, of these four random variables. That's our second step in defining a probabilistic model. So we need a function that operates over binary variables. So there's going to be a four-dimensional array with two entries along each axis. So 2 to the 4 is 16. But we actually know that they sum to 1. So one of them doesn't matter. We only have 15 degrees of freedom. And we're going to write down P of A and E and B and R. And now we can actually write down this distribution in an arbitrary fashion using the product rule. So if we use this mechanism here, then we can write things like this. And that's just a way of writing down an array that contains 16 random variables. And maybe one intuition for why this works, why the product rule works this way, is if you think of populating an array in a piece of code, one way to do that, like classically with old-style Python is in list comprehensions, where you go like A for A, R, E, and B, in and then three different things. So it doesn't matter how you nest those indices next to each other in general. You can nest your for loops by going first for A in 0, 1, for B in 0, 1, for E in 0, 1, for R in 0, 1, or any other order in which you nest those for loops. You'll get to every element in the array. That's maybe a much more intuitive thing to say for computer scientists. And that's exactly why the product rule works the way it does. And it gets you to every point in this array, and there are just 16 of those. But in this case, we know something about the numbers that go into this array, because we know that not everything depends on everything. So first of all, we know, or at least we assume, for the moment, that burglaries and earthquakes happen independent of each other. So we're not thinking of a catastrophic earthquake that is looting afterwards, just a little shaking. Let's say they are independent of each other. They don't affect each other. It's not like someone grew up, except because there is a little shaking and then goes at burglar's houses. So that means we can write this term as just P of E. And we get bit of two, or one number, actually, that we have to define. Secondly, the radio announcement has nothing to do with the burglary in one particular house. They are not going to say on the radio that there was an earthquake, because there was some, it has absolutely nothing to do with whether there was a break-in or not. So if you know whether there was an earthquake or not, you know what the radio is going to say. So that saves us some degrees of freedom. And then finally, the alarm has nothing to do with the radio. The alarm only goes off because of the earthquake or because of a burglary, or potentially both. But a radio announcement does not trigger alarms. So we can get bit of R. So by doing all of this, we've reduced the degrees of freedom to eight. Why? Because this has one degree of freedom. If I tell you the probability, then you know the other one, one minus that one. The same here. Two here, as we just discussed, for conditional distributions like this. And here, it's four, because we need these two, and then one each for each of these four possible cases that they might have. So that's four plus two plus one plus one is eight. Actually, if you look at the slides again later on, you might discover that this story isn't quite as easy as that. Because it seems like, OK, actually, I'll tell you now. So here's a quick break from the main programming. So in a moment, I'm going to tell you that the radio announcement and the earthquake are pretty much completely determined by each other. If there's an earthquake, there's a radio announcement. If there's no earthquake, there's no radio announcement. So actually, this distribution doesn't matter at all. R is like a completely irrelevant variable. There are just three of them in the end, A and E and B. And now it might seem like, yeah, OK, then kind of with three random variables, of course, we have eight degrees of freedom. That's not particularly interesting. But actually, the point still holds. It's just a little bit weaker. So if you forget about this, then even for a three-dimensional probability distribution, we would in general need eight statements. And because of this independent structure, we need two less, only six. It's just not as impressive because we go from eight to six. But the point still holds. But in general, the point here is just if you know something about the structure of the problem, then you can make things easier. Now, we come to the point where we draw a little picture like this. I'm going to, in a moment, define what this is. This is called a directed graphical model, or sometimes called a Bayesian network. But we don't use that term anymore because it sounds like neural network. And it sounds like if you don't draw it like this, it's not a neural network or it's not Bayesian or whatever. So it's just misleading. This is called a directed graphical model. Why? Because it's a graph and it has arrows, so it's directed. And the way we do it is we write down all the variables we care about. There are four variables. Then we write down the joint probability distribution over everything. Use all the domain knowledge we have as we just did. So we write down this equation up here. And now we go and draw arrows in this graph. For every term in this factorization, if it's just a marginal distribution, so it has no conditioning, we don't do anything. We just leave it. And if it's a conditional distribution, we draw an arrow from the right-hand side, from the thing that we condition on, to the left-hand side, the thing that is the random variable. And if there are two such variables on the right-hand side, we draw two arrows from both variables to A. And that's how we get to this graph. And we call these variables that are at the tail end of the arrow, the parents, and the variables that are at the front end of the arrow, the children in this relationship. That comes from graph theory. OK. And now there's a final thing, as you notice. What do you notice about this graph that I haven't talked about yet? Oh, it's not fully connected, yeah? OK. That's true. So it's not a dense graph. We'll discover that that's a good thing. Anything else about this graph? Some variables are black and some are white. Yes. So we're going to fill in those variables which we know. So the story goes, we never find out whether there was a whether, well, as the story progresses, there are these two variables, E and B, burglary and earthquake, which we care about, which we'd like to know about. And two variables, A and R, which we get to observe. They become data. So the alarm has gone off. We get a signal. And then later on, there's the radio announcement. There's no temporal structure yet to this graph. And what we want to know is whether there was an earthquake, or a burglary, or potentially both. So that's the setup. This is what in machine learning we call modeling. And now it's the second lecture of time, so it's just four variables. And they're both binary. But don't worry, they'll very quickly become much more complicated. So that's a model. A model consists of a structural equation that defines a probability distribution. It's a function in Python code, which we can use to fill in a big array. And then we can ask questions to this model. We can do inference, learning, by conditioning on data. So the probabilistic mechanism for learning and inference, they're the same thing, actually, is to just condition on observed variables. And we can do that by first conditioning on A, so we get the alarm to see what the effect of that is on E and B. And how do we do that? What does conditioning mean? It means computing something on a computer, because that's what machine learning is. It's like training our model, training on the data. And here, it's just binary variables, so training is going to be trivial. But that's what learning is. It's conditioning on data. So we do that. First, on a slide, then in code. So I'm going to define what the function actually is. For that, I need to put numbers to the algebraic structure. I'm going to say that the probability of a burglary is something like 1 in 1,000. That means if you're living in Oakland, your house gets burgled every three years. Which is maybe true for Oakland, not for tubing and thankfully, but yeah. Also, those numbers are in the book. And of course, I'm going to give you codes so you can play with it and change it. Then the probability of an earthquake is also once every three years. Let's put it like that. It's maybe a little bit low, should be a bit higher. You can change this in the code if you like. That means the probability of not getting burgled today is 1 minus that and the probability of not having an earthquake today is 1 minus that. Now I also need to fill in all these tables. I need to define the terms in the equation. So there is this term here, which says what is the probability of a radio announcement if there is an earthquake? Let's say it's 100%. It's just the same thing. If there's an earthquake, the radio says something. And if there's no earthquake, the radio doesn't say anything. By defining those two numbers, I've also defined, obviously, the probability for r equal to 0 given e equal to 1 and r equal to 0 given e equal to 0. Right? If you think this is all very boring, you can start nodding very rigorously that I'm going to speed up. If you think it's completely obvious, you can keep staring as you do. Sorry, if you think it's not obvious, you can keep staring as you do, and then I'll be slow. So good. OK, people find this obvious. And then finally, I need to fill in the final part of the equation, which is tedious because there are so many of them. Right? So this is the probability for the alarm to go off if there is either a burglary or an earthquake or both. So we have four possible values. There could be an earthquake. There could be a burglary, or not, or not, or both. And that means I have four possible values here for the alarm not to go off. And then the corresponding probability to go off, which is just 1 minus the probability not to go off. OK, so I'll start on the left. The probability for no alarm, if there is no burglary and no earthquake, is some kind of false alarm probability. Right? What's the probability for your alarm to just go off on its own? Let's call that probability F, false positive rate or something. And let's say it's also 1 in 1,000. And you get to change it in the code if you like. Every three years, your alarm randomly goes off. Not a particularly reliable alarm. And what's the probability for the alarm not to go off if there is a burglary? That's the probability, first of all, that it doesn't go off on its own without a burglary or not, because that might happen, times some kind of detection probability. Right? So how good is your alarm? Does it go off when someone breaks into your home? Let's say it's a good alarm. So it goes off in 99% of the cases someone breaks into your house. Then the probability for it to not go off is 1 minus that 99%. What's the probability for the alarm not to go off if there's no burglary but there is an earthquake? Same thing, but there is a corresponding sort of trigger probability for earthquakes. How likely is your alarm to go off if there's an earthquake? Let's say it's 1% likely to be triggered by an earthquake. And finally, what's the probability for the alarm not to go off if there is simultaneously an earthquake and a burglary? Well, it's not going off by itself, 1 minus F. It's not getting triggered by the burglar, 1 minus alpha B, and it's not getting triggered by the earthquake, 1 minus alpha E. So we can multiply those. And with that, we also have the other side. So then we plug those in. We actually multiply all those numbers. Now there are actual numbers. And that's what's stored in our array. That's our 2 by 2 by 2 array. Actually, if you like 4 by 2, no. Sorry, sorry, sorry. 2 by 2 by 2 by 2 array. You can plug in all these numbers. Now they're all there. And now we can do Bayesian inference. So one way to do that would be to ask you, break out your pen and paper, and do it. And then you write down Bayes' theorem. And this was my opportunity to ask you, how do you do inference with someone to shout out Bayes' theorem? I'm going to do this a few more times over the course. So what's the probability for the two things we care about, given that there was an alarm? That's the question we're asking. Let's learn from the data A. Then that's this conditional distribution. And, well, Bayes' theorem tells us how to do this probability distribution. It's the probability for the prior probability, for B and E, times the conditional for the data given the prior, divided by the probability for the data, the evidence. Lots of numbers to be plugged in, to be summed out, all possible values. And we end up with 0.002. Actually, I wrote code for this as well. I'm going to show it to you. So here, I've defined all those numbers that we just talked about, so that you can change them outside of the code and play around with them, and plug them in. And now, I've done the same thing as before. So I've defined functions, which you can evaluate for the break-in, for the earthquake, conditional distribution for alarm given earthquake, or end-or break-in, and joint probability. And as before, I can write down the joint probability once I have defined those functions using the product rule. Then I can fill in all this stuff with a bunch of Vmaps in JAX, or you could write three different for loops. So for those of you who haven't seen JAX yet, because you've worked in other dialects of Python, I guess, what this is is just three nested for loops in a fancy way of writing three nested for loops. We fill in, for every possible, 0, 1. Basically, this is like a list comprehension. This is like for A, E, B, in 0, 1, 0, 1, in 0, 1. And fill those all in by evaluating this function. Now we have a joint, and now we can do all this stuff. So we can compute sums, compute marginal distributions over the alarm. How likely is it for an alarm to go off? We sum out E and B. How likely is it? That's the big question. So what's the probability for the two things we care about, earthquake and or burglar, given alarm? For that, we use space theorem. So we take the joint. The joint is the product of prior and likelihood, the joint. Divide by the marginal for the data, so the probability, the sum over E and B. And then only evaluate the first entry because that's the probability given that the alarm went off, not that it didn't go off. We could also set that to 0. And then we find out how likely earthquakes and burglary are given that we didn't receive a message about an alarm. And we can compute conditional distributions for B given A and E given A. Those are interesting. This is kind of how, what's your actual belief about a burglary irrespective of earthquake, given that you've seen the alarm and vice versa? And also, finally, a question is kind of, once you know that there was an earthquake, once the radio has told you there actually was an earthquake, how does your knowledge about the world change? What do you now believe about burglaries given that there was an earthquake? And our intuition tells us, if I find out that there was an earthquake, my sort of worry about the burglary should go down somehow to become less likely again, which is an interesting feature, by the way, of probabilistic reasoning that more information can make something less likely. And now I just make a plot and I just create markdown fields to plot. And now what this tells us is, Python tells us, because we've got computers. We don't need to do the computation ourselves on a piece of paper, like in 1988, is before we know anything, the probability of an alarm is 2 in 1,000. Why? Because it could go off randomly, or it could be triggered by an earthquake, or by a burglary. And going off randomly is actually, they just all sum to 0.2%, which also tells you something about how often you should expect an alarm from your system. Now, once we get informed that there was an alarm, the marginal distribution, sorry, the conditional distribution for B and E given A is this. And what we notice here is, does this look like an independent distribution? No, it's one more of these cases where you get to see some information that makes things dependent on each other. And finally, we can ask marginal questions. Given that there was an alarm, what do we now believe about break-in or earthquake? Well, we believe almost 50% probability that there is a break-in, just a moment. And 0.6%, no, 0.6% that there was an earthquake. Why earthquakes and burglars are equally likely a priori, but earthquakes are much less likely to trigger the alarm than break-ins are. And therefore, the alarm is much better explained by a break-in. Is that your question? What's your question? Yeah, so this is a problem with this notation that we should probably write P of B and E given A equal to 1. Yeah, so I'm evaluating it at the, yeah. So this is the, maybe, it's not a problem with the code, it's a problem with this notation that I had to slide about at the beginning that we're talking about B both as kind of a Boolean statement and the variable that could have value 0 or 1. And this will actually be even more for problem over the rest of the course because of this notation, but I think it's possible to read this like this, right? What's the probability of a break-in given an alarm? So it's the probability of B equal to 1 given A equal to 1. The probability of B equal to 0 given A equal to 1 is just 1 minus that. And the probability of B equal to 1 given A equal to 0 for that we need to do another computation. I need to actually look it up. OK. And then finally, if something interesting happens, that's actually what the point of this exercise is. Once you find out that there is an earthquake, if you condition on the earthquake, so basically the radio announcement, the probability of a burglary drops to 8%. So this guy runs out of the office having heard about the alarm in his house. 50% roughly confident that there was a break-in. Then he gets additional information. It's informed that there was an earthquake and suddenly becomes much calmer. Today, a pearl calls this explaining a way. So if I get additional information that makes the earthquake a more likely explanation of the alarm, then that explanation explains a way to worry about the burglary. So this is nice because it shows that probability here we can do these kind of computations. Actually, here the numbers are as well. It also means it also kind of explains why it's hard to do those computations because you have to keep track of these arrays and keep summing things all the time, which is harder than proving the truth value of a Boolean formula where everything is just either 0 or 1 and you never have to keep track of downstream stuff. So if you want to write what's called probabilistic programs, so computer programs that can keep track of possible values of variables in your code, you need to have some kind of functionality to keep track of all these possible values. If they are discrete, then we talk about arrays. If they are continuous valued, things get even harder. With that, I want to briefly use the last few minutes to summarize a few things that we've seen and define a few things finally. First of all, what we've just done is we've gone through a general recipe of machine learning. Didn't sound like it, but that was machine learning. We defined a model. That's what David Becquiat used to call always write down the probability of everything. If you have a transfer question, someone tells you some complicated story and you ask a question about some complicated stuff, then what you do is you write down the probability of all the variables. For that, you first identify the variables, then you write down the generative model. So you typically either you write the graph or you write this factorization, which are equivalent to each other because you can construct the factorization from the graph and the graph from the factorization. And then you do Bayesian inference. You just condition. That's it. And for that, we used these notions of what's called a directed graphical model or sometimes called a Bayesian network, which looks like this. So here's a formal definition, but it's actually just what I said. So it's a probability distribution over variables which can be written as some factorization where the factorization has this particular property that the graph is acyclic that it sort of goes from left to right. We have parents that feed into children. And in fact, you can do this with every probability distribution. Every possible probability distribution can be written as a graph. And in general, it's gonna be a dense graph. Why? Because the product rule tells you you can factorize distributions like this. Or if you're a computer scientist, you know that to fill in all the elements in an array, you have to write nested for loops that just work over all of the possible values. And those nested for loops, you can permute their order as long as they reach every element in the array once. But that's not particularly useful. First of all, a first observation is that you can do this in arbitrary order so the direction of the arrows doesn't mean anything. They are not causal. They're not saying A causes E. It's not that the alarm causes the earthquake. It's just a way of writing down the equation of nesting your for loops with each other. But if you know something about your problem, then you can make it much, much more compact. You can remove edges from your graph. And removing edges from the graph is what we want to do because it makes the problem easier. It means that there are less terms to take care of. However, it's not as easy as just saying when there are less and less edges, things get really easy because we discovered that in such distributions, there can still be this interesting phenomenon of emergent dependence, conditional dependence. So even though two variables can be independent a priori, like in this case, the burglary and the earthquake, they can become dependent on each other once we have information. And actually, I mean, you might wonder, a natural question might be to ask, is can you read off such conditional dependence structure from the graph? And of course you can because otherwise, why would we write such graphs? If there were just neat pictures that you can't do anything with, it would be a bit silly. So in fact, you can actually read off conditional independence structure from these graphs by effectively separating them into some kind of atomic structure. How to do this formally in an algorithm we'll talk about much later in the course, but today we can actually already do this somewhat intuitively by inductively building up graphs. So imagine, what's the simplest possible graph? What's the empty graph? Okay, so empty is trivial. The next easiest graph is one with just one node, right? That's also boring because the corresponding distribution just looks like this. Okay, every variable is just always independent of the other variables that aren't there. So the next most complicated graph is this. There we only have two possible values and either there's no edge, then they are independent of each other, clearly, because there's no term combining them, or there is an edge, but then it doesn't matter which direction the edge goes. Either we could write this as P of B given A times P of A, or we could draw the edge the other way and then this would be P of A given B times P of A. Doesn't matter, right? So the interesting stuff actually only comes up once we have graphs that have three variables because then the directions of the arrows actually matter, relative to each other. So if you have a graph with three variables, A, B, and C, what are the possible non-trivial arrows we could have between them? So the trivial ones are the ones where there's in some case no edge, right? And then independence is obvious. But the non-trivial cases are the ones where the arrows go in one direction. This is called a chain graph. Or, there's no colorful chalk here, so I have to draw it three times. Where do I draw it here? Or the arrows point inwards. This is called a collider structure. Or the arrows point outwards. This doesn't really have a name, I think. At least not a commonly used one. Maybe it's a fan out. So, and now we can look at these three and check what kind of conditional independence they imply. And I'll do this really quickly because we're running a little bit out of time. Intuitively, what kind of dependent structure, or maybe you've already heard about it, you can read off from this graph, is A independent of B? Is B independent of A? Is A independent of C? Given B, yeah. So, to see that, you have to write down what the corresponding joint distribution is. What does it look like? So, we have a P of A B C, which is equal to, we can read it off, P of A times, what's the next term? P of A given A. Ah, you want to go through your bus. Times P of C given B, but not A because the graph has no arrow going the other way. So we have P of C given B. By the way, there's of course a graph where they're all connected and that's the trivial one again where everything is connected, everything's connected. Okay, so looking at this expression, we have a hunch that maybe by observing B, A and C become independent of each other. Why? Well, if we condition on B, so if we want to know P of A and C given B, then we need to take this term and divide by what? By P of B, all right? So, we divide this thing down, P of A times P of B given A times P of C given B and then we divide by P of B, so what's P of B? Well, P of B is P of B given A times P of A plus P of B given non-A times P of non-A. Notice how there is no C in here because we can sum over C directly. So now we can rearrange those terms and see that we get a P of C given B times just terms that contain A and B, but no C, right? What is that distribution that is here? It's the thing that you can sort of see outlined here. What is this? This is base theorem for what? It's P of A given B, of course it is, right? Because that's what we kind of expect. So they are independent of each other. So when we condition on B, they become independent of each other. And you can actually do this very same derivation. I'm not gonna do it here to save you time, but pretty much the exact same mechanism, right? You write down the graph, you translate the graph into an equation and you compute conditional distributions and stare at them to see whether you can factorize or not. This could have been a homework. And you can see that those actually apply to these three types of graphs. So here's a table for them, which I'll quickly go through, which basically just says for chains, we can write the factorization like this. We notice that A and C become independent of each other when conditioned on B, but they are not independent in general. For fan out structures, we get that A and C become independent of each other when conditioned on B, but not in general. And for collider structures, it's actually the other way around. A and B, A and C are independent of each other in the marginal, but they become dependent on each other when we condition on B. This is explaining a way. And we could actually have read this off from our graph in our example with the burglar and the alarm by saying, look at this graph. First of all, we can ignore R because it's perfectly connected to E. So we can look at this atomic graph, E, B and A. It's a collider structure, they point together. So therefore, when we condition on A, E and B become conditionally dependent on each other. Almost done, last sentence. Unfortunately, advance warning, and we'll talk about this again. This notation, directed graphical models, is imperfect in the sense that such a graph cannot capture every possible dependent structure that might be present in a probability distribution. And this is such a fundamental problem that you can even see it in this coin and bell example. So we discovered in this coin and bell example that everything was dependent on each other when you condition on one of the variables. So for A, B and C, right? Two coins and a bell. If you condition on B, A and C become dependent. If you condition on C, A and B become dependent. And if you condition on A, B and C become dependent on each other. But there is no single graph that can capture all of these. You would actually need to write down all three graphs. That's an annoying property of directed graphs for pictures. We need to talk about this halfway through the course. It's a fundamental problem. And now you might wonder whether there's a way to fix it. And we'll need one or two lectures to discover that it's actually very hard to fix. There's another notation which is not directed, which has another downside. Which fixes this problem, but it has a separate downside. First of all, for this now though, we now have this notation called directed graphs. It allows us to think about conditional independent structure. And we'll need to do that in the rest of the lecture. So I want to introduce this notation so we can look at graphs and you don't have to go, huh, what is the graph? And then I can write neural networks and you will realize that they are also these kinds of graphs. And so we can talk about neural networks without having a non-defined notation. And we've discovered that conditional independence and dependence are really the tricky part. But we are not the first ones to discover this. Kandagorov already knew. So if you read the chapter right after the introduction of the axioms of probability theory, he has this section here, which I'm not gonna read out because it takes too long, but he says, we are now, so at the end actually, we therefore come to the conclusion that the concept of independence is at least in principle at the core of this complicated nature of probability theory. And therefore it's the most important task of the philosophy of sciences to figure out this sort of contested question about the nature of dependence and independence to explain what probabilities actually are. What he means by this is that base theorem is not magic and it's not intuitive reasoning and philosophy base theorem is just measure theory. It's just keeping track of distributions. What is really tricky is to say what it means for two things to be independent of each other. And we discovered today that it has very subtle nature. It is, first of all, a big problem because if you want to keep track of every variable at the same time, by the way, please give me back, then you need to keep track of every possible combination. And in a problem with n degrees of freedom, there are exponentially in n number of elements to fill into this array, right? An array with d dimensions has length of each axis. Let's say they are all the same length, then this array has length of axis raised to the deep power number of entries. So unstructured probabilistic reasoning is exponentially expensive in the number of variables we're keeping track of. So we will now need to find algebraic tools to deal with this complexity. And of course they are and they are wonderful and we're gonna spend a large part of the lecture course thinking about them. Really, really carefully. And that will allow us to keep track of infinite dimensional objects, both in terms of infinite in one variable, continuous value variables, but also even infinite nearly many variables, function spaces. And that's gonna be super interesting. We are going to represent still then relationships between variables with these directly graphs which allow us to read off conditional independent structure. And that's it for today. From Monday onwards, we'll talk about actual real variables, continuous value variables, which are much more interesting. Thank you very much.