 Hello, thank you very much for tuning in again and Welcome to the second lecture of this course on probabilistic machine learning in which we're going to learn how to reason under uncertainty and More importantly how to teach computers To do so for us efficiently This is where we are in the course with just Past the very first lecture all of this is still ahead of us what we saw last time is first and foremost that probability and the associated notion of uncertainty is not just a technical mathematical concept used in machine learning But it's a much much broader broader notion that applies to actually most of our daily life and That is an important part of the process that we might describe as human intelligence or As the a persimmon laplace put it life's most important problems Are for the most parts problems of probability? This includes scientific reasoning it includes many societal and political decisions and therefore, it's no surprise that a part of computer science machine learning which aims to in to endow computers with The core aspects of human intelligence Has to deal with this notion of probability at its very heart in the last lecture We introduced Mathematical formalism for dealing with uncertainty with probabilities we did so by Constructing a set of axioms that go back to and they come over off Which are at their heart a construction of uncertainty using sets and Set theory or measure theory to distribute truth Not to a single binary statement But to distribute truth between true and false over a spectrum of Values between zero at total totally false and one totally true We saw that doing so Primarily requires just to think about Sets just to construct a meaningful way to talk about intersections sums and differences of offsets and then to Apply one or actually two key notions To distribute truth over them the first one is actually a relatively simple one Which is just to say that there is a finite amount of truth They might as well say there is one a unit amount of truth and The other one is the key master idea called sigma additivity Which states that for two disjoint sets over which we have distributed truth The probability assigned to their union has to be Equal to the sum of their individual probabilities This is really just saying that we want to keep track of uncertainty and not artificially add or remove uncertainty or probability purely Due to operations on sets So that talking about derived events Should not allow us to construct An additional truth this gave rise to first relatively straightforward theorem, which is known as the sum rule which states that If you want to get rid of one of the variables in your reasoning system Because you don't care so much about it. You can sum out all possible values of this variable to get what's known as a marginal distribution By summing out all possible hypotheses for the other variables then we introduced another quantity Called the conditional probability for statement a given statement B this was defined in Statement that is known as the product rule which says that the joint so-called joint distribution so the probability of the intersection of two sets can be written as The probability for either of these two sets times this new notion called a conditional probability combining these two rules Gives rise to the core theorem of probability theory known as Bayes theorem, which also has a Philosophical interpretation yielding the so-called posterior probability measure for And typically unknown latent quantity x given some typically visible observable data D by multiplying the prior probability for this latent statement with the likelihood that's a conditional probability for the data to arise to be true if The hypothesis is correct and a normalization which is known as the evidence a denominator in Bayes theorem Which is the sum over all possible such joint explanations of other Hypotheses and the one we care about and the data What we well, we didn't actually see it, but I essentially put it as a homework is A very interesting aspect of this concept, which is that it Extends propositional logic or you might call that Boolean logic depending on what your notion of Boolean actually is so In a classic propositional framework, we are allowed to put binary truth values true or false to Statements so we can make a statement like from a follows be That means if a is true then B is also true in our new notation with probabilities We can be placed this string with this right similar string Which is a probability a conditional probability for B given a we can say B is true Whenever a is true, so the probability for B given a is one and so this is really a Generalization of this propositional statement in the sense that The two reasoning directions that propositional logic allows still work So the probability for B given a is one so that's plan standard forward We modus ponens style forward reasoning if we know a to be true then B is true So if it rains then the street is wet It also allows the converse which is modus tollens So if B is false then a is also false So if the street is dry it cannot have rain But it's actually a true generalization because there is a continuum in between Which is something you can show as your homework Using base theorem and the rules of probability We can find that if we make this assumption then the probability for B given Non a the complement of a is less or equal than the probability for B Which means that if a is false B becomes At most less plausible note that it can also stay as plausible as it is So if the street, sorry if it's not raining It becomes less plausible that the street is wet and The other way round in the probability for a given B So that's a conditional the other direction round is larger equal than the probability for a So this means if B is true a Becomes more plausible or at most it can stay as plausible as it was before so if you observe the street to be wet It becomes more plausible that it has rained Now as part of your homework you will show that actually there is an even stronger generalization So this statement of course is a limit case of this weaker statement Which is that the probability for B given a is larger equal than the probability for B So that means if a is true then B becomes more plausible or stays at most as plausible as it was before If we assume this inequality for our probabilities for our real numbers between 0 and 1 Then we can again show using the laws of probability that be devised in the last lecture that If a is true then B becomes more plausible So if it's raining it becomes more possible that the street is wet Even if we do not assume that rain directly implies that the street is wet only that it becomes more plausible That's just the statement copied basically, right? That's a trivial trivially true because we've assumed it But we can also see that the probability for B given non a is less than the probability for B so what we had up here still holds even under this weaker requirement and the statement above Also holds as before and a weakened form of modest tolerance also applies if B is false then a becomes less plausible So if the street is dry then the probability For rain that it has rained is reduced in this sense Probability theory does exactly what we want it recovers Classic propositional logic in the extreme corner case But it allows us to make much more subtle statements about the relationship between Propositions between variables Now just to be sure You might be thinking that I'm bashing bull here, but bull was actually a very smart man in fact, if you read bull's original texts, then you can find in his Investigations of the laws of thought actually Sections on probability So you might think of bull as the guy who did binary probabilities and propositional logic But actually he knew about probabilities. This was note. This is this was well before Conmagorov, but actually after Laplace so probabilities had already been discussed and bull actually talks in his own works about probabilities and He actually uses exactly the rules that we now use today as well So for example, he says that his first principle of probabilities is that the probability of the occurrence of any event is One sorry if P is the probability of the occurrence of any event then one minus P will be the probability of its non occurrence So that's our Sum rule Second the probability of the occurrence of two independent events is the product of the probabilities of those events This introduces the notion of independence which we will actually talk about today The third principle is the probability of the occurrence of two dependent events is equal to the product of the probability of one of them By the probability that if that event occur the other will happen also That's the definition of conditional probability distributions similar to how Conmagorov would define them and His fourth statement most maybe excitingly is the probability that if an event E take place an event F Will also take place is equal to the probability of the current currents of the events E and F Divided by the probability of your currents of E. So what this is? is base theorem So bool if you like was actually a Bayesian So if you are a fan of classic propositional logic Maybe you should be a fan of probability theory as well now Unfortunately, there is also a problem with probability theory Not a conceptual one, but a computational one So let's think about a situation where we have two them a more than two variables so let's say just For arguments sake that there are 26 variables from a to z an entire alphabet That we would like to make statements about now the way this works in a classic propositional framework is That you assign a binary value to these Variables, so you could say a is true B is false C is true D is true E is false F is false and so on Of course, that's a bit trivial What you are probably more likely doing is that you assign a truth value to a certain subset of these 26 variables And then you also use propositional statements like from a follows B or you define C as A and B or M is the truth value of K or F right and So so by this process you assign Initial truth values to a certain part of the alphabet and then use the rules of propositional logic to Automatically derive truth values for the other variables in this alphabet at the end of this process you have assigned a truth value to all of the variables or actually there might be some variables you haven't assigned anything to and then you might as well ignore them and Doing so requires how much storage think about that for a moment Normally, that's the point where I would like to ask you a question, but unfortunately we can't So of course it requires exactly 26 bits Right, we have to store binary values for all the variables from A to Z Now think about what happens if we extend this framework to the probabilistic setting Where now we don't just assign true and false To all of the 26 variables But instead we have to assign a probability to every possible combination of True or false for all of these 26 variables So how many of these are there while there are two to the 26 possible Configurations of these 26 variables and their truth values So storing these Requires us to store real numbers between zero and one for two to the 26 Different states and two to the 26 for those of you who don't have memorized All powers of two up to I don't know a hundred two to the 26 is sixty seven million one hundred and eight thousand eight hundred and sixty four that's the amount of numbers we have to store in our memory now and That's not even getting into the fact that we have to store real numbers rather than binary ones, of course Which complicates the process even more? so Where storing actually so by the way of course we don't have to store two to the n numbers We have to only store two to the n minus one numbers because the Rule of probability that probability measures Some to one over all of the hypotheses so that the probability of the entire hypothesis space always has to be one of course Saves us from storing one of these variables. So we only have to store sixty seven million one hundred and eight thousand eight hundred and sixty three So by moving from propositional logic to probabilistic reasoning We have complicated our process from our reasoning process from on the computational side from storing 26 binary numbers, so 26 bits to having to store about 67 million floating point values between zero and one So something like 67 megabytes rather than 26 bits That's disastrous right? Why do we have to pay this price? Well, because if we are uncertain we have to keep track of every single hypothesis Because every single one of them might be the right one. There isn't as unique single answer anymore Whether that's a good or a bad thing maybe depends on your outlook on life But it's a fundamental aspect of trying to be uncertain at least in Principle if you want to keep track of every possible hypothesis So what we've just seen Is that probabilistic reasoning extends classic propositional logic But it also causes a massive new computational challenge instead of keeping track of a single binary string To store a single hypothesis we commit to We now have to keep track of a combinatorially large space of hypotheses and assign Real truth value between zero and one to every single one of them Well, except for one which we can construct by one minus the sum of all the other ones This is the key challenge of probability Theory and probabilistic reasoning and if you want then in a way the entire Rest of this course and more or less all research on probabilistic machine learning is In at its core Trying hard to deal with this challenge By making all sorts of simplifications computational tricks Using structure and using a particularly important structure that we will deal with for the rest of this lecture So this is one of our great slides where you have a chance to take a quick break and then we will continue having found this Annoying aspect of probability theory this combinatorial explosion of computational and memory cost Obviously we have to think of ways to simplify computations in the probabilistic setting and and We'll find one of the most important ones of the most essential ones now To do so I'm going to first introduce some new notation So actually I'm not going to introduce a new notation. I'm going to change the way we use the notation so up until now I've assumed that variables called a and b and c and so on are Names of sets in the sigma algebra So that means when I write a notation like p of a then this is supposed to represent the probability that this Formula is true in the sense that the correct event within the elementary set of events of atomic events is a part of this derived set in the sigma algebra a and P of non a meant that it was equal to 1 minus p of a by the Derivations that we did in the previous lecture and that meant the probability that this statement is false So that the correct value lies in a part of the elementary set Which is part of the complement of a in the sigma algebra Now I'm going to make two changes now. The first one is actually quite simple. I'm just going to change What kind of values the ace actually the variables like a actually can take so instead of talking about a formula I'll talk about binary values and I'll assume that a or b or c and so on these Variables are binary variables which take a truth value. So they are either true or false. They're either one or zero This is a little bit confusing at first, which is why I didn't immediately do it because it uses truth values For things that we assign probabilities to But I'm sure you'll be able to manage So I'm just going to write p of a equal to 1 as the same statement of above that the probability the probability for the formula a being true and I'll write p of a equal to 0 for the probability that a is false Why am I doing that because it simplifies the notation quite significantly because we can now think of p as a function that maps from the space That is that from the binary set of zero one to the real numbers Now but I'm going to make another change in notation that is actually maybe more subtle I'm going to assume that p is a function that is aware of the name of its input so quite often in mathematical notation we write something like f of x and Then x is just supposed to be a placeholder for some number and it doesn't actually matter that we've written f of x you might as well write f of y and As long as y and x have the same value, that's the same number now in The retiree may not of this course. I'm going to adopt the notation. That's actually quite well It's sort of standard in any literature using probabilities Which is that p of a and p of b are different things so even if a and b have the same value p of a and p of b are going to be separate and In particular That is going to allow me to write something like this statement p of a and b so the joint of a and b the Probability for the intersection of a and b if I can write something like this is equal to p of a times p of b Then this what I mean by this notation is that this function of two binary inputs which corresponds to Four possible values right either a is 1 and b is 1 or a is 0 and b is 1 or a is 1 and b is 1 or a is 0 and b is 0 That all of these four statements Can be written in terms of two other probability distributions which take a single binary input so p of a Equals one and b equals one can be written as using these other distributions over a and b as p of a equals one times p of b equals one and so on and so on so in Other parts of mathematical analysis Being able to do something like this would require me to introduce new functions So they had would have to be a probability for a maybe have to have to introduce a sub index a Over the very valuable variables that a can take to zero and one and then another function pb which takes the inputs for b and Can take two binary values, but this quickly becomes very tedious to do Therefore we're going to do use this introduction. You can actually if you're if you have coding experience you can think of this as Functions that are aware of key words that get entered as their variables so similar to the formalism in Python for example Why is this notation helpful? Well one interesting thing we can now do is we can so I've just mentioned that there's this problem with Computational complexity in probabilistic reasoning. So we'll have to find structure In probability distributions that simplify reasoning which allow us to do less than keep track of all the combinatorially many Terms in our probability distributions And one way to do so is to use a Structure like this which you could call a low-rank structure So this means that for example for two variables this p of a and b is joined This is a matrix of size two by two What this statement is saying is that we can write this matrix as the outer product of two vectors and of course these two vectors require Potentially less entries than the elements of the matrix This isn't true for a two by two matrix, but it's true for all matrices that are larger than two by two right and That's because two plus two and two times two are the same thing Using this structure we can define maybe the most important concept of probabilistic reasoning and that's independence We're going to say that a probability distribution that has this structure that it Can be written as this product of two separate distributions that this has the property of independence so Two variables a and b are con are called independent random variables if and only if they're joint distributions The distribution factorizes into so-called marginal distributions That means we can write p of a and b in this form and this string means what I just defined on the previous slide In this case as you can convince yourself off quite easily Using the definition of conditional of the conditional Distribution p of a given b actually equals p of a I'm going to use a particular notation Which is not universal But it's also going only going to show up a few times over this course Which uses this symbol to say that I could that when this holds then we write this so this means a is independent of b One way to think about this intuitively is that information about b if I tell you something about b you cannot use this information to learn something new about a and The reason for this is precisely this relationship and vice versa, of course There's a generalization of this which is quite straightforward to conditional distributions So as you know from the previous lecture for conditional probability distributions are also probability distributions So we might as well define the same concept for conditional distributions. So Oh, by the way, yeah, I should I didn't say that so of course there's a One To give you a simple example of independence. Sorry, let's go back to independence for a moment A very simple example is two coin tosses. Actually, that's your every statisticians favorite example right, so if you have two coins and you throw them separately then The probability for each of them to separately Well, the probability distribution for each of them is completely separate from each other right the probability for the first coin Let's call that a to show heads Doesn't have anything to do with whether the second coin shows heads or tails. So that means we can write this Joint probability distribution over these two variables a and b. Let's call them a zero one and B We can write it. Let's say both coins have the same probability of 50% to come heads then the probability for Both of these to come heads is a half times a half For both of them to come tails is a half times a half and so on Of course, we can write this as the outer product of these two vectors Which are both a half and a half Okay Good, so that's easy and clearly therefore we can write this joint distribution as The product of two separate marginal distributions Now there's a generalization to this notion which is conditional independence So two variables and that's a trivial extension these two variables are going to be called conditionally independent Given a third variable. Let's call it C if and only if they are conditional distribution factorizes so if we can write the conditional distribution P of a and B given C as P of a given C times P of B given C and Everything else works as before so in that case the Conditional distributions are are are essentially the same and In in light of the information provided by C Learning about B doesn't provide further information about a further to the information that C already provided We're going to use a similar notation to say a is independent of B given C So an example for this kind of situation Would be a different setup where we again have our two coins But there's a third variable and that's that variable is actually a bell. So the experiment is going to be that Someone unbeknownst to you Throws two coins separate from each other You cannot see what the outcome of these coin tosses is and then they look at this setup They see what comes up and if both coins show the same side They have both heads or both tails then this person rings a bell and If they're not the same so if one show it shows heads and the other shows tails or the other way round then There will be no bell ringing the person will just go hit lift lift the bell and put it back down again So what this bell does is it provides information about the parity of these two coins? It tells you whether the two are the same or whether they are different now. Let's think about this example for a second So Maybe we can write down a joint distribution for this It's going to be now a little bit more complicated because we're not going to have to think about that matrix anymore But about a three-dimensional object because we now have three variables and that means our table is going to have eight entries Right because two to the three is eight. So Maybe one way to do this is to write our joint distribution P of a B and C By writing two of these two by two tables. So we're going to say either C is zero and then we can write a table about a and b and A might be either one zero or one zero or one So if C is zero then the two so that means the bell did not ring then the two Coins have to be of different face. So they will have to be zeros in here and the other two are just one half Of course Or C is one so there has to be a comma here in between. It's not a plus or so It's really just more entries to the table and then a Can still be zero or one and B And still be zero or one and C equal to one means that the bell has rang So we now know that these two coins have to have the same head the same face So there's only two options for this All right, okay, so this is our joint probability table. So oh I'm sorry, of course, this should be quarters because all of this of course happens with a quarter probability Otherwise, this wouldn't be a probability distribution. Okay So now let's think about conditional independence in this kind of setting So first of all, there is still a marginal independence. Of course a and B are independent of each other How do we see this? Well, we can actually I mean, of course That's just a generative process. That's actually what happens. You just throw two coins but we can also check that this is two in our table of probabilities here by Writing the marginal over a and B Which is by the sum rule the probability for a B C equal to one plus the probability for a B C equal to zero So it's just a sum of these two tables and as you can see, of course This is just our old table back with all one quarter everywhere. So I don't even have to write that down But now what about conditional independence? So what about? a and So actually no another example is marginal Marginal independence for a and C. So this is a different kind of variable now, right? So C is the is the is the Winging of the bell now before I write down the math we can think about this So you've just let let's say you just throw you throw through one coin a and you saw heads Okay, now. What's your prediction for whether the bell is going to be or not? You don't know right because you have no idea what the other coin is So the probability is still 50-50 that the bell is going to ring So that means this your coin toss does not provide information about what the Value of variable C is going to be so therefore it's independent. We can also check this by computing the distribution for a and C so p of a And C is of course by the same rule the sum of p of A C and b is equal to 1 plus p of a C b equal to 0 by the way Notice, I've just made use of this property of this notation that all the variables have names So we can even change their order and nothing is going to nothing is going to be affected because this function p is Aware of where all the a's and b's and c's are in our notation. So what is this? Well we can first write down a distribution over a and c for B equal to 0 so sorry be equal to 1 so if B is 1 then we could write a table over a and C and This means so if a is 0 but B is 1 then C has to be 0 it can't be 1 right because the bell is not going to ring if a and b Have different values So the entry will have to be here and of course the probability for this is one quarter because what we will see what we're doing Here is we're just selecting numbers from this table up here You can't get any other number than one quarter right so it has to be a 0 here and now you can convince yourself that this works In the other direction as well. We just get 0 1 quarter 1 quarter 0 and our table is again this full table of all one quarters everywhere And of course that's independent Now let's get to some points where conditional independence becomes a little bit more interesting and we actually talk about conditional independence so let's say Well, okay, so B is also independent of C, but we don't have to do that because it's totally symmetric But you just replace the names of A and B here now. Let's talk about A and B given C so of course intuitively if I tell you what If you observe the beginning of the bell then that provides information about the other coin So I throw my coin I see that it heads that its heads the other person throws their coin I don't see the other coin but the person watching both of them is ringing the bell Then I know what the other coin is right. I know that is also heads. So the bell actually provides information for What this Structure looks like so I need some space on my whiteboard just a moment Let me get rid of this so now We want to talk about P of A and B given C to check whether it can be factorized So we need P of A and B given C Now how do we write that down? Well, okay, so we've just already spoken about it intuitively But we want to use math to get this right. So we have to use base theorem to write Or actually just the definition of a conditional distribution because we can we can simplify We can simply compute the normalization constant the evidence. So we need P of A and B For C equal to something. So let's say P of A and B given that the bell rang Of course, we could equally use the fact that the setting if the bell didn't ring because Conditional independence means that this holds for all terms Not just for one of them. So if you find one example where The where we can not separate the table into rank one terms, then we're done. So We have to normalize by P of C I hope where you have convinced yourself that that's the case equal to one Now notice that the probability for C to be one So for the bell to ring we can read off by summing over this part of the table That's one half. So what this is going to be is One divided by one half. So multiplied by two the probability for A and B Given that C is one and this is this table up here. So It's just this thing right one quarter One quarter zero zero. So this is one half and one half. So this table Clearly because it's a diagonal table cannot be written as the outer product of two rank one terms so A and B are Dependent on each other when conditioned on C the Observation of the bell ringing gives us information about the value of the other coin So what about maybe just to conclude this thought a and C Given B, let's say B is equal to one so this is now the situation where I throw my coin and And then the Other person so I should maybe actually write this like this at first right? So I have my coin. I call that B. Okay and I know That's right. I see that it's heads and now I have to predict both the coin tossing and the other coin Now, of course, I have to predict That both of them jointly if I first predict what the other coin is then I'm forced to assign a certain value to the coin I had to the to the bell right or to the other way so if I Have thrown my coin and I've seen that it's that it's a heads and the I've observed the person winging the belt and I know what the value of the other coin is So let's see if this is true. We so that means they are conditionally dependent on each other So we just use base theorem again. That's P of A and C given that B is equal to one Not given that and be equal to one divided by P of B be the probability for P of B given one So this probability is again 50% one-half. So it's two times and now we need a new table That is a table between a and C and what is that table? Well, so if a is zero then the coin mustn't ring so we get The probability one-quarter in here if a is zero and So so like this and if a is one and B is one then the coin then the bell has to ring So we get our one zero here You can also look check for yourself that these entries are actually the corresponding one that I've picked from this joint Table up here because of course this defines everything further down and again. This is the same table So we see that it doesn't factorize so we're going to use this concept of Independence or conditional independence to simplify computations Further down so so far you haven't actually seen any reduction of computational complexity at all because we've been talking about two by two matrices and two by two matrix When written as the outer product of two two by one vectors actually contains the same numbers on either side Right the same amount of numbers. We always need four numbers, but things get more complicated when we have More is more than two variables and also of course if the variables can take on more than binary values To which will come next week So this was the second section of the lecture We've introduced a concept which in the most general form we can call conditionally independence conditional independence two variables A and B are called conditionally independent given a variable C if and only if they are conditional distribution factorizes and A special case of this is just independence or you might call it marginal independence So two variables A and B are called independent If and only if their distribution factorizes By the way If you want to check for yourself whether you freely understood this example with the coin in the bell then Try to think about the situation in where in which one of these coins is not fair So let's say the second coin has probability One quarter of coming up heads and three quarters of coming up tails Think for yourself whether the conditional independence that we've just constructed then still holds Now that we have defined the notion of independence and conditional Conditional independence we can now try and see how we can use this concept and how it can help us to reduce the computational cost of inference under uncertainty and to do so we'll use an example that Goes back to a wonderful book by Judea Pearl written in 1988 It's called probabilistic reasoning and intelligence systems and it was picked up by David Bacchai in his book information theory inference and Learning algorithms which was published a little bit later both of these books by the way really great Even though you don't have to read them to follow this course I've actually tried to come up with a better example One that is more recent maybe one that's motivated by current social societal situations That actually turns out this Example is so well crafted and the numbers work out so nicely that this is actually the perfect experiment to make and I'm not Example to make and I'm not going to change it at all. It's based on the following story So there's this guy. Let's call him Fred. He lives in downtown Oakland in the Bay Area in California There's two important things to know about Oakland. One is that it's a high crime area at least it was in 1988 and the other one is that because it's in the Bay Area, there are regular Earthquakes in this region. So let's say Fred Drives to work in the Silicon Valley every day so he sits in his car for quite a long time to get from one side of the bay to the other and Because he's worried about break-ins in his house. He's had break-ins before he has an alarm at home Now he's sitting at work and he gets a phone call from his neighbor telling him that his alarm just went off So he's worried of course that again someone has been Breaking into his house so he gets out of work jumps into his car starts driving towards home Because it's quite a drive he switches on the radio and now he hears on the radio an announcement that There has just been a small minor earthquake that hit Oakland So now as a rational man Fred is relieved because he thinks to himself ah of course this must have been the source of the alarm There was this small earthquake and it really took out my alarm Now this kind of reasoning process is typical for the way humans think about their daily lives and let's see How a machine going to produce this kind of process if it uses the calculus of uncertainty probability theory And to do so we will follow a basic cooking recipe that applies to all such problems And it starts with creating all the ingredients we need For probabilistic reasoning which we already outlined in the previous lecture So first of all we have to create our sigma algebra We have to create the set of variables with which we're going to be reasoning and in this in this exercise or example There are four of these there's the variable which we'll call a it's a binary variable that says whether the alarm at Fred's home has been triggered or not Then there's the variable e Which stands for earthquake and is another binary variable that says whether there is an earthquake or not and there's a variable which will call b Indicates again in a binary fashion whether there has been a break into his house or not and finally a variable r Which indicates that there has been an announcement rate made on the radio now? Notice that some of these variables similar to the experiment with the cards with it at the beginning of the first lecture are latent in The sense that we do not know them or at least Fred and his current state of mind doesn't know their value While others are observable at first we get information that the alarm is on and then later on we get information That there is also an earthquake Now the second ingredient we need once we've written down our our sigma algebra is the function p that shows up in the definition of a probability space So we now already have elementary events and a sigma algebra so the elementary event at the elementary events are those four and all of their possible combinations basically and The sigma algebra is all possible combinations of these kind of statements now We or sets of subsets now we need a probability function So we need to assign a probability to all possible states of these four binary variables Now we could do so really by just creating a table for these four binary variables So that table will have 16 entries 2 to the 4 only one of which is fixed by the fact that all these probabilities have to Sum to 1 so they are 15 degrees of freedom when we can create this table in any way we want so basically We can start with any variable that we care about write it on on the right-hand side and then create conditional distributions for Additional variables given everything before we create it until we have spent all of the variables That's what the product will tells us to do however Sometimes and this is where conditional independence becomes important We might get lucky and there might be some additional information that we have available when designing the process Or when like using domain knowledge about the problem that can simplify the computation So let's look at these terms here the probability for the alarm to go off Obviously only depends on whether there's been an earthquake or a burglary because the radio announcer doesn't care about Fred's alarm So this are can be dropped and instead of this complicated term, which is a table with eight degrees of freedom We actually get only four degrees of freedom by the way Why is it eight degrees of freedom well because there are three variables here and two to the three is eight and We need all three possible Sorry all eight possible values We need to enumerate these and then for every single one of these configurations We have to write down what the probability for an alarm is we don't need to write down what the probability for not an alarm is because That's just one minus the probability for an alarm So there are eight degrees of freedom here for two and one but actually we don't need this R We know that this statement is independent of R So therefore we actually only need to enumerate four possible states to get this term The probability for the radio announcement also doesn't depend on whether there's been a burglary or not again Because the radio announcer doesn't care about Fred's home. So here instead of four parameters. We only need two and An earthquake is not triggered by the burglar. So therefore we can drop the B here and are left with a simple Essentially just one parameter or just one parameter probability for an earthquake because the probability for not an earthquake is just one minus that probability Okay, that's one way in which independence can drastically help us Simplify computation instead of having to write down 15 Different real numbers. We now only have to write down eight That's almost half the degrees of freedom and all we've done is use domain knowledge Now the final part of our cooking recipe is going to be a computation Before we get to that we can Create a little bit of a graphically representation of what we've just been doing At the moment, this is just a nice little picture I'll come back to it in a few minutes and tell you that this is actually a formal framework to create such pictures But it's actually best to start this thought process just by looking at this nice picture What we've done here is is create a kind of a visualization of what's actually going on here by Writing down all of these four variables each of them in a circle and then I did two things the first thing is I drew arrows between these variables and I did this in the following formal way for every term in this factorization up here. I drew an arrow from or arrows From all of the variables on the right hand side to all of the variables on the left hand side so he there is two arrows from E and B to A and One arrow from E to R and no arrows into E and B because there is nothing on the right hand side And then the other thing I did is I Colored in in a dark color variables that Fred gets to see not at the same time one after the other but eventually he'll get to see them and I've left empty latent variables variables for which Fred doesn't have a direct access Now that we have this picture we can now Do inference and the beautiful thing about probabilistic reasoning is that the inference part is totally mechanical We just have to write down numbers and use base theorem to compute posteriors. So let's start with some numbers We need to actually say how likely everything is not just provide Variables so let's just for the sake of argument assume that the probability For a burglary in Fred's home is about ten to the minus three per day So that means once every three years his house is broken into Of course the probability for not a burglary is just one minus that The probability for an earthquake to keep things things simple might also be ten to the minus three So that's again an earthquake every three years and the probability for not an earthquake Of course, it's just one minus that number Now let's assume the radio announcer is faithful so this person only talks about earthquakes when there is actually one and They don't talk about earthquakes when there is no earthquake. That's very simple This of course is going to drastically simplify the computation We basically don't have to worry about R anymore because R is now just identical to E and The most complicated part of our computation is this conditional probability table for the alarm to go off Depending on whether there is or isn't an earthquake and or a burglary This is a bit of a combinatorial problem and combinatorics are can be quite tedious So let me simplify this for you by just going through the reasoning process for you It's not actually something that you have to totally be excited about we just need to get some numbers So I'll introduce three variables f is the probability for a false alarm So that means every now and then Alarms burglary alarms just go off without much reason not because of an earthquake or a burglary They just go off and let's say that happens also once every three years just to keep things simple ten to the minus three and Then we need a probability for the alarm to go off if there is actually a burglary Let's say this is a very reliable alarm So it goes off in ninety nine point nine percent of the cases when there is actually a burglary and finally we need the probability for the alarm to To go off if there is an earthquake and that's probably hopefully going to be a small number because earthquakes happen Well, sorry no because alarms shouldn't all go off and there's a small earthquake So let's say if there's a small earthquake every one in a thousand alarms actually go off so that would be And a probability for an alarm of 0.001 Given that there is an earthquake with these three Let's say generating variables. We can now populate this conditional probability table So What's and we can do this quite quickly. It's just plugging in numbers. What's the probability for the alarm not to go off? Well, actually, it's easier to write down what's the probability for the alarm to go off if there is no burglary and no Earthquake that's what we just defined. It's this probability f this small probability once every three years What's the probability for there? No to prefer that to be no alarm in this case Then we just take one minus that probability, right? This actually quite simplifies these computations You always just look for the case that is easier to write down and then take the opposite It's just one minus that probability For the other three cases. It's actually easier to think about how no alarm can come about So how could it be that there is no alarm if there is a burglary and there is no earthquake? Well, the probability for that is first of all that there is no false alarm because we don't see an alarm and There is a burglary, but the alarm didn't go off. So that's one minus alpha b and vice versa for the probability for the alarm to go off So that's actually a kind of complicated number even though it sounds so simple What's the probability for no alarm given an earthquake but no burglary? That's the same as before but now with alpha e rather than alpha b I actually recommend that you have a look at this afterwards yourself to understand where these numbers come from It's sometimes easier to just wrap your head around this if you've actually looked at it a bit for yourself And what's the probability for no alarm given that there is a burglary and an earthquake? Simultaneously, well the probability for the other to be no alarm is that there is no false alarm Despite the burglary no alarm and despite the earthquake no alarm and the probability for the alarm given both It's just one minus that probability We can also compute the actual numbers for that. So here I've done this for you That's obviously trivial arithmetic. So let's just save ourselves this just a bunch of numbers some some very large some very small and now Let's actually get into doing some inference. So The first question one could have is what's the probability that? Oops. I'm sorry. What's the probability that? There was Something concerning either an earthquake or a break-in given that the alarm just went off So this is now just base theorem. We plug in the probability for the alarm to go off given earthquake and burglary and for that we haven't actually said yet whether we want b or e to be equal to 1 or 0 we can just plug in numbers later and Multiplied this with the probability for burglary and earthquake and Normalized by the evidence the most complicated term in this expression is actually the evidence the probability for there to be an alarm to get this We just sum up lots and lots of numbers that we can all find on a previous slide just multiply them together and Get out a probability of about 0.2 percent for an alarm in any given day and Use this to compute the posterior probabilities for burglary is an alarm So here are all four of them probability given a lot given an alarm for no burglary No earthquake burglary and earthquake burglary, but no earthquake and burglary and earthquake now We see that the probability for nothing to happen is about 50% and the probability for a burglary and no earthquake is also about 50% One interesting thing to note about this is well actually Let's first look at what's the probability for a burglary in the first place well to get this We have to get rid of the earthquake and how do we get rid of variables? We use the sum rule so by summing out the two cases where there is no burglary and Either an earthquake or not you get a probability of about 50 just over 50% for a burglary I feel it for no burglary and just under 50% for a burglary So that's maybe enough concern to get into your car and drive back home Another interesting aspect of this computation is that if you look at this probability table, that's the probability for the conditional probability given an alarm for earthquakes and burglaries Notice that this table is not Independent anymore so although the model initially assumed that Burglaries and earthquakes happen independent of each other after we observe the alarm We actually now have a conditional dependence between these variables This is an interesting kind of effect once you observe a variable that can be caused by two different causes or generative processes The two possible generators actually become correlated with each other they doubt depend on each other so How is that like what kind of effect is that going to have well We see that if you now go forward and ask Once we get the additional information via the radio that there is actually an earthquake So what is then the probability for a breaking given that there was an alarm and a radio announcement well For that we just compute the probability for a burglary given that there is an earthquake How do we do that? Well, we use again base theorem which tells us to compute this conditional by taking a joint and dividing by the Marginal or the evidence and actually we can do that by taking these joint probabilities that we just computed in the previous slide and Just computing recomputing the normalization constant Which is also two numbers that we have on the previous slide and that gives us a probability of 92% That there is no burglary and 8% that there is a burglary. So this is an interesting effect that is called explaining a way Initially, we got the observation that there is an alarm which could have come from essentially two different sources either an earthquake or a burglary Once we are informed that there is an earthquake our posterior probability for the for the burglary actually goes down And this of course is in line with everyday reasoning right your concern Drops once you found that another explanation you were also concerned about is actually the right one now notice That this is not the same as saying or I'm not saying that after we know we hear about the earthquake We're now less concerned about the alarm than we were before we got So sorry that we're now less concerned about the burglary than we were before we heard about the alarm Actually, the posterior probability for the break-in is still significantly higher than it was a priori It's just that this other explanation is so much more likely now that we noted is actually true That the only remaining hypothesis to get this is that there was a burglary and an alarm at the same time independent of each other which is quite unlikely even though it's actually still more likely than The break-in was before we heard about the alarm So this example didn't just serve to show structure in our reasoning process It also served to show you give you a simple case example of the cooking recipe for solving inference problems that we're going to follow all through term and This boils down if you like to a very Concise sentence Due to the wonderful David Mackay, which is that to do inference you always should write down the probability of everything That means you first define The what you might call the sigma algebra the set of all relevant variables Then you define a joint probability That's assigning a probability measure to the sigma algebra on top to create a probability space This is also known as writing down a generative model in practice Once we have the probability space. We now are concerned with where our observations come from So doing getting observations fix a certain variables to certain certain values and then inference is an entirely Mechanical process applying base theorem. This is the key conceptual strength of probability of forwardistic reasoning That once you have agreed on a generative model There is basically no question anymore about what the correct answers to inference should be now if you remember How easy it was how fundamentally clean it was to write down a sigma algebra at least on mathematical grounds This should be Really exciting news to you because it breaks down this supposed to be complicated process of reasoning in under uncertainty into a purely mathematical process That then it ends with a simple mechanical step now, of course in practice. There are two different Challenges here. The first one is a conceptual philosophical one People argue a lot about their sigma algebras and we will do that together over the entire term And then there's a computational one, which is that it's easy to say just apply base theorem with as we just noticed That can be a combinatorially hard problem. So to solve these inference problems We have to rely on all sorts of computational simplifications approximations and Models and sometimes the computational concerns will enter our modeling decisions. Actually, they will do that all the time So we will have to think about this for the rest of time Now this isn't a gray slide, but you can still treat it like one and take a quick break now for the final part of this lecture I would like to return to these pretty pictures that I just happened to draw into one of our slides Those turn out to not just be a pretty picture They're actually a mathematical formalism that is surprisingly helpful and we will be using it across the rest of time to Deal with this very important issue of conditional independence So these these pictures I showed you with circles and arrows These are called graphical models and you might have heard about this concept before Here is a first definition of what I mean by such a model Which is called more specifically a Bayesian network or a directed graphical model Directed graphical models are a visualization of a probability distribution over variables which give let's give the name that can be written as a Factorization, so the joint probability distribution over the variables from 1 to D can be written as a product over a bunch of terms where each Variable in the graph depends on a certain set of other variables, which we will call the parent nodes of that graph or parental variables that means that Xi is not as a member of the parental set of any of the variables that are in its own parental sets that's why the word parents make sense because if You have a set of four four bears that you're certainly not a part of the four bears of any of your four bears a Directed graphical model can be represented by these pictures that I showed you these directed us like the graphs Where the propositional variables are nodes and the arrows point from parents to children so we saw an example of this with this radio earthquake alarm situation and let's go back to this example and observe an interesting property which of graphical models or directed graphs I will quite often say graphical models and then usually I mean directed graph if I don't meet a directed graphical model There are other variants of directed graphs, which we will get to later in the course then I will explicitly say so so one interesting aspect of these Directed graphical models or DAGs directed acyclic graphs is that any probability distribution over a set of variables can actually be written as such a graphical model so That's why is this true. It's true because of the product rule So we can take any joint probability distribution and just decide to use the product rule to To write it as a factorization like this Now in general though, this means that there are arrows pointing Connecting all variables with all other variables. That means there are sort of practically many arrows in here and they are pointing from Every possible variable to every other possible variable That of course also means that you can change the order of these arrows around So you could just decide to do the factorization in some other way and then the arrows point in the other direction or in potentially into other directions It's still a directed acyclic graph and it's still fully connected Now this however is not a useful use of directed graphs So of course not because it applies to every probability distribution What is interesting is that every now and then we might be able to find Factorizations such as the one we just used where the graph contains way less arrows and That will then simplify inference for us So notice to do the inference in the previous model We only really had to write down these eight possible probabilities for the For earthquake and sorry for the alarm to go off if there is an earthquake and or a burglary That was all it took and then we were essentially done and could do inference if it would have had to do this for And actually only this only had four degrees of freedom right to be clear So if we wanted to do this for the fully connected graph, we would have had to write down 15 different variables rather than eight so Now you might ask, okay, that's nice. It's a pretty picture But and I understand how to generate these graphs But how do you actually read off conditional independence from these kind of graphs? So how do you know which variable or which conditional probability table to actually write down? Once you've seen the graph well, it turns out that this is actually not an entirely trivial process however We can often look at local structure of these graphs to identify interesting conditional independence structure and to do so we Can look at individual Elementary graphs so we'll do that now that's going to be essentially the rest of what we do in today's lecture So let's take so if you have three variables Then that's maybe the most atomic graph you have But if you have two variables then either they are disconnected, so then they're fully independent or they are connected And then they are fully dependent on each other and the graph cannot actually show us anymore If you have three variables things get a bit more interesting because we can now have three kinds of atomic graphs Actually, there's a fourth one where one of the variables at least is disconnected from the other and then we just reduce to the case of two variables These three atomic graphs are the graph where arrows point from the left to the right Always in the same direction. This is known as a chain graph It's the graph where the arrows point from one variable to the two other ones this Well, actually there are different names for these for these subgraphs You might call this a fan out if you like and then there is another graph where the arrows point inwards towards one joint variable one common variable Let's Call this a collider. It's sometimes also called a V structure But we is a bit dangerous because of course you can if you move the B down here Then this looks like a V as well, but the arrows are just pointing in another direction So what are the conditional independent structures of these graphs? Well, let's go through them one by one. I've actually already given you the answers But let's see if we can reconstruct them on the board So what's the conditional independence for for a chain? So let me just draw this on the on the blackboard for the situation a B C We what what this graph implies is the factorization that you see on the right so that the joint can be written as P of C given B times P of B given a times P of a I'll write that down because otherwise it's going to be awkward in the in the video a b c is P of C given B P of B given a times P of a if you don't know why you can read this off from the graph Go back and have a look at how we define these directed acyclic graphs now What does that mean for? the probability For let's say let's first think about something that might be independent of each other conditionally for P of a and C given B Well, we can do this by just writing down base theorem, right? So that's P of the joint a B C divided by The normalization constant, so that's P of B. What is P of B? Well, it's the sum over all possible values of a and C And I'll just write that like this or you could write it as two sums. That's what you like this a and C times P of and what is the factorization down here C given B P of B Given a times P of a Now you might notice that you can rearrange the sum and let me see if I can do this so that you can see it on the video Into by plugging in this factorization above into a term that only depends on C and B so we get P of C given B Divided by now the important part is this sum down here Notice that C only shows up in this one term and not in these two so therefore We can and let me get rid of this with a layer like this so therefore We can take this sum outside of the sum over a and we get a sum over all values of C P of C given B Times and let's already put a record around that times an expression P of B Given a times P of a Divided by the sum over exactly this over all possible values of a P given a times P of a by the way Notice that I'm misusing notation here a little bit I'm just saying sum over a what I mean by that is it's a sum of all possible values of a in our notation so far These variables are binary, so it's a sum over a equal to zero a equal to one From next lecture onwards. We will consider other general more possible values for variables like this But this factorization property will actually also carry through because it doesn't really matter what that sum is over a only shows up in this right-hand side C only in this left-hand side So now we've written this as a probability over C times a probability over B So given B a and C become independent in this graph That's an interesting property which we're actually going to make very fundamental use of and for a significant part of the lecture You can already imagine why this might be useful. It might be useful because these kind of processes are time-structured processes that go over time in time series now What about P of a and C without B now just look at this expression again if we don't condition on C so if we only Want to get rid of B then I'm not even going to write this down. We can look at the joint Now imagine you wanted to sum out B here using the sum rule to get a marginal distribution over just a and C Then you can't do this in general Because there is a term that depends on B in both the term that depends on C and the term that depends on a So here are some bits on a and some bits on C if I write a sum over B in here And there is no general way as assuming no no further specific structure on the on this joint for these to be independent of each other Okay, so in chain graphs We now know that conditioning on the thing in the middle the variable B The local kind of current state if you like This connects like makes conditionally independent the left and the right-hand side of the chain Let's look at the other expression the Sorry the next graph which is this fan out type of structure. So that's I Join distribution over a B and C such that That joint is given by a given B and C given B times P of B What is the conditional independent structure of this? Let me write this down again B a C So maybe just to point out along the side this where we put the notes doesn't really matter We just do it so it has such that it's visually pleasing sometimes It's a good idea to move up and down. We can also put them all in one line It's just easy or so to see the structure Also, these arrows have nothing in general to do with causality. They're just to observe Conditional independence We talk about generative processes, but not about causal processes to be precise Nevertheless, of course in many cases causality will play an intricate role and actually the fact that causality is Weakly connected with generative processes is one of the key challenges to get causality, right? Because people constantly confuse the two So this joint distribution which I will write as P of A B C is equal to P of A given B B of B P of C given B For that we can ask again, what is the conditional independence or dependence of A and C? So let's think again about our joint distribution over A and C given B For that we will notice so we can write down again the joint. So this is a Bayes theorem is tells us to compute this by Writing down the joint distribution, which is this P of A given B P of B P of C given B and divided by the double sum over this expression over A and C and we can actually already do that Just like looking at this notice that this will give us one sum over A, which only depends on A given B Times P of B, but B doesn't actually depend on A, right? And then there is another sum over C P of C Given B and you can maybe even already see now that P of B will just cancel out Which is good because we are conditioning on B So this part just goes and we're left with just two normalized probability distributions for A Given B and for C given B. They're obviously independent of each other But in general, if you look at this joint distribution, so without the denominator or just the expression up here if we Wanted to Maybe you can't even see that we'll find this later on then I'll do something about the picture. So if we Instead, okay, let me just write this down here. I'll vote it down here, right? So if if we wanted to sum out B from this expression then B will show up both in the term in A and the term in C and We won't be able to get two separate terms that only depend on A and only on C In a factorized way. So again in these fan-out structures A is in general Not independent of C, but it becomes independent of C once conditioned on B again the intuition that sort of forces itself onto you is that you can think of this as a Generative causal kind of problem a situation where B causes A and C and then if you know the cause then A and C differ only in some kind of noise process that is local to them So therefore they are conditionally independent once you know the cause if you don't know the cause That of course they depend on each other because they are connected by the cause Now again, it's dangerous to think this way because these graphs are not in their construction Directly connected to causality The way we draw these arrows is solely based on a mathematical property of The generative process that it factorizes in a certain way It's not necessarily motivated by causality in many cases. It will be Because of course causal structure, but it doesn't have to be because of causal structure Now the final graph we can talk about is this Collider type structure where two parents A and C affect B And create B. So let's draw this again And you notice how the language I used how they had the parents affect the children and so on already suggests causality again This is exactly going to be the problem. We'll have to deal with when we talk about causality. So Let's talk about this graph finally then This is this joint Is given by P of B given A and C and then the two separate Ancestral terms So in this expression We see just by looking at it that obviously the marginal of A and C is going to be independent of each other Why because there's only one term here in B, which you can just sum out It's a probability distribution and A and C do not show up in that sum so we can take it outside of the sum and this term Just drops evidently they are independent of each other, but when we condition on B we now get a A Term that complicates things so we can just write again It's always the same kind of by now you should probably be able to do this proof in your head, right? So the probability of A and C given B Is now just this joint. So that's P of B A C P of A P of B Divided by a sum over A and C P of B given A C P of A P of B and the annoying bit is this term here, which Forces us to consider A and A and C jointly in That's of course wrong. That should be a C jointly In this computation because A and C both show up in this term So when we sum them out there will just be one term that has to be considered together Okay, so these atomic structures Sometimes allow us to just read off conditional independence from our graphs. So here is our earthquake alarm burglary and radio situation again What what can we read off from this graph? Well, actually maybe you want to stop the video here for a second and think about this yourself Once you've done so I can tell you what what we can read off. So for example, we can see that The radio announcement becomes conditionally independent of the alarm once we hear about the earthquake why? because If you look at this kind of subgraph, so first of all, why can we think of a subgraph? Well, you could think of the variables A and B as A joint variable actually so as a set of two variables A and B and Actually for this kind of problem, we can even think about them jointly and not even try to to get rid of them in the first place, but So because then we can still make this statement, right? So notice that then what remains is one of these fan out type structure So we can read off that when we condition on the parent the children become independent of each other Even if we call that child A and B rather than just a they're still independent, right? But in general if we sum out E then this line up here will tell us that they become dependent on Well, they are in general not independent of each other. So they are dependent on each other. Why is that? Well, if you don't know yet whether your alarm has been raised by a burglar or an earthquake then you don't know you Have to your posterior probability for radio announcement actually goes up, right? You hear about the alarm one possible explanation for that is a is an earthquake if there is an earthquake There will be a radio announcement So if you just so if someone calls you on the phone and tells you that your alarm just went off Your posterior probability for a radio announcement about an earthquake has just risen as well But there are other conditional independent structures that we can't Directly trivially read off just from staring at the graph and trying to find substructure in here. They Can still be found if you think a little bit longer about them for example by writing down the entire joint and laboriously trying to find conditional independence or also by again introducing sets of variables that are jointly considered and then Looking a little bit more closely into the subsets whether they can be Taken out or not in the interest of time. I'll not do that and leave that to you as an exercise It's also interesting to to to point out a final aspect and maybe a bit of a flaw of these Graphical models by looking again at our case of two coins and a bell So that was the situation we spoke about earlier today Which is that to there are two coins that are being thrown and which we might or might not observe Parts of which and then there is a third process C Which is a bell that rings whenever the two coins share parity. So if they're both heads or if they're both tails now These conditional independence tables and you can do this for yourself Actually imply various different conditional Independences and well before we talked about this earlier in the lecture, right? So A is independent of B B is independent of C C is independent of A and C is independent of B. Now, unfortunately, there is no single Directed graph which can represent all of these conditional and conditional or general Independences. So if we take each of these four individual lines Then we can write them. We can generate these different kind of factorizations from these From from this table. So we could either think of C as Being the child that is generated by the two by the two coins by the two parents Actually, it's up here or we can think of One of the coins as the child of the other coin in the bell Or we can think of one of the other coin as the child of the bell and the other coin so Notice how this is where causality breaks down. I've this I've described the situation to you as Someone throws two coins then looks at them then rings the bell But of course these just this conditional independent structure can Doesn't encode this causal kind of structure It could also be described as you ring you throw one coin Then you randomly decide whether you're gonna ring the bell or not And if the bell rings then you put the other coin to show the same face as the first coin. Otherwise you turn it around That's not the causal process we we imagine But it's the same it implies the same conditional probability table and therefore the same conditional independent structure Now if you look at these three expressions They correspond to three different graphs that you can write down And I hope that you will be able to convince yourself that these three Correspond to these three graphs and each of these graphs has only encodes The conditional independences that are encoded in these three representations But and I'll leave that to yourself to convince you any single one of these graphs Does not encode All of the conditional independent says that we have up here So a problem with directed acyclic graphs or general or sorry Bayesian networks so directed graphical models to represent conditional independent structure is That there are an incomplete language in the sense that Any but any unique single directed acyclic graph can encode only a certain number of conditional independent structures and there are Joint probability distributions for which it's impossible to encode all Conditional independent structure in one directed graph in one go now That's a problem and you might ask yourself how to fix that if you can find a general solution You're very much invited to write a wonderful paper about it Because it's essentially an unsolved problem However before you do so maybe wait till the end of the term because there's a much much more complicated story to these graphs That we will talk about over the course of this term So now I just have to get rid of this stupid thing and Finally show you our final slide Thanks for taking part in the lecture. This second lecture was About the notion of independence. We saw that probability theory even though it's a beautiful mechanical process has a computational issue Which is that if we keep track of Uncertainty about several possible hypotheses Then we might potentially face a combinatorially hard problem. This is a fundamental part of Reasoning under uncertainty. It's not solved by any other framework It's just caused by the fact that we want to keep track of several possible hypotheses We will find many different ways of dealing with this computationally one very important one is the notion of conditional independence Which reduces complexity and helps us make things tractable by essentially Separating sums from each other so that they can be solved with way less degrees of freedom Because combinatorial complexity complexity drops out We will think much more about this over the course of this term We also found directed graphical models as a notion to encode these generative processes They provide a language that allows us to encode certain kinds of conditional independence visually That's much easier to read off from a picture than it is from a factorization of a probability distribution There are wonderful tools. They are quite formal, but they also have some flaws One of them is that it's not entirely trivial to read off conditional independence And the other one is that they are not able to encode in general all conditional independent structure of any particular joint probability Distribution in just one graph We will return to them later in the course and they will show up as a pictorial view along the side of the of the slides over and over again Now conditional independence is primarily going to be a computational tool for us that simplifies computational tasks and machine learning and we will make lots of use of it throughout this course But I want to leave you as we end this second lecture with a philosophical thought to take with you as Your ponder this this lecture, which is that it is actually the very nature of independence That is the biggest philosophical issue maybe for probability theory Not the prior and the likelihood and what it means to be uncertain, but actually what it means to be independent Komogorov in his original paper or book in 1933 already points out very early on actually that independence is really the key challenge to define what probability or how to use probabilities correctly and So here's the original text. He actually says with us realize that in the the in the notion of independence the real core of the unique problematic aspects of probabilistic reasoning or computation with Probabilities, it's not the definition of probabilities as such because those are just set theory as we've now discussed at length but the fact that we use notions like independence to simplify a computation and Komogorov thought that that was one of the biggest tasks for for philosophers when studying the notion of probability Maybe that's something for you to think about as well Thank you for your time