 Thank you all for being here. It's exciting to see a full lecture hall like this For this course in particular here in this building because I have fond memories of this Course in this building the last time I thought it here was two years ago Well, actually the last time I thought it here was in 2018 and then well, let's not talk about that. I want to start With an experiment that I've been doing for years and Which I've actually not invented myself which I've taken from my PhD advisor from David McKay I've got three cards here one of these cards is red on either side One of these cards is red on one side and white on the other side and One of them is white on both sides I'm going to take these cards. I'm gonna put them in this nice little branded bag and then Shake this bag around so that you can't see what's happening inside all the cards get mixed up to get shifted around and now I'm gonna pull out just one of them Not cheating. I'm just picking any one so that you don't know which one it is and Then I'm gonna take it out such that you can only see the upper side of the card and the upper side For those in the front is red Now the question to you is What's the probability? That the other side is also red There are already answers. I'll give you a minute Because clearly many of you already have an answer To think of an answer you can talk to your neighbor as well if you want to okay, so It seems that Most of you are looking like they have an answer some of you are already looking very bored But some are still actually discussing so I'll give you four possible answers to give The first one is one half The second one is two-thirds The third one is something completely different. I have my beautiful theory. It's one over pie and The fourth one is I don't know yet. You were too quick. I was still discussing with my with my neighbor Okay, who would like to answer number four. I'm not finished yet Good that's very useful because it means everyone is done everyone now has an opinion you have to answer one of the other three, right? So who is for number one one half? That's I would say that yeah, it's more than a third. I'm already less than half Two-fifth of the room who's for two-third? that's a little bit more but also just over half maybe and Who is for something else? No one very good. So The answer to this is always the same It's great to see that all of you had an opinion. You were all clear about one of the possible answers and you disagreed with each other so not quite half and half but nearly half and half actually the percentages of the answers have shifted over the years, which I think is a good sign and In fact the correct answer is two-thirds So one way to think about this is there are three cards here Sorry three three red sides here Right, and what I've revealed to you is that we've picked one of those three red sides clearly because it's here So in two out of these three cases the other side is red and In one of these three cases the other side is white Okay, that's the boring Kind of playground answer to this question is informal. It's completely informal quick check Who's seen me do this before or someone else do this thing before? So some of you have either have either watched the videos before or you've been here for a while or just I don't know It's good to know though that is not everyone yet. So that's that's good and This is it's a variant of the famous Monty Hall problem So there's like game show problem where the host like tells opens one of the doors that doesn't contain the price and And I could have equally asked a much more boring question. I should have shown you roulette board like this and then ask the question of the type given that I tell you that the court like the wheel is just spun and some number has come up and I can tell you that the number is red Right and then ask the question. What's the probability that the number is? Let's say even right and Then you could have done the work yourself It's a little bit tedious because you could have gone through and checked for all the red cards and checked how many of them are even and basically like Computed some probability that we would have an even number What this is is an inference problem and this lecture course used to be called probabilistic inference and learning Before we started the machine learning master course It's a it's a problem in which we're trying to reason about a Value a variable that the the correct value of some unobserved Latent quantity here the latent quantity is the other side of the card given some observed related quantity In this case, it's the side of the card that we got to see and In computer science and in machine learning and in many other disciplines we call this observed stuff the data and the unobserved stuff Hypothesis or model or latent quantities or whatever Unless you could clearly see When we work with these kind of inference problems We encounter a form of uncertainty Because you're not totally certain what the other side is right But you have some information provided by the data by the observation about the latent quantity and you become more Certain about it by the data And this kind of problem is absolutely universal to the human condition Pretty much everything we do all day is of this type When you go outside in the morning actually when you look outside your window to decide what kind of clothes to put on in the morning you're you're collecting some observations and then you infer what the temperature might be outside even though you haven't actually measured it or if a scientist Does an experiment they quite often can't actually measure the thing they really care about they have to measure some related quantity If you'd like to know the mass of the Sun you can't actually do that But you have some various other measurements that you can combine to reason about what the mass of the Sun might be If a medical doctor does a diagnosis creates a diagnosis for a patient They usually don't actually know exactly what the ailment of the patient is instead They collect symptoms and then combine the symptoms mentally to infer what the diagnosis should be So this problem of inference is that's actually the entire point of today's lecture is much much broader than machine learning and AI and computer science. It's a fundamental part of Being human and living in this world, but it's also a a Very fundamental problem that in fact in some sense generalizes what We would like from a machine that may automate reasoning so If you've studied computer science before before doing your machine learning master or your computer science master then you've encountered the theory of computer science where you've learned about discrete-state machines about Turing machines and about propositional logic and These are all of that of the kind of classic type of what a These are machines that do the classic kind of reasoning that we know from It's called a Boolean logic or propositional logic or even Aristotelian logic. They are statements of the type From a follows B So if someone tells you that there's this rule that says from a follows B, then that's something you can implement Typically on at the very least a Turing machine and it's associated with a truth table like this thing up there if a Then be so if a is true Then B has to be true and non B has to be false Right So these are the kind of statements that you can write down as a rule and this is the kind of stuff that computers were for for I don't know the first half century of their existence to Automate mathematical reasoning But the kind of problem we just looked at inference problems are actually more general than that They require us to give answers to the two parts of this truth table that aren't filled in So for example, we have relationships like if it rains then the street gets wet Tribunal statement by the way taken that from Stefan Hamiling who I used to taught teach this course with the very first time they did it Now if you watch if you look outside well what the statement says is if it rains then the street gets wet Yeah, so if a then B But if you look outside and you see that the street is wet What does that tell you about the weather ten minutes ago We all do this all the time, right? You go and say huh must have rained but this Conclusion this deduction of the from the observation The street is wet to it must have rained It's not actually filled in in this truth table, right because it's of the form if B if B then A becomes more plausible we can't do this in classic propositional logic because There could have been some other explanation for the wet street, right? So maybe there was someone out there right outside your window with a garden hose Who made the street wet? So now it looks like Must have rained but actually it it hasn't there are other possible explanations for the observation and we take them into account kind of mentally By saying well, I'm not totally certain, but it seems quite plausible that it has rained And we do this all the time so much that we get almost kind of annoyed when people question this and become too Like hung up on the rules that I say well, but the theorem says exactly this and if you'd like you do it the other way It doesn't work so Propositional logic has these two ways of reasoning if a then be that's called the modus ponens the way of putting things into the equation and If B is false then a must be false if the street is dry it cannot have rained That's called modus tolerance The way of the excluded other things But what we would like to be able to do is to say there's also numbers up there in this gray part if It has rained Sorry, if the street is wet, so if B then it becomes more plausible that it must have rained Or if it has not rained so if not a And it's quite unlikely that the street is wet again. Maybe there's someone out there with a garden hose But typically the streets gonna be dry and those are exactly the kind of generalizations that probabilistic reasoning What this entire lecture course is going to be about allow us to do they generalize the abilities of machines to reason about statements that are not following Discrete or deterministic laws of nature. Yes So the coloring here is supposed to mean white is zero black is one so if a then non B has zero probability if a Then B has probability one. It's a hundred percent probable. So it's certain and Then up there. There is just no statement. So gray just is supposed to indicate that Propositional logic does not allow you to say anything about those So if you do not know if a is not true, then you don't know anything about B basically That's also sometimes stated as x-file zoom quadruple bit from the false everything follows So what we would like to be able to do is this and what we'll now do over the course of just today is Construct the theory of probability which allows us to exactly do do this and then in your homework this week you show that this is actually true and the main point of all of this was to point out that This is actually the more important way to reason about the world and that's kind of why a lecture like this has to show up in The context of AI and machine learning because the point of AI and machine learning is to build machines that reason about the World like humans do and therefore they have to be able to do this. They have to be a true generalization of the deterministic Propositional reasoning systems that we have in classic Let's say to ring machines although that's a little bit too much actually and here's a quote from There's lots of famous people showing up in this course James clerk Maxwell who said the actual science of logic So that's what computers should be about is Conversant at present only with things either certain impossible or entirely doubtful none of which actually matter in the real world Therefore the true logic for this world is the calculus of probabilities Which takes into account the magnitude of the probability which is or ought to be in a reasonable today We would say humans mind or person's mind so This is what we're going to do in this course we're going to construct a Formal mathematical framework that allows us to reason about quantities that are not perfectly determined This is evidently a very very general thing to do That goes beyond computer science and anything you might want machines to do It's really a general description of reasoning reasoning about the world But this is a computer science class So we'll actually be done with this pretty quickly after about three lectures And then we'll start thinking about how to represent this reasoning system on a computer And that will lead us to construct quite Elegant mathematical frameworks that actually have been around for a long time But have only recently made their way into computers properly we'll discover starting from Thursday that Unfortunately reasoning with uncertainty can be computationally a little bit more challenging than just reasoning in a deterministic or propositional fashion or deductive fashion if you like and That will require us to really think about how we do the computations correctly and to do that will often actually write code to think exactly about computational complexity of inference and We'll also notice that we often have to make approximations and simplifications because otherwise the computation would be intractable and Then of course I Am and I'm going to come back to this several times even today going to make Connections to contemporary machine learning and artificial intelligence And I'm going to try and be as close as possible to the state of the art So that you actually learn something meaningfully that the meaningful that you can take along into your professional career But all of this is going to revolve around this equation Which I guess everyone knows So does someone want to shout out what this is called? There's rule-based theorem. Yeah, so actually we'll call it base theorem I'll try and call it base theorem because there's also something called base rule that sometimes statisticians use although we're not going to use it at all and And you've all seen this before and that's why today's lecture runs a bit of a risk of Boring some of you because of course you've been through your undergraduate degree. You've had some stats course You've had some stochastic scores. You've had some basic undergraduate machine learning class You've had whatever else why maybe some of you have studied cognitive science You've all seen based based theorem everywhere and you've done things like this Slides like this let's say oh so I assume we've done some tests for COVID and this test has a true positive rate of 93.8% so if someone has the disease the test is going to be positive 93.8% of the time and It has a false negative rate of 4% which means If someone has a negative test a summer sorry if someone doesn't have corona the test is going to be negative most of the time 96% of the time and Then you've plugged those numbers in and maybe the last time you've done this probably is in your data literacy exam And you've done this kind of computation that you've probably all seen before which gives these kind of interesting results All right that so Maybe let's actually do it for a moment right so for those of you who I don't want to lose so the way to do this is we just look at the previous slide which contains based theorem and and So what the theorem says is maybe just to say it out loud a few times I'm gonna say it a few more times today the posterior probability so the probability to have the disease given the positive test is Given by the probability To have the disease a priori before the test so that's the prior probability multiplied by the likelihood For the disease That's a conditional probability for having a positive test if you have the disease So that's maybe the only important thing about this slide using those words again Remember that the likelihood is a function of the right-hand side. It's a probability of the left-hand side and a function of the right-hand side Divided by well the probability for a positive test which is the probability for a positive test given that you have the disease plus the Other possible explanation and what base theorem does is it relates the probability of the observation under one hypothesis to all possible explanations of the data so in the Numerator of this of this fraction we have one possible explanation the person has the disease for the data we've seen a positive test and in the denominator we see All possible explanations you either have it or you don't and you still get a positive test We plug in the numbers. There's one number. We don't know yet, which is how many people actually have the disease That's the prior probability And then we get an equation that we can even plot this as a function of this probability of having the disease and you'll get For some reasonable numbers or during the height of the pandemic or something like 1% of the population had the disease then you get a positive a probability with something like 20% and then the thing that everyone is always supposed to be excited about is that This is a surprisingly no number given that the true positive rate is so high and the false negative rate is so low Right, but why well because the prior is low, right so if it's unlikely to have the disease in the first place then an observation raises the probability a positive probability, but It only raises it by some kind of amount, right and if the other explanation Not having the disease is extremely unlikely Sorry, it's extremely likely in this case. It's 90 percent then even those four percent False negative rates are a decent explanation for the observation and that's actually Often the kind of argument people make for Bayes theorem that it has this prior in there And the prior is very important because it allows you to include this kind of information that you may already have a priori Now let's think back to this problem with the cards here. I think I have the time to do that so How would we do Bayesian inference here does someone want to shout out a formal description of how this works There are three cards The question is we would like to know which card it is. Is that the white white one the white red one or the red red one and What we observe is a color of the top part of the card, so we could say We have a card which is Either one or two or three Right, and then we have an observation. It's called it are because I can't say See for color. Let's say are because I observe red, right? Then we can write down first of all a probability for these cards. What's that probability? one third so and It's one third if C is one it's also If C is two and also if C is three The prior doesn't matter at all in this example It's the same probability for every card because I just push them in there and shop the you know Shuffle the bag. So it's not always about the prior In this case actually it's entirely about the likelihood so what's the probability to see red if The card is the well, let's say it's the red red card. That's card number one. Maybe let's say that's red red red white white white What's the probability of this observation? one What's the probability to see red if the card is the red white one? one half exactly and What's the probability to see red if the card is the white white one? It's at this point. I'm boring you so it's zero All right, so now let's do this computation again our probability for the red red card given that we've seen red is The probability for the red red card, which is one third Times the probability to see red if it's the red card. That's one divided by all possible explanations of seeing red so that's one third for each card times one Plus one half Plus zero and now you can cancel out the third and We're getting a three-half One over so that's two-thirds. Okay whoo So base theorem isn't always about the prior It's quite often about the likelihood as well and actually this entire course We will talk about priors because it's not always so easy and basically the only reason really to care about priors Is that we'll often talk about spaces in which it's not possible to assign one to everything a Priority or one over the number of observations a priori because we'll talk about infinite dimensional spaces where this isn't doesn't actually work so That's though is the kind of the data literacy version of this This is supposed to be probabilistic machine learning So we'll start correctly with a bit of math to get everyone up to speed on how to do probabilistic reasoning and it So does anyone know who invented probability theory? Yes Come over of hi. Okay. Yeah. Well, maybe actually the inventor of probability theory is Piersimo Laplace I will show show him in a few slides But you're at last it is in like 1812 something like this and if anyone has seen a math book from 1812 It's not really formal, right? They just write down lots and lots and lots of text and sometimes an equation There may be the first person to properly formalize probability theory is indeed Andrei Nicolay which come agorov in Russian mathematician who wrote a wonderful book in 1933 Published in German by the Springer for luck. He actually voted in German because it was 1933 Called the Grundbegriffe der Wahrscheinlichkeitsrechnung And we're going to go through his derivation first in German then in English apologies to everyone who's expecting an entirely English slide There will be a slide with all the proper definitions in English in a moment But this is the actual text from here from page True and the reason I'm showing you this is a because I want everyone to have the same idea of what probability theory actually is even though you've many of you have seen these definitions before in probably well maybe a stochastics class or Your math or machine learning class, but also because come agorov actually Did something like very important for the development of this theory which is that he showed that it's not some big philosophical spiel So depending on how you've had your undergraduate education you may have learned that Bayes theorem is this wonderful thing that is sort of philosophically derived and it's based on the laws of common sense and and that's a derivation that maybe goes back to Laplace and was also Strengthened by people like Richard Cox in the US Who are often cited as kind of the foundations of Bayesian reasoning? So they derive these rules by saying what we would like to have is something that confirms with our Everyday experiences in inference like the example I showed you before of this probabilistic reasoning right if the street is wet Then somehow we should have some properties of these functions but come agorov did actually was very kind of subdued to show Actually probabilistic reasoning is the only acceptable thing to do if all you want to do is to correctly measure stuff so he relates probability theory Wahrscheinlich kaltz rechnung to Measuring things correctly and if you are if you if you if you manage to really follow this argument Then there is really no philosophical debate at all about why this is the right thing to do So let's try and do that Here's the original text again, and then I'll go through it like Properly in math. So this is 1933. So back even in even a hundred years ago less than a hundred years ago 90 years ago Things were still Being haggled out. So come agorov writes what we're going to do is we're going to consider and a set of Elements, let's call them psi eta theta whatever lowercase-gray characters, which we call elementary events These days this set is often called the universal set But he has a more direct approach as it's an elementary event in the example of the roulette Board that's the wheel Right the thing that the croupier puts the put the ball in and it has elementary events Like the numbers actually they are on the next slide, right? They're 0 1 2 3 4 5 and so on up to 36 And now what we actually care about is a set of derived events a set of subsets of the elementary set And we'll call that Fracture F We call that the set of random events And I'll just call it like that for a moment. It has a much more fancy name these days and What we're going to need is Actually, the only thing we want from this set is that it makes sense that we can Distribute some truth across it and then that that's going to work and the way that Kamagorov tries to phrase that is to say first of all I want this F to be something that he calls a mingen kerper and that's actually where all the magic lies But thankfully Kamagorov doesn't have to define this. He just cites in a footnote down here housed off Who says a mingen kerper is a system of sets such that their Intersections their differences And their unions And also all belong to the set so if I have Some element in that set of subsets of the elementary reset then If I have two such sets their union is going to be in the set in F Their intersection is going to be in F and their difference is going to be in F At least that's how stores formulation in 1933 and that's actually the important bit after that we just say well, okay, there's one special thing we need to require is that the Elementary reset is going to be in this collection. So basically the whole set is part of our F And now we're going to do something to it We're going to assign probability to events in this space and this assignment is called is done by a function which we call P of some element in the set of sets called the probability of the event and that Assignment has to be such that it has to be a non negative real number could be zero, but it has to be Larger or equal than zero The probability of the entire set has to be one so the probability of the entire roulette wheel is one one of the numbers is going to come up and Then there is this special actual important rule which Is sometimes called additivity which says that if you have two sets of events A and B so two elements of F Which are mutually disjoint they do not intersect with each other Then the probability of their union is the sum of their probabilities So that the plus sign here on the left is a set union It says combine these two sets together like a Python plus for sets and The plus here on the right is a plus between real numbers So these are two different things so what this means is that if you go back to our Roulette board, then we want this probability This function to have the important property that if you take two subsets Let's say the red and the black ones the black numbers and the red numbers if we assign a probability to this space of possible events F Then the probability of the union of the events that are in red and black has to be the sum of the probabilities for red and black and I hope that everyone agrees that that's an absolutely Like non-contentious statement, right? This is something you would really want to have From something you want to call a probability There's basically no way to question this So what we're doing is we have this roulette wheel here which chooses from one of those numbers and now that that's called the set E and now we can construct all sorts of statements about the world right instead of just saying it was the number 21 we want to be able to say it's a red number or a black number or an even number or a Uneven number or it's one of these weird dozens or one of these other groups that I'm basically this is how roulette works right someone comes up with all these stupid other ways of explaining the number that have certain marginal probabilities and then the players feel like they're doing something smart by assigning their bets across this Port which actually they aren't right because the probabilities of these derived sets are Just directly implied by the probabilities for the underlying numbers And that's exactly what we want to have from any kind of reasoning system that distributes truth and a system that has these properties Komogorov calls a Wahrscheinlichkeit fehlt. These days we call this a probability distribution so these days we actually Use English for this and also people have been careful to clean up a little bit So we're going to do the derivations again, but you will see that they are basically the exact same thing Now you've had enough time to look at this at these slides to not be entirely shocked by this statement Here is the here is the like a 2023 version of 1933 Let E be a space of elementary events. That's the roulette wheel Now we consider the power sets or the set of all possible subsets of E Which I am going to write like 2 to the E because I don't want to use a curly P because that looks like probability already, right and We consider any collection of subsets No any subset of the power sets or any collection of subsets of E Then such a collection E which we call random events Needs to have the following properties to be something we want to consider and that thing we want to consider is going to be called a sigma algebra And here is the reason why I haven't showed you this slide first because I Don't know about you, but for me when I first heard of a sigma algebra my brain just shut off Because it's like what? That must be something very mathematical and complicated. So I can't write so we first did the proper like the nice 1933 way of thinking about stuff and now You'll realize that this fancy word sigma algebra is really just the word for the thing. We just talked about so a subset of the power set is called a sigma algebra if it contains the elementary The set of elementary events, which is sometimes called the universal set if It has the property that if as if a set is in F then its complement is also in F So you will remember that maybe you don't remember but on the last slide house of actually talks of differences between two sets So he says if one set if two sets are in F then their difference should also be in F And this is actually kind of a weaker form of it, but it suffices because you can construct the other statement from it and the third property is that if a Number of disjoint sets can be chosen so pair-wise disjoint sets between in F Then actually sorry not this turn just any sub any sequence of ace is in F then their Countable union is also in F and again post of says the union and the intersection But we don't need the intersection actually because you've all had propositional logic and you know rules like this Which are called the Morgan's laws which can be used to construct a statement that that also means that all Intersections of the sets have to be in F and therefore also the empty set has to be in F. Why? Well, because if a is an F then its complement is in F and the intersection between I'm set and its complement is the empty set So it has to be in F Okay good This is called a sigma algebra and the elements of F of such a sigma algebra are called measurable sets or borel sets and The space is then called a measurable space or a borel space fancy words for this very very simple stuff So if you ever hear someone talk about sigma algebras or borel spaces or measurable spaces now you can be like Yeah, whatever so The space is the combination of these two things So it's a construction on E if you like right you take E and then you construct something on it a sigma algebra that together makes it a space While this F itself is a collection of sets called the sigma algebra and E is a set so we start with E then we construct something on it that is constructed from subsets of E and We call that the sigma algebra and then we come come we call that the combined thing E with the constructed sigma algebra a space So so far what we've done is we've defined measurable spaces We have not defined probability distributions and we'll do that. This was come a goal of four and five axioms we'll do that now. So Am I on the right? No, I'm not moving Let's say we have such a measurable space. So a combination of a universal set and a sigma algebra fancy words a borel space then Consider a non-negative real function There's already the first axiom of Kormogorov already goes into Into this simple little sentence. So we're thinking of a function that maps to the non-negative wheels Which has the following properties the empty set has probability zero Secondly for any countable sequence of pairwise disjoint sets so for any countable sequence of Sets that do not overlap with each other We have a property. That's very important. That's called sigma additivity. That's the one I already pointed out on Kormogorov slides Which says that the probability of the union of this countable sequence is equal to the sum of the probabilities of the individual sets So this is a fancy equation and what it just says is we want this rule of probability to work on this space meaningfully so that you don't accidentally lose probability if you add upsets or Inject it without noticing so if you take the atomic that the roulette wheel and you let your ball roll around it and then Construct some code some kind of events that you want to talk about like red and black and green and Even and odd and so on and you want to make sure that by talking about those devolved events We're not inventing not confabulating probability or losing it that's This object that has these properties so this map From the sigma algebra to the wheel numbers actually the non negative wheel numbers Which contains the empty set and assigns probability zero to it and has sigma additivity such functions are called measures and they're called probability measures if they have the additional property that the Universal set has probability one So what that means is something on the roulette wheel is going to happen Nothing else There is no probability for the ball flies off from the wheel actually if you have This additional property Which makes it a probability measure and then the assigned space is called a probability space Then we don't actually need the first one and I mean the proof is down here But pretty much actually we don't even need p of e is one if you could just say if there is any Special set within the measurable space Which gets assigned a non-zero probability Then this holds right So if there's any set here, which has a probability that is different from zero then By sort of that just the rules of set theory We can and sigma additivity We can write this probability like this and therefore the probability of the empty set has to be zero So we could even leave this out if we had plugged in This assumption of p of e is one and indeed that's what Kambagorov does So now That's just slide page two and now the guy kicks into action and says now This is what mathematicians do you write down five axioms and now we can just show some cool things The first thing he does is to show what's called the sum rule which works as follows So let's consider the fact that the entire set e is made up by definition of a and its complement for any set a that means we can write the probability for the entire set by sigma additivity as P of a plus p of non a But by the rules of the definition of a probability measure p of e is one therefore we can write p of a as one minus p of non a and Now using again set theory to say a is equal to the interest itself intersected with the entire set E which is can also be written for any set B as B plus non plus non B and We can write the probability for the intersection the probability of a and b Which in probability theory we write like this and I'm going to use this notation all over the place p of a comma B as the joint probability or so fancy word for the probability of the intersection or a and b and Plug in those results and we see that p of a can be written as p of a and b or p of a Intersected with B plus p of a intersected with non B So this sounds like a simple thing, but it's actually The first and most important maybe rule of probability theory why because it provides a mechanism for getting rid of a variable If you've written a program and you want to get rid of one of the variables in it What you need to do to compute the probability of just one variable is to sum over all the other variables and Without telling you too much ahead of Thursday already if you think of this as an array of The possible values of a and the possible values of B Then what this equation says is if you want the probability of a then you have to call dot sum Over the axis axis that you don't want to keep and that's why this probability is also often called the marginal Probability because if you had had an array on a piece of paper Then you could sum over the rows of the array and on the margin of your piece of paper you can write down that probability so this is Page six of the book now come over of that something that is a little bit shady and That's why some philosophers don't like it. He defines something He says we're just gonna define this thing which is called the conditional probability We just call it conditional probability because that's what it is, but we just give it a name It's not derived in any philosophical way whatsoever There are other derivations of probability theory for example by Cox which goes through a long spiel of explaining why this is the right way to think of a conditional probability It just says P of B given a is the probability of the intersection Between a and B divided by the probability for a That's just what it is Of course for this to work We have to assume that P of a is larger than zero. So a has to be a non-empty set in particular then By definition we immediately see that we can write the joint P of a and B as P of B given a times P of a or P of a given B given times P of B You've all seen these rules before and and now actually the only maybe interesting thing to show about this object this defined object is that Itself defines a probability distribution. That's why we call it the something something probability the conditional probability By what I'm not gonna do this you can do it for yourself if you want to and just notice that we've now defined a probability and We can use this definition To revive something that Kalmogorov calls the law of total probability Which is actually like a generalization of the sum rule and which is the way we usually think of the sum rule so so far I've the sum rule for Kalmogorov is just P of a and B Sorry P of a is P of a and B plus P of a and non B But what if non B can be split up in lots of other possible Divided quantities. Well, then this is true, right? So if AI a1 to an is a sequence of disjoint sets then we can write the Probability of any event in the sigma algebra as the sum over the well actually the joint probabilities between X and AI by Sigma additivity and we can use the definition of the Conditional distribution to write it with conditional distributions and that's where we get the denominator in what come in what is to come next because of course, that's the big reveal using those two statements We can write down base theorem there it is and Because the entire construction so far is hopefully unambiguous we have to agree that this is the one way to reason about Quantities that do not have a binary truth value But for which we distribute truth across several elements in a collection of events and That's it. That's the theory of probability. It consists of these well first of all these actions Which are really non controversial. They really just say what they're going to do is we're going to take truth Which is probability one. We just assigned one. We just use the number one to assign to call that truth And we're just going to distribute it across more than one possible event instead of saying There's just one event. That's true and everything else is false We're going to say this true can be distributed across a Cept e of possible events and now we just have to make sure that when we talk about subsets of those of that set that We construct the corresponding probabilities correctly in a measurable fashion And we do that then naturally the sum rule has to hold the product rule has to hold and therefore of course base theorem has to hold so that's like a First-year master's level description of base theorem that you've now seen many many many times over and the only thing you might want to Kind of wonder about why this is actually necessary is this one in there somewhere right and So maybe has someone done that no everyone has just accepted that one is the correct way to think about truth So there's maybe two different questions you could have one of them is why one Why not Something else and of course we could just choose some other number and say that's true It's just one is particularly convenient In fact one other way is to say we don't talk about probabilities from zero to one But from minus infinity to plus infinity just by taking the range zero one and transforming it through some function a Logit that puts it to minus infinity and plus infinity and that's actually what we do in machine learning a lot Why because it uses the entire floating point range and that's just really convenient on a modern computer But other than that it doesn't really matter then that just means that minus infinity means false and plus infinity means true and Zero means one half Okay, fine the other question you could have is well Why does it have to sum to just one number? That seems so constrained and that's actually what most of the problems with probability if you we come from Which is this I said before right you have to say what E is you have to say what E is and then assign one to it and That's it will turn out is one of the major problems with this entire theory that quite often There is something happening that you didn't expect some event of probability zero actually happens Like you know the croupier flicks the ball and it flies out of the roulette wheel and lands outside What happens then what happens with the bets on the table if the if the ball flies out of the wheel So maybe you want to say that this comes down mathematically to something like I want to be able to say there is more than one But we're not allowed to do that because then reasoning breaks down. So here is my argument for why this is not a good thing to do so Let's as assume for the sake of argument for a moment That we had decided that there's actually more than truth. There is two possible truths and They're mutually exclusive, but they can they're both truth. They're both true So we say we just say p of e is two Right, but we still mean by one. We still mean true, right? Then we could for example split our e into two parts a and non a And let's just for the sake of argument say that both of them get assigned probability one So they are both true at the same time You see where this is headed, right? so Then we can do some kind of game that that you've seen in propositional logic a lot So if two things that are mutually exclusive are both true Something stupid is gonna come out, right? So let's do this Consider the probability of a and non a which by definition is the intersection between a and non a So the intersection between a and non a is zero, right? here The empty set so it's zero times the probability for non a which is one which is zero So therefore a and non a has to be false Because it's the intersection of two this joint sets is the empty set and the empty set has probability zero But also our rule has assigned probability one to a and to non a so a is true and non a is also true so therefore by The rule of propositional logic, right? If a is true and non a is true then a and non a is also true So it's both false and true and that's not good. We don't want statements that are at the same time false and true So therefore we have to assign probability one and we have to write down what e is before we start reasoning and And to be honest spoiler alert for the entire lecture course We're gonna use reasoning systems where that's not true. We're gonna talk about real problems in the real world That are going to contain events that have probability zero under the prior and we're just going to do it all the time And we'll have to come back to think about what what happens when we do that Okay, so That was the big deal for today We've thought and played around with this theorem that you've seen all over the place already many many many times Of course, you have good thing and and there were people in history who didn't want you to learn about it famous people like Ronald Fisher who would have really not cared for a lecture like this He would have been happy if this equation would have been erased from human memory because he thought it was wrong And he gave it us like a funny like he derided it by making it assigning the name of some from his perspective naive mid-century a medieval Non-conformist priest to it some Protestant priest which was like a bad thing in Anglican society Thomas base, right? This is Bayesian stuff. This is bad, right? That's what Fisher was arguing, but it's just measure theory. It's just assigning a Finite probability to a set of possible events and then making sure that you're measuring them correctly Now in your homework, you're going to see that when you apply these fundamental rules Then you actually get a reasoning system that conforms to these Properties that you would want from something more general. So that actually allows you to make statements like the street is wet Therefore, it seems more plausible now that it must have rained or It has not rained therefore. It seems more plausible now that the street is dry Without being certain That's the mathematical content. There will be a few more minutes. Of course. I need to talk about admin I also want to put a little bit into context of what's to come for the next Semester what we're into I'm hoping that the lecture hall will stay as full as it is right now Even though we're going to be in different lecture halls by the way So just as a reminder on Thursdays. We're in a different lecture hall So don't come here on Thursday. We're over there 21 One first question you might have is how does this relate to this other lecture in the Machine learning master taught by Professor Hein the statistical learning theory or statistical machine learning class Actually, there used to be two separate classes one called statistical learning theory and one called probabilistic inference and learning And they were a bit of an uneasy pair. They were just both there and they didn't really talk to each other And now with the machine learning master We've sort of unified them and make sure that they are actually two lectures called probabilistic machine learning and statistical machine learning and this term the other one is taught by my colleague Professor Hein and So first of all, what are the connections between those two? I very very strongly recommend that you take both I know that that's challenging and for the machine learning masters This is actually maybe the main challenge of this entire degree is to take both of these courses at the same time But they are deliberately next to each other To expose you to these two different sets of ideas simultaneously historically these two views have formed the foundation of AI and machine learning on the one side There is statistical learning theory and on the other side. There's probabilistic inference and learning or Bayesian machine learning if you like and They're very different that many different ways in which one might talk about what they what they refer to and I'm sure That Professor Hein is going to give you a different perspective and that's exactly the point So the two of us are not coordinating with each other what we teach I don't have I mean we at some point I saw a list of the stuff he teaches and he has heard a few things that I'm going to teach But it's not like every lecture is like neatly aligned with each other that would also be nearly impossible But there will be a few points where I will make very clear connections to the other lecture in particular at the halfway through the section on Gaussian processes I will talk about kernel machines and how the two fear like Objects relate to each other because they are the connection is extremely close There has been a phase in machine learning, you know, like when when I entered the field in like, you know Long time ago early 2000s there were still these two camps the basions and What some people also divisively called the frequentists or the statistical learning theory people who were constantly at each other's roads and arguing bickering with each other about who had the right theory and Like one side required theorems from the other that the other couldn't provide All of this has sort of died because of deep learning, which is difficult to explain from either side. So we've all sort of started to like each other again in 2012 we co-organized even a dark school workshop together to kind of Soothe each each other's wounds But actually from this a really new perspective has emerged on deep learning, which is really exciting And we'll talk about it on in both lectures But to give you one high-level idea my high-level idea of how the two relate to each other One way to phrase this quite succinctly is to say Statistical learning theory is about Bayesian reasoning where you don't say that but you don't say out loud what the prior is You just say that there is a space of events But we and there will be some prior but we won't talk about what it is And then we'll just see how far we get and see what kind of statements we can make in some sense That's more general because it allows us to talk about things without making hard decisions But in some sense, that's also much more restrictive because it doesn't allow us to make concrete statements That's it actually or a little bit more formally in statistical learning theory you'll typically encounter models that are defined by some loss function defined on some model and That loss function will have some relatively abstract properties And and it's often constructed such that the algorithm works well It might be convex for example to make sure that you get a very fast algorithm that converges fast and doesn't have much to compute And and that's in some sense totally ad hoc Right, you just write down this thing that is convenient for your computer to work on then you run it and Because you've come up with it in this ad hoc way. You then have to spend a lot of time Analyzing it to argue why it's a good thing Right if you say oh, I'm gonna minimize this loss function That's convex then you have to say well this this convex loss function is somehow very good because it's minimum has some good Properties that we're going to find we're going to converge fast and then once we're there somehow this estimate is in some sense close to the truth under some assumptions and Those assumptions typically are strongly connected to the ones that a probabilistic formulation would also make using different language And advantage of this is that sometimes those assumptions can be wager But the downside is that you often have to make asymptotic statements You can only say in the limit of very many data points something something and then the analysis will typically be of a worst-case worst-case type, so You know assuming that the true function has some properties. We need to constrain it somehow Then our estimate is going to be at worst this far away after endpoints or asymptotically Conversely in the probabilistic framework, we will write down what's called a generative model That's a joint probability distribution. So a p of data and thing we care about a prior chance of likelihood and Then once we've done that we're done Because the rest is just base theorem we just need to apply base theorem and it will tell us what the correct answer is Unfortunately because so we typically start by writing down a generative model and then we question it to say who are these good prior assumptions What does this prior look like and we do it like does this actually describe what we believe about the real world? Ah, yeah, maybe we really believe in this prior then we are duty bound by base theorem to compute the posterior and What will typically then encounter is that computing this posterior is difficult because it requires complicated computations So then what we'll do is we'll approximate it somehow and By doing that will actually deviate from The true formalism, but we'll try and stay somehow close to it But the good thing about this is that if you believe in the prior you don't have to question the algorithm anymore You never have to go back. Oh Now I changed something slight about my prior which means I've changed the loss function in some sense Do I now have to redo all the proofs? Do I have to show everything again? Ah, no because it's still just base theorem another great thing about this is that we can actually look at the posterior and It's gonna quantify uncertainty for us automatically out of the box So we don't need to have worst-case error estimates. We can just say well just measure. That's what we know posterior done so we gain something by being very explicit about assumptions and relying on the mechanisms of Bayesian inference probabilistic reasoning to compute the object of interest and We're paying for it typically with algorithmic cost and so this course will contain a lot of algorithms and code examples and thinking about how expensive something actually is and how to approximate a complicated computation so now having been in here and You're looking at the watch. You're thinking. Hey, this is all fine But why do I have to take this course in 2023 in a machine learning master when there is chat GPT out there and stable diffusion? And everyone talks about, you know everything being automated and right Why do I still need to study this? Isn't this like 2005 has nothing ever changed? Well The next two slides are hopefully like my way of trying to convince you But we'll come back to this many times that this course is as important as it ever was and It's as close as it can possibly be To these new models that you've heard about so everyone at this point in the machine learning world is totally stressed out and You know in various forms of either excitement or doomed worry about the developments of the field on the product side and The field is moving so fast that As people teaching in a university setting we constantly have to ask ourselves whether what we're teaching is actually still relevant So what I've done is I've actually changed the content of this course quite a bit From the last time I thought it two years ago Which means that there will be? Stuff that is very close to the state of the art hopefully soon Although I ask you to bear with me while we still derive and construct the proper foundations for a while Because they are going to be very important It also means that especially towards the second half of the course there will be a little bit of seat of the pants So I ask ask for your forgiveness if sometimes there will be things that are a little bit Just develop just finished five minutes ago and so on So what we're going to do here's a very rough outline what I want to do in this course We're going to spend the next Two to three lectures Actually three lectures Thinking more about the foundations of probabilistic inference will find out that in which sense It is actually computationally harder to reason about entire measures rather than one point estimate in statistical learning theory Will encounter interesting algebraic properties in these distributions that are useful to make things tractable again of various types some of them will be more algebraic more structured and some of them will be more Analytic so using derivatives and This geometric descriptions of functions and then we will encounter something very important in this A class of distributions, which you all know already Gaussian distributions Which you will get to know much closer than you might want Because they are going to be our tool for that almost the entire course they are going to be the way of thinking about probabilities and Will discover that they are very intricately linked to the concept of linear algebra and To differentiation and you know that machine learning lives on differentiation these days So we're actually going to find out and I'm going to argue quite strongly that Gaussian distributions are the fundamental object to think about and they are even the fundamental object to think about deep neural networks and I'll really want to make that case in lots of detail. So we will spend a lot of time thinking about Gaussian models almost in half of the course different aspects of it as well and We'll try to connect that to the deep learning literature And in various different ways that we'll get to when we get there and then towards the end of the class Actually depending on how it goes I might take several different detours and point out some interesting other Relationships that aren't directly connected to deep learning And how much time we'll have left for that depends a bit on how we move through the course over the next few weeks and we'll see what might be interesting then but In the course of this process we will well first of all of course realized that The Bayesian perspective on machine learning provides very interesting functionality. This postivio isn't just an error bar on your estimate It's actually an algebraic object that allows you to do really interesting things With a model to probe it to query it to understand it to build better algorithms on it and to explain to other people what the model does and And We'll also encounter that the tools that you already know from deep learning like automatic differentiation like every based thinking our area based programming and linear algebra are as important if not even more important in probabilistic reasoning So if you think that deep learning is about auto diff well then actually You should be even more interested in this course because we'll talk about gradients and Jacobians all the time everywhere And in fact linear algebra will arrive as Arise as something even more important because it's not just about computing a gradient But then what do you do with it to get some meaningful statement? We'll write lots of code, which is also a kind of a new development in this class And I made some real effort to write some proper Jacks code that looks decent You'll get to see some of it like starting from next week with lots of Visualizations and examples so today was actually maybe one of the classes with with with less examples So I hope that there's going to be lots of interesting stuff for everyone in the room to see I know that there's always people who want to see theorems and some people who want to see Visualizations and some people want to see examples and some people who really want to see the code And we'll try and do all of these there'll be some times where we'll spend really a long time thinking about some code And why something is implemented in a certain way and then some other times And we just look at some for pretty pictures The final other argument that I want to make is that This is much broader than machine learning and AI And that's why I started the lecture today the way I've been doing for the past 10 years or so because probabilistic inference is the generalization of propositional logic to statements That contain incomplete information to inference problems And it has been important over the course of the entire history of quantitative science starting with the absolutely amazing book by Karl Friedrich Gauss from 1809 the theory of the Celestial bodies that move on cone Intersections which in which he derives the Gaussian distribution and the least squares estimator to Laplace's theory analytic the probability Which actually is the text that first writes down base theorem because base didn't actually write down his theorem He just wrote down a prior and a likelihood and forgot to normalize it To all its applications and derived statements in the natural sciences Thermodynamics is pretty much just probability theory So the theorems of Maxwell and Boltzmann and Gibbs They have really just writing down probability measures and then operating on them Kamogorov in 1933 formalized it properly and then following him a lot of other Scientists use them. So for example Quantum electrodynamics or quantum actually quantum field theory in general The work that is often attributed to Feynman and people like Iraq is all about Manipulating probability distributions on some awkward spaces Where you have to be careful because things are infinite dimensional and dynamic and moving through continuous time And then everything gets really complicated, but in the end, it's just probability field And if you've heard of a path integral well a path integral is a probabilistic model And if you've seen Feynman diagrams that some people have tattooed on their forearms That's probability theory All the way to computer science if you've used a compression algorithm if you've used your phone today You've used a probabilistic algorithm Probably a Gallagher code or a turbo code. So a linear probabilistic model to encode and decode signals over noisy channels. That's just probability theory and much of contemporary machine learning was influenced very deeply by People like of course, I have to cite David Bacchai because he was my PhD advisor But also many many other people who describe themselves as basions by self included and and it is still a Central way of thinking of fundamental paradigm for machine learning to this day despite Large language models and no code and what so what? So if you really want to be an expert in machine learning and AI and want to build these models rather than just you know become the next form of Web developer with chat GPT then you need to understand these theorems It has been useful to know about this way of thinking for the last 200 years since Gauss So it's quite unlikely that it will suddenly stop being useful within the next two years In fact, I think this these are going to be this the stuff We're going to see at least on a high level this content that will be useful for your entire career Despite all the craziness currently happening in the field Here's the final slide those of you who've ever had a lecture by me before know that I always do this at the end Here's a QR code in which you can leave feedback for this lecture. Please do so Because this is my way of figuring out whether the course works or not By the time the evaluation rolls around run by the department. It's way too late So I want to know what you think of every individual lecture and we will talk about this feedback at the beginning of every lecture So please hold your phone up look at the QR code You can also find the feedback sheet on Elias without the QR code But that's the easy way to get to so here's a summary of what we did today If you want to reason in a world that does not follow Detuctive deterministic Propositional logic if you're not ready this car Then you have to reason under uncertainty and there is one correct way mathematically correct way to reason about Latent quantities Given observed quantities and that is by assigning a finite amount of truth One across the space of all possible events. You want to consider and then making sure you're constructing derived statements from those elements in a Measurable fashion so by correctly adding things basically when you do when you construct them and That leads to two rules of probability called the sum rule and the product rule which together make up this Fundamental theorem of inference called base theorem even though Thomas base didn't actually write it down Laplace did this provides a extremely Powerful framework for inference that is goes beyond computer science But in this course will talk about how to realize it in computer science and That means that first of all we realize you can do pretty powerful things if you use a modern computer We don't have to do what statisticians have been doing for the last hundred years Which is to do things on pen and paper where you have to make sure that you can solve integrals Which is really complicated in the Bayesian framework But instead we can use computers with the modern power of in particular the machine learning stack automatic differentiation linear algebra To construct very powerful tools which in fact provide a Theoretic foundation for contemporary AI and machine learning and how exactly that looks I'm going to tell you over the course of this entire term Thank you very much