 Hello everyone welcome to the very first online lecture of probabilistic machine learning at the University of Tobingen. I'm not gonna lose any time with introductions but instead I'll start the course right away with an experiment an experiment that I've inherited and by extension you and I are inheriting it from the late and great David McKay. So what I have here is of course I have in the bag three cards one of which is red on either side that's the first card one of which is white on either side that's the second card or third card and one of them is red on one side and white on the other side. What I'm going to do now is I'm going to put these back at these cards into a nice little bag like this shake the thing around a little bit so we don't know any more which card is where and then I'm gonna randomly grab into it and just pull out one card and I'll put this down on to my notebook and now you can see that the side of the card you can see is red and the question for you is what's the probability given that you see this red card that the other side of this card which you currently can't see is also red. Normally I would tell you to talk a little bit with your neighbor and discuss this but because this is a video recording you can just stop the recording for a moment and once you've once you've come to a conclusion have an idea you can switch it on again. Now that you've done so I can provide you with a few alternatives possible answers maybe your answer is the probability for the other side of this card to also be read is one half because what's it going to be either white or red maybe your answer is the probability for the other side of the card to be white to be also red sorry is two-thirds or maybe it's something completely different because you have your own theory. Now a non-formal non-mathematical way to answer this question is that the correct answer is actually two-thirds and that is truly the correct answer why well because there are three red sides of cards involved in this problem and I've just randomly picked one of them because you don't know the orientation of these cards and of these three red cards one two of them have red on the other side and only one has white on the other side. Now in fact if I actually turn this card around you can see that it's white but of course that doesn't invalidate this particular point. This kind of question I've just asked you is an extremely elementary form maybe it's the most reduced possible case of something that we call an inference problem. An inference problem is a question that one can ask for which given the information at hand there is no true correct answer or no answer that can be given with certainty. So therefore it requires an answer with uncertainty and these kind of inference questions this particular one might have sounded a little bit contrived but it was actually I mean was deliberately constructed to be as transparent as possible all the rules are available there is no trickery involved I did not play around with these cards inside of the bag without you noticing and yet there is still a notion of uncertainty in the end even though you were aware of all of the rules these kind of questions and they don't just apply to cards and bags they are actually the typical most prominent question that we face in our daily lives. When you go outside if you allow to and have a look at the sky then typically you'll be able to predict with a relatively high certainty for a few minutes or hours into the future what kind of weather you're going to be facing if you go outside by looking at the sky that's a simple kind of inference problem that we face in our daily lives because it clearly it's really impossible to predict with high certainty what the weather was going to be but it's possible to do so with a certain degree of imprecision that's a simple example but many particularly hard tasks that humans deal with professionally are also associated with this notion of uncertainty and often it's actually the kind of jobs that we associate with particularly high human intelligence when a judge decides or you know over the course of a of course proceedings tries to decide whether a defendant is guilty or not then she's usually never able to do take this decision with perfect certainty of course because she wasn't at the scene of the crime she can only collect evidence later on which over time might accumulate to remove so much uncertainty from the process that the judges then able to decide with high certainty with low uncertainty whether the defendant is guilty or not and only then my law she's allowed is she allowed to actually pass judgment when scientists try to decipher the way the world works that it's almost always impossible to build one experiment that perfectly captures all information necessary to unravel a rule of nature to perfectly understand a particular process but instead scientists try to devise experiments that collect individual pieces of evidence which taken together then form a increasingly certain picture of the world and when a doctor maybe that the best example in these trying times when a doctor is faced with a patient that exhibits certain symptoms then of course no medical doctor is able to say with certainty with a logical kind of certainty as we are used to from computers whether a patient has a certain disease or not and how they are going to respond to treatment but instead medics have access to various diagnostic tools from just looking at the patient up to high modern modern precision medicine techniques to collect evidence about what kind of ailment the patient has and also to predict into the future what the reaction is going to be to a particular treatment now clearly this is an ability reasoning under uncertainty that we would like to have for ourselves and that maybe we want our computers our machines to replicate as well however classically a computer does not actually allow us at least not out of the box to do this kind of uncertain reasoning the kind of formalism that you learn in an undergraduate computer science or computer science computational logic course is propositional logic which allows to map true statements into other true statements the important be point being that the there is only two binary values of truth false and true so for example imagine that we have two variables which might be called a and variable a stands for it is raining outside and variable B stands for the street is wet then a classic formal logic allows us to devise statements like this one this string which is spelled out as from a follows B and which is represented by this truth table which says that if a is true then be also has to be true so if it rains then the street is wet and it allows us also so this this process of reasoning that if a is true then be also has to be true is classically called modus ponens but it also allows actually the in a very specific sense inverse kind of reasoning which is that if B is false then a also has to be false so if you look outside and you notice that the street is dry then this implies that it cannot have rained recently however propositional logic does not allow us to make the other two kind of or reason about the other two kind of possible combinations of true and false even though these are conclusions that you might make yourself with your human brain so if you look outside and you see that it's not raining so a is false then you might infer that it's quite likely that the street is dry so B is false but propositional logic doesn't allow us to make this kind of statement why well it does that with good reason because there might be another another explanation for the street being wet so let's say there is someone outside Gardner with a hose who is a spritzing water onto the street so the street ends up being wet anyway even though it's not right the fourth possible solution also doesn't work so let's say you're looking outside and you see that the street is wet then typically you might be willing to conclude that therefore it has rained but that also doesn't work because someone else might have gone outside and wetted the street with some water that didn't come from the sky and therefore you've drawn a wrong conclusion so why does this not work in classic propositional logic well the reason for this is that it's not the formalism itself it's that it is restricted to binary truth values that it's only possible to say something is true or false so to remedy this problem which is evidently an important practical problem we need to have a formalism which extends binary truth values or interpolates from binary truth values to something in between to a statement that is partly true that is perhaps true that is probably true so we would like to be able to create this kind of plausible reasoning we want to be able to say that if a is true then maybe b becomes true or at the very least it becomes more plausible or if b is false then a either becomes false or at least it becomes less plausible and if b is true so that's something that classic logic doesn't allow us to reason about then we might want to be able to say something like a becomes more plausible if the street is wet it seems to become more plausible than it has rained we're never quite certain but we become more certain that it might have rained and finally if a is false so if it's not raining then b becomes less plausible so it's less likely that the street is wet there are potential explanations left but we deem them relatively unlikely we would like to be able to write this with a notation that looks a little bit like this so we're going to read this out and in a moment I'm going to introduce this more formally as the probability for the statement b given that a is true and this is going to be the probability for the statement b to be true so a statement like this might replace this string by instead of saying if a is true then b is true instead saying if we observe a to be true then b becomes more plausible than if we do not observe a to be true this is the kind of extension we want to make and we have to do so because everyday reasoning requires us to be able to make these kind of statements perhaps nicer way to put this is in this quote by James Maxwell one of the most seminal physicists of the 19th century who said that the actual science of logic so propositional logic is conversant at present only with things either certain impossible or entirely doubtful so either true or false none of which fortunately we have to reason on because this is not the situation we actually face in real life because life is never certain or impossible therefore the true logic for this world is the calculus of probabilities a notion of uncertainty which takes account of the magnitude of the probability which is or ought to be in a reasonable man's mind so if you are a reasonable human being then I hope that you're interested in this notion of uncertainty and therefore in this course because that's what we're going to be doing in this course we will first establish a formal mathematical framework and we'll actually do that today in this lecture for a probable reasoning then over the course of the entire remaining term use this to build a powerful collection of mechanisms that apply to real world problems and in doing so we will encounter many mathematical and computational challenges which have to be addressed by specific technical tools so a large part of this course will be dealing with designing these algorithms and models and tools to get this inference actually off the ground and apply it to more than trivial problems like cards in the back we're going to start right away with defining this formal process but maybe at this point and over the course of the term there will be typically in every lecture maybe three or four of these gray slides that signifies an opportunity for you to stop the video because this is sort of an end of a train of thought maybe get up and get a glass of water or walk around a little bit to collect your mind and then once you think you've understood this part move on to the next one okay now it's time to construct our formal system of reasoning using probabilities and before I write down some axioms let's look at an example that maybe gives an intuition for what we actually need and that example is a roulette game so I'm sure everyone knows how roulette works there is a wheel of numbers that rotates and someone throws a ball around those numbers until it falls into one of these boxes and next to the wheel that so notice that nobody actually interacts with the wheel other than the croupier so there is the players can watch the wheel but there the game actually takes place on this board which is next to it and this board lists all of the numbers on the wheel but also a bunch of derived statements or variables like for example whether the number that we have seen is repainted in red or black on the wheel whether it's even or odd whether it's in the top or the bottom half of the of the table or even whether it's in the top middle or lower 12 so one third of the entire set of numbers and players actually allowed to construct combinations of these statements so they might put because they have more than one chip they might put a chip actually at the intersection of numbers or even at four corner intersections but also they might both bet on even and a certain number for example so imagine you are the designer of the roulette game and your job is to construct a set of rules for this game such that it's fair now how is this connected to our notion of probability remember what we want to construct is a set of reasoning rules which do not use an indivisible chunk of true or false that we have to assign in a binary fashion to one particular statement but instead to take this chunk of probability and distribute it across possible statements such that some statements can be more or less likely so here what we need to construct is we first need to think about this wheel so the wheel is some kind of mechanism that produces outcomes elementary outcomes in this case there are 37 of these possible outcomes 36 numbers and a zero and then we are not going to make statements so the players don't actually make statements about those numbers necessarily but they make statements about these derived quantities so the individual numbers are actually part of these possible statements but there are also derived statements that are essentially collections of subsets of the numbers so for example the red subset is the subset that is marked in red on this board and what we now need to make a fair game is as a third kind of quantity the first the first object we have to think about is the wheel the second thing we have to think about is the table and how to lay it out and the third thing we have to think about is the rules of the game which allow or which ensure that the payout is fair and I have to think about what it actually means for the payout to be fair we're not going to turn these statements or this sort of intuition into mathematical formalism of axioms and these axioms go back to a wonderful Russian mathematician all accounts are very interesting man called Andrei Nikolayevich Kalmogorov he was born in the 1903 in the early 20th century and lived until 1987 and he clearly had a very interesting life he was born to a mother who died at birth his father didn't care for him he was raised by his aunt who also afforded him a good mathematical education there are all sorts of interesting stories connected to his names and over the course of his life he became the father of modern probability theory he wrote a book in 1930 which I have here it's actually a very nice book I recommended it's very thin as you can see it's written in German there's also an English translation it was published in German and written by Kolmogorov in German published by Springer in 1933 which is why it's not so easy to Google for it because Springer doesn't advertise books that he published before 1945 but I recommend that you have a look if you read German or look at the English version because it really is a wonderful text it's very precise very short it's almost like a paper and essentially raises all of the issues that we have with probability theory to this day and discusses them very well and so Kolmogorov provides one set of one way of constructing a probability theory which I prefer because it is purely mathematical it's extremely intuitive there are other motivations for probability theory one for example is connected with the American physicist Richard Cox which is more philosophically motivated so motivated from a notion of common sense the advantage of Kolmogorov's formulation is that is very precise and clean and disadvantage is that there is one key aspect of probability theory which I'll mention in a moment which actually has to be defined by Kolmogorov rather than derived by people like Cox so how does Kolmogorov system work so remember our example of the roulette board we needed three pieces we needed the wheel the table and the rule book how the payout works right the rules of payout so Kolmogorov defines all three of these and I'm going to show you the original text for a moment which is in German so this is actually copied from literally I copied this from page two of the book and then we will construct from I will read it out in English don't worry about it and then I will construct from that a slide with a modern version of these axioms and if you want to follow along really precisely if you've already seen this before then you can check what the minor differences are between the classic formulation by Kolmogorov and our more modern one so let's start with the wheel to our roulette wheel that's the set of elementary operations so Convogorov says we need this object which we will call E E is a set of elements psi and eta and theta and so on which we will call elementary events that's the roulette wheel now we need the roulette board for that we define this German calligraphic F which is a set of subsets of E and these elements of this set of subsets will be called random events and we require from this set of subsets that it fulfills the following three axioms the first one is that it's what Convogorov calls a ming and kerber so a field of sets and if you don't know what that is he actually handfully handily provides the definition in a footnote by just referring to Hausdorff who says as a set of system a system of sets is called a field if both the sum and the intersection and the difference between two sets in the system are part of the system so this is analogous to the notion of a field and kerber in algebra where you need an operation which has the property so here the operations are in classic algebraic fields your the operations are called plus and times here the operations are called intersection and union and difference and they have the property that if we apply them to elements of this field then we stay within the field that's the first action we need the second axiom is that our set of random events has to contain all the atomic events so our roulette board has to contain all the individual numbers on it and thirdly now actually these are the two sorry these are the two axioms we need for this set for this set of sets and so this is basically the definition of our roulette board so what we basically done by this is sets is set out a set of rules which says that our roulette players are allowed to place any bet by combining arbitrary elements that are already on the board for example they can construct a set of all red numbers and call it the set of all red numbers and they can contribute and they can define the set of the upper 12 numbers and then they can define the set of the upper 12 of the red numbers within the upper 12 numbers that's an intersection right and okay and so now we need the third part the third part is the rulebook and so for the rulebook we will need to come up with a way of assigning probability to the atomic events so for that we will distribute truth and we will distribute a finite amount of truth one across the atomic events and then come up with a rule for how those probabilities on the atomic events translate into probabilities on the derived events which are part of this set of possible events set of sets the rules for that which called the Gamblegorov sets out are that for every element for every set in F we will define a non-negative real number so that's what we will call the probability and we will call this the probability it's called it's a function that maps from a to a real number a positive real number it's called P of a the probability of a we will call this the probability of the event a and it has it requires that the probability of the entire set the collection of all atomic events is one so that's what I meant by taking the block chunk of length one and distributing it across all of the events and if we have two disjunct events like for example or the number zero and the number 12 then it has to hold that the probability of the sum of these two events so the probability of either a or 12 sorry either 0 or 12 has to be the sum of the individual probabilities and of course that is going to be the mechanism by which we ensure that all of our derived events have meaningful probabilities such a system of sets with this function P if it fulfills these five x terms Kolmogorov says we will call a washeinlichkeitsfeld a field of probabilities now if you don't like reading this in German I'll now do this again in English and so the reason I'm going to do this twice is that people often get introduction so if you're if you're not a particularly mathematically inclined person then you might find that the modern definition of probability theory a little bit cumbersome or difficult to understand so if I would have just thrown this definition onto you you might have find it quite hard to understand I hope that this way of getting to the rules has motivated a little bit better the weird somewhat weirdly named modern mathematical concepts that are now going to follow so here is how we would do this in modern English mathematics it's essentially the same thing but there's going to be a very subtle difference that we will notice in a moment so we will first define our wheel and our roulette board and let's think about the rules or probabilities later to do that we actually do not use any notion of probability theory we're just using the theory of sets and so you could call that set theory if you want it's also called measure theory so if we first start with our wheel the wheel is again called E it's the space of elementary events and now let's consider and for this definition the power set of E so that's the set of all subsets of E and consider a subset of that power set we call this factor F these events are called random events so far so good that's exactly the same as if I come over off if F satisfies the following properties we will call it not a field of sets anymore but there's somewhat annoyingly this very cryptic name a sigma algebra a sigma algebra is this collection of sets such that the elementary events are all in the set that's for come agar of axon number two for us is action number one if we have two sets from F then F is closed under their difference and countable unions and countable intersections of these sets so the difference to come agar definition here is actually the infinity up here that we allow countable but unbounded unions and infinities and intersections that's why this is called sigma algebra and not field of sets if you like but it's really just at least the first site a minor difference and it requires a little bit of fixing later on so that's just that sort of a post hoc correction that little generations of mathematicians have made to come agar of axons to make them a bit easier to use we will notice in the next lecture why this is actually important this kind of property already implies that the empty set is actually a part of of F this follows from the third axon if by the way it's also true that for for countable atomic events e we can actually you can think of the sigma algebra as just a power set the power set has this property this is not true for uncountable spaces and we will have to think about that in lecture 3 but we will talk about that when we come to it because it's actually a little bit of a tricky bridge to cross if F is a sigma algebra then we call its elements measurable sets and the space that is created by considering the atomic events and all the possible subsets you can construct on it within the sigma algebra is called a measurable space or a braille space notice that so far I haven't defined probabilities at all these entire parts are just a statement about what we will consider an acceptable set or an end set of sets what's admissible to talk about what are the ways in which players are allowed to place their bets on the roulette board to construct a permissible bet now we have to talk about the payout rules how do we actually construct probabilities for individual events by the way if you're wondering why I talk about payout rules and probabilities almost exchangeably think about what the inverse of a probability is okay good so this was our definition of the the roulette wheel and the table now comes the definition of the rule book the probabilities so let's take a measurable space so a braille space and we will define this function P which we will call the probability of an event and Kornogorov even has an axiom for the fact that this thing exists and that it maps to the positive reels not the negative numbers this is called a measure it's not yet a probability it's just a measure if the probability or the measure of the empty set is zero and for any countable sequence so that's a sequence of elements of the sigma algebra which are pairwise disjoint so that they are they do not overlap or intersect we require that and this is really the sort of master axiom for probability theory not the master theorem but the master axiom that the probability of a countable union of these joint disjoint sets is equal to the sum over the probabilities of the individual sets so this is a very natural definition it just says imagine you have two separate possible events so on the roulette board this might be the top 12 numbers and the bottom 12 numbers and we have assigned probability to both of these events and the probability of the or of these two so the joint of them has to be equal to the sum of their probabilities why because otherwise we could construct what's called a Dutch book so we could construct a strategy to bet which guarantees that we will win so this is the definition of a measure and now there's a final additional add-on which is the only distinction between measure theory and probability theory is a single line which says such a measure is called a probability measure if the probability of the joint of all atomic events is one so what that all that says is that there is no infinite truth in the world there's only a finite amount of truth a statement is either true or it's partially true but it's never more than true okay that's actually really straightforward and that in itself is the entire TV need to move from sex theory to probability theory why is this important it's important because most people agree with sex theory but there are many people who disagree with probability theory so if you disagree with probability theory either you have to disagree with set theory for countable and even finite sets or you have to disagree with that one line that there is only that a statement cannot be more than true I challenge you to do one of them and not feel stupid in doing so why is this a great formalism well as everywhere in mathematics once you've written down a set of axioms you want to set put them to use by actually dividing something from them and come over to us that immediately actually on page four or five or so of his office book he starts showing interesting properties so first of all there is a wonderful result that is called the sum rule I'll just state it and then I'll tell you why it's useful so the sum rule is a very simple observation notice that the entirety of the atomic events E can be written as the sum of any element of the of the sigma algebra any a and its complement of course that's really just a definition of the complement right that's not even an axiom is just a definition now therefore we can write the probability of E which is 1 by definition of axiom 4 using sigma additivity 5 to get that 1 is equal to p of a plus p of the complement of a so therefore the probability for an event a is 1 minus the probability of its complement okay that's if you like the law of inverse probability not sorry not of inverse probability of complementary probability now and now we're going to make a definition so we'll just introduce a new notation we will talk about the joint probability p of e a and b as a short-hand for the probability of the intersection of a and b this notation will be much more convenient later on when these a's and b's are not always sets anymore but we will adapt the notation a little bit so this joint probability we can make a statement for that by noting that any a is equal to and this is again just that theory a is equal to the intersection of a and e and e can be written as b and non-b or b at the complement of b so therefore using sigma additivity p of a is equal to p of a and b plus p of a and the complement of b so the probability for a is the probability for a so how does how does how can we describe a well we can describe a in terms of the part that intersects with b and the part that doesn't intersect with b right that's pretty straightforward but we are going to use this sum rule very extensively in probability theory to for the operation of getting rid of one variable so what this rule says is if you have two variables in your reasoning system a and b and you'd like to get rid of b because you don't know anything about b you just uncertain about it then this rule tells us how we just sum out the possible values of a the second statement we'd like to make is unfortunately this is the point where gone McGorough with his strong mathematical bend has to actually define something rather than derive it from common sense is a definition rather than a theorem so it's a definition for what's called the conditional probability conditional probabilities will be written like this and we will call them read this line as the probability for b given a that's how you write this line which is defined as assuming that a has probability larger than zero as the probability of a and b divided by the probability of a how should we think about this well for me along if you think of the entire t of e as the circle then we can draw a nice little venn diagram so let's say this is b and this is a then p of the function p the probability assigns a probability of one to this entire circle and it assigns a certain amount of probability to be now to construct and to a of course now we assume that the entire probability in this circle is larger than zero and we can construct a new probability on just this domain by computing the probability of this subdomain and dividing it by the probability for a now why is this interesting it's interesting because and you can easily show this yourself I'll leave it to you as a very simple finger exercise it's really just plugging in definitions and you can show that the probability for b given a this conditional probability is always larger or equal than zero so this is the axiom 3 for probabilities of come on or off such as they are a map that maps to positive non-negative real numbers and it's easy to show that the probability of e given a is one literally just plug in the definition of e in up here and for disjoint sets we also have sigma additivity so therefore this function actually is itself a probability that's useful because it allows us to move to restricted probabilities so if you know that we are within a this operation provides a new probability within the domain of a and what are we going to use this for in reasoning well we'll use it to reason about one variable given the other so if you have information that comes from variable like you want to say something about b but you've made an observation about a then this rule tells you how to incorporate this in your reasoning process we're almost done though we're not quite done yet because there is this p of a down here and that's kind of often a problem to construct so we have to say what this actually means and for that we use what's called the law of total probability which is actually an extension if you like of the sum rule so if we again consider a set of these pairwise disjoint events a so that they do not overlap I also assume that together they span the entirety of e so this could be for example a1 a2 and the complement of a1 and a2 any then for any element of the sigma algebra for any event x the probability for x can be written as sort of it can be sort of represented in a you almost in like a basis for for x or sorry not a basis a generating set for for x as a sum over conditional probability times probability so probability for x given ai times p of ai why is that well so here's the entire proof we just noticed that any x can be written as the intersection between x and the entire space and then use the definition that we have up here or our assumptions about this set of disjoint sets to reconstruct e by the elements of a and then use sigma additivity 5 our rulebook for how uncertainty arises or how probabilities add up to notice that this property is basically true right so we just we just applied a definition of the conditional probability and sigma additivity and that's it so why is this a useful well what are we going to use this law of total probability for well we are going to use it to complete the construction of the conditional probability into a wonderful theorem that provides the basis for all inference processes using probabilities and that is called base theorem which states that the probability for ai given x is the probability for ai times the probability for x given ai divided by the sum over all such terms over i this should be a j I'm sorry I'll correct that in the slides there should be a j here of course so another sum index submission index so how do we get to that well the proof is straightforward it's just a definition of the conditional probability and plugging in the law of total probability this base theorem provides the basis of all probabilistic reasoning but I'll tell you more about that after the break first I'll show you a summary of what we just did the these accents I just I just wrote down which are actually if you think about them hopefully very easy to buy because they're really just set theory with a measure on sets and the only constraint that makes that measure into a probability is that there is only a finite amount of truth that's it so there's not much to question about that if you do so then you can directly divide these wonderful theorems which are the sum rule which is a way of getting rid of variables in your reasoning problem the product rule and directly arising from the product rule and essentially the sum rule which is restated as the law of total probability base theorem which so here in this case you can even use it directly from the sum rule right which is a mechanism for making statements about one variable given that you've seen some other variable that you have some other piece of information where we are within this then diagram and in a moment I'll give names to these individual terms and then we'll talk about how we can use them to do something interesting but for now you should take again a brief break stop the video think about what we've just done and then let's come back and continue all right now that we've introduced this formal mechanism of reasoning under uncertainty of distributing knowledge about across several possible explanations we can spend the rest of this first lecture getting a bit of a feel for how this mechanism works and appreciating its strengths and also finding a few challenges that we will have to deal with over the course of the rest of this semester to begin with that I first want to there we go I first want to give a few philosophical interpretations for these terms in base theorem the ones that many of you will have heard about before so the way I introduced this framework so far didn't require me to give philosophical interpretations for the terms that show up in the in these expressions that we've been looking at but of course historically this framework is much older actually it's much older than comogoros measure theoretic formulation and the terms in base theorem have always been associated and immediately been assigned philosophical interpretations and so this is not going to be a bit of a big big surprise for most of you so let me just move the mouse out of the way so this probability for a hypothesis X given the data D has always been interpreted as a posterior distribution and in fact you already you already notice me using using words like data so let's say we have two different statements at an observation and a latent quantity X data and latent quantity something you see and something you want to reason about then this conditional distribution for the thing you'd like to know given the stuff you've got to see is called a posterior probability and it advises by base theorem by multiplying P of X with P of D given X the conditional and dividing by the probability for D which can be written as a sum by the law of total probability as a sum over all possible values that X can have so all possible elementary events if you like or also elements of the sigma algebra that together make up that mutually distinct and together make up X by summing out over all of these possible X's so this term up here P of X is usually called the prior probability P of D given X is called the likelihood for X notice that it is when when we use the word likelihood we talk about a function of X P of D given X is a probability for D given X it's not a probability for X but it's treated as a function of X and the word likelihood has been specifically reserved for this kind of use and this denominator P of D which can also be written over as this sum is known as the evidence for the observation D it's how likely this observation D is under any possible explanation for how it might have come about under this interpretation we can think of base theorem as weighing as computing a posterior distribution for X given the data by weighing the the individual possible explanations for the data and the underlying hypotheses relative to each other so how likely is it that X is the correct explanation for the data well for that we have to reason first about how likely X is in itself and then multiply with the probability to observe the data if X is indeed the correct explanation and then normalize this probability distribution because that's how we define conditional distributions as a normalized probability by normalizing by the sum of all such possible explanations for the hypothesis X by summing out over these individual terms many of you have heard these terms before and we're going to talk about them many and many times over over the course of of this semester so I'm not going to dwell too much on this one key point to keep in mind here is and I will get back to that later is that the as we just divide this framework you might have noticed that the entire like the key in contribution the key part of the construction was the construction of the sigma algebra and the probability measure P all of these together form the set of assumptions that we use when we do probabilistic reasoning it's not just the prior distribution and they give you a feeling for that I'd now I'd now like to go through a few example applications and problems and questions to see how we use base theorem and practice and let's start with this very simple example that I did at the beginning of this lecture showing you three different cards one of which was right on either side one of which was red and white and the third final one was white on both sides and the question was what's the probability given that you see a red side that the other side is also red I gave a simple explanation for how you could do this in sort of an informal way you could notice that there are just two that there are three red sides and two of them have read on the other side so the probability must be something like two-thirds let's see where the base theorem reconstructs this result to do so we have to do a little bit of of introduce a little bit of nomenclature so we're going to define a variable which we call cards the identity of the card there are three of these cards let's call them one two and three card number one is the one with two reds card number two is the one with red and white and card number three is the one with white on both sides and there is another variable called color which takes two possible values not three is either white or red now we can write down conditional posterior distribution or conditional posterior probability for the individual cards one two three given that you've observed color red and we can just apply base theorem this so this posterior is the prior times the likelihood divided by the evidence so what are these terms well first we think about the prior probability and when people talk about critically talk about probability theory they often bring up the post the prior distribution the prior probability as a key issue a key philosophical problem but in this case and I hope that most of you will agree with me that the prior is actually totally unproblematic I just picked any card out of the bag and if you believe me that this is actually what I did that I didn't cheat in doing so then of course the prior probability should be think of it for yourself one-third right because there are three cards so the prior is very easy here the more interesting object is actually the likelihood what's the probability to observe the color red given that we have card one two or three well so for the first card this likelihood is evidently one that I will write down this number here because there is no other possibility than to see red for the second card the likelihood is one half and for the third card it's actually zero because if it's the white card there's no way I'm going to see red so this is the only source of structure that enters this problem the prior is totally uniform so what's the the evidence the normalization constant but this is also easy it's one-third times because that number can get out of the sum times the sum over one plus one half one plus zero so that's three halves times a third which is one half the prior is the same for every possible card so these two numbers together give two-thirds and the only thing that changes from one scenario to the another from one hypothesis to the other is this likelihood which is either one or one half and zero so the question I asked at the beginning of this lecture is what's the probability given that you see red that the other side is also red in our namespace here where cards are numbered from one two to three that question essentially is what's the probability for card one because that's the one that has read on the other side and we can now see if we take the this likelihood and plug it in here we get exactly two-thirds which is what you wanted to see another thing you might notice is that the probability for the other cards for the one with white and red is one half times two-thirds which is one-third and two-third plus one-third is one which is something our mechanism requires us to have that the total sum of probabilities remains one and in fact of course this only works because the third card has probability zero which is this which it has been assigned to by the likelihood here's a picture pictorial view of this maybe this helps for some people you could give names to these individual variables call them card and color now we get to see the color we assign a prior probability to all three cards that probability I've used I've set it to be uniform of course I don't have to do this there's no mathematical reason I have to do that I could distribute truth in a different way but our assumption and I hope you believe me that this is a good assumption was that we draw cards with equal probability from the bag maybe multiply by the likelihood the likelihood is either one if I take the red red card to see the red side one half if I take the red white side or the zero for white white and then I divide by the evidence the evidence is one half so dividing by one half is like multiplying by two and this gives us our posterior distribution the next example we should talk about is the observation that I made at the beginning actually I sort of after the first few minutes of this lecture that propositional logic is too limited to represent everyday reasoning under uncertainty so just to remind you I've made this example of two variables it's raining outside and the street has become wet in so be in the street is wet classic propositional logic allows us to say if it rains then the street gets wet and if the street is dry it can't possibly have rained what you'd like to do is to extend this to include statements like so first of all to weaken statement like this so to be able to say if it rains the street becomes more likely to be read to be wet or if the street is dry it becomes less likely that it has rained but also to do the this sort of to add the two kind of inverse reasoning processes that are not possible under propositional logic which are that if the street is wet it becomes more plausible that it has rained or if it hasn't rained it becomes less plausible that the street is wet it's actually possible to show that this is the case very simply by plugging in real numbers into Bayes theorem and just checking that inequalities hold I'm not going to do this here because it's actually one of your homeworks we're going to talk at the end of this recording very briefly about how homework exercises work for this lecture and then there will be more discussion of that in our first inverted classroom on Tuesday the 21st of April instead of spending too much time with this example then I'm gonna I would like to show you something else and give you a concrete and quite recent example of why it's useful to know about probability and why reasoning under uncertainty is often not as intuitive as people think and to do so let me see if I can briefly do this here I want to use an example that actually happened or is happening in real time just now so about a week ago I think on the 6th of April the US Food and Drug Administration the federal FDA provided an emergency permission to a company called Selex to introduce to the market a an antibody test for the coronavirus disease 19 COVID-19 antibody tests are different from virus tests so this is a test that checks whether it can detect in your bloodstream immune responses to the coronavirus specific to the coronavirus which are an indicator that you've probably gone through the infection and indicate that it's likely that you're now immune to this disease this test looks you can see a little drawing here so this by the way this document I've opened here is actually the official sort of tech sheet essentially spec sheet for this test provided by the company that built it and this this device is it's a it's a use at home device if you like it looks a little bit like a pregnancy test you put a drop of blood that comes from a pin trick onto this part of the device and then over time up to three of these test strips become red and if they all if they're all red then that means that the test has detected some antibodies in your in your blood this test has gone through the news because in particular because the British government has announced that they are ordering a very large number of these tests as far as I know about 3 million of these tests to distribute by mail to the British public so that people can test at home whether they have gone through this coronavirus disease because of course people are a bit unsure whether they've actually had the disease or not because they are all these rumors or also evidences for asymptomatic courses of the disease so people are actually often unsure whether they've had it on so if you look closer into this into this document then one of the numbers you can find an interesting bit is over here there's this there's this table that lists essentially it's a little bit complicated the way it's printed but essentially it lists true positive and true negative rates so percent positive agreement and negative percent agreement and the corresponding numbers are ninety three point eight percent and ninety six percent so that means if you have been exposed to the disease and your body has produced on the antibodies for it then the probability that this test that the strips in this test will actually show up is ninety three point eight percent that's the true positive rate and therefore the false positive rate is one minus this number and the in the other case so if you have not been going through this disease if your body has not produced antibodies then the probability for this test not to produce all three stripes is ninety six point zero percent so that's that means the false negative rate is four percent one minus ninety six percent okay so these are the numbers and to a layman like me is actually sound like a relatively convincing performance for this so specificity and selectivity for this kind of test they might convince you that this is actually something important to test at home and I think a lot of people party in the UK but also all across the world are looking forward to having such a simple test at home delivered by mail to put a pinprick on on on their finger and see whether they are immune or not because of course this has a massive economic implication as well so now let's think about what these numbers actually mean so if a random person uniformly selected from the population actually gets this test delivered by mail to them and they try this out at home and let's say that test is actually positive so all three stripes lighten up then what is the probability that this person actually is immune to the disease for that I have to go back to my slide and we just plug in these numbers into base theorem so the probability these are numbers I just quoted to you from the spec sheet so the probability for two tests so T being the outcome of the test given corona virus infection C is ninety three point eight percent and the probability of a negative test given that we haven't don't have antibodies in our bloodstream is 0.96 those are the numbers we can now plug into base theorem using the formalism we've just constructed and the question we would like to ask is given that I had get a positive test result what's the probability that I actually have antibodies that I'm actually immune to the disease for that I just have to plug in base theorem so P of C given T the posterior is equal to the likelihood times the prior P of T given C times P of C divided by the evidence which consists of two possible explanations for this positive observation so either actually I am immune and then I get a positive result or I'm actually not immune and this is a false positive so the probability we need the probability for this test to be positive given that we don't actually have corona virus so this is a number we don't have up here but it's easily computed using the axiom of complementary probability if you like or the result the theorem of complementary probability which is that P of T given non C being probability can be written as one minus P of non T given non C why can we do that because conditional probabilities are themselves probabilities as we saw earlier and the probability not to have the corona virus of course is just one minus the probability to have gone through the infection and be immune to them so we know many of these numbers now so we can plug in the true positive rates down here in both places where it shows up and one minus this number up here to get 4% and the only thing we don't know yet is what's the probability to actually have the virus so a lot of a lot now depends on our probability to have the virus well so of course this is a very debatable but at the moment I think it's realistic to assume that there's still a very small percentage of the population that has actually gone through the infection so in Germany the official numbers at the time that I'm recording this video is that there are about the Robert Koch Institute lists about a hundred thousand cases of corona virus infections in Germany let's assume that that's a massive underreporting and that there are actually 10 times more cases which might be a realistic assumption then that means there's roughly a million people in Germany who have gone through this infection which is about 1% plus or minus a little bit of the population so let's say the probability to have this virus is actually 1% that's because they're about 80 million people in Germany and a million is something like 1% of 80 million right then we can just plug in this number 0.01 into our computation and that means the posterior probability to actually have immunity given that you get a positive test is just 20, barely 20% for many people this is a surprising result I'm sure you as experts have seen this kind of computation before and you're not so surprised by it but there is a large number of people out there who do not understand this kind of reasoning and I believe that this test is actually rolled out to an unprepared population as a very high probability that a large number of people will believe that they are immune even though they're actually not because of all the people who get a positive test only about 20% will actually be positive and all the other remaining 80% will be false positives so why is this again and here base theorem provides us with an explanation of what's going on here the problem is that they're in this nominator there are two numbers here that get multiplied with each other but one of them but sorry but both of them are actually sorry no so there's one small number that gets multiplied by a large number so that's the problem right so that the probability for false positive is relatively small it's essentially it's 4% but it gets multiplied by one minus an even smaller number so the two explanations for this positive result consists of a large number multiplied by a small number plus and a small number multiplied by an even larger number and the two just about will actually more than cancel each other out and we end up with this low probability so to conclude this lecture let's take another look at the processes that happen when we as humans do our own internal reasoning with uncertainty and how they are reflected by this notion of probabilistic reasoning and inference and also how some of the flaws of our own human reasoning actually show up as well in the probabilistic framework so let's say you're sitting at home because you're in quarantine and you are so you're sort of reminiscing a little bit pondering weak and very over many a quaint and curious volume of forgotten lore and suddenly there's a tapping on your door now and you might convince yourself that the only possible explanation for this is to some visitor tapping at my chamber door only this and nothing more so there's just a person outside who's knocking on your door I must be a visitor well let's say that there are many possible visitors you could get at this time of the day so you might represent these with different variables from V1 to V I don't know and how many visitors how many ever visitors you might actually expect so let's assume every possible visitor has the same probability of knocking because that's just what visitors do right they show up they knock on your door I mean what else are you going to do if your door is locked from the outside then what base theorem is going to do is it's going to turn out being helped helpless or useless to you so if your likelihood is uniform so if the probability for tapping given that it's a visitor is actually the same for all possible terms in the evidence then you can take that number outside of this some down here and it cancels out and your posterior probability is going to be equal to the prior probability which in maybe it's not so surprising because if someone knocks on your door you don't know who it is right so data that has a constant likelihood under all hypothesis hypotheses doesn't actually provide any information and doesn't change the posterior distribution this actually happens more often than we like that you end up with likelihoods or kind of data that are actually not or almost uninformative about the latent quantity we actually care about but now let's say that there is a special kind of visitor you might be particularly hoping to see again maybe some long-loved long-loved love interest called Lenore for which who's lost your your your your sad about this rare and radiant maiden whom the angels named Lenore then your mind might do play all sorts of tricks on you to convince you that this is actually the person tapping at your door that might happen in various different ways under the probabilistic framework actually in two different ways maybe you believe that this particular person if she would happen to wander past your door would of course be tapping because she's that special person and she would never just walk past your door without paying you a visit so that would mean that your likelihood actually is different for this particular visitor and for all the other ones or your brain might convince you that it's much more likely that this person would walk past your door in the first place or come and pay you a visit so therefore this special person named Lenore gets a higher prior probability to show up this is actually the kind of worry that many critics of probabilistic reasoning often mention that by changing the prior distribution you can more or less create any possible explanation for the data if you just convince yourself that this person Lenore is so much more likely to visit you than everyone else then it doesn't really matter what the likelihood is as long as the likelihood for her tapping isn't zero you will just always like push yourself to believe that this is the person who's currently out there knocking on your door so maybe then you actually realize that it's the other way around that you have lost this person that she's never going to visit you again for reasons that we might not fully understand maybe she's dead maybe she just doesn't like you anymore then your prior probability you have to assign to this hypothesis actually is zero and by convincing yourself that of course this person is going to come visit me you're just going to be wrong so you have to force yourself to put the prior probability to zero only then will you actually maybe get the correct answer which is that it's very very unlikely that this special person is knocking at your door the probability might even be zero if she's actually dead so by changing the prior distribution you can actually convince yourself of more or less anything and that's of course something people are often worried about when you think about probabilistic reasoning in practice over the course of this term we will find that often this problem is actually present but it is much much more subtle it's often created not by not as much by how we distribute truth across the hypothesis space so what the prior distribution actually is but the bigger problem is often how we construct this space of hypotheses in the first place so imagine that you know actually throw open your door right your soul grows stronger you don't hesitate no longer and you open wide the door and find outside darkness there and nothing more so you've just observed that there isn't any visitor outside the nor or else there's just no one standing outside of your your door in the corridor so the probability for any of these hypotheses is actually zero then what this creates is a inconsistency in your reasoning system because you've just observed that there was a tapping so the probability for the tapping has to be larger than zero because you observed it right and probability theory requires us that the probability assigned to the entire hypothesis space is one but we've just decided that all possible hypotheses are zero so we can't sum up numbers that are all zero and get back one and we also can't compute a conditional distribution given the tapping if there is no explanation in the hypothesis space for the tapping so the problem here is not probability theory the problem is how we set up our sigma algebra or actually even our atomic space of events so to fix this and I think this actually happens quite often in human everyday reasoning when we encounter a surprising result we we have to come up with another explanation for the observation that we previously have considered to be almost impossible so a philosophical treatment would now maybe say originally actually there were many many other hypotheses that your brain just pushed down to such a small probability that for reasons of bounded bounds on its own rationality on its use of resources you feel essentially set to zero but if you're honest there is actually a large number of additional hypotheses that we all just for convenience sake assigned the probability of zero two but actually we should have assigned a probability of epsilon larger than zero to them so maybe you begin to think for yourself turning back into the chamber that now we actually hear another tapping maybe surely this is something at my window lattice let me see then what there at is and this mystery explore maybe you've come up with another explanation of what chi p but what could be the source of this sound just the wind and nothing more so now you've added one more possible hypothesis and because we've done that at a later point of course our previous reasoning had to be flawed because this hypothesis was never part of our reasoning process to begin with this is going to be a very frequent problem in our inference process that we have to write down the correct variables before we even begin reasoning because otherwise all of our results could be flawed no matter what the prior distribution is now of course those of you who've seen this poem before know that even this hypothesis is actually wrong so now as you open your door in comes with many flood and flutter a stately raven of the saintly days of your and flies in and lands on the pallet bust of palace just above your chamber door and purchase and sits and nothing more so if your hypothesis space just never includes the right the right correct explanation to begin with then probability theory cannot help you at all it will always assign zero probability to the correct hypothesis this is a big worry of course that's a fun the most fundamental very anyone could have about any reasoning system unfortunately this problem of course has nothing to do with the mechanism of probabilistic reasoning it has to do with the set theoretical part of probabilistic reasoning which is that you have to set up your space of hypothesis before you even begin reasoning okay another simple way to phrase this is famous quote by one of the fathers of probabilities here we actually I will mention this name maybe more often than Glamogorov he was around the way long before Glamogorov the French mathematician Pierre Simon de laplace who lived in the 18th and 19th century and he provided some of the basic formalisms for probabilistic reasoning before there was a set theoretic formulation for them and he seems to have been an extremely intelligent chap and one of the his famous quotes among many is that probability theory is nothing but common sense reduced to calculation these examples I just showed you hopefully highlight this issue that it's down to you to define your hypothesis space and to assign prior probabilities to all the latent variables but also to provide a likelihood function that could provide a conditional probability for any possible observation given the hypothesis both or actually all three of these objects the hypothesis space so the sigma algebra and the joint probability measure over all of these variables are part of what you have to design when you build your own reasoning system and the pitfalls which you will encounter in the use of probability theory will typically well they will never arise from the use of base theorem they will always arise from these kind of flaws that there are explanations that we have not considered or that there are mechanisms that play in the likelihood that we have not considered in constructing our reasoning system but there will be time enough of it over the entire semester to think about this issue so with that I'm at the end of the content part of the lecture I also want to very briefly use this opportunity to bring in two pieces of technical information or administrative information for those of you taking this course and not just watching for fun on YouTube we will begin the interactive part of this lecture with a flipped classroom on Tuesday the 21st of April from 10 15 to 12 o'clock we will try and do this once every week every Tuesday at this time slot this will be an opportunity for you to ask me questions we will also use the other time slot originally assigned to this lecture to do exercises I'll say something about this in a moment to get into this flip classroom we need some form of entrance control some identification and to do that we will provide credentials to all signed up tubing and students who sign up by in Elias to do that you have to of course have an account on Elias if you don't have one and you think that you should be allowed to take this course because you are a qualified student in tubing and then please send me an email you can find my email address on our website if you already have an Elias account then of course you can sign up yourself by going to my teaching webpage and there is a link where you can directly sign yourself up on Elias many of you have already done so I'm looking forward to hear more from you another thing that you might know if you've seen some of my lectures before is that I collect instant feedback on every single single lecture that I give usually I do this with pieces of paper that people fill out obviously that's not going to work this term so instead there will be a poll on Elias for this lecture and another one the next one and for every other lecture following it I hope that you fill out this form please do so otherwise I have even less feedback about how what you think about this course then I would usually have because I really can't see your faces I don't even hear you moan if I say something stupid okay the other thing I should briefly say is that they are of course exercises for associated with this course which you have to take for credit I will say more about how the exercise system is going to work how tutorials are going to work in the flip classroom but maybe it suffices to say there's an advance notice that there will be a plenary symmetric hopefully if it works out plenary tutorial every Monday from 10 ct to 12 so 10 15 to 12 this course comes with a number of exercises there will be basic exercises there will be mathematical relatively simple theoretical exercises and then every week there will be a programming exercise that goes alongside the course and gives you an opportunity to really try your hands on these these mechanisms and algorithms and models and tools that we're going to encounter over the course of this term and they will quickly become much more hands-on as a start-up exercise this week we've given you something that I think is actually really important in this phase of the development of machine learning and which is an opportunity to test for yourself how useful the deep learning the standard tool that many people now think of as machine learning actually is for generic data sets so many of you might think having taken a deep learning course last year that all of machine learning is deep learning and if not then all of old machine learning is now outdated and everything should be deep learning we should try whether that's actually true or not by this week by this week's exercise which I'm not going to read out but it basically works as this here's a data set many of you have seen this before it's the famous keeling curve of co2 concentration in the atmosphere collected over the Mauna Loa volcano on Hawaii and your task will be this week to use any standard deep learning toolbox you like and use it to construct a prediction of this data set into the future and I'm very much looking forward to see the result of these this modeling task what you come up with while you do that maybe pay attention to a few interesting questions one of which is how easy is it to set up your model so I'm actually hopeful that some of you will come up with quite interesting deep learning models how much knowledge do you think it takes to build these kind of models later in the course we will encounter probabilistic models that allow us to do the same thing and we have to compare how hard these models are to build manually think for yourself how much you trust this model that has something to do with how much you understand it and whether you think it's the right model or not and think for yourself about the uncertainty that this model might be associated with for you so that means how much do you actually trust this model with that I'm at the end of the lecture I have you try to use these first few minutes that we have together this term to introduce you to the notion of uncertainty point out that uncertainty is a fundamental part of our daily lives of scientific medical and societal processes even political processes and that being able to deal with uncertainty is one of the most important parts of human intelligence to address this we constructed a formal reasoning system called probability theory purely from a relatively like basic axiomatic system which is based only on the distribution of a finite amount of truth across sets of sets and we noticed that there is actually very little philosophical motivation we have to provide for the system we only have to assume that we have a certain space of hypotheses over which we distribute a finite amount of truth and then we just have to make sure that when we construct sets of subsets of this space that we add up probability in the correct way so that we don't accidentally add additional probability so that it doesn't sum to one anymore or lose some probability in the process so that it doesn't sum to one anymore in doing so we also arrived at two elementary rules of probability theory the sum rule and the product rule these are uh respected here again actually the product rule wasn't so much an axiom sorry it wasn't so much a theorem as an axiom it was actually a definition i pointed out that they are also philosophical ways of motivating this choice of the definition of the conditional distribution why we didn't do them today just to save some time these two together given almost immediate corollary which is known as Bayes theorem and Bayes theorem provides the mechanism for reasoning under uncertainty by giving a relationship between posterior probability for a hypothesis given some observed data and a prior probability and the likelihood to observe this data if this particular hypothesis is actually true by normalizing it with the probability the so-called marginal probability for this data which arises by just summing out all possible such explanations for the data under various hypotheses we notice that this framework allows us to formalize simple reasoning processes even those that might be quite important at this point in time for our society and we also saw that this process this mechanistic process does not absolve us the designers of algorithms from the responsibility to actually put the right assumptions into our model and this doesn't just amount to distributing prior probability actually distributing prior probability is often the easiest part of this process the much more crucial aspect is that the hypothesis space has to be the right one and that likelihoods have to actually faithfully capture relationships between latent variables i hope you've enjoyed this lecture and that i'll see more of you actually you will see more of me in the coming months thank you very much for your attention