 We still need to get to a little bit of theory that I just don't get away with in a course called probabilistic machine learning. And that's going to be the first 45 minutes of this lecture. It's going to be very theory laden. And I need to do it also for a very concrete reason that we need to get to one particular insight that we're going to use concretely across the course and that you also need to understand for your exercise sheet this week. That's why I do the first 45 minutes the way we do. And then because I know that it's going to be tedious for some of you, the second half is going to be much, much more hands on. So if you get annoyed by what's coming now, wait for the second half. And if you really enjoy the first half, my apologies for the second half. So we still need to talk about, so in the very first lecture I introduced the axioms of probability theory and I took them directly from Kalmogorov's original text. Now, Andrei Kalmogorov wrote this text in 1933 and he, so back then measure theory or actually set theory was still in a phase that wasn't entirely completed yet. So some of you, well, I know actually, I know some of the phases in this room from my theory of computer science class about a year ago or actually exactly a year ago. And you may, if you were in that class, you may remember that there we also talked about history that happened around 1933 and sort of right before and after the Second World War when people realized that certain things were much more complicated than they thought. There's a similar story for measure theory that we need to briefly touch upon mainly to give us the mechanisms to do something interesting in 45 minutes and also to highlight that there is a bit of a, there are a few dead bodies buried underneath the stuff that we do. But this is a computer science class. So we're not going to do like the full on math. Can you close the door when you come in? Okay, so here's the definition again that you got from Kalmogorov and it's still true today. It actually still works, right? So we've defined this object that we call a sigma algebra, which is this construction on taking an elementary space, taking the set of all subsets of that space and then defining this this collection or saying this this collection is called a sigma algebra if we're able to take are you looking for your HDMI connector? Someone just took that. All right, sigma algebras. Let's go back. are a collection of subsets of the power set, which have the property that they include the entire set, the empty set, and that they are closed under all intersections, unions and building complements, okay? So this is a construction of like a direct construction of an object that allows us the direct construction of an object called probability distributions, which we then defined in this way, right? So this is a function that operates on this and is sigma additive, basically it makes sure that we're not accidentally inventing truth or dropping it while we are combining sets together. Now the first thing we need to talk about is what we do if we take such objects, A such sets and then map them onto a different space. So we take E and by extension F, the sigma algebra and map onto some other space because we're constantly going to do this. Here is a really simple situation where something like this might happen. Imagine I have N coins or actually just one coin that I toss N times, ta-ta-ta-ta-ta-ta and then I don't actually care about the outcome of the individual throws, I care about the sum of them. Let's say heads is one and tails is zero and I just want to know what the sum of those binary numbers are, right? So that's a very, very simple form of a derived quantity and it lives in a different space, right? So the original X's live in the zero one space and this derived quantity, let's call it R is a sum so it lives on the integers, on the positive integers, right? So we'd like to be able to say what is the probability distribution of this derived quantity R? So we need to talk about functions of random events and if you've done your theory of computer science you may remember that functions are kind of important. So, and actually the construction, the definition of what we do when we have functions operating on random events is very, very similar to what we do in computability theory. So we define something called a measurable function which is the following definition. Imagine that we have a function, let's call it X that maps between two spaces and we assume that both of these spaces are measurable. So we have sigma algebras on both of them. Then we call this function itself measurable if it has the property that for every set in the sigma algebra of the output space of the function X, the pre-image of that set under the function X. So X inverse of that set G is itself a measurable set in the original space. That's one of these abstract nonsense sentences that you just have to write down and then stare at it and then convince yourself that that's a meaningful thing to say, right? And if that's the case, if X is such a measurable function then we call its output a random variable and that's a word that you've often heard everywhere, so that's the definition of a random variable. And you may notice that I will keep constantly talking about random variables. Random variables is one of these words that people just use when they talk about objects that have a probability distribution assigned to them even if they directly construct the original distribution. Why? Because everything is a function of something and if it's just under the identity function. So we actually don't really have to talk about random events anymore, we can just always talk about random variables. And assume that there is some underlying original random event and we've just taken some function of it, which just makes it easier. So we basically don't need the object random event anymore, it's sort of hidden away. Also if you think about how we construct random variables on a computer, like variables in a computer program, then we take some random bits and transform them and that's a random variable. We'll talk about how we do that by the way when we actually get to do it. And then there's a corresponding, of course, thing that happens to the probability distribution. If you have a probability distribution on the original space on F and then you transform it under X, then the random variable X has itself a probability distribution. And that distribution is called the distribution measure or sometimes it's just called the law of X. And if this were a statistics class or a stochastic class, if this were in the math department, we would talk about laws a lot, but we won't actually. But that's just a fancy word for saying if you take an original random object and then you apply a function to it, that output of that function, if it's a measurable function, has itself a distribution and that distribution also has a measure and that's called the distribution measure of the law. And it's defined how, well, here's a very, very fancy abstract nonsense sentence and then we'll see the example and it's gonna be very clear. It's the probability for the transform variable is the sort of collected up probability for all of the pre-images of the function, right? So if I want to know what's the probability of the random variable having the value, let's say, little x. So if capital X, what is the probability for capital X to have the value little x? Then I have to check in your original space where I have the probability distribution and check for all of the original events that would under the action of capital X, give this value little x, gobble up all of their probability, sum it up and that gives me a probability, right? It's just another way of making sure that we're not dropping any probabilities along the way. What would that look like for our example with the coins and the sums of their outcomes? So at the top you see the original result again, just to be a little bit clear here, right? The original space is this sort of n copies of the zero one space and this is a simple space, a countably finite space so we can define this power set on it and then define a sigma algebra on it and now we can construct this random variable which is the output of this function. This function is called dot sum, right? It's a very concrete function. Now if you want to know the probability distribution of this divide quantity, we need this so-called distribution measure or the law of R, it's just a probability distribution, right? It's just constructed in this way and how do we get it? Well, we get it actually in a way that you've done in high school for this very simple example, right? So let's think again about the definition that's up there at the top. We need to think of the pre-image of this complicated function. What the function is called sum, right? So you basically just have to think about for a particular value of capital R, let's call it little R, let's say two. There's a plot over here. We have 10, we've thrown the coin 10 times. Let's say the original probability for the coin to lend heads is one third. Now to think about how likely it is to get to see two, that's this one here by the way, that's a bad axis. This is two. Then we need to think about all the possible combinations of heads and tails that would make up two, right? So that could be 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, right? And all of these make up the possible value two. This is a little bit too easy actually, probably, right? You all look like, eh, okay, this is complicated. So how many ways are there to choose little R from capital N results? Well, there's this function for it, right? It's this binomial coefficient that I don't need to construct for you because you spend some time in your high school thinking about it, right? It's this N factorial over N minus R factorial times R factorial and it's something that you have code for and it produces, we have to learn nice little symbol for it, N over R. Now notice that this actually matters because there's an R in there, right? We care about this distribution of R and this function clearly depends on R. It's not a constant. So if we care about R, we need to make sure this is correct. And then we get this distribution here on the right. That's actually how we operate with functions. And maybe you've noticed if you've been to my theory of computer science class or some other theory of computer science class because you've got a computer science bachelor's degree, you may have noticed that there is some similarity to how we define computable functions on Turing machines and how we show that certain languages are Turing computable, right? We take some computable language and then we apply a computable function to it and out comes something and then we have shown that we can compute this thing, right, because we can do the two things. We can compute the original thing and then we apply something to it. And here we have a similar thing. We have not a computable function but a measurable function and then the output is a probability distribution if we have a original probability distribution on the original space. Now, in computability theory, there's a lot of talk about certain things not being computable. Unfortunately, not entirely by accident, some of these problems translate onto measures as well. And so here at this point, I have to very, very briefly open up a Pandora's box and warn you about something, but then it won't actually matter. So the quantities we usually actually care about are not binary or integer-valued random variables, right? Okay, sometimes we actually care about how often something happened, something discrete once, twice, three times, but most of the time, we want to talk about real-value things because we can represent them decently with floating board numbers, right? Not perfectly, but decently and we want to think in our heads about real numbers, right? Because we all just like real numbers. Now, the space of real numbers as you probably know is a continuous space. Anyway, I should probably check. So how many of you can do something with the word uncountable? That's not everyone, actually. It's like two-thirds. Okay. Okay, but just sanity check. Who knows what the real numbers are? Let's see if you're all awake. The real numbers are with the bold thing. It's not everyone, actually. Okay, but I think it's mostly everyone and the people who don't know probably just forgot what I mean or they didn't understand because I was talking too fast. So the real numbers are continuous. They like dense, right? Between two rational numbers, there are infinitely many real numbers. Very bad. And this can be a problem. And as it actually turns out, this is a problem for the construction of measures as well. So this is the bit where I have to highlight something but then not even closely spend enough time on it. So if the original space that you're operating on, our E, if that's countable, then we can actually just construct the power set, the set of all subsets and that's gonna be countable as well and everything's fine and we can just use the power set as the sigma algebra. And then we don't have to worry about anything. But once we talk about continuous spaces like the real numbers and want to define probabilities on them, constructing the sigma algebra is a little bit tricky. And if I would just say, ah, there's a sigma algebra, there will be some mathematicians here in the room. I know where you are. We'll probably go like, oh, but this isn't quite true. There are some problems. So I need to point out that for continuous spaces, there is a problem that not all sets that you can construct on continuous spaces are measurable. And if you're the kind of person who likes thinking about this, I'll give you a three minute outlook. How does this work? Well, so again, if you've had a theory of computer science class, you've learned about uncountable spaces and things like the diagonal language that isn't computable and the whole thing problem, which isn't decidable, right? And how do things work there? Well, you construct, by typically by Kantor's diagonal argument, some kind of combination of machines and problems such that you can show a contradiction that something can't possibly exist. And it works similarly for measurable sets. So the thing we care about in measurability is this thing called sigma additivity, right? That's the thing that really matters in the definition that we are A, that we have this property, right? If you have sets that are pair-wise disjoint, then the measure of their union is supposed to be the sum of the measures. And for the continuum, you can show that that doesn't work always. So here's how our construction would work. I'm not actually going to tell you exactly if you were one of the five people in the room who care about this, then you can read up about it after a lecture by reading this example. And if you're the kind of person, and I know that you're here, who says the immediate reaction is, oh, is this going to be an exam? No, okay. But if you care, then these constructions always work by saying something like, let's take a continuous space, then construct accountable collection of disjoint sets, make them pair-wise disjoint, by specifically picking objects from the underlying continuum that make them pair-wise disjoint. So one very high level, like if I just, without reading the text out here, right? If I just am allowed to wave my hands around, imagine you have to circle. So the circle is a continuum. It's a subset of the real numbers, right? You can write it with complex numbers that makes it easy and convenient. Now if you consider for any point on that circle, any real number on that circle, we can define the action of a group that is rotating that number by some rational number. So the rational is accountable. And that means we can construct for any real number, all the rotations of that real number by rational numbers. And then do something, that's where the illegal stuff happens, and say maybe we can pick from each of these rotations of any real number a particular point such that for every real number, we have a collection of subsets which are pair-wise disjoint. This construction is a little bit like the diagonal argument. Consider the particular machine which has a particular property which I can only talk about by assuming that I can pick it. And then you can construct a contradiction by showing that that set, the set of these picked rotations is uncountable. Sorry, can't have a measure. Why? Because if it has measure zero, then we can construct the entire circle as a union of all of those sets, accountable union of those sets of measure zero. So therefore the entire circle has to have measure zero. That's not good. Or if it has finite measure, then we can construct this entire circle by summing up these uncountably many infinitely, sorry, countably many but infinitely many non-zero values and would have to sum to something infinite. So either the circle has zero volume or it has infinite, it has zero length or it has infinite length. And neither of those seem correct, right? We want the circle to have circumference to pi. So something must be wrong. So there must be basically some problem with this argument and some of you may have heard that there's a beautiful theory behind this called the Banach-Tarski Paradox or Paradoxon. It's connected to these two wonderful mathematicians and there's a beautiful story behind them that I unfortunately don't have time to talk about. So, but just very briefly, this is connected to basically the entire story of set theory and computability and undecidability. Banach was, there's this beautiful story about him living in Lviv which actually at some point was in Hungary, then it was in Poland, now it's in Ukraine and they had this cafe there where they met and did this beautiful math on a book that they kept there. It's wonderful, wonderful, wonderful story. And it ends up with these two people and several others becoming very important for the formalization of set theory and therefore the formalizations of the axioms of logic and decidability. And they did that by showing exactly this kind of statement that there is no way to define a volume in three dimensions unless you have to give up one of these five things. Either the volume of a set can change on the rotation, that's not good, you don't want that. Or the volume of a union of these joint sets can be different from the sum of their volumes, we absolutely don't want that because we want to reason about truth and not forget about it. Or some sets have to be called non-measurable, annoying, but maybe that's what we're going to pick. Or the axiom of choice is not admissible or actually the axioms of set theory don't work, it's also not so nice. Or as I said, the volume of the continuum is either zero or infinite, that's also not good. So our way out is to say, there's going to be some sets that can't be measured. So how do we fix that? Well, there's actually a little way to fix it which basically, well, okay, I'm going to show you the construction and then I'm going to tell you what it means. So incomes and another person from history, Emil Borrell, who points out that usually we want to construct probability densities or sorry, probability distributions on spaces that are essentially Euclidean vector spaces, R through the D, that's what we care about. And that's actually what we're going to do in this course mostly. So R, R, and by extension R to the D are topological spaces. So on them you can define a norm. You can define a function that you see down here which, so like this, right, we know how to write that down, it's actually a function we can compute, for example, the sum of the squares, the square root of the sum of the squares and we can define what's called open sets. And if you've ever heard a math lecture and in this sort of direction, you might have heard about topologies and collections and this kind of construction of open sets. So a topology, by definition for the mathematicians, is also a construction on a set, on offsets. So it's a construction on a set that constructs sets with the property that the entire set is in the topology and the empty set is in the topology and any union of elements of the topology are in the topology and any intersection of finitely many elements of the topology are in the topology. Such spaces, such collections are then called open sets. And now we might notice that this construction looks a little bit like our sigma algebra. It's not quite complete. In some sense it's broader because it allows any union of elements and we only actually, I'll put them next to each other. Here's the definition of the sigma algebra, here's the definition of topologies. So sigma algebras are collection of sets such that the entire space is in it. Any intersection, any union, sorry, any countable intersection, any countable union and complements. And that by construction also means that the empty set is in there. So topologies are kind of close, all right? The empty set is in there, the full space is in there. Any union that's actually stronger than any finite union but nothing about, well, but only finitely many intersections, not countably many and nothing about complements. That's actually important because for topologies you really don't want the complement in because those are called, the complements of open sets are called closed sets. So it turns out that for a space set that has this structure and R to the D has that structure, we can actually use the fact that this exists and construct a sigma algebra from it. And it turns out that that actually works. It's not entirely trivial that this works. And we do it by waving our hands about and referring to Emil Borel who says there is a sigma algebra generated by a topology that we get by taking the topology and completing it. Completing means that we put in infinite intersections so we allow to go beyond finitely many intersections and that's actually the trick of the construction. That's why it's hard. You need to do some kind of trans-finite induction and that's very, very, very quickly, very complicated. And we also include complements. Including the complements is actually not so bad. I mean, after that is not a topology anymore but it doesn't actually matter. Well, nah, it's not a topology anymore but it's not at the construction of a topology anymore. And that's just the sigma algebra we're going to use. So, what does all of this actually mean for you? My waving my hands about and saying that there's lots of complicated things. So first of all, there's an important concept called a random variable. A random variable is the output of a function acting on random events. And random variables can be defined originally on countable spaces but they can just be defined. They just use the definition of a sigma algebra. So if you can extend sigma algebras to continuous domains, we can also define random variables on continuous domains with continuous functions. On continuous spaces, there is something dangerous about measurability. And so we have to make sure that we are operating on definitions of measures under which certain sets aren't measurable. So what does this mean for you, concretely? Well, it's a little bit like the theory of computer science. Those of you who have taken a theory class, you know that there's the halting problem and that there are some things that can't be computed. So if someone came along and said, I have this wonderful idea for a computer program that can take any Python code and check whether it's correct. You would go, maybe you want to Google halting problem, right? Or maybe give me the code and then we'll see, we'll talk, right? And they'll be like, the code isn't quite ready. It's just, but I know exactly what it would look like. No, no, no, no, write down the code because it's not going to work, right? There are gonna be some problems that your code is not going to work on because there's this thing called the halting problem. I know this, right? But usually when you write your code, you're not constantly thinking about the halting problem, right? You're not like, oh my God, let's make sure I'm not writing a function that I never will know whether it terminates or not, right? You just write your code. And here it's very similar. If someone came along and said, I have this wonderful idea of defining this universal intelligence system and it starts by taking a uniform measure over all of the prime numbers. What? Show me the code, right? Write it down and then we can talk. Write a demo and then we can talk. So if you really like these kind of theoretical questions of computability and undecidability and non-measurability, then okay, maybe take a course on mathematical stochastics or ask Bob Williamson whether you can do your masters or PhD thesis with him here in Tübingen, because he's kind of the expert for these kind of questions around here. From now on in this class, we're going to just assume that we've sorted out these problems. And we won't concern ourselves anymore with these problems. There will be a slight nod to them when we talk about doing inference on functions because functions we are nasty. We have to be careful what kind of spaces we define them on, what we'll get there. The other reason I wanted to show you these constructions just a moment is because we'll need them to do something that actually matters in practice in a moment, but there's a question. No, no, no. If you have a topological space, you can construct a sigma algebra from the topology. Yeah, sure. I don't know of a good example of a continuous space that has no topology in which someone can construct a sigma algebra, but for example, if you have a countable space, you don't have to worry about a topology at all. So why now to actually something vaguely concrete? Why do we need this? Well, the thing we would like to talk about are continuous random variables. So let's do that now. And it turns out that there's a very straightforward construction for probabilities on continuous random variables. If you have a Borel sigma algebra, so a collection of a sigma algebra that is constructed from open sets by including all the closed sets, the complements of the open sets and countably infinite intersections. So, and it works like this. It revolves around a fundamental object called the cumulative distribution function and the practical object called the probability density function. You've heard both of these terms probably before. Not your head if you've heard these terms before. Good. You've heard them in the context of some statistics and things like that. So here's the formal definition of them. So the cumulative distribution function is a thing, a function, that measures the volume of the space to the sort of the left of a particular variable. So in one D, it's this thing. f of x is the probability for the value of the random variable lying at x or to the left of it. So notice how this here is both an open and a closed set. On the right, it's closed. On the left, it's open. So we need this kind of construction of open and closed sets. But we can do that because it's a Borel sigma algebra. And of course, if you switch off your math brain and just think of, okay, your numbers, it's totally fine, right? It's just how much truth is in the space to the left. And we can generalize this to multidimensional problems by just doing this in every dimension separately, right? And then multiplying up those volumes. So these objects, though, unfortunately, they are a bit cumbersome. They are fundamental. You can construct them in this very, very straightforward way. But implementing them in practice is often a very tedious because among other things, the rules of probability theory, some rule, product rule, and so on, are awkward to realize on these objects. Why? Because they are counted up volumes. Instead, for many, many such cumulative distribution functions, pretty much all the ones we care about, we hope that we're able to take the derivative of these objects. And if that derivative actually exists, then we can define something called a density function. So here's the definition of it. We take a probability measure on such a boreal space and say that it has a density if there is such a function called a density, which has the property that we can write this function in this way. So we can write it as an integral over some function, right? So not every function on a space can be written as an integral over a function. But if it can, if there is such an actual function that exists that we can talk about, then we actually focus on this function inside. We call that the density, and we think of it as like a density. It's like, how dense does truth lie in this space? And if that's the case, then we can actually also relate it to the cumulative density function by taking the derivatives. That's exactly what they are. They are the derivatives of the cumulative density function. And in particular, we can compute probabilities by integrating up the density from left to right ends. The nice thing about those is, and I'm going fast because you've all noted and you've said I've seen those before, is that for these objects, the rules of probability theory sort of translate. They fulfill a continuous version of the axioms. So the integral over the probability density function over the entire domain is one. That's like the probability assigned to E is one, axiom four. There is a corresponding concept to the sum rule, which is that if you have a joint probability density over a bivariate collection of objects, over two random variables, then the marginal distribution over one of them is the integral over the other one, over the density. And the conditional density is given by, well, the product rule. And we just have to be careful that this object down here has to be larger than zero. And we also have to be a little bit careful that this object down here isn't something that an isolated event of probability zero. So there are some hard problems for annoying spaces that we are not going to talk about. But again, there is one little body buried over here. So if you really want to know about this, then look up conditional densities on Wikipedia and you'll drop down a rabbit hole. This is one of these old stories that probability theories used to fight about with statisticians with frequentists in like the 1980s, but it's solved by now. It's just a problem of correctly writing things down. And that means we can do base theorem on continuous basis. Here is what that looks like. Here is a two-dimensional probability density function in red. And so this is the red thing is the bivariate probability distribution over X and Y. What is this thing here at the back? I can tell you how I've computed it. I've summed up all the values of this function along X. Does everyone else agree with this? A little bit of nodding. So the projection of this distribution onto one of the coordinate axes is the marginal distribution of that coordinate of Y. And this cut, this black line through the distribution, that's the conditional distribution. It's like given that Y has this particular value, X is distributed like this. If we know that we are at this point, X is like this. This is one of these things where everyone nods and says, I've seen this before, and then you get it wrong in the data literacy exam. I've heard about this apparently. So my colleague Jacob Macke told me, show this picture again, because they always say they know. Okay, so here is the one thing that is annoying about probability density functions and it will actually matter in your homework and for the rest of this class, which is that they are densities. So if we transform them, we have to be really, so they are derivatives of functions, right? And if you transform the variable in an integral, then you have to be careful what you do to the integrand. When you do a change of variable in an integral, you remember that there is some Jacobian showing up somewhere, right? Some derivatives. And that's gonna be a problem for densities, but there's a question first. Yeah, so we have to normalize again. Yeah, we have to divide by the thing on the right. Basically, the number. You have to take this black line and divide it by that number. Good point, thanks. Okay, so if you change the variable that you integrate over, you have to make sure that you change the dx at the end. From dx to dy, you have to multiply by something like dx dy, right? The inverse of the transformation. And this is actually true for densities as well. So there's a theorem that I always have to go through and then we kind of forget about it again and then to show up often over and over again. So you actually have to understand that. And that's why you have an exercise this week to work on this. If x is a random variable that has some density, so this is a density function now and I'm writing capital X in the index because it's the density of the random variable capital X evaluated at little x. So little x is a real number. Then over some domain, c one to c two, that's this case here. And we assume that there is a function we call u that maps from x to y. u is this line here. And let's assume that u is monotonically increasing because then things are easy. Because if it's monotonically increasing then it has an inverse on this domain for c one to c two, which we can think of as another function v, which is u inverse, which maps from y to x. Then the probability density function, the PDF of the random variable capital Y evaluated at the value little y, that's a real number, is given by the probability density of x evaluated at the inverse of y under u, so that's v of y, multiplied by the absolute value of the derivative of v of y with respect to dy. Or that's the same as the inverse of the derivative of u of x with respect to x. And there's an absolute value around it, but in this case this doesn't actually matter because it's a monotonic function so the derivative is positive everywhere. And therefore we don't need the absolute values, right? But I leave them in there because it is going to work, it's going to help us once we go to multivariate distributions. So the proof of this works as follows. And this is actually a good way to check whether you've paid attention in the first half hour of the lecture. We care about the fundamental object which is called the cumulative density function of the random variable y evaluated at the little y. By definition that's the probability for all the domain to the left of little y, right? So that's the probability, and here we look at this picture. If this is our little y, then the probability of y being below d2 is the probability of x being to the left of c2. And for that you just have to convince yourself by looking at this picture. That's just true for monotonic functions. That's why we needed to be monotonic, otherwise things would get really hairy, right? Because otherwise we could jump around and have mass over there, which is difficult to track. So here we use first the definition of the cumulative density function, the definition of sort of the Borel sigma algebra, and then the definition of random variable. Now for u of x, we can use the fact that we have an inverse transform, so we can explicitly talk about what the random variable, the quantity in the random variable we care about, the pre-image of u, and then we write down the definition of this cumulative density function in x. That's this thing. And now we just apply, so now we are done with probability theory for this proof, and the rest is calculus. So we know that this thing is supposed to have a density, that density is the derivative of this cumulative density function. And now we look at the right-hand side of this expression and say, what's the derivative of this thing with respect to y? Well, it's the derivative, it's like chain rule, right? It's the derivative of the integral, that's px of x evaluated at the inverse of y, times dx dy, and dx dy is just this. So it's a fancy way of writing dx dy. And the same thing actually works for monotonically decreasing functions. And the only change in the proof, here I have the picture of a monotonically decreasing function is that the orders of d2, d1 kind of change, right? So the probability of y being less than d2 is the probability of x being larger than c2. And that's why there's a minus showing up, and now we have to be careful with minuses and absolute values. To bring this point home, actually, yeah, okay. So I'll make one more. There is a generalization of this to multivariate distributions, and that's really not entirely straightforward, but it turns out that if you have a multivariate distribution over many variables, and you have some continuously differentiable and injective function with non-vanishing Jacobian, what does that mean? It means that it's invertible on this domain, right? It's the multivariate version of what I just wrote in one dimension. Then something very similar holds, which is to evaluate the density of your multivariate quantity, you have to compute the pre-image of your transformation, g inverse, evaluate the probability density of x at that point, and then multiply by the determinant of the Jacobian of the inverse of that function. So this sounds very, but remember, we've got computers. And in 2023, computers can compute Jacobians and they can compute determinants. So what I've done is I've written a little demo for you to think about this, because at this point I want to switch from the nasty math to the more fun part of the course, but first there's a few questions. This is Jacobian, not really. So to show why this is correct with the Jacobian. Okay, for that, the intuitive way I'm gonna do now. For the Jacobian story, it's something about waving your hands about and saying that determinants are like a measurable volume of a transformation. And this transformation, basically what we have to keep track of is, well, I'll show you the picture actually. And here is something that we're going to do across the course. Every now and then I'm gonna have an app like this. And if you have downloaded the slides, then you can just click on this streamed cloud thing up there and it's gonna open up your browser and you'll hopefully see an app that works. If it doesn't, then it's probably because everyone else has just clicked on it and now it's sort of overrun. You can also just clone from a corresponding GitHub repo and run the streamed app yourself in your own browser. That's what I'm doing here. There it is. And I'm gonna show you some of these apps across the course. If you find something about them that you'd like to change, then just do a GitHub pull request and then Alfredo might take care of it for you if I don't have the time for it. And if you actually have some functional improvements to some of these apps that are genuinely making them better, I might be willing to every now and then hand out some bonus points in the exam, but only for actual improvements, not like for changing the color scheme and making it a bit nicer and adding some labels, right? Okay, so here is how this works. I'm gonna show you the pictorial view. What I'm doing now is basically the exact, the theorem again, but now with pictures. So what you see here at the bottom is first of all, let me switch this off. In sort of yellow, golden, greenish, mustard colored, a Gaussian distribution. I haven't introduced a Gaussian distribution yet, but let's, okay, right? Come on, everyone knows what a Gaussian distribution is. So this, I want to transform this thing through some function. Here is a red line, that's a monotonic function. Right? How have I made this monotonic function? Well, I wanted to have a function I can control a bit and mess around with, but wanted to make sure it's monotonic. So what I did is I've added three sigmoid features. Those are these gray things here in the background that aren't dashed. You know what sigmoids look like, right? Like this, if you take a sum of a bunch of sigmoids, you get a monotonically increasing function, nice. And I can move these things around by taking some of these, for example, the location of the first one, moving it to the left or to the right. And that will change the location of this first feature. All right? I'll put it to minus one again, or something minus three. I can also change the gain. I can make it very flat, or I can make it very steep. A bit more steep, like this. And then I can do this with the last one here as well. So now I have a non-trivial monotonic function, okay? So what I've now done is to get this dashed dotted line, I've done the naive thing, which you can actually see down here. So here I'm defining the function. This is my sum of three sigmoids in a differentiable fashion. So I'm computing both the function and its derivatives at once. And by the way, you can think, if you want to have something to think about yourself, you can wonder why I do it this way and not define the sigmoid and then compute value and gradient outside of it. If you want to know why, try it yourself and you'll get an interesting learning experience about autodiff. So a little hint, this is a multivariate output function. It produces three outputs, right? Three different features. And so now what I've done is I've literally just plotted this sort of golden line here. And then I'm plotting the thing you would get if you would naively do this transformation without the Jacobian. So I'm just, for each of these numbers here, I'm just checking for this particular y what the corresponding x is and just plotting that number. And that gives me already something that doesn't look like a Gaussian because of this non-linear transformation. By the way, those dashed lines are all derivatives. So the solid red dashed line is the derivative of the solid red line and the dashed lines are the derivatives of the features which I'm going to switch off now. So, and this is not actually correct. So the way I checked that it's not correct is I'm computing in a very, very, very stupid naive way integrals. So for this line, I'm just, and you can look at the code if you want to on GitHub, I'm just taking a dense grid from minus three to plus three with something like, I don't know, 300 points or so. And I'm just summing up the values of all of these, of the corresponding little golden function values and then just multiply by three, divide by 300 and multiply by the width of each bin, which is six divided by 300. Right, six for minus three to plus three. So that's the numerical algorithm called the trapezoid rule, which is a stupid way to compute an integral, but it's also a really straightforward one that just uses a sum. And then I do the same thing up here for this curve and what you see is that I get out 4.7. Why is this bad? Yes, this is supposed to be a density it should integrate to one. It doesn't. Why? Because we've lost track of truth. Actually, we've invented some extra truth in this case. So there are some points here, in particular, where the derivative is very steep, where there is a little truth value here getting spread out across a large domain. So the amount of truth in this region gets spread over a larger region and we pretend that they don't all come from the same smaller region. We forget about the fact that the region is kind of extended in this part of the domain. So to fix this, we have to multiply by the Jacobian or in this case, because it's one dimensional, just the gradient. The absolute value of the gradient, but the absolute value doesn't matter here because the red line is monotonic, so the gradient is always positive and we can just multiply by the gradient for that I have a thing down here. Boop. And we get this green line, which has roughly the right integral. There's a, it's not correct, but not because the transformation is wrong, but because everything is numerical, right? The gradient is numerically evaluated, I'm plotting something on a finite grid and then I'm using the trapezoid rule, which is a very bad integration rule. So it's a bit wrong. And actually, if you play with this code, I invite you to, you'll find settings of these features that make this integral approximation also quite bad. You can make it like 1.8 or whatever, just because of the way that the integral is computed, but it's still better than the naive one. And how do I do this? Well, here's the thing that actually matters. I'm taking, instead of the naive P, which is just basically evaluating the function, I'm computing the gradient, which is this DF thing here. There we go. And dividing by it, right? Multiplying by the inverse of the gradient. With an absolute term around it, just in case it wouldn't be positive, but you can leave out the ups and it's still going to work because it's a monophonic function. Does this make sense? Good. So that was the first part of the lecture where I had to tell you that there's this thing called PDFs. And to get there, we first had to talk about random variables that come from transformations. We had to talk about measures on continuous spaces. And then we had to care about the fact that when you, well, that was enough to construct density functions. And then we just had to observe that changing the random variable, transforming a variable from one space to the next, changes the probability density function in a non-trivial fashion, but a tractable fashion. You have to compute the Jacobian and multiply by the determinant of the Jacobian. Or if it's a one-dimensional, just compute the gradient and multiply by the absolute value of the gradient for monotonic transformations. And Jacobians are fine because we can compute them on a computer. Okay. Now, do we want to take a quick four-minute break and then do something hands-on? What do we want to go straight up? Okay, four-minute break. I'll continue at quarter past. So I realize that this part of the class was very theoretical so far. And I'm really curious about your feedback on this at the end of the lecture. Because in my experience, there's always a small set of people in the class that are usually thinking very deep thoughts who like this kind of stuff and they need it to be able to continue. And then there's also a significant part of you. Maybe it's the ones that tend to not actually give feedback who see this lecture number three and then they just leave and never come back. Something like this was so tedious, I'm not gonna get this class. So I hope I can keep you here by doing the second half of this lecture today, which is like the antidote to the first half. Because this is a computer science class and we want to write machine learning algorithms at some point. And yes, we will build deep neural networks at some point and assign uncertainty to them. We just have to get there slowly, one step after the other. So now, as I have one tiny little step towards this goal, let's think about a concrete problem that is very, very, very simple but an actual inference problem and see if we can use everything we've seen so far to make an actual inference about something non-trivial. And the question that I want to take is something that also comes from the very first time I taught this class with Stefan Armeling is let's figure out the probability of people in this room wearing glasses. Why glasses? Because it's a relatively simple thing that I can check by looking at you. It's hopefully not something you're embarrassed about because you're wearing it, right? If it was such a problem, you'd probably be wearing contacts. And because it's an interesting quantity to think about. So here's the sentence. What is the probability for a person to be wearing glasses? I'm gonna use the symbol pi for this, which is dangerous, but it's P is already used for P off and pi is gonna be a probability. So I wanted something with a probability and the Greek letter for P is just pi. And that's actually the main challenge in all of this. This thing we care about is a probability. And we're going to need to assign probabilities to them to be uncertain about them, right? So at first, when I, if I don't see you in front of me, I won't know what the probability is of you wearing glasses. But without looking around, you probably have some feeling for what it is, right? That's what a prior is. You can sense it deep inside of you what this prior is. Think about, first of all, what this is. So there's a number between zero and one, right? That's the probability. If it were zero, that would mean no one in this room or out, let's even talk about the population out there. Let's just assume you're an IID sample of the population, which is not true, but whatever. So if it were zero, then no one anywhere, everywhere in the world would be wearing glasses. If it were one, everyone all the time would be wearing glasses. And it's probably somewhere in between, right? Okay, so let's do machine learning. Let's do Bayesian inference on this. What's the answer to such inference questions? I'm gonna ask this question many, many times and there's always one answer that you can always shout out. There's rule, there's theorem, yeah. So we need to construct a prior and, here we go. We need to construct a prior and then we need to construct a likelihood and then we need to normalize it and then we're done, we have a posterior, right? So what is a good prior? What should it look like? Draw it with your hands in the air that everyone can do it. Here's the zero, one simplex. What does the prior look like? Gaussian. Took. Mm-hmm. No one is drawing anything with their hands. Where's it like, what does it look like? Does it look like this? Does it look like this? Does it look like, ooh, ooh, ooh, ooh. One over pi is important somehow, whatever, I don't know. Ooh, so yours is one step ahead of us already. So the first, so first of all, I want you to go through this mental exercise, right? In your head, you have something that looks like, ooh, I don't know, maybe it's 20%, maybe it's 80%. Probably not, right? It's sort of, I don't know, 30, 40, whatever, right? So there's some distribution. And that's the philosophical part of patient inference. You'd like to write down the prior. And then we need to talk about the likelihood. And immediately, you're coming up with a good point, we're gonna need to do this on a computer. So we have to think about how expensive this computation is going to be. So maybe we want to do some complicated distribution which makes this integral down the attractable. Let's see. So I've actually have an app for this as well. There's an app for that. This is my prior in red. This distribution is one. It's the uniform distribution. Actually, I have a slide for this as well that I should have shown you, uniform, right? It's just one everywhere on the simplex and then zero outside. Do you agree with this prior? Some nodding, right? Could be anything. Interesting, because there's actually very little amount of nodding. Most of you don't. You just, you don't care, you know? So of course it's maybe not perfect, right? Do you actually believe it's likely to be zero or one? I mean, you already know that there will be people here wearing glasses, right? But it also doesn't feel so bad, right? It's just whatever uniform, right? So there's something historical about this, right? This is what people used to argue about. Oh, you have to have a prior and prior is so dangerous. Right? And you already, you already, ah, whatever. It's gonna wash out anyway. Interesting. And now I'm gonna start collecting data. I'll look at the first person in the room. He's wearing glasses. Ha! What happens now? How do we do inference? Base theorem, yeah? This could be like a whole chorus. Base theorem. Okay, we need to multiply by the likelihood. What's the likelihood? Pi. Pi? Ah, hi. Why pi? This is the trick. If you get this one, you get the rest. What is the likelihood for this observation one, person wearing glasses? So let's think about this for a moment. The probability to observe one person wearing glasses, if the probability to observe a person wearing glasses is pi, is pi? Right? So what does this look like? It's a function that goes from zero to one. Because the likelihood is a function of the right-hand side of this conditional distribution. So we're talking of P of X given pi and that's a function of pi. Now as a function of pi, it's this gray thing below. It's this thing, this gray line here. Right? If you think that's now obvious, please nod. That's not obvious to everyone. Then let's go a bit slow here, because that's really important. So what we need in our Bayesian inference is this thing here, right? That's the probability to observe an individual wearing glasses if the probability to observe someone wearing glasses is pi. And that probability is just pi by definition. Can you nod now if this is clear? Much better, okay? So that's a linear function that goes from zero to one. And then we need to do Bayesian inference. So what's the probability of observing one person, period? Well, we need to integrate out, right? This thing down here is the integral over this expression d pi. So the prior was just a unit function. So it's just flat, fine, it's just one. So it doesn't change the integral at all. So we just have to integrate this function that is pi d pi. So what's the integral over pi d pi from zero to one? You can read it off in this picture. It's the integral over this gray thing at the bottom. What is that? Shout it out. It's not one, what is it? It's one half, because this triangle is half of this rectangular block which has volume one. So we have to divide by one half, we have to multiply by two. That's why this red line, which is the posterior, is steeper. And it integrates to one. That's our posterior. Okay, next person in the room, he's not wearing glasses. Ah, what's the likelihood to observe someone not wearing glasses if the probability of seeing someone wearing glasses is pi? One minus pi, very good, you got it. Now we're in business. Ooh, so I've seen two people, one is wearing glasses, one is not wearing glasses. And what I've done is I've multiplied in one term of pi and one term of one minus pi. And now I have to integrate at the bottom this thing and now it's getting hairy, right? This is a quadratic function, okay? I can still do that, but we can guess that there's gonna be a problem once we have more and more and more people. But let's forget about this for a moment and let's just keep going. Next person wearing glasses. So I multiply in one more term of pi. This is what the posterior looks like. By the way, the dashed line is at the mode of this distribution is where it has the highest value and the green line is at the mean of this distribution that's its expected value. And of course you all know what that is because you've had studs classes and the arrow bar is the standard deviation around the mean. Well, should I continue like the next row? One person wearing glasses. Next person not wearing glasses. Next person wearing glasses. Now we've got four, five, six. Next row, let's go from the left. No glasses. Next person glasses. No, it's difficult, yes. No, yes, yes. Oh, that's a lot, okay? Next row up, not wearing glasses. One, yes, two, three, yes, one, no. So one, no, one, two, three. And then one, yes, two, no. Okay, before I get to your question, do you get what's going on? So what I've done is I've just multiplied in a bunch of functions that are all of the form x or one minus x. But if you multiply many of these x's and one minus x's you get something that looks like this where n is the number of positive cases, how many people with glasses I've seen, and m is the number of negative cases, how many people without glasses have I seen. And interestingly, this is actually a really complicated function. It's not just straight lines, right? But a product of a bunch of straight lines makes up something non-trivial. That was a question. Ah, very good point. So here is an assumption I've made. I've assumed that I'm drawing IID from the population. So first of all, I've assumed that the people in the room are an IID drawer of the population, which is probably not true because you're university students so you probably have worse eyesight because you read more, hopefully. Although these days everyone's on their phone all the time so maybe it doesn't matter so much anymore. And then, as you say, the people in the front rows maybe tend to wear more glasses because if you have bad eyesight you want to sit in the front, maybe. Not sure. So these are the kind of problems we will have. Is this something to do with Bayesian inference? Nah, it's just a problem with how you select data and you have to care about it. In this case I don't because I want to show a simple example, yeah? Yes. So okay, let's keep in mind that we had 14 and nine here. I'm gonna remove those again and take those out. Now we are back at the prior and what I have up here actually is the thing we need to talk about, the prior. So do I have a good slide for this? So one thing to notice is that we can think of this prior. One, we could actually think of this itself as a, of this form. So if I take pi to the n times one minus pi to the m and set both m and n to zero, then that's like raising pi to the zero which is one times one minus pi to the zero which is one and that's our one function. So I could write something down with like pi to the, let's say alpha minus one so that's how far we are from zero and multiply this in and that's actually what I do in this app and that's the alpha here. I could take this down a little bit and here as well and then I get a steeper distribution. I could also raise it up but I can't raise it up in this code beyond one. So what do I do? This is important. If I wanted to have a prior that where the alpha goes beyond one, what do I do? So yeah, I add an observation. If I wanted the first one to be larger than one I just add one here. That's the same thing. And now I can do 1.6 and the prior looks like this. So the prior is a bit like observations. It's like pseudo observations. It's like inventing having seen observations before and the person who came up with this particular thought actually realized this as well. Here he is. Pierre Simon at Marquis de la Place. That was the very first application of Asian inference and he writes, oh, it's in French. Can someone translate? Huh? I'm gonna stop you right there. Thank you very much for your help. The good thing is these days we don't need people like him anymore. We've got deep networks so we can do it in English. Sorry, a joke was on your part. So yeah, so thankfully we can all talk each other's languages now thanks to machines and at least the machine says the probability of most simple events is unknown. Even who's wearing glasses is unknown. Considering it a priori, it seems susceptible to all values between zero and unity. That's what we just did. We realized this probability is between zero and one. But if one has observed a result composed of several of these events, the way they enter makes some of these values more probable than the others. That's just like fancy talking about inference. Thus the observed results are composed by the development of simple events, the individual observations, plus, minus, plus, minus, plus, minus. Their real possibility becomes more and more known and it becomes more and more probable that it falls within limits that constantly tighten and would end up coinciding with a number of simple events if it became infinite. So that's just his observation that we can do this with our like, what did we have, 18 and, I don't know, 14 and? Nine, okay, so we had 14 and nine. Okay, so this is our, and actually as you can see, right, what I change in the prior doesn't actually matter. It just shifts it by a tiny amount. I can put it like this, it doesn't really matter anything. Okay, so that's Bayesian inference and I can make the prior treated like, like some essential sort of pseudo observation of initial, of previous values. If we need to think about this probably a little bit more, but we'll talk about it in the lecture on Thursday. So we'll get there and think about this a bit more. What I want to do now is to just briefly talk about this annoying normalization constant down there. Which isn't here yet. Because Laplace actually had a problem with this as well. For this I'm quickly going to rush through some of the slides that you can look at afterwards because they're really just constructing what we just did. So here we have a setting that basically uses everything we've done so far. There were various random variables, individual observations, person bearing glasses, not bearing glasses, bearing glasses, they are binary values. They are either zero or one and we have constructed something from them at distribution, continuous probability distribution over an unknown quantity, give the probability to observe someone bearing glasses. And so there's a picture for this. We make some assumptions. We assume that they are independently distributed and drawn from the actual distribution. This is not correct, but we're just going to do it anyway. And then we make some algebraic observations, which I'm just going to quickly go through. We have a neat way of writing down the prior as a uniform distribution. It turns out we can actually think of it as a little bit like a likelihood term. And then we notice that the likelihood term can be constructed by exactly the kind of quantities that we had in the very first slide of the lecture. It's one of these sums over random variables, right? It's a binomial distribution. That's what this is called. And then I'm going to save you all of this stuff and go to slide number 30. You can look at this afterwards if you want to, but it's really just a repetition of what we just did in hand and find that we can write our posterior up to normalization with as a product over a bunch of terms, which are all of the form pi raised to some power times one minus pi raised to some power. And the typical notation is with a minus one there for reasons that will become clear in a moment. It's just a redefinition of what a and b is. Okay? Have I lost you now with this quick tour? Are we still there? Okay, good, people are tracking. And this is actually what Laplace did as well. Laplace was actually, when he was trying to answer this question, I think he was motivated by some ancient philosophical question that Aristotle already posed, or maybe even Socrates, which was, what is the probability that the sun will rise tomorrow? There's this thing that philosophers used to say, you know, every day the sun rises. We all know that it rises, but how do we actually know? Like, maybe tomorrow it won't rise. Shouldn't this worry us? Well, no, there should be some process that makes us confident because it has risen every single day of our lives. So therefore we are quite confident. And Laplace was trying to make this mathematically clear. So he made this argument and say, you know, if A and B is very large, then the probability will be high, except for, oh, that's this normalization constant. What do I do with that normalization constant? So Laplace had a problem because he had this integral in there. That's this normalization constant, right? I hope you agree that that's what it is because it's this thing, if you divide, so if Z is this thing, then this thing integrates to one by definition. And Laplace didn't know how to solve this integral because it was 17 something or early 18s, 1812 or so when he wrote this book and he didn't know. But he realized that someone had already done, like talked about this integral, someone else, a Swiss mathematician called Leonhard Euler, who was actually a young man when he started thinking about this because there was a treatise by Goldbach, I think he was Russian, maybe Prussian, I don't know, mathematician, who had asked this question of how to interpolate between the factorial function. So here's the factorial function, the black dots. What is the function that interpolates between them? And like for generation, or not for generations, but for many years people had tried, they couldn't find anything and then Euler, being the mathematical genius that he was, came in and came up with this function, which is actually this function here, which he called, he didn't actually call it gamma, he just wrote down this integral in his text and then I think Dalonbert found it and called it the third integral in the treatise by Euler. So therefore it's called gamma because it's the third integral, alpha beta gamma. And before that, he constructs this object by using this other, ooh, there it is, it's the second Eulerian integral, the second thing that shows up in the text, that therefore it's called beta because it's the second time it shows up in this text and he says this actually can be used by rewriting it in this form as constructing this function gamma and this gamma function is actually an interpolant between the factorials. So it's a function that goes straight through all the factorials, it's a continuation of the factorial, it's a wonderful thing. It's not the only one actually. In the early 20th century, Hadamard came in and came up with a different interpolation that also exists, it's this thing down here. It also uses the gamma functions like the different also interpolates, so it's not like there's one unique interpolation between these functions. But all of this didn't help Laplace because I mean, there's this integral but you still don't know what it is. There's this thing up there and we didn't know how to compute it and he needed a number in his tables because he wanted to draw pictures. So what did he do? This is going to be important for your homework. So Laplace sat down, this is the final thing we do today and he said, okay, and this is a story that we will do at least three or four times in this class because it will turn out to be a magical key to Bayesian machine learning in 2023. I cannot overstate how important this slide is. So try and get it the first time round and then we'll go slow the next time it happens again. So here is Laplace's argument. The first time we encountered this problem is in 18, I don't know, 1805, if you wrote this book, I think it was published in 1816 but he voted over a long time. Lectures at the university that he was teaching or at the gold that he was teaching it. So he said, this distribution we care about, it's x to the a minus one times one minus x to the b minus one divided by this normalization constant. So first of all, let's take the logarithm of this distribution because then things get easier. We get a minus one times log x plus b minus one times log one minus x minus a constant, which is this normalization constant. And now what we're going to do is we're going to find the mode of this distribution. So we'll take the derivative of this function with respect to x. We can do this on a piece of paper because it's 1805 and we have a lot of time. So between our busy job, working with Napoleon or whatever, so we get a minus one over x minus b minus one over one minus x. Of course, I hope you agree with that. Set this gradient to zero, solve it and we find that the mode lies at a minus one over a plus b minus two. I hope you follow. So why is this a good thing to do? Well, first of all, you can do this in closed form. Why is this the right mode? Well, because what we've taken is we've taken the logarithm of this function, but the logarithm is a monotonic function. If you think of what the logarithm looks like, it's this function, right? It rises all the time. So by taking the logarithm, we're not shifting the location of the mode. It just has a different number now assigned to it, the logarithm of the value of p plus a constant, but it doesn't matter, right? The location of the mode is the same and conveniently, this annoying integration constant has gone. So now we have the mode, but we don't yet know what the value of the function at that point is because for that, we would need this normalization constant. So now we'll do something else. We take the second derivative of log p of x at this location. So the second derivative of log p of x like this, you can look at this expression, right? I take the second derivative, so the derivative of this with the vector x is minus a minus one over x squared plus b minus one over one minus x squared times the inner derivative, which is minus one, so there's a minus again. And then we plug in x hat and compute the curvature of this function at the mode. And that's, it's an annoying expression, but it's just a and b's and squares and things, and Laplace could take squares, right? So I have a picture of this. Actually, not quite yet, let's do this first. So what we've done here now, is we've done essentially a Taylor expansion. And I think Laplace wouldn't have called this Taylor expansion because he didn't know of Taylor, but he's expanding this function in terms of some polynomials. So we find the mode, and then at the mode, the second order Taylor expansion is a constant, right? The value of the function evaluated at that point plus a linear term times the gradient. What's the gradient at the mode? Zero, so there's zero here, plus a quadratic term, right? One half times the square distance to the point at which we have taken the expansion times the second derivative at that point. That's this thing. It's the stupid number in the world, a's and b's. So that means up to second order, the logarithm of this distribution can be written as this quadratic function. And therefore, the normalization constant is approximately the integral over this thing, if you take the exponent exponential again, right? e to this object. So that e to that object is e to the log of p of x hat. So that's just a number, a number that we don't know because it involves this normalization constant, times the integral over the exponential over the quadratic term here because there's no linear term, which I write like this, dx. And we know that this has to be one because it's a probability distribution. So if we know what this thing is, then we know what this thing is and it involves the constant that we care about. Ah, we're nearly there. And now, thankfully, Laplace knew of someone who lived in Brunswick in Germany and he had solved this integral. So back then, mathematicians got famous by solving one integral. Well, actually, he gouged it many other things, but he solved this one integral among other things, which is really not so straightforward if it's 18.05. And he showed that this beautiful object has this value given by the square root of two pi times v. And Laplace knew this because he had read a book by Gauss, which he had published like five years before Laplace published his theory of probability, the theory of probability, the theory analytic, the probability. And so he plugs in this thing. He goes, this integral there, I know what this is. It's the square root of two pi times minus the inverse of psi. So there's this object up there. Why minus? Because there has to be a minus in the exponential for this integral to work. So I need to multiply by minus to make the minus work. And then I need to invert it because otherwise it doesn't look like the integral that Gauss has in his texts, but whatever. It's just a real number. So I take minus and invert, get one out. And P of x hat is our constant, one over B, AB, the beta integral, the Eulerian beta integral, times x hat to the A minus one, times x hat to the B minus one. And I can just solve it like, ah, there it is. This is my normalization constant. So this is called the Laplace approximation. And this is what it looks like in pictures. Whoop. So on the left, you see in black, our actual probability density function. And I take the logarithm of it, plotted in log space here. So you can see that it's a convex function. Oh, a concave, sorry. Then I find the mode that's at the dashed line. I take the second derivative at the mode, that's this red parabola, that's actually a parabola. Oh, the other way around. The red thing is the true distribution, of course, because it has to go down at zero and one. And the black thing is the parabola, sorry. And then I can take the exponential again, and that's this black thing that you see on the left-hand side plot. And I know what the normalization constant for this is, so I can just plug it in. And that's an actual probability distribution. Now the red thing on the left is something that Laplace didn't have because it involves this normalization constant. But you could approximate it with this black thing. And that was fine. So now I'll tell you what your homework is. Maybe you noticed that along this derivation that we just did, I sneaked in an error. That Laplace made as well, and he just doesn't talk about it. Because all great mathematicians know what they're aiming for, so they just keep going, even though it's a little bit wrong, but whatever. And then someone else, two generations later, can fix it for them. Or actually, in this case, like four generations later, goes fixed in like 1995 or so. Can someone guess what the mistake is? And to make it easier for you because we don't have too much time, I'll give you a hint. There is an integral sign here, and there's an integral sign here. And I've left out the boundaries of the integral sign because they are not the same. So Gauss's integral goes from minus infinity to plus infinity. And Laplace's integral goes from zero to one. So it's not actually true, but whatever, right? These two things are close to each other. It's at log space, it's not quite even. So in your homework, you'll get to check what happens if you do the integral correctly. And you can do it today because you have these machines that sit in front of you. Laplace couldn't because if he wanted to do, like, cut this off here and here, he would have needed to be able to evaluate error functions. And he didn't have error functions, right? So he needed to do the closed form thing with pi. And that's actually the last point I want to make. These days, of course, we don't have to worry about any of this because we've got computers. So if you don't like the prior that I chose, if you don't like this uniform thing or this parameterization of the prior with alphas and betas, you can put in whatever prior you like. Here, in this case, the black dashed line is some prior that I drew down by hand. Well, not quite by hand, right? And then I multiply by some observations, that's this red dashed line and out comes a posterior. This is not of the form that you can do with this app, but it doesn't matter because we've got computers. I just need to integrate from zero to one over this function and you all have enough power in your computers to do these integrals. So there's actually a lesson hidden in there, which is that we're gonna try and do things like this to keep things fast, but if you really need to evasion, we can always just do the full thing and compute numerical integrals. So with that, I'm at the end. What I wanted to do today is in the first half do some heavy lifting theory, which will be the last time we have to worry about this because from now on we'll try and write code. We saw that we can take functions of random events and construct new variables from them, like for example, taking the sum over those events. That constructs an object called a random variable. We can also construct probability distributions on continuous basis. There's a few things to worry about, but actually if you are on a topological space, so a Euclidean vector space, it's possible to construct sigma algebras and usually everything will be fine unless we do something very pathological, like thinking about sets that are countable but lie in some uncountable space. And on those continuous spaces, we can construct probabilities that sum up entire volumes that we can measure. And if you compute the probability of a volume up to some upper bound, then we can take the derivative with respect to this upper bound and if you can take that derivative, we get something called the probability density function, which is a very useful object because the laws of probability hold for it. The sum rule, the product rule, normalization and base theorem. And we can use that as we just did in this example to actually learn non-trivial things like the probability for someone to wear glasses, continuous probabilities from binary observations. When we do that, we have to suddenly start caring about the computation on the machine. And if we want it to be fast, we better make sure that we compute everything in algebraic forms that actually makes it tractable. And Laplace already had this problem 250 years ago and it's been with us for the entire time. Even today, when we do deep learning and artificial general intelligence, we're going to be packed with these problems of how to translate the philosophical idea of inference from data onto actual algorithms. And the vast part of this class will be about how to do this with good algorithms. That's the end for today. I hope that you're giving me some feedback on this somewhat half and half lecture and then we'll be back on Thursday.