 So, what we've been doing basically over the past, over the entire last lectures, well okay, so maybe the very first few lectures was mostly me pointing out that there is this notion of reasoning that allows us to keep track of many possible hypotheses while data is accumulating over them called probability theory. And one of the very first observations we made is that doing so can potentially be very, very, very expensive. Exponentially expensive in the number of variables and well, in a way linearly expensive in how many possible values the variables can take which is still very bad if you think about continuous variables because they effectively take an infinite number of possible values. And then ever since then, so this has been lecture three, four, five and now six has been all about showing you that it's never the less possible to do probabilistic reasoning and in fact that it's not even particularly expensive to do. So there's this sort of notion in the machine learning community that Bayesian inference probabilistic reasoning is somehow expensive and intractable and impossible and therefore we should just do deep learning and forget about probabilistic reasoning. This entire class is here to show you that that's not necessarily a good idea. Actually it's just not a good idea because when we jettison probabilistic reasoning from machine learning, we're also losing all sorts of very interesting functionality that we'll get to later in the class. So the way we've been doing this over the past few lectures is to look at a relatively broad structure of probability distributions called exponential families which provide a very interesting algebraic structure with which we can construct models. And I introduced this in this abstract form so these are probability distributions over some variables X, we call those observations or the data which are parametrized by some parameters W so you can think of this as a conditional distribution P of X given W, which have this algebraic structure that there is one term that depends only on the data that's called the base measure and a mixing term in the exponential between a function of the data called the sufficient statistics and the parameters which you could also represent in some other way as canonical parameters and then there's a normalization constant which only depends on the value of the parameters. And I made the point several times that really what this boils down to is a language to describe models about data that sort of hinges on well, phi and Z, where phi is something that you get to design so phi, the sufficient statistics is your model of how the data comes about, what you care about in the data and Z is the sort of price you have to pay because you need to be able to compute a normalization constant so somehow if you know this structure if you decide to use this structure you don't have to be able to compute this thing so this function is sort of defined through phi because it's the normalization constant of this you have to be able to compute the integral over this function with respect to X and then you're left with a function of W that is Z of W, up to inversion but really the point is that once you've done that, you're done that's the entire modeling process and everything afterwards follows procedurally from the rules of probability theory and just algebra in particular, the prior that you should probably use for this kind of model to be able to do tractable inference on W kind of forces itself upon you in the form of a conjugate prior so we encountered this notion of a conjugate prior which is an algebraic structure such that the posterior after observing some data drawn from this likelihood has the same functional form as the prior and we saw that every exponential family has a conjugate prior this is this really busy slide let's look at it slowly again every exponential family of this form has a conjugate prior which has this form which is itself an exponential family which has a base measure that we can construct to be just the uniform measure, the Lebesgue measure sufficient statistics which consist of the natural parameters of the likelihood and the log partition function of the likelihood which means that its natural parameters are some kind of pseudo counts of W and a counting variable that just keeps track of how many observations we've made so far the only problem with this is that for this exponential family we also need a log partition function and in general it can be intractable to compute this function so the only real problem in doing Bayesian inference is that it's not that there has to be a prior it kind of forces itself upon you the problem is that you need to know this function to do full Bayesian inference because if you can then the posterior is going to have this simple form to just evaluate basically a variant of this function where you just add to the natural parameters the sufficient statistics of the data and the count of the data and evaluate this function and then you can do pretty much anything you can evaluate the posterior and you can predict future data by marginalizing over this conjugate prior and you're left with this predictive object which by the way will be your homework this week to implement part of your homework so in a way this is bad because it means you need to know F but it also kind of really boils down what we need to know to be uncertain we need to decide what our model is for the data which in the exponential family case really reduces down to sufficient statistics phi there's one function you need to define and then we need to be able to do while these two integrals Z and F and then we're done and I said that usually you know Z because you're picking your likelihood from a standard collection of exponential families if it's a discrete random variable you're probably going to use either the multinomial or the binomial distribution and then the conjugate priors are a Dirichlet or a beta distribution you might if it's a real valued variable you might use a Gaussian likelihood if it's a rate then you might use a Poisson likelihood or with a conjugate prior called the gamma distribution if it's some extreme event you might use a Laplace likelihood and so on and so on and then the main challenge is really basically left through is F and that's where we ended the last lecture because I ran out of time so now I want to complete that story and say well actually maybe we can get away approximately without actually computing F and the first observation for this I already kind of made but I cleaned it up a little bit in the last lecture which is that of course even if you don't know F then we still know quite a lot about the shape, the geometry of this probability distribution we know basically the entire thing up to a constant that scales the entire distribution so in particular we can reason about the shape of this distribution the spread that it has and we can use that to find for example the mode of this distribution not just the likelihood that's what I did last week but also the posterior and it doesn't make any harder to construct a posterior why because let's look at this equation again if this is our likelihood exponential family I've already left out H of X because it's not going to matter because we are caring about the thing as a function of W this is its conjugate prior as we just saw then the posterior as we saw will be of the form something like this so an exponential family with sufficient statistics given by the natural parameters of the likelihood and its log normalization constant with a minus in front sorry natural parameters no, no, no, no sufficient statistics and natural parameters given by the updated sort of form so the prior parameters plus the sum of the sufficient statistics of the likelihood and accounting variable for how many variables how many observations we've seen and then there's a normalization constant but if you only need to know the mode we don't need to know the normalization constant because, well, it just scales the distribution up and down so it's not going to change where the mode is in particular we can find the mode by taking the logarithm of this expression because then the exponential goes away that's nice we're allowed to do that why? yeah, well, because the logarithm is a monotonic transformation yes so if you're transforming a function through a monotonic transformation it doesn't change the location of the mode because something that's further to the right than the others is going to be further to the up than the others okay so you take the logarithm and then you can compute the gradient of that logarithm set it to zero well, and that's sort of you can look at this and directly basically find for yourself that that means we have to solve this equation which is neat because I already made this case on last Thursday the right-hand side doesn't involve a w so we're trying to solve for w but the right-hand side doesn't depend on w and the right-hand side is the thing that contains the data so this is really it's not really an optimization problem more than a root finding problem there is a function of w that we need to optimize and there's just a constant on the right that encapsulates the entire data so in particular this means if I give you a data set or if that person you're working with has given you a data set and you've computed a sufficient statistics you're not going to come back to the data afterwards you can give it back you can hand the USB key back with the data and finish your computation afterwards the only time you would need the USB key back is if you decide to change your model change the likelihood so change the sufficient statistics fine and the other nice thing about this is it's actually posterior inference so here what I've done is I've just allowed the prior to enter through its sufficiency through its natural parameters here and the only thing that has changed is that we have an extra little alpha up here and an extra little nu so you can think of those as regularizers from a statistical perspective or as this little epsilon that makes sure that we're not accidentally dividing by zero you can also just say it's well it's just well written code to do it this way because it ensures that your algorithm is always going to one so even if you have no data yet you can still make predictions you can ask the algorithm from the start if you set nu to whatever one or 0.5 and alpha to whatever else 0.5 or 1 or whatever you like or maybe even 0 alpha isn't that important so after that with this we can find the mode so that means there is our there's our function I'm gonna draw it with my hands rather than on the blackboard which now we'll have a mode which we can find and if we were statisticians we'd stop there and say ah here's our estimate right maximum aposteriori or maximum regularized likelihood or whatever you might want to call this but we would like to be uncertain and we're nearly there where we already have an estimate we have a full prior and likelihood we know what the shape is we have the entire thing the only thing missing is this normalization constant so how would we address this with our current approach yeah okay so these are several questions wrapped into one it so the question is for everyone else is if I can if I can try to paraphrase is if you're in a sequential setting so I made this point about the person with the USB key enter the room and left the room again and your data is sort of now encapsulated in the sufficient statistics what happens if you get more data in the future well so if you get more data then you just add more terms in this sum over there so your question is about this phi right maybe I might decide to change phi later yeah so maybe there are settings where as you get more data you actually convince yourself that the sufficient statistics you use were the wrong ones and we'll get to that not today but in not so many lectures from now but a standard setting and actually one that gets you quite far as you will see in the next few lectures is you've decided what phi is what about your data you care about but what you don't know yet are those parameters so for example you've decided you want to know what the probability is for someone to wear glasses you just don't know again what the probability is but you can keep getting more and more people so if you do that all you have to do is just update this count here and of course that's going to shift where the mode is so it's going to shift our point estimate and if you're really in that sequential setting you might want to think about how you do this update efficiently now for our very simple case cases with the beta distribution you don't need to worry about it because I've shown you that you can even do it in a silly web app so it's so simple that it's really straightforward but of course you might have something a little bit more challenging on your hands and then you have to think a bit more about how to do this okay so the point though is well maybe your question also was about so what about like there will be an error in this model and why shouldn't we not just better keep this point estimate my main message is again no matter whether you believe in your model or not it's always a good idea to track uncertainty if you don't believe in your model then of course you also don't believe in your uncertainty and you have to think about what you can do to track this uncertainty about your model as well we'll get to that but it's never the right idea to say ah there's going to be something wrong about it so it's just not the uncertainty at all because it's going to be wrong this is a classic rhetorical trick from the frequency side to just say well your assumptions are going to be wrong so it's better to just make an estimate and not talk about the errors that's just shifting the problem to someone else the person who has to work with the output of your model and then it's easy sometimes you're in this lucky situation professionally that someone can come back and say well your prediction was wrong I never claimed that it was right so what was the error I wouldn't know I can't possibly quantify the error it's easy as a rhetorical trick but it doesn't get you anywhere so of course sometimes it takes an investment to sort of say here is a first idea of what the error might be and then it's sometimes more difficult rhetorically to explain to the person you're talking to that that doesn't capture every possible error but it still tells you something about your model just like when you write unit tests it's usually a good idea to write unit tests well your test suite is not going to cover your entire code base actually there are some people who try to get 100% coverage on their GitHub repo but typically it's more like 80, 60, whatever percent and then people are quite happy about that already that doesn't mean that it's a bad idea to write unit tests and you shouldn't write unit tests you should still try to cover as much as you can you want to find out what's wrong you want to sort of diagnose your code and uncertainty is a way to diagnose your model to ask it how much have you actually learned from the data yet how much data do I have even if it doesn't capture all of the possible ways which it could go wrong so how do we get away from maximum likelihood or as I just showed you maximum upholstery already is really not much of a difference because yet again my point the prior doesn't really matter so much the difference between maximum likelihood and a posteriori is just whether we add these two numbers there alpha and u whatever so actually funny enough the grand master of probability theory already basically told us how to do it in 18 10 or so so I showed you this very quickly in passing when we went through it in the lecture number 3 or so thinking about how many people are wearing glasses right that in his particular exponential family which we now realize is an exponential family this binomial distribution there is this normalization constant which he didn't know the beta function well I mean he knew what it was but he couldn't compute it because it was a difficult integral so he also already made this trick that we are now just going through for his particular exponential family he found the mode by taking the logarithm of this expression that means this annoying factor becomes a constant at the end then we can take the gradient of this function with respect to x find that gradient set it to 0 find the mode which in this case happens to be at this particular value and then compute a second derivative of this function at the mode find that it has a particular form it looks like this it's a bit of an annoying expression but it just involves numbers so you can actually compute it and it's the kind of function that Laplace even in 18 or something could just compute in particular if those a and b's are integers and what that tells us is the curvature of the problem at the mode Laplace observed without thinking yet about normal distributions and Gaussians he just knew that Gauss had already solved the corresponding integral which allowed him to make a closed form expression which is if we find the mode then we can at the mode express the log probability distribution as a quadratic term so the Taylor expansion is a constant term the value of the function at the mode plus a linear term which involves the gradient and because we're at the mode the gradient is 0 so that term is 0 plus a quadratic term around the mode and that quadratic term is 1 half times the square distance to the mode times this object which is the second derivative at the mode and it's just this annoying expression so I've given it a name called psi I'm going to use psi throughout the entire lecture for this thing and it's soon going to become our matrix but for Laplace it was just that means the log probability distribution is a quadratic the probability distribution is the exponential of a square and Gauss had figured out how to compute integrals over exponentials of squares actually of negative squares so there has to be a minus here but whatever right we can just redefine psi or basically move a minus here shift minuses around and then everything works out and we're left with an expression for which we know the value it has something to do with square root of pi and the square root of this times the thing that we don't know the normalization constant of evaluated at the mode so that's a function that we can evaluate up to the fact that we don't know the normalization constant but we know that this is supposed to be a probability distribution so it integrates to 1 and now we're left with an equation where there's only one unknown variable the normalization constant that we're done so this particular idea Laplace did for his distribution but now in 2013 we can think about this in code for the general case of our exponential family so let's go lift this back up to this abstract representation that we had the exponential family is of this form it has a conjugate prior of this form so whether we have data or not we're always going to deal with a distribution that has this form so I'm going to write it as a distribution over w with a bunch of parameters which I'll call alpha prime and new prime maybe to make this very clear at this point it really doesn't matter what alpha prime and new prime are you could get those from just a prior not having seen any data yet then you can predict data you can predict them after a bunch of data by including terms like this sufficient statistics of the data and then you can predict the future data more data beyond the ones that you've seen you could also look back and try to predict the data you've seen that somehow feels a bit dangerous you're probably going to become overconfident or whatever it's like testing on the train set or whatever but you can still do it, it's a function you can call you just have to be careful what you claim about it right and you can also just basically call this a function that you can evaluate at any possible value of alpha prime new prime the main point is that's the algebraic form it's going to have we know that it looks like this now we're going to do what Laplace did but abstractly we'll take the logarithm of this expression so that means the normalization constant literally becomes a constant which makes it easy to think about the mode we'll take the gradient of this expression with respect to w that turns into a root finding problem as we just saw here it is written again and we find the mode so how do we find the mode we could find it on a piece of paper like Laplace did or we could call an optimizer the optimizer is like the pedestrian solution it's the thing you do if you don't know what else to do just hope that SciPy does it for you it's a bit dangerous because it's going to do something and give some answer and you'll not really know whether it worked or not just like deep learning you could call Adam as well of course it's not like it's forbidden to use Adam there are better optimizers if you know the data but whatever and it's just good to do something and you'll have to deal with whatever learning rates and decay and shit whatever so you'll get an estimate but you can also sometimes do it on a piece of paper because everything is sort of relatively well understood and if you can solve it on a piece of paper of course that's going to save you a lot of time in the code it's going to run way faster so if it's a homework exercise maybe the numerical solution is easier because you don't have to deal with it afterwards but if you're building well better make sure you make it as fast as you can okay once we found the mode we can think of our I keep wanting to call it loss function but let's just call it a lock posterior probability distribution as a function for which we can do a Taylor expansion that will contain a constant at the mode a linear expression which I've now written out explicitly at the mode then this is just zero and then a quadratic term which I've now written sort of in full algebraic glory because it might well be a matrix this second derivative if you have more than one parameter w and our exponential families usually always have more than one parameter then the second derivative span a matrix but the i jth element is the derivative of this function with respect to first the ith element of w and then the jth element of w and what's that matrix called the Hessian yes yeah so your question is what happens if the mode is at the edge of the parameter of the parameter space then things are a bit hairy and we need to think about what we do with it so let me draw a picture though more generally so just sort of keep everyone in sort of give an idea of what's going on so here's this distribution for which we don't know the normalization constant we've taken its logarithm that means it looks like this in the log space we found the mode and then we did a quadratic expansion so typically this will look something like this so now the exponential of this function looks like a Gaussian probability distribution because it's e to the minus some quadratic form now if you're somewhere in a corner if this function looks like like this but this is your parameter domain that's not a good situation to be in then we have to be a bit more careful what we do there are also many other ways in which this could go wrong maybe this log probability distribution looks like actually the distribution might look like this right? then we're going to find a function a distribution that looks like this which is going to miss most of the uncertainty so it's not like we can just stop thinking when we do this but you can also imagine that maybe these are pathological cases right and quite often you'll get away with just doing this and more importantly all of these situations are better than just returning the mode and claiming that you're not responsible for it that's my most important point if someone just comes like it's just a point estimate whatever I'm not allowed to talk about the error it's still better to say I'm actually at the corner that seems really dangerous or the curvature at this mode is not positive or actually not negative for the mode so if the distribution is not negative definite then that seems like I might be in a sort of a nasty situation like this even if it's not at the corner and so on and so on and then I can think about it and see what I can do yes well so when we talk about deep learning these will become lost functions so x is the data w is the weights of your whatever network it's going to be so the question is how close to the mode do we have to be for the Laplace approximation to be a good approximation to be a good approximation to the posterior so this story is sort of as you move away from the mode we get more and more of a deviation between this quadratic approximation and the true log probability distribution and well so the mathematical answer is that's the O of cubed term in the Taylor expansion and that is one way to think about this you could say maybe I should compute a third derivative without going too far away into some direction that everyone shouldn't get confused about right now the object you would need to compute is a tensor of third order and if w is a long vector this object will become very large it will become cubically large in w and that's then usually too much so if you think of a deep neural network it's not going to happen so there are other tricks that people do in practice but it's too early in the lecture now to think about them just to give you one simple quick answer this is a Gaussian distribution that we are approximating with you could draw from this Gaussian distribution random values for w and then the Gaussian tells you what the pdf should be at that case if it's the correct one and you just evaluate the pdf at that point it deviates from the Gaussian and then that gives can give you a stochastic estimate for how wrong you are ok so for example in this case you would draw points over here which happen very rarely and then you will notice ooh actually the model would say this is still very likely that's not good I'll need to do something about this ok so let's finish this thought though so we we've done our quadratic approximation the exponential of a quadratic is something that Gauss knows how to compute the integral over by the way this lecture is called Gaussian distributions so in the second half after the break we'll talk about Gaussian distributions if you're confused by these just wait for 15 minutes we'll get there and we know how to compute the normalization constant of this the normalization constant of a Gaussian distribution is this term here it's the square root of 2 pi to the dimensionality of the problem that's the number of W's we care about times the determinant of this negative inverse Hessian distribution so the determinant of the inverse is the inverse of the determinant it's not so bad why do you compute the determinant if you compute the determinant you can also compute the determinant of the inverse it's just an inverse in Europe by because of the definition of how Gaussians are typically defined and we now know that the problem behind this is just you know exponential families that the way we think about Gaussians is in terms of their canonical parameters rather than natural parameters so we could write our probability distribution as the value of the actual correct posterior at the mode so this is a function we know completely up to its normalization constant times a Gaussian probability distribution normalized standardized with mean at the mode and covariance given by the negative inverse Hessian of the log pdf at the mode times this annoying number whatever it is we have to compute it but it's just a number and that means we know everything up to this normalization constant and now we can fix it because we know that this has to be a probability distribution so I've continued the argument the first line is the same as in the last slide we know that this object has to be a probability distribution so it has to integrate to one so if you integrate over w there is only one w in here and this is a probability distribution so it integrates to one everything else is just w hat and w hat is a particular number it's not a variable so this has to be one the integral over this is one so we just know that one has to be roughly equal to this function which contains an unknown constant times this whatever thing square root of 2 pi and we know that this is now we plug in the value for the p of w hat which looks like this I've just plugged in this algebraic form again and the only thing we don't know about this is this normalization constant so now we know that our normalization constant can be written like this it's a function of alpha new alpha prime and new prime which depends in a complicated nonlinear fashion actually on where we find the mode but locally we can think of it as a simple function of alpha prime and new prime which only enter at three points here and there and actually in this hashing so let's add this to our piece of code so here is our can you read this is it too small in the back it's okay good so this is our abstract base class again for exponential families you've seen it last lecture we defined it with all of these basic things that define the exponential family and then we realized that when you define these basic objects the sufficient statistics, the base measure and the log partition function that actually also engenders a conjugate prior which we have constructed down here as an exponential family itself which has sufficient statistics which arise from the likelihood and it has a base measure which is just a natural base measure the Lebesgue measure it has a log partition function which in general we don't know so we need to ask the likelihood do you know, do you happen to know what the log partition function is and then in general the likelihood is going to answer I don't know not implemented error but if it does then it can return it and then we can do full Bayesian inference like for example for the binomial we know that the beta function is the conjugate log partition function so then we can now do something cool we can add to our conjugate family the ability to sort of simulate its normalization constant by adding a function at the bottom here that's called Laplace Precision which says okay if you don't know what my log partition function is then just tell me where my mode is either because you know on paper or because an optimizer told you where the mode is and then what I'm going to do is I'm going to evaluate my negative Hessian at the mode and that's a function of the natural parameters of this object of alpha and nu and then I can return this this is called the Precision because it's the inverse of the covariance matrix of this Gaussian so the question is what's the relationship with this matrix they are very close and I don't want to go too far away from them but there is a big theory in frequency statistics about the asymptotic error of a maximum likelihood estimate and it's very very closely related to this up to some normalization up to some 1 over n whatever thing okay so as we can evaluate the unnormalized log pdf of the posterior because its form is algebraically forced upon us by the sufficient statistics of the likelihood so let's do one example of this and then I'll go into the break very briefly so here's a quick summary but actually okay I'll briefly summarize and I'll do the example so exponential families should provide an algebraic structure in which you can do Bayesian inference by writing down a modeling assumption sufficient statistics basically everything else becomes predetermined there's a corresponding normalization constant called z there's a corresponding conjugate prior which has its own normalization constant called f and in general you can't evaluate f but you can do Bayesian inference up to normalization and if you don't know the normalization constant one out of actually several possible options but a very very efficient one is to find the mode evaluate the Hessian at the mode and we can do this using autodiff so this is also a notion that historically kind of got lost out of favor because people hated computing Hessians it was just too much work but it's 2023 and we have Jax and PyTorch and whatever all these automatic differentiation libraries so they do the Hessian for us and that directly gives us an approximation to the normalization constant so what can we do with that well he's a tongue in cheek not entirely serious example so you've noticed that these distributions are very somehow important right in classic statistics they are associated with the names of these big old white dudes right so Gauss the Gaussian distribution the beta distribution which is associated with Laplace the gamma distribution which comes from Euler the Dirichlet distribution there is at the top right and the gamma distribution also has something to do with Bernoulli because he voted down all these other people you know like well here's Boltzmann for example who used them in thermodynamics what if you know wouldn't it be kind of cool to add yourself to this right wouldn't you have your own name like on Wikipedia wouldn't that be nice somewhere like be up there have your own picture extend the list a bit well you can do that what do you have to do you have to find one integral called z that's it so after this lecture you can go across the street into the main library of the university find one of these really big books that list integrals the Gratstein and whatever like these big blue like paper weights go through it find some fun integral that no one has used yet and attach your name to it and that's what I've done here so I've looked up a particular weird integral which I found actually in one of these books here it is I've observed that this integral is known so the integral from 0 to infinity over the exponential of minus first first parameter times x squared minus second parameter over x squared so that's x to the minus 2 is given by the square root of pi over w1 times e to the minus 2 square root of w1 w2 it's just in this table of integral so if you want to do the integral from minus infinity to less infinity it's just multiple by 2 okay so so you that amounts to basically an exponential family with sufficient statistics given by these two functions x squared and x minus squared x inverse squared so now what's left to do is just to write a big cool paper with a nice story about how this is a really important distribution so maybe let's say I don't know something with physics because there's going to be a picture I've plotted in thin black in the background these two features so this is the quadratic function e to the square actually no it's just square and this is the one over square function so maybe we can think of some orbitals of some atoms right but there's two forces one attracting that drops with this right with the square and one with the pulsing that drops with one over the square I don't know electro, weak, interaction, whatever right and we're going to find some solution where the stuff is sort of at some constant distance to the to the nucleus but we don't know what the parameters are so maybe the parameters have something to do with how many protons there are in the nucleus and how many neutrons or whatever right so we just give names to them we call them w1, w2 and that's it here's our probability distribution and now we just have to use a big curly character for it so that it looks like a big important thing right and of course I'm going to use an h because I want this to sort of sneaky get into Wikipedia with my name right so okay so here we go now we can do Bayesian inference on w what do we need to do to implement this we define this distribution here it is up there actually I cleaned up a little bit in notation I've introduced new parameters theta I call them the canonical parameters they are actually just the square the square root of the weights why because if you look at this expression you notice there's a lot of square roots floating around doesn't look good so I just use theta which is the square root of w1 or w2 and then everything looks a bit neater and this is the distribution here's the normalization constant here is the linear term and what I have to do to implement this is just this cell it's called the bagel distribution because it has this toroidal shape right around the center it's an exponential family it has sufficient statistics which are given by minus x squared and x minus squared that's the business end and it has a normal it has a base measure which is just the Lebesgue measure just one it has these natural and canonical parameters w and theta which we can map to and from by taking square roots or squaring and the only real thing the only actual contribution that makes this a beautiful new probability distribution is that I found somewhere in a book that this is the log partition function so the log normalization constant of this is just this expression up here so it's one the logarithm of it is one half the square root didn't get rendered there's a square root missing here weird markdown error so it's the square root of pi minus the logarithm of the first entry in theta zero base counting minus the product over the theta one theta two times two done that's it and now we can do everything with it everything is directly inherited we could for example plot the distribution in x it's also a likelihood in w it has a conjugate prior I don't even know I've written it down it's not on the slides but I don't need to because the abstract base class will know it will generate its conjugate prior with the corresponding sufficient statistics and natural parameters and it'll know how to construct the log unnormalized probability density function for this prior actually I can plot it so I've done this here this prior will have two natural parameters actually three alpha one and alpha two which is w one w two or theta one theta two it just has two alphas I don't know I've set them to minus one half whatever and it has a counting variable nu which in this case I've set to zero but you can set it to something else those are the prior parameters for this prior which I get by just telling my likelihood please construct whatever the conjugate prior is take care of it and then I give some data which I've just invented there are some experiments that look like this then I can construct a postivio which will be of the same form as the prior because it's a conjugate prior and it has natural parameters which I can just ask the thing to construct for me by adding in the data so what will it do internally it'll take the data, computer sufficient statistics add them to the prior parameters and that's it so this color shading in the background that is the lock postivio of this distribution up to normalization so it's not normalized yet but if I want to normalize it the only thing I need to do is to find the mode at this point compute the Hessian and it'll give me this like spoiler alert this Gaussian approximation that you see as two red ellipses so how if I found those well I could just call an optimizer and actually the code for this is here and it's just commented out this is what I had initially thought we could do so we just load an optimizer and then just tell the optimizer well take the unnormalized lock pdf and find whatever the mode might be but actually maybe let's wait a second and see if we can do it ourselves so if we go a little bit forward so here's the reparameterization sorry like this with theta and now so I don't want to annoy you too much with this but it turns out you can actually find the mode yourself so if you take the lock normalize the lock pdf sorry the lock normalization constant lock set from the previous slide and you don't have to do this now follow this argument mentally yourself but I just take the function from the previous slide take the logarithm compute the gradient which you can do on a piece of paper yourself then you get stuck twice which is why last Thursday I couldn't do this yet for you that was a bug in the derivation so I had to go back spend another hour last evening getting it right that is the sort of thing the price you have to pay for doing it by hand and then you find the mode something like this and then we can construct the estimates theta 1 and theta 2 in closed form from this there are just some sorry some transformations so if we do that then we get we actually do this here so that's this piece of annoying code so you know I construct this mu bar and omega bar that is on the slide and then do some simple rearrangements to find the mode then we can ask that's the cool thing here's the the point ask the posterior to evaluate its curvature at that mode that's this line and it'll do it for us because it's 2023 and it can compute Hessians like in 19 actually in 2007 this would not have been possible but in 2011 it would still have been like an arcane art but in 2023 you know we have Jax and it just does it and it immediately gives us so this is just a bit of plotting annoyance, this approximation and now we have an almost perfect posterior over those parameters of this function that I just came up with by looking at a text so if you encounter some data that really has this weird sufficient statistics or you decide that that's the thing you care about then you can define your own probability distribution and this abstract base class will take care of all of the algebra for you up to full Bayesian inference so really Bayesian inference isn't that hard you just define your model that's what you do in other languages for machine learning as well and then afterwards we just keep track of geometry we find modes and then shapes around the mode and this finding the shape this used to be something historically that sort of people forgot about because when people like me wrote their PhD finding that curvature estimate was really tedious like it was at the time I submitted my PhD thesis I would have said this is stupid I'm not going to do that because if you have more than 5 parameters it's a 25 entries in the matrix right 5 by 5 I'm not going to sit down and compute all of those and that's going to be a bug in there and I'm not going to get it right but these days without a diff this thing has sort of received some rejuvenation this wouldn't have been a topic in a lecture 10 years ago in my opinion it's the way we should going forward think about probabilistic reasoning not as the correct and final and ultimate answer but to understand why inference isn't actually so hard you just need to find the curvature of this I keep doing like this because I'm thinking of the negative log PDF right around the mode and it gives you a Gaussian approximation and everything is done with that let's take a quick break 5 minutes quarter past 11 I'll continue I want to now finally get to the point that was actually supposed to be the core of this today's lecture but it's become sort of a maybe an afterthought that will spend so much time on over the next 5, 6, 7, 8, 9, 10, 12 lectures that it maybe doesn't matter so much which is maybe the best way to think about this question that was asked by one of you last week in the feedback was that if at the end we're going to do Laplace approximations for everything why are we even doing all this business with these annoying exponential families in the first place aren't you just going to eventually talk about Gaussian distributions anyway and the answer is sort of yes there's this beautiful structure that I wanted to talk about and then we'll focus for the majority of this class on Gaussian distributions and we'll get to talk a lot about why I do that so I'm not going to do it now but maybe the question is why why should we actually care about this Gaussian I keep pointing out I made it sort of simple for myself because I was pretty sure that everyone in this room maybe I should check who knows what the Gaussian distribution is or who has encountered the Gaussian distribution before here's the one chance where everyone can ah who knows the normal distribution just check whether everyone is awake or not not everyone's hands are up okay so I'm not trying to make fun of those who didn't raise their hands but you've all heard of this bell curve if you've had some ever so basic statistics class you've heard about this man where's the pointer there and actually the painting is the other way around it's a mirrored painting because otherwise it wouldn't have fit on the money and this distribution which you cannot read so well in the back but here it is and in the background there's the city where he did most of his work cutting in and this is what the money looked like when I went to high school and this distribution this bell shaped distribution this normal distribution which has this algebraic form that's actually on the money f of x is 1 over square root of 2 pi times sigma squared times e to the minus x minus mu squared over 2 sigma squared that somehow seems to be really important here it is properly so important in fact that I've reduced pretty much all exponential families to this to this distribution here it is again it's this bell curve which is often parameterized by these so-called canonical parameters mu and sigma squared which are as it happens the mean of this distribution so the mean is an algebraic property right or analytic property it's the integral over the linear function times that distribution the expected value of the linear function the first moment is the mean and sigma is the second central moment actually sigma squared is the second central moment the variance of x and without the square we talk of the standard deviation ok so this is the boring bit for most of you and you've heard that apparently IQs are distributed like this and heights and whatever and everything else why is this distribution so important? there's many possible answers of course the central limit theorem that's the typical answer it usually comes up so there's this fundamental the central statement of stochastics I don't know maybe maybe stochastics is the right term for it which says by the way who's heard of the central limit theorem I'm not gonna ask you to it who could recite it if I ask to ok so this the theorem says something like the sum of IID random variables is asymptotically distributed like this and there are variants of it that even work without IID but have some other requirements so this is a good motivation for it but actually it's a bit post hoc so the reason we care about Gaussian distributions is that they have wonderful analytic properties and the central limit theorem is a little bit like Taylor's theorem that says that any function that's sufficiently regular can be locally approximated by a linear function and well other order higher order terms and it would be weird to say linear functions, linear algebra is important because of Taylor's theorem the other way around is useful because linear algebra is so super cool right you can do so many things interesting things with linear algebra and especially with computers and linear algebra and therefore it's very useful that most functions can be locally approximated by a linear function similarly Gaussians have wonderful properties because they actually are just linear algebra they would use probability theory as we're gonna see in a moment onto linear algebra and because linear algebra is so fundamental the central limit theorem helps us because it allows us to talk about most distributions in terms of Gaussians it provides a theoretical motivation a kind of rhetorical argument to say I'm allowed to talk about Gaussians because after lots and lots of data locally things will look a little bit like a Gaussian and in fact what I did here with Laplace is pretty much that argument it's okay to do find the mode and do a quadratic approximation around the mode that means we're missing all the third and fourth and fifth and higher order terms but it's gonna be good enough because once you have lots of data the distribution will be relatively narrow and that's what the central limit theorem then says so what are those good these wonderful algebraic or analytic properties of the Gaussian distribution let's talk about them a little bit so first of all we already saw that this distribution here that it's actually an exponential family it's just an exponential family with natural parameters that look a little bit different than the ones we usually encounter the natural parameters of this exponential actually let's first talk about the sufficient statistics because they are the defining quantities of exponential families the sufficient statistics are depending on how you want to define them either the mean so the first moment of the data the sum over X or sorry and the second non-central moment so sum over X squared or depending on how you want to deal with the base measure we can also introduce a constant so it's like the zero's first and second moment of the data define the sufficient statistics sometimes it can be more convenient to define them with a minus doesn't really change much and the natural parameters are one over the variance called the precision the inverse variance and that precision the inverse variance multiplied with the mean which is sometimes somewhat awkwardly called the precision adjusted mean but we've talked so long about exponential families that I don't want to annoy you further with that but of course it means this fact that they are exponential families that we can do very efficient inference from data with Gaussians and here I'm not going to bore you with this because you now know we have a Gaussian prior which is this curve and single observations that are Gaussian distributed so that means there is a likelihood that looks like this then the posterior will also be a Gaussian which is like inferring the mean of a Gaussian with known variance and we've done that and that's this red curve and all across science it has become commonplace to not actually draw this entire likelihood but just draw something like this a circle with an arrow bar and whenever someone draws a circle with an arrow bar they mean that there is this distribution underneath unless they really go into a long explaining why this isn't and if you have lots and lots of data then we can keep doing that if you have lots and lots of likelihood terms like this even with varying variances we can just add them and adding in literally means adding them in to the sufficient statistics so we update the precision by starting with a little there it is by starting with a prior precision and then summing up the precision of all the individual observations and we sum up the precision adjusted mean by starting with the precision adjusted mean and adding up all the precision adjusted means of the data which involve the actual observations of the data we've seen this example already and this was really the original reason why Gauss cared about this he wrote about this and by the way in a moment I'm going to have to eat my words here's by the way the original painting as you can see it's not mirrored and on the money it was just flipped around Gauss was interested in the paths of the planets around the Sun so this was after the Copernican Revolution people had accepted that the planets are revolving around the Sun but he needed very precise measurements of those orbits for very scientific reasons and so he kept measuring them and he realized that when every time you measure you make a mistake so how do you keep track of those mistakes that happen at different points in time across the trajectory you measure Saturn here and then there and then there on the sky somehow there should be a way to keep track of this and he wrote this amazing text called in Latin something about the paths of the celestial bodies that are revolved around the Sun on cone intersections because that's what ellipses are intersections of cones and he noticed that what you can do is you can add up the squares of the arrows and that will lead to an equation that he can solve in closed form and now I have to eat my words because actually it turned out that Gauss wasn't the inventor of that normalization constant if you read the original text here sorry it's not the original it's the translated German text because the Latin one I guess no one here in the room could probably read he actually says here's this distribution which involves e to the times a constant times delta delta so square of delta times an another normalization constant further you can see that K has to be negative it has to be a negative number otherwise it doesn't normalize bla bla bla bla bla so that omega can even have a maximum so that in the end it could be the biggest that's why we see this which is why we see that we have to set it to something and then the elegant first of Laplace get Funden and Theo Reims so thanks to Laplace who actually found out that this integral is square root of pi so Gauss shifts the wonderful result back to Laplace we can write our distribution like this h over the square root of pi times e to the minus h h delta delta he couldn't write a square yet because that notation didn't exist yet so he realized that after that you can now do a computation that is closed form and that's really beautiful because it comes up with an algorithm for solving such systems of linear equations in this text which these days is called Gaussian elimination it's maybe the very first computer algorithm ever invented because he writes it in the form of a mechanism of an algorithm rather than a long winded argument about how he finds some numbers for maybe the very first time at least in western literature so these days of course we think of this with hindsight with all of this notation for linear algebra that has been invented over the last 200 years in a much more compact form we have learned how to write these things such that they are easy for our eyes to parse and that's thanks to the hard work of people over generations who have fiddled with this notation to make it ever more convenient and that still keeps going on to this day as we think about them in terms of arrays now which have multi-dimensional inputs and can be combined and consumed in various ways so this distribution has a multivariate form it can be generalized to the multivariate case like this which, again, just to double check raise your hand if you've seen an expression like this before good, thank you so it's a normalization constant which involves 2 pi raised to the number of dimensions of the problem a half times the determinant of the covariance matrix sigma maybe we need to talk about the determinant at some point times the exponential of a quadratic form and the fact that it's a quadratic form, that's the thing that makes it really work because quadratic functions are wonderful the main thing for this to work is as Gauss actually points out K has to be not negative these days we say that sigma has to be a positive definite matrix and I sometimes ask people in job interviews what's a positive definite matrix and then maybe you can think for yourself whether you still remember your undergraduate linear algebra class where you've learned about positive definite matrices it's one, the standard textbook definition of positive definite is this so a matrix is called symmetric positive semi definite and we're going to constantly flip definite and semi definite back and forth because it doesn't actually matter it's just a case of keeping track of zeros correctly in the code if for any vector v that you can throw at this matrix from the left and the right the resulting inner product scaled by a will always be non-negative we need this because otherwise this distribution won't look like a bell it could deviate in some direction upwards and then we can't normalize it to one because it grows without bound so this is a hard requirement on the covariance that it has this proper algebraic structure once we have that though everything becomes super efficient because what we're left with is if we leave out all the business with the exponentials up here a quadratic function and quadratic functions are some of the most regular objects known to humanity for example the sum of two quadratic functions if you think of a polynomial a 0 plus a 1 times x plus a 2 times x squared and another one b 0 plus b 1 times x plus b 2 times x squared the sum of those two is another quadratic function so what does this correspond to the sum here? well it's in the exponential so we have to multiply with each other the product of two of these functions will be another one of these functions that's what we use in conjugate prior inference and it also means that if you have two Gaussian terms about a random variable you can multiply them with each other and out comes another Gaussian so we have one piece of knowledge about x that looks like this red Gaussian distribution and another piece of knowledge that looks like this blue Gaussian distribution then their product which is what we do in Bayesian inference is another Gaussian distribution which has a bunch of parameters which we need to compute and now that we know that this is an exponential family we actually know what those parameters are the new precision is the sum of the precision so the new covariance is the inverse of the sum of the precision the new precision adjusted mean is the sum of the precision adjusted mean so the new mean is the inverse of the new precision times the sum over the precision adjusted means that's it of course we have to do that in code and we'll talk about that on Thursday but we need to do it very very important note this is not the same as the distribution of the product of the Gaussian random variables what I've done here is I've multiplied two Gaussian probability density functions so that's one distribution over one variable called x about which we have two pieces of information you can think for yourself about what happens if you would like to know the distribution over the product of x1 and x2 if they are both Gaussian distributed and spoiler alert that object is not a Gaussian random variable it's not going to be super important for the product of this class but it's a very very common misconception which I keep noting in exams the other really cool thing about quadratic function which is not so easy to see in this is not because it doesn't exist in the well actually it exists in the univariate case as well so if you have a polynomial a0 times a1 times x plus a2 times x2 and so a0 plus a1x plus b2x and you rewrite x in a linear form so you write u is constant times x then this function as a function of u is still a quadratic function right there will be a a0 plus a1 times u sorry a1 times u over c plus b2 times u squared over c squared so as a function of u that's still a quadratic function more generally in linear algebra terms any linear projection of a Gaussian random variable is again a Gaussian random variable so if z is Gaussian distributed with a mean and a covariance then any linear map of z will also be Gaussian distributed so here is a Gaussian distribution in red and what I've done is I've projected this distribution down along these dashed lines that's a linear projection because the dashed lines are straight lines if you take this entire blue cloud and you sum it up along those dashed lines out comes another Gaussian distribution up to normalization but normalization is easy to compute because it's a Gaussian so the integral is closed form and this new distribution has a mean that is the linear map of the mean and a covariance that is a bi-linear map from both directions of the variance a special case of this is marginals in particular I could project the entire distribution onto one of the axes right? so now the dashed lines are axis aligned and that means that the map I'm choosing a is a sort of a unit vector type matrix so it has in general in this simple case it has one one which is 0 but it can also be a block matrix with a unit matrix and everything else is 0 and what that means if you look at this expression up here is that we are selecting elements of mean and covariance so if you have a bunch of Gaussian random variables many of them, 5, 10, 50 a billion 200 trillion right? if you only care about 5 of them then the corresponding marginal distribution is given by a Gaussian distribution which you can very simply evaluate you just pick out the elements of the mean of this big distribution that correspond to your variables you care about and you pick out the sub matrix the block diagonal part of this huge covariance matrix and obviously looking up that sub part of the big array is easy this means that we can keep track of a very very very large set indeed of random variables if they are jointly Gaussian distributed as long as all the things we care about are linear maps of them and it will turn out in 3 or 4 lectures we can even keep track of an uncountably infinite set of random variables it's a wonderful property of Gaussians because it means we can do this really really complicated set of objects you want to keep track of and you only talk about a subset of them it's like completely ignoring the rest you just forget about it by picking out the part the other wonderful thing about quadratic functions because that's what Gaussians are exponentials of quadratics is that if you take and this is really not so entirely obvious in the univariate case if you take one of these quadratic functions and you cut linearly through it at any angle then the function evaluated along that cut is also a quadratic function this is what you do when you complete the square in a quadratic completion problem that's exactly what a cut is to this thing so if I take this red curve cut along this black line so what we've previously done is we've projected onto the orthogonal of it we take the entire cloud and we project it down but I could also evaluate the function along the line and ignore everything around it then I also get a Gaussian distribution out and of course cutting through a distribution has an interpretation in probability language exactly I had this picture in lecture 3 on continuous variables with this cloud where we can project to get the marginal or cut through to get the conditional conditioning amounts to computing this object which we will talk about a lot over the next few lectures so I'm not going to talk about it right now but actually on the next slide when we combine them together and the first thing you have to observe is so computing this conditional distribution which is saying what is the distribution of X if X is a linear map of some sorry has been observed as some linear map of some other Gaussian random variable then the conditional is a Gaussian distribution so Gaussian distributions have two parameters a mean and a covariance so there they are there's a distribution over X with a mean and a covariance and the first observation we make is there is the data in here y the sort of the observed variable that we are conditioning on and it shows up linearly in this expression and then there are parameters of the distribution and the projection a that show up and they don't all show up linearly but the expressions in here are all linear algebra type expressions so we compute a times sigma times a transpose in your head you already have your numpy code it's a at sigma at a dot t these are these basic operations and then there's this inverse and we'll need to talk about this inverse a lot because that's the part that is expensive but it's the inverse of a matrix so we know that computers somehow can compute inverses of matrices they do that it's not entirely straightforward so we'll need to talk about it but they can do it and maybe the final thing to note is that the expressions in the mean and in the covariance they seem to mirror themselves somehow each other there's a lot of the stuff that's in here shows up in here as well it's not quite the same thing there are similar objects showing up so maybe a first a gut feeling for this is if you think of the mean as a point estimate and the covariance as an error bar then computing the square error bar will not actually involve much more than you need to compute the mean okay so here's our big philosophical point again it's Bayesian inference expensive well not if it's Gaussian because the stuff you need to do to get the point estimate is pretty much everything you need to do to get the error estimate seems somehow weird but it's actually true this object and we will talk about this many many many times I will make big arguments about it this object I said that the inverse of the matrix is usually the stuff that is complicated notice that this matrix inverse shows up both in the mean and the covariance so the error estimate involves a quantity that we've already computed if you compute the mean so if you are careful with our code construction computing the error bar is not going to be more expensive in general than the mean so that means if everything is Gaussian distributed and the relationships between variables are linear then we can do marginalization and we can do conditioning what else do we need to do probabilistic inference other than marginalization and conditioning but what is Bayes rule it's just the application of marginalization and conditioning it's a theorem it's not an axiom so we're done if you have marginalization and conditioning then we can do Bayes inference just with linear algebra so I think we didn't work if you have one variable x that is Gaussian distributed so it has an associated distribution that has a mean and a covariance and then you observe through a likelihood some data called y which is also Gaussian distributed where the relationship to the unknown quantity x enters in the mean of the Gaussian through a linear function sounds like a complicated stuff and maybe it is so if the likelihood has in particular this form and this is actually the business part because as we just realized in the first half of this lecture this first part is kind of forced upon us by the structure that's just the conjugate prior to this observation then the marginal over y so the normalization constant in Bayes theorem is a number that can be evaluated by evaluating a Gaussian PDF or log PDF for the log evidence and more importantly the posterior distribution the result of Bayes theorem is a Gaussian distribution which has parameters mean and covariance which can be computed by two linear algebra so at the moment I will leave this red thing here just standing there as it is and we will talk about it on Thursday a lot but the main thing to take away from this complicated expression is that this computation involves numpy code not torch code not scipy.optimize or add them or whatever it just involves a at a transpose then a matrix inverse the sort of stuff that's available in a package without fancy numerical optimization and the terms in the mean and the covariance are related to each other they contain terms that you've already computed yeah so teaser I will talk about this on Thursday again really very very fundamental mathematical aspects of linear algebra really rather than Gaussian distributions they are related to a fundamental result of actually group theory that is often expressed in terms of linear algebra but it's even more general than linear algebra which is connected to many many different names so depending on how you write the equation and what you talk about there are people like Sherman and Morrison but actually the most interesting person maybe to talk about historically is this guy here he has one of these very late 19th century early 20th century lives very in many ways well early on I think happy but then also later on really complicated life as you can maybe guess from the name he was Jewish he was born in what is now Belarus and emigrated to moved to study with his family to what is now called I think Lithuania and then he moved to Berlin to study mathematics he became the successor to Felix Klein in Göttingen maybe the most eminent German mathematician at the time and he was always trying to get a position in Berlin he became a member of the Leopoldina and various other high ranking German organizations and then the Nazis came in and he had to stop teaching in 1939 and emigrate to first Switzerland and then Palestine where he died in 1941 of a heart attack he worked on all sorts of aspects of group theory and in particular he left us as part of a proof actually as a margin note for a major theorem that he provided called Schubert's Lemma with a representation of how you can think about structured terms in the inverse of a matrix so when you have a matrix and you're computing it's inverse which as you saw we need to do for Gaussian distributions all the time a lot of very structured objects show up that have really beautiful interpretations they are actions of parts of the group on some other parts that are in some sense irreducible and they in particular involved this object M which is called the Schur complement and they tell us something about what the numbers in these matrices in Gaussian distributions actually mean there is no time to do it now but we'll maybe do it on Thursday again but basically thanks to results like this first of all you can speed up the computations but also we can look at the numbers in the matrices and they tell us something interpretable about the variables about their marginal and conditional independence so the directed graphical model observed in the second lecture turns into a matrix with zeros and non zeros in the Gaussian case and if you've studied theoretical computer science you know that it's a beautiful situation when you can replace a graph with a matrix because then you can study the matrix which is much easier to study so Gaussian distributions translate that's the final thing I want to say sort of for a few minutes to translate this complicated business of probabilistic inference all these fancy mathematical statements about keeping track of measures and sigma algebras and these abstract objects that are functions being multiplied with each other they translate all of this into linear algebra products of Gaussian probability density functions are Gaussian probability density functions linear projections of Gaussian random variables are Gaussian random variables conditionals of joint Gaussian random variables as long as the conditional is linear are Gaussian random variables and that means that all of Bayesian inference turns into linear algebra when we use Gaussians and linear relationships between them in the most general case they turn into this very complicated expression that takes the entire breath of the slide and we will stare at them for a while over the next few weeks to figure out what's going on there but the main thing is that if you have a Gaussian prior over a variable and observations that are linearly Gaussian related to that variable or actually affine Gaussian related to the generalization of linear then the posterior distribution will be Gaussian and the parameters of that Gaussian will be linear algebra expressions of the parameters of prior and likelihood and in fact any linear map of that posterior random variable will also be Gaussian distributed like this and because computers are good with linear algebra and because autodiff is about linear algebra Gaussians are everywhere and Gaussians will be too probabilistic machine learning what autodiff is to deep learning they are the fundamental object we want to work with so over the course of this class we will build up a toolbox for building machine learning algorithms with a probabilistic perspective which I will add more and more to you've already encountered the fundamental framework marginalization conditioning and Bayesian inference and we encountered a bunch of structures that help us along the way both for modeling and for computation I will talk more and more about these with this lecture we have ended our look at exponential families as the algebraic structures that translate potentially intractable Bayesian inference onto tractable Bayesian inference period and now we're entering Gaussians as a special case of exponential families which map not just to tractable inference but to linear algebra tractable inference and we've already seen today right in the middle between the two parts of the lecture that a plus approximation is one way to match those two to each other you take any exponential family and then with some caveats about the domain of the parameters if you can compute curvatures and the mode then you can map from an exponential family to a Gaussian distribution so with that I'm at the end for today please leave some feedback from Thursday onwards we will talk much more hands on about Gaussian distributions and you'll see some more streamlit apps again in actual pictures and code thank you