 Hello and welcome to probabilistic machine learning lecture number 15. In the course so far, we've already advanced now quite a bit. We're more than halfway through already over here having gone through all of this and maybe you've noticed a trend emerging. We saw in the very first lecture that probabilistic inference, the mechanism created by probability theory, allows an extension of propositional logic and therefore the rules of propositional reasoning beyond statements of discrete truth value by distributing truth over a space of hypotheses. And this gives rise to this mechanism called Bayesian inference, which is encoded in this fundamental equation called Bayes theorem. We quickly noticed in lecture two actually already that this process is unfortunately in general computationally very hard. And the fundamental reason for that is that if you want to distribute truth over a space of hypotheses, that then means that you now have to keep track of all of these hypotheses when reasoning about even a single one of these hypotheses. You can see this reflected in Bayes theorem. To make a statement about the truth value of one particular hypothesis under the data, you have to keep track of all of the other explanations because you have to compare to them and sum them out. More generally, this means that integrals, like computations that involve volumes in the weighted measured hypothesis space, so the hypothesis space weighted by the probability measure are the core operation of probabilistic inference, whether it be to compute evidences in Bayes theorem or to just make statements, analytic statements, about probability distributions like this posterior. For example, you might want to know where most of the mass of this distribution is, how wide the region is where that mass is located, where the center of the mass of this probability mass is. And for these kind of operations you need to compute, for example, moments or more generally expected values of general functions, and this in general can be very hard. So the practical process of probabilistic reasoning, what we are trying to achieve in this entire course, is fundamentally a computational task. And so what I have tried to lay out in front of you in the past few lectures is a collection of tools to simplify or at least make tractable this process. For example, we saw that conditional independence as reflected in the graphical view of graphical models can help drastically simplify computations because if you know that one part of your inference process becomes independent of other parts when conditioned on certain variables, you can use this to separate the process into individual steps. We also saw that sampling methods represent a way to turn non-finites or continuous integrals into discrete operations. It might still be hard, but at least they are tractably hard. We then encountered, and this became a theme for a large part of the lecture, the wonderful framework of Gaussian distributions, which have an intricate connection to linear algebra in the sense that if you're reasoning about a set of variables that are connected to each other by linear operations and the joint distribution over these variables is Gaussian, then all the operations we need for probabilistic inference, marginals, conditionals and so on are Gaussian distributions and their parameters can be computed in linear algebra. We used this insight to build actually a very powerful framework for large parts of machine learning already, essentially more or less any form of supervised machine learning. We first saw that we can use the Gaussian framework for functions that are described in a parametric, finite-dimensional way with features. We then saw that we can learn these features using a process that is connected to what we now call deep learning by maximizing likelihoods. There's another maybe equally exciting framework which allows us not to use a hierarchical deep structure of features but to extend towards the limit of infinitely, many infinitely wide neural networks if you like called kernel machines and this gave rise to the concept of a Gaussian process. We saw how powerful this framework is that it can be adapted to certain settings. For example, that it allows particularly fast inference if the data has an input time structure and we also saw that we can extend this framework even to the setting where the function we're trying to learn does not have real-valued output. So if it's discrete, if it's a classification problem or has other kind of structure, then maybe we might be willing to just sort of squeeze it a little bit and suspend this belief to some degree and approximate the associated non-Gaussian distributions using the Laplace approximation. So basically to turn something that isn't actually Gaussian into a Gaussian by just pretending that we want to have a Gaussian and using a particular way to find to construct such an approximate Gaussian. What I want to do today is to maybe open the final third of this lecture by breaking with the Gaussian framework or in other words trying to see where we can go if we deliberately leave the Gaussian framework behind and see what happens if you go beyond Gaussian distributions. The first time we encountered a situation in our modeling so far in which we had to give up Gaussian distributions was in the setting of Gaussian process regression when we wanted to learn representations for regression models which actually used Gaussian distributions. Remember that we were trying to learn general linear functions which map from an input domain over X in a nonlinear fashion to Y representing the nonlinear function through a bunch of features that are weighted in a linear fashion with some weights and that can easily be done with Gaussian distributions. That's the whole main power of the Gaussian framework. But then if we ask ourselves what the right features are to use to represent this function we encountered the problem that to learn that the corresponding likelihood which happens to be a marginal distribution over this joint model is not a linear in general Gaussian type distribution that means it's not a Gaussian distribution that contains the quantity we care about in this case labeled as theta as a linear map in its mean or argument. And so to learn this parameter theta what we did so far is we basically jettisoned the probabilistic framework which we gave up on probabilistic reasoning and said, well, okay, we don't know how to do this so we throw up our hands and just compute the likelihood itself or maybe a posterior and by multiplying or adding in this case in the log space a prior and then just find the minimum of this negative log posterior which is the maximum of the posterior to get a point estimate. So of course this is in some sense disappointing because it's really not probabilistic anymore it doesn't provide uncertainty over these variables theta and it doesn't allow us to integrate out an entire space of hypotheses over what theta might be. Back then when this situation arose in the lecture I just presented this situation as fact I just said, well, at this point we don't know anymore how to do probabilistic inference so let's just throw it out and use point estimation. That was a good decision back then because it allowed us to continue on quickly but maybe now that we have a little bit more time and we've seen what we can do with this powerful framework maybe we can look at this again and wonder something that some of you may already have wondered back then which is, well, do we really have to be so quick and so radical in our abandonment of probabilistic inference? Really what is it in this situation that makes this inference intractable? Then I just assumed that there is a general feature function here and so therefore we really don't know what the shape of this thing is so we don't know anything about it well let's just forget about it but of course there are situations in which you can actually learn these parameters in closed form. For the Gaussian framework we've used so far for regression that is actually a particular example of such a situation. Another way of phrasing the key property that we've been using for regression using the Gaussian framework is that if you get observations from a Gaussian distribution for which you just don't know what the mean is so you draw a bunch of samples that all come from a Gaussian with a known noise covariance but an unknown mean in this situation in regression where we get to see a linear map of but other than that it's the same thing of an unknown function which I now just call mu which happens to be a scalar in this little plot and we just get observations of that function at various points. What we've done in regression so far essentially is that we've observed we noticed that if we choose a prior on this unknown parameter mu which is itself Gaussian so it's a Gaussian distribution mean and a covariance but another mean and a covariance then posterior inference is actually possible in closed form because the product of a Gaussian prior and a Gaussian likelihood is another Gaussian and it happens to have parameters that have this complicated shape so that means if we observe samples from a Gaussian distribution several of them then we can include them in the posterior in an analytic fashion and we just always get back a Gaussian distribution. So is this situation unique? Is this the only case ever in which we can do closed form analytic inference on a variable in our model or a set of variables in our model? Well actually no so earlier on in the course in lecture 3 before we even encountered Gaussian distributions we already saw a situation that is conceptually quite similar and it's connected to the example or you encountered it in the example I did on what's the proportion of people in the population that are varying glasses so that's inference on an unknown variable which we might call f which is a probability so it's a number between 0 and 1 and we get observations that are draws with that probability so coin tosses if you like random variables that are either 0 or 1 and n such throws then the probability for them individually to fall on one or the other class is given by this expression which we can simplify by introducing variables n0 and n1 for the number of such observations so for example observing n1 people that are wearing glasses and n0 people which are not wearing glasses we saw back then that if we choose as a prior over this unknown variable f probability distribution which we actually call the beta distribution which is of the same form in some sense as this likelihood so it also contains terms f to some parameter and 1 minus f to some parameter and it seemed convenient to introduce the parameter minus 1 here the posterior distribution over this unknown variable f will be tractable in the sense that the posterior is of the same form as the likelihood so the odd number of observations just needs to be added in the exponent here so we take the prior and we just add the numbers n1 and n0 in the expression and the only challenge in this which is actually present in I mean this framework in general is that we need to compute the normalization constant of this distribution now Laplace actually couldn't do this integral it's called the beta integral we had to wait for Euler to sort it out but notice that this is really the only challenge here so if you know how to approximate this integral then we can do closed form inference Laplace actually approximated this integral using a Gaussian approximation today we don't have to do that anymore because we have computers to do that for us so if you have a way to compute this integral then we have a closed form inference process here on an unknown variable which isn't Gaussian distributed at all it's a number between 0 and 1 and you saw back then that these distributions don't look anything like Gaussian distributions I mentioned back then that this concept of finding such a prior distribution that in this sense sort of fits the likelihood it's amenable to the likelihood which means that the posterior distribution arising from this likelihood under this prior actually is of the same analytic or sort of structural form as the prior is called a conjugate prior situation so this kind of prior is called conjugate to this likelihood because the posterior is of the same form as the prior now it turns out that this is actually not the only case in which such conjugate priors exist well the Gaussian inference framework we just looked at is another situation where we make observations with some likelihood and we have found a prior such that when we multiply the prior with the likelihood the posterior looks like the prior now here it just happened to be the case that the prior and the likelihood were actually of the same shape they were both Gaussian here arguably this is not the case because these two distributions are in some sense different I mean they look conceptually quite similar but notice that this is a distribution over X and this is a distribution over F so as a function of F this object here looks quite different as this object looks as a function of X right so here the argument is in the exponent and here the argument is well not at the exponent it actually turns out that there is a generalization of this concept from the binary case where we observe members of two classes in or out to the multi-class case which is due to this person called Peter Gustav Illusion Dirichlet well he's probably best described as a western European mathematician of the late age of enlightenment he was born in a small town called Duren which lies between Aachen and Cologne and back then was actually French his father and that's the story behind his name came from a small town to the north in Belgium actually not his father sorry his grandfather came from a small town to the north in Belgium called Dirichlet so he was the young of the family from Dirichlet and he became one of the great mathematicians of the 19th century he was a contemporary Humboldt and Gauss and Poisson and Fourier and many other great mathematicians he actually married a sister of Felix Mendelssohn Bartholdi so he must have lived in really enlightened society and he made many great contributions to mathematics that go way beyond this simple kind of observation that we're going to do now but this one seems to at least be due to him and which is the observation that we can think about the situation we just had in the beta distribution for the binary case in the more general case that there are k possible instances of observations so for example you could imagine that society consists of several different groups or several different every person can have one of k features in some observation so for example we could think about who is wearing glasses who is wearing contact lenses and who is not wearing them now we already have three and if you make such observations each with probability f k if you like or fx right so where x goes from well 0 to k or from 1 plus 2 to k then and we make n such observations then there is a similar kind of situation here that every single observation comes with a certain probability so if we make nk observations of the individual classes each so we make let's say three observations of the first class and five observations of the second class and so on then this function can be rewritten in this form so now there are not n terms in here anymore but only k which of course is convenient because it sort of summarizes the process from an unbounded number of data points into a finite number of classes and of course we can think of a corresponding prior distribution that is conjugate to this expression and the argument is extremely similar to the one on the previous slide so I can just show it to you right if we choose our prior to be a distribution that is in its sort of business end has a product over individual terms fk for each class k raised to some positive power then the posterior distribution so p of f given x is going to be of the same form it's going to involve again a product over the f's case where the a the alpha case just get added or the nk the observations just get added to the alpha case again the only trick here is that we need to know what the normalization constant of that distribution is and that happens to be the what you might call the multivariate beta integral it's a corresponding version which you can imagine it's just the integral over this kind of expression over all the fk and this distribution is called the Dirichlet distribution because of him so if we have access to this integral then inference becomes closed form and we can keep track of uncertainty over a large number of individual probabilities for individual events to occur without having to resort to complicated numerical operations so now we have two examples of closed formation inference the Gaussian one and the well in general the Dirichlet one because the beta inference due to Laplace is a special case of the Dirichlet essentially you might be wondering are those the only two but we saw sort of a structure here we looked at the likelihood we looked at the likelihood as a function of the unknown variable and wondered whether we could construct a prior distribution that is conjugate so let's see if we can do this in a more manual way for a situation that we haven't encountered yet so let's say we have observations from a Gaussian distribution but we actually know what the mean is we just don't know what the variance is notice that this is quite different from the situation we had so far so so far we assume we know what the noise is but we don't know what the mean is we use that to learn functions of x by putting a Gaussian distribution over those functions that's the corresponding Gaussian prior for the mean and then making observations of them with no noise and in both cases we assume that we know the variance we chose the kernel of our Gaussian process prior and we choose the variance of our likelihood now what happens if we don't know the variance but we do know the mean well maybe we can use this idea of a conjugate prior so to do so let's look at the logarithm of this Gaussian likelihood now you may have noticed that it's actually often convenient to look at the logarithm of probability distributions because then products of prior likelihoods turn into sums of log prior or log likelihoods and we can look at the terms so the logarithm of this Gaussian distribution is given by this expression this is just the Gaussian distribution and then taking the logarithm of it so here is the sort of what we think of as the business end of the Gaussian distribution x minus mu squared divided by sigma squared and the other two is what we think of as the normalization constant so the square root of 2 pi in the denominator and the square root of the variance now if you look at this as a function of the variance though then this normalization constant actually plays a really non-trivial role because sigma is a part of this normalization constant so we cannot just forget about it if you want to have a prior over sigma so if we continue our thought process from the previous few slides and ask well what kind of prior would we need that would be conjugate to this expression then we notice that hmm ok so we will probably need an expression that involves a logarithm of sigma squared and something that is a function a rational function so a function of one over sigma squared so it turns out that this function exists and it's connected to an integral that is famous for various different reasons it's a very important special function and maybe we can connect it just to add one more name to our interesting collection to this Swiss mathematician called Daniel Bernoulli he's one of the two brothers Bernoulli we could also assign it to Euler but Euler did so much that maybe it's better to assign it to Bernoulli actually it arose from an exchange of letters between Bernoulli and Euler and Goldbach actually so how does that work so let's say we have a log prior which is of this form but of that form and this is really the tricky bit here is we think of this function as a function of sigma so there will be a term that involves well the logarithm of sigma inverse or minus the logarithm of sigma squared so actually let's think of the variable sigma squared as the variable we really care about that makes things easier and another term that involves 1 over sigma squared so actually we realize maybe we can think of a function of sigma inverse squared that makes things easier it's because then we just have that variable here and in the logarithm here and then there's going to be of course terms in front of that so let's just give names to them alpha and beta and two two historical reasons it makes sense to write alpha plus 1 here so it turns out that this is itself a probability distribution it's called the gamma distribution or this is the logarithm of that distribution if you take the exponential of it then we get an expression that looks like this up to normalization so of course this isn't itself a probability distribution to make it a probability distribution we have to normalize it to make it to integrate to 1 and that means we need to know what the normalization constant is which here I've just called z so that corresponds to if you take the exponential of it 1 over the exponential of z and we need to know what that is so to do so we need to take an expression like this and integrate it against sigma inverse squared now it turns out that that can be done and how to do that is due to Bernoulli who found out that well first of all you can maybe imagine that this is sort of a function that is of the form x to some power times e to minus x if there's a parameter beta and here then we can do a change of variable basically v scale x and then that adds this additional term beta to the alpha here which comes from the change of variable here as you can probably imagine and then we can instead do an integral over x to some alpha times e to the minus x and that happens to be the gamma integral that's the name for this important function so if we use this prior then the posterior distribution given observations from a Gaussian with unknown variance sigma squared is going to be again a gamma distribution so a distribution of the same form over the inverse variance with a parameter that is given by whatever the original parameter of the prior was plus well so here we have an alpha in front of the sigma inverse the log sigma inverse squared in the observation there will be n such terms each introducing term one half times logarithm of sigma squared so we will get an additional plus n half term in our posterior and for this other parameter beta well we're going to get lots of terms so beta plays the role of the thing that goes in front of in the logarithm one over sigma squared so in the likelihood we'll have n terms that look like this that come from this form so what we need to add to our posterior to beta is the term one half times the sum of these individual terms and that gives us a way to learn the variance of a Gaussian over time interesting so now we know how to learn the mean of a Gaussian we've already been doing that for a long time now and also now we know how to learn the variance of a Gaussian if we know the mean what do we do if we don't know either the mean or the variance well here I'm going to do a quick answer and I'll leave the detail derivations to you I might set that as a homework so if we have an unknown we have observations x, data, difficult to collect that come from a single Gaussian distribution drawn IID with unknown mean and unknown variance then we can essentially and I'll just tell you that that's the case do the analogous kind of step to what I just did on this slide we take the logarithm you look at this expression as this likelihood as a function of the unknown variable so here the unknown variables are mu and sigma and then concoct a a function for the prior where the functional forms of the prior are dictated by the shape of this likelihood and then what we are left to do is to just rearrange this expression until we can figure out how to compute the integral over this expression over the unknown variables over mu and sigma and that turns out to be a product this conjugate prior of a Gaussian given sigma with new parameters mean and this sort of scale on the variance called new and then a distribution over sigma inverse that is a gamma distribution with two parameters which we just saw if we do so then the posterior distribution given these observations actually is of the same form it's again that this is called a gauss inverse gamma prior so if you get more of these observations here you see the shape of this kind of distribution it's a prior distribution over both mu and sigma and these distributions tend to have this perhaps interesting shape now if you make observations that from this Gaussian from some true value here's one that I've chosen I've actually drawn this distribution and keep making lots of observations then this distribution will concentrate this posterior and eventually it'll concentrate around the true value actually this is a famous setting as you can imagine because it's sort of the the elementary form of scientific inference on a generative process with unknown measurement noise and unknown value of an unknown variable so sigma is the measurement noise and mu is the variable it's connected with the name of William Gossett who wrote this paper under pseudonym called student and maybe you'll get to do and live through that story yourself in a homework exercise so what we've just done here is we've seen several different examples of a quite interesting general structure that we'll study for the rest of the lecture and that's called a conjugate prior a conjugate prior is a probability distribution called pi over an unknown variable x with or random variable x with some parameters theta that is convenient to use in the context of a likelihood given by some probability distribution over some data that is observable given the unknown latent variable x it's convenient because it has the property that it can be written in some algebraic form which we're going to be talking about now in a moment such that the product of the likelihood with that prior actually has the same algebraic form so we can write it in the form of pi over probability distribution over x given a bunch of other parameters and you can imagine maybe just from reading this definition you notice that this is a very weak definition it doesn't actually say what we mean by algebraic form so what we're going to do next is to think of a convenient algebraic form to formalize this process before we do so here's a quick great slide so you can take a brief break and think about what we've just done so far once you've done that let's return to this question so in this definition of a conjugate prior I've really just said a conjugate prior is a probability distribution that makes inference easy so what exactly do we mean by easy well we mean that this update in the form of the parameters from the prior to the posterior should be easy to do well maybe the easiest operation to do on a computer is addition so maybe we want our update for theta to be of the form that the posterior's parameter theta prime are the prior's parameter plus some update term that was actually the case in all of the examples so far maybe you want to go back and look at the slides to convince yourself that that's actually the case it turns out that and this actually requires a little bit of thinking but thankfully it was done for us by smart people in the past that there is a general form for the probability distribution that we can think about that gives us this kind of update and this form is called an exponential family an exponential family is a probability distribution and this is going to be the most maybe most difficult definition of this lecture so maybe you want to go slow here is a probability distribution over an unknown variable x given some parameters w so here this is another parameterization of our situation in which we want to do inference on x given some data y that is a distributed according to a likelihood y given x we want a prior for x so that prior for x will have to be a probability distribution over x and it will have itself parameters and that those parameters will call w such a probability distribution is called an exponential family if it is of this form so it's given by a function of x times the exponential of an inner product between some function of x and the parameters w minus a function that only depends on w and is conveniently written in this form as minus the logarithm of some function z of w we can also of course we arrange things around here the h of x could be moved into the exponential then we would have exponential of logarithm of h of x minus logarithm of z of w plus this inner product of a function of x and w or we could move z out of the exponential and then we have an expression like this all of these are just the same the important structure here is that there are parameters w and the variable x and they connect to each other in the sense that there is one term that only depends on x one term that only depends on w and the mixing term which is linear in the parameters w but potentially non-linear in the parameters phi of x these particular functions in this definition have names because they are so important h of x is called the base distribution w is called the natural parameter of this exponential family or the natural set of parameters and phi of x is called the sufficient statistics of this distribution over x why this is the case we'll see in a moment exponential family distributions simplify the construction of conjugate models before I show you how to do that though let me first maybe briefly convince you that actually all the distributions we've encountered so far are of continuous variables and discrete ones are of this exponential family form or at least can be brought into this form so let's first consider maybe the most simplest one the Bernoulli distribution so that's the distribution for observing individual events of a coin toss or more generally a certain number of successful experiments such experiments so for example the probability of a group of people of size n observe k of these people to wear glasses then as a function of this number of positive observations k rather than the individual observations named observations that distribution needs sort of a correction for being a true probability distribution over k this is this n choose k combinatorial term so that's the number of instantiations in which we observe k positive cases among a collection of n or we arranged them in n then so this term was not on previous slides because back then that was a distribution over the individual events rather than the counts themselves so if you want to talk about the counts then you need this additional term then that distribution, the Bernoulli distribution by this expression where here that's the important bit that we've seen before and it depends on an unknown variable q so we can think of q as a parameter of this distribution and then see if we can bring it into an exponential family form notice that I'm treating n as a fixed number here which is not a parameter and you might already raised some questions for you which we're going to get to in a moment so if we rearrange this expression then you can clearly see that we can write it as the exponential this constant in front times the exponential of k times log q plus m minus k times log minus q now we can collect all the terms that involve k and get the exponential of k times the logarithm of q over 1 minus q plus an expression that doesn't involve k n times log 1 minus q so this is of the form that we're looking for and now we just have to rename the variables to see that this is an exponential family this term here in front which only depends on k and not on q notice again that we've decided to treat n as a fixed number is the base measure so it's h of k and the term in here is this is the sort of mixing term between k and the parameters so the sufficient statistics of the distribution is literally just k and the natural parameters of this distribution is given by this it's called a logit function the logarithm of q over 1 minus q plus this normalization constant minus the logarithm of z of w is given by n times the logarithm of 1 minus q so there is no w in here it's still written in q if you rewrite it in terms of what we just decided w is then this function can be written in this form it's n times the logarithm of 1 plus e to the w so maybe a first observation you have here is that we've decided to say that n is a fixed number this is similar to how we've said in the Gaussian case that the dimensionality of the problem d is a fixed number and we're not considering it a parameter of course there are settings in which you can question whether this is the right thing to do or not and that will then have an effect on what exactly you get to call the sufficient statistics and the base measure and the natural parameters this is a first example of how the definition of exponential families is often a little bit vague and depends a bit on interpretation but we'll see that that's not really a problem because what we're going to do with these distributions doesn't really depend on these technicalities so here's another example we used as a conjugate prior for this penalty distribution the beta distribution the beta distribution is given by this distribution over the variable we just called q on the previous slide with two parameters alpha and beta and it has this form we use this because it's a conjugate prior for the probability distribution which actually is also contains terms of the form q to the k times 1 minus q to some other parameter now here we thought about this object as a function of k so it's a probability distribution over k with parameters q now we're looking for probability distribution over q so that needs a different normalization constant that's called the beta integral again we can rewrite this expression by just writing as the exponential of there's just a 1 in front times the exponential of the logarithm of q times alpha minus 1 plus the logarithm of 1 minus q times beta minus 1 minus the logarithm of the beta integral so this is one way of defining this as an exponential family with a base measure of 1 a sufficient statistics that's given by log of q and log of 1 minus q and natural parameters that are given by alpha minus 1 and beta minus 1 notice though that we could also take these ones outside and treat them as like to give additional terms involving log of q and log of 1 minus q so we could drag those outside of the integral and then this would be an equivalent definition in which the base measure is happens to be 1 over q times 1 minus q and the natural parameters are given by well alpha and beta so again here's an example of how the definition of what exactly an exponential family is depends on how you want to write these expressions so one thing you have to keep in mind when dealing with exponential families that there is no unique way of writing a particular probability distribution at least in general in exponential family form there are though a whole list of interesting such exponential families which can be used for various different applications here are just a few of them you can find an even longer list on Wikipedia if you like for example there is the Bernoulli distribution we've just encountered which is a distribution over probabilities for the individual successes of coin tosses or observations of people wearing glasses or other individual discrete events as observations there's the Poisson distribution which is a probability distribution over the number of events in a certain time frame given that they are happening independently with a certain frequency in that time for example how many emails you get per day given that the rate of which you get emails is constant there's the Laplace distribution which is the exponential of absolute distance that's a distribution that's often used for extreme events for disasters in nature like floods and so on frequently called chi-square distribution but it's a bit annoying to have just a parameter so here I'm listing a bunch of people that could be connected to this name the Helmut distribution Helmut being a German geodesic someone who works in the measurement of Earth which is a distribution over variance this is actually a special case of the gamma distribution we already encountered the Dirichlet distribution which we already saw a probability distribution over probabilities for multivariate class events so it's the conjugate prior to discrete distributions so the distributions over individual events coming from more than two classes that's the gamma distribution which is connected maybe to Euler or to Bernoulli as we saw now I'm putting down Euler's name just to be fair which is maybe a generalization of the chi-square distribution as a good conjugate prior for variances of Gaussian distributions which is a multivariate version of the gamma and is widely used as a conjugate prior for covariances for example to model stock indices over time and their covariances as they develop the Gaussian distribution which we've already seen got to know and love which is our prior for functions for supervised machine learning problems and as generalizations of them we're going to encounter Boltzmann in a moment from this list for example applications on the right hand side you can maybe already guess what kind of role exponential families might need to play or should play in your mental toolbox they offer data types in the probabilistic concept or context if you're a computer scientist then you know and intuitively use the concept of data types like integers and floats and arrays and lists and so on these are types of objects that are particularly suitable for certain kind of operations and they come with certain interfaces that allow you to apply them in a particularly clean way to particular problems sometimes they overlap in their use cases so floats and integers are actually conceptually separate from each other but some people don't care so much and they use them in an overlapping fashion which is maybe not particularly clean exponential families play a similar kind of role they are particularly suited to certain types of variables but in a somewhat more subtle kind of way so if you're making observations of a variable actually collecting data that is related to a latent variable of a certain type then there's often a corresponding exponential family you might want to use if you know about it if you want to know the scale of a Gaussian distribution you better use chi-square or gamma or vichar distribution for it depending on the dimensionality and the number of observations if you want to learn a real-valued variable then you use a Gaussian if you want to learn discrete probabilities use a Dirichlet prior and so on and so on now I've said several times that exponential family distributions have great properties and that you want to know and use them but I haven't actually shown you sort of in a formal way the great properties of exponential families we've only hinted at them so far so let's go ahead and see some of the great properties of exponential families and advanced warning what I'm going to show you in the next few minutes are actually overlapping concepts that relate to each other so there are different ways of looking at essentially the same algebraic properties of exponential families and I nevertheless think that it's a good idea to do it in this fashion rather than to give you the most general introduction to them in the definition of their properties first because it's easier to understand this way the first great property of exponential family distributions is that they have conjugate priors and that these conjugate priors actually are themselves exponential family distributions to see that let's consider a generic exponential family distribution like this one so let's say we're getting to see data x which we assume to come from this kind of likelihood function from distribution conditional on some unknown variable W and we want to do inference on W so we need a conjugate prior for it and it has exponential family form then what we can do is essentially in an abstract fashion repeat the process we did at the beginning of the lecture with these individual concrete instantiations of exponential family distributions to see that there is a conjugate prior and it is actually itself an exponential family and to do that I can just show you what that conjugate prior is it's this probability distribution which has a normalization constant which is defined naturally from the structure so it's an exponential family with sufficient statistics that are given by W and the negative log normalization constant of Z natural parameters that we just add to them so that's just giving a name to things and then a new normalization constant so notice that we need two things here we need to know two things log normalization constant of our exponential family is if we don't know it then we can't really write this expression in this form it's just that a very abstract thing so knowing the normalization constant of exponential family distributions is extremely crucial and we will see that over and over and over again in the next few minutes and then to do anything interesting with this we actually also need to know what the log normalization constant of our conjugate prior is so we need to know what this integral is and of course we only know whether we know that once we know what the explicit forms of log of Z of W and well W itself actually is if we have such a prior let's just convince ourselves that it is actually conjugate so if we multiply this likelihood with this prior then and let's say we have many terms of that form so we get lots of different observations X i because that's the most general form then the posterior distribution is going to be the product of prior and likelihood up to normalization at this moment and clearly this is of the form as a bunch of terms in X which arrive from the base measure and notice that those don't matter they will just drop out in the normalization constant this is one of the reasons why the base measure is often considered as kind of a secondary object that's not so important in exponential family distributions not only can it be absorbed into the sufficient statistics it also doesn't really matter for the conjugacy property because a term in X affects our posterior on W the only terms that matter are these expressions in W here and here and we notice that the posterior is going to be of the form of the prior where the sufficient statistics are updated in an interesting fashion sorry the natural parameters are updated in an interesting fashion to get the posterior natural parameters we take the prior natural parameters and then add the sufficient statistics of the likelihood and the second special natural parameter is related to the log normalization constant so there's a scalar term obviously because there's only one log normalization constant that thing actually just keeps track of how many observations we've seen how many terms we have in our likelihood just counts up the number of experiments we've seen in sensations of this before in the case of beta and Dirichlet and also gamma inference and actually it's also hidden inside of Gaussian inference even though we didn't see it so explicitly we did an exercise to see that now Bayesian inference doesn't just require multiplying a prior in a likelihood it also requires computing an evidence the normalization term of a posterior like this now this we can compute because it actually amounts to a normalization constant of such an exponential family distribution so let's say we only think about one particular sample X so we don't have n observations we just have one then to predict that sample that observation we need to compute this marginal probability distribution and if we now plug in the forms for this so if we plug in the form of our exponential family likelihood and we plug in the form of our conjugate prior then we see that we have to compute an integral over an expression that is exactly given by an evaluation of the normalization constant of the conjugate prior at a point that is sort of the predicted update to the to the sufficient statistics so the natural parameters keep mixing them up so we need to compute this capital F at alpha plus the value of the sufficient statistics as X and new plus one and that's only two after normalization there's a bug in this equation there should be a minus here sorry about that just by the definition of the conjugate prior of course then that moves down here and we just have to evaluate this ratio to get our predictive distribution that is up to the base measure H of X but of course we assume that we know what the base measure is so we can just evaluate it here's another reason for why base measures are not that important we just assume that we can evaluate them so if we can evaluate them we just plug them in front now notice that again to be able to do so this requires that we know this log normalization actually this normalization constant if we don't know this then we can't do this kind of process so actually I could have defined a little bit more specifically practically mind that an exponential family by saying exponential family distributions are distributions that are that are of that form where you know all of the quantities in there so it's not enough to just be able to abstractly write down that there is this thing if you actually want to use it you need to know what not just phi and H so what log of Z of W is and in fact it turns out that Z is really important many of the great applications that I'm going to show you and I'm going to show you now we rely on the fact that you know what Z is actually a very concise way of thinking about what exponential family inference actually means is that you're borrowing someone else's work to compute an integral if someone has already gone ahead of W for you for a particular choice of natural parameters and sufficient statistics then you can use that work of someone else the embodied knowledge in that integral to simplify subsequent computations you already saw that in this example so if you have this F then you can predict future X or you can compute evidences if you like and you can also compute posterior because posterior involves this normalization constant another way of thinking about this role of F is in the next slide so let's say we have data that comes from an exponential family but we don't have a conjugate prior so maybe this slide is sort of the generic setting of what you would like to do Bayesian inference on a variable W but we notice that to be able to do so we need this normalization constant F so actually we need two different normalization constant Z and F so let's say we don't have F but we have Z so that's already one integral that might simplify our lives then we can still do approximate inference in the maximum likelihood sense on the parameters W and use the structure of the exponential family to simplify our life to see that let's say we get data X that comes from an exponential family with parameters W so it has this form now you'll notice that I've left out the base measure here I've just assumed that the base measure is one you can see from what I'm going to do in the next few slides that that doesn't really matter if you have a base measure H of X here it's just going to drop out one line below so if you have such data and we'd now like to know what W is then ideally we'd like to do conjugate prior inference that requires knowing this F if we don't know that maybe we can do maximum likelihood inference we can at least estimate what the best choice of W would be and then give that as a point estimate to do so we could try and maximize this expression this one here for our end data points of course if you only have one datum then nothing changes you just get rid of this sum now we've done this exercise many times before so by now you probably know what we need to do we need to take the logarithm of this expression because we only want the maximum so we might as well compute the maximum of the logarithm of this expression because the logarithm is a monothonic transformation that makes things easier so we can just compute the maximum of this expression inside of the exponential of our exponential family and then what we're going to do is we're going to compute the gradient of this expression with respect to W and set it to 0 now because there is this sum in the exponential family setting this gradient to 0 means that we just need the gradient of this expression which is our log normalization constant with respect to W to be equal to the sufficient statistics or actually this kind of empirical estimate of their expected value so why is that good well it's good for two reasons the first one is if we know so first of all there are sufficient statistics in here so even if we have n data points then we know that at most we need to compute here this whole gradient that we care about just involves summing up the sufficient statistics of each datum now there are finitely many of these sufficient statistics so we know that this process is of linear cost in the number of data points that's already useful because it means we don't have to do something more complicated with the data and we will have linear time cost in this maximum likelihood inference in the number of data points and then what we need to compute to actually solve this equation is the gradient of this log normalization constant with respect to W so that means we need to know what z is otherwise we can't compute this gradient at least we need to know it in a sort of numerical sense we need to have code that can be automatically differentiated that computes this log z of W now in some cases this gradient of the log normalization constant is of such a form that you can just solve this equation in closed form then you have a particularly fast way of doing maximum likelihood inference if that's not the case then we can still compute the gradient and use it in a numerical optimization scheme to compute estimates of our W that's a little bit more expensive but we still know that it's going to be linear in the number of data points because it involves the data only through the sufficient statistics actually another great property of exponential family distributions is already hinted at in this expression that you see down here so the gradient of the log normalization constant is related to this empirical estimate of what looks like an expected value or sort of a Monte Carlo estimate if you like of the expected value of the sufficient statistics let's make that formally a bit more precise because it is actually true that the log gradient is equal to an expected value of the sufficient statistics so let's say we have data from but actually as an exponential family distribution p of x given w so a distribution that is of the form as we have it on the previous slide then if you take the derivative of this expression so first of all if you compute the integral over this thing over x then that's just one because it's a probability distribution over x by assumption by assumption or by definition of what log z of w is then that means if you can take the gradient of expression which is one so the gradient of something that's one is zero then we can take the gradient inside assuming everything is sufficiently regular which it usually is then that means we need to take the derivative of this exponential so here's an exponential in here to maybe let's go back one slide we need to take the derivative of the exponential of this expression with respect to w that means by chain rule of course going to get that expression back times the inner derivative so the inner derivative of this phi with respect to w is the derivative right it's phi minus the gradient of log of z of w so let's do that here and so p remains because that's the exponential then our integral over this gradient turns into this expression there is an expression here with the gradient of log of z of w which doesn't depend on x by definition so you can move outside of the integral this here is still one because it's still an integral over probability distribution here we get the expected value of phi of x and therefore what we've just seen is because this expression is zero so we can rearrange that the expected value of the sufficient statistics under this exponential family distribution is given by the gradient of the log of z of w so if you have something for which you assume it is exponential family distributed then you can compute expected values of the sufficient statistics and often the sufficient statistics are sort of interesting for various reasons very directly by computing the gradient of the log normalization constant this is another instance of another sort of symptom of what I summarized before as exponential families being a way to leverage someone else's integration work so if someone has previously computed or told you what log of z of w is then we can now compute expected values of the sufficient statistics of this exponential family distribution not by computing an integral which as you know is an expensive complicated process but instead just by taking the derivative of this log normalization constant so for example you to compute the expected value of the log probabilities under a Dirichlet distribution log probabilities being sufficient statistics of the Dirichlet distribution we can do so by leveraging Dirichlet's or actually maybe Euler's and Bernoulli's work on the beta integral by computing gradients of that gamma or beta function rather than writing down an abstract integral that we don't have to solve in a complicated high dimensional potentially numerical fashion a final thing I just want to point out in passing which doesn't seem to particularly important yet but it's going to be helpful a few lectures from now is just the sort of abstract form of fact that if you take the product of two exponential family distributions over the same random variable but with different parameters so that's sort of the other way around from the way we've used it before then you only that gives another exponential family distribution in which you just have to add the sufficient statistics this is going to be useful when we do approximate inference in a framework of exponential family distributions because of course this is an operation that is very easy to implement on a computer it's just summing floating point numbers so to summarize what we've seen in these last few minutes is that exponential family distributions have great properties they have conjugate priors the only problem is that their conjugate prior involves a log normalization constant which you might not always know but that conjugate prior is itself at least an exponential family so that's maybe interesting more concretely something we definitely can do once we have an exponential family distribution is maximum likelihood inference on the parameters because that can be done by computing the gradient of the log normalization constant and setting it equal to the empirical estimator so the expected Monte Carlo estimator of the expected value of the sufficient statistics that's relevant because there are many sufficient statistics so if you have many data points then that means that inference on the sufficient statistics sorry inference on the natural parameters maximum likelihood type inference is feasible in linear time and as a related result we saw that we're able to compute integrals over the sufficient statistics so expected values of sufficient statistics not by computing an integral but by just computing a gradient so all of these properties require us to know what the log normalization constant or the normalization constant actually is otherwise we can't really make use of these properties so exponential family distributions are particularly interesting if you know they're not normalization constant in fact one could do the entire kind of argument the other way around and say if you know how to do a particular integral that involves an exponential function so the exponential of something then that directly defines an exponential family distribution so exponential families are a way of turning an analytic integral into a probability distribution with good estimation properties a distribution with which we can do maximum likelihood inference under which we can compute the expected values of the sufficient statistics and so on and so on so before we continue on with the grand finale of this lecture let me summarize or recap what we've done so far in this lecture that took a relatively phenomenological approach to this somewhat technical domain of exponential families so we began by observing quite concretely in practice that for certain probability distributions it's possible to learn parameters of that distribution latent quantities in these distributions through the Bayesian fashion through a conjugate prior that is true for example for the parameters of discrete distributions and binary distributions or for the mean of a Gaussian or the variance of a Gaussian so when we try to formalize this notion of a conjugate prior we notice that there is this other concept called an exponential family which simplifies the constructions of conjugate priors and in fact if we choose the likelihood to be of exponential family form then first of all there is actually a conjugate prior and that conjugate prior is itself of an exponential family form now that conjugate prior if it's available that means if it's tractable allows Bayesian inference on the parameters of our probability distribution now it's not always tractable so if it's not tractable then we can still do maximum likelihood inference on the parameters of our exponential family likelihood and doing so is possible in linear time in the number of data points drawn from this exponential family distribution because the data only enter in the sufficient statistics of our distribution so maybe you've noticed that the process that we've gone through here has something to do with learning we're learning not a function as we have in previous settings but we're essentially learning a probability distribution we keep getting lots of draws x from some unknown probability distribution and then we use those draws to fit the distribution to learn probability distributions can we make this process a bit more formal do exponential families provide a framework for learning probability distributions similar to how Gaussians provide a framework for learning functions and to do so let's maybe go back to our framework for learning functions and think about how it would be described in a statistical and probabilistic framework and how they two connect to each other so in lecture so far we focus on learning problems where we're trying to learn functions so that's supervised machine learning if you like in the first lecture on Gaussians I provided you with a tedious lengthy version of what I just did with exponential families we went through various good properties of Gaussians distributions just as we now just went through good properties of exponential family distributions then in the lecture that introduced regression on Gaussians distributions sorry, regression on functions we decided to use a combination of likelihood and structure in the thing we're trying to learn parameterized structure the function we decided that the likelihood is going to be a Gaussian probability distribution so we get to see values, or there's a bracket missing here values of an unknown function evaluated in various locations with Gaussian noise and then we made a further simplification to assume that the function we're trying to learn is actually of a parametric form so that it can be written with a bunch of features then we noticed back then actually that there was a conjugate prior for the weights of this function under this Gaussian likelihood and that conjugate prior happened to be a Gaussian probability distribution we didn't always call it a conjugate prior back then but that's what it is because it gives a posterior over W that is itself a Gaussian distribution and then we used that framework to learn functions now you can do a statistical analysis of this kind of process, of this Bayesian inference process and identify with a certain risk minimization procedure, we've done that on previous lectures here's a quick recap if you're computing the posterior distribution over these unknown weights W then you're multiplying prior and likelihood and dividing by the evidence if you think only of the maximum of this distribution, that is a point estimate, then you can equivalently compute the maximum of the negative of the logarithm of this posterior that simplifies expressions and we saw that if we use a Gaussian prior and a Gaussian likelihood that means this loss function, this empirical risk that we are here minimizing when we are finding the maximum of the posterior distribution corresponds to this expression which is a sum over quadratic terms this is the log likelihood and quadratic regularizer on the weights that's the log Gaussian prior and here I've rearranged already terms so that the variance of the likelihood is moved over into the prior now such an analysis what a statistical analysis now might include among other things is looking at what how this estimate behaves if you now get lots and lots and lots of data points so if you have many data points so then let's assume these data points x i come from some probability distribution p then and let's even for a simplicity assume that we actually get to see this function value without noise if you have it with noise we have to think a little bit more but actually things don't really complicate themselves all that much then the loss function that we are minimizing here is actually the expected square distance between the true function and our approximation for it now notice that that's actually the true function this is not F as written in this form it's just a real F so we just decided to represent F in this particular form and then we have this regularizer but if you have a large number of data points then this regularizer basically is drowned out in some sense you can imagine that large this term here drops away and the function we're going to find is the function or is the choice of W which within a hypothesis class of function values of a functions that can be written in this parametric form minimizes this expected quadratic risk now in this lecture we moved away from functions to probability distributions to generative models for data is there a corresponding concept for what we've just done with exponential family distributions and do exponential family distributions play a similar role to parametric Gaussian regression when we do maximum likelihood inference in exponential family models or actually full conjugate prior Bayesian inference and the answer is yes and the technical answer is yes and it relates to a change of the empirical risk from this quadratic loss to what you might call a log loss and that log loss is connected with the name of these two American statisticians who have two of the most mispronounced and misspelled names in statistics they're called Solomon Kulbach and Richard Leibler or at least that's the German pronunciation but these are American guys so Kulbach or Liebler or Leibler is probably okay as well they were statisticians who worked for various secret agencies in the US during the Second World War and the Cold War on cryptoanalysis and their name is connected to this particular function which is called the divergence it's the Kulbach Leibler divergence or KL divergence which is a measure of this similarity and I'm particularly careful not to say a metric or a distance measure it's a sort of measure of the similarity between two probability distributions p and q which is given by this expression so assuming that p has a density and q has a density given by little p and little q then we can write this expression take the logarithm of over their ratio and integrate against p a few interesting properties to note which I'm not going to prove for time is that this is not a symmetric expression so if you exchange p and q that's a different thing because the integral is then not against p anymore but against q so it's quite a different object actually Secondly this is an expression that is not a metric so it doesn't fulfill the triangle inequality for example however it does have the property that it is strictly or actually not strictly it is non-negative so its value is always larger or equal than zero and it's actually zero if and only if p and q are exactly identical almost everywhere so everywhere except for a set of measure zero this thing is going to provide our loss which we're going to minimize when we're doing maximum likelihood inference with exponential families now let's see that that's actually true so notice that this is here and this expression before we move on is similar to this expression in that there's an integral over p here there's an integral over p here but inside of the integral we don't have the square distance between the function p and the function q but we have the difference between their logarithms that's clearly asymmetrical as well right it's the logarithm of p minus the logarithm of q integrated against p rather than the square of the distance between p and q now let's see how that loss function shows up when we're doing inference with exponential families so let's say that we get some data x this x is actually drawn from some unknown probability distribution p of x we don't know what that distribution is but we'll make the decision to assume to approximate it with a parameterized probability distribution so with an exponential family distribution so we assume that the true p which we don't know can be written in the form of p hat over x which is a function that has exponential family form with parameters w normalization constant and sufficient statistics phi by the way just to be clear again because people can keep getting this wrong this is called an exponential family it's not the fact that the set of all functions that can be written in this form is the exponential family it's that every family of functions that can be written in this form for a fixed phi and various w that's an exponential family so we want to approximate it this way right so we have our data we don't know what the distribution is but we assume that it can be written in this form now let's just decide that we want to find the choice of w which minimizes the kL divergence between the true distribution p and the approximate distribution p hat so from the previous slide we know that that kL divergence can be written in this form I've literally just copied over the definition of the kL divergence so if you wanted to find the w which minimizes this expression then we can take the derivative of this expression with respect to w but to do so let's first simplify a little bit so here's a term that only involves p this term actually happens to be the negative entropy of p that's just what it's called and it doesn't matter to us because it doesn't depend on w so if we optimize for w we might as well forget about it so instead we just get the term the second term in this expression which is an expected value because it's an integral over p over the logarithm of p hat so let's look at that logarithm well it's given by phi transpose times w minus the log of z so the log of z w doesn't depend on x so that integral here is trivial it's just an integral of our probability distribution so it's just that thing and in front we have the expected value under the true distribution p of the sufficient statistics and they're in a product with respect to w so if you want to minimize this then we have an expression that we've seen before we want to choose w such that the gradient of this expression is equal to this expected value and so here we go right and one way to estimate that is and that's the empirical risk minimizer is to just compute the empirical risk so that's the empirical estimate of the of the statistics transpose with w minus the log of z and if you take the gradient of that then we find an empirical risk minimizer which is given by the choice of w which sets these two equal to each other in the limit of many many data points of course this Monte Carlo estimate here if assuming we have iid data converges to the true expected value and then we're actually just finding the member of our exponential family the choice of the value which minimizes the kL divergence between the true distribution p and the estimated distribution p hat so that's if you like statistical estimation of probability distributions now this is this lecture is called probabilistic machine learning so we would like to not do statistical estimation we'd like to do probabilistic estimation so let's fix that by adding like moving towards the full Bayesian framework and we already have all the ingredients for it from previous parts of this lecture so first of all let's say that we didn't want to do maximum likelihood estimate because estimates because maybe we only have three data points by 20 sufficient statistics then we are going to introduce a prior distribution over the parameters of our exponential family w so we already know how to do that we know that there is a conjugate prior for this exponential family and it has a form that we had on previous slides so it's of the form alpha times w minus eta times log of z of w plus normalization constant which only depends on alpha and eta or alpha and nu sorry now that normalization constant is the really tricky part so that's why so far we're not doing full Bayesian inference yet but if we want to do map estimation so if we just want the maximum of the posterior distribution then we don't need that normalizer because it doesn't depend on the value of w it only depends on the value of the priors parameter alpha and nu so all we need to do to maximize the posterior distribution is we take the logarithm of the full posterior so the full posterior is the p of w of that exponential family form for us I mean actually it could be of any form but we're going to need exponential family form in a moment for full Bayesian inference times the likelihood so the likelihood is given by our exponential family and in the logarithm that just means we add that this exponential family internal sort of form of the prior and everything just sort of moves through as before right so the same arguments go as before so this is still an entropy so it still doesn't depend on w we get rid of it here are the expressions we had on the previous slide and now there are just two additional expressions that involve w here and there if we take their gradient the whole gradient of this expression and set it to zero then we get a new mean sorry map estimate it's now a map estimate rather than a maximum likelihood estimate and if you wanted to do so then you can convince yourself that that has a minimum that is given by at the point where the this estimate or the true expected value of the sufficient statistics actually the estimate of the sufficient statistics is equal to this sort of regularized expression so this is an expression that involves the term we had before but now corrected for the number of data points so as n gets large the prior again drops out so our parameter alpha doesn't matter anymore because 1 over n goes to zero while the the term that we had in the maximum likelihood estimate is still around and these two ends approximately cancel for large values and we're back at the maximum likelihood estimate that's sort of a typical statistical estimation result that as the number of data points grows very large our map estimate approaches the maximum likelihood estimate that prior gets drowned out by the data and again we get an estimate a point estimate that is approximately or asymptotically equal to the maximum likelihood estimate and therefore minimizes the KL divergence so let's see what happens if you do the full Bayesian inference so what we would like to do is to find a conjugate prior for our exponential family distribution if we can find that and it will look like this then I mean finding that depends on knowing what f is that's the entire trick right so if we know that the conjugate prior will have to have this form just by construction from looking at the likelihood then to be able to normalize that we need to know what this integret is so let's say we know what this log normalization function f is then we can use it to do full Bayesian inference on a probability distribution of exponential family form so we're going to compute our posterior that means we multiply the prior with the likelihoods they're actually multiple terms because we assume that we have n data points and normalize we saw that we can do this normalization because it involves just evaluating this partition function f at various points and we'll get a posterior over w that is in the exponential family in the conjugate prior exponential family with updated parameters that are given by the prior parameters plus the sum over the sufficient statistics of the data and account variable that keeps track of how many observations we've made now that's actually full Bayesian inference and if you only care about Bayesian inference then you are done at this point you can use exponential families to learn probability distributions and to be able to do so you need to know two things you need to know the exponential family itself that you are trying to learn including its normalization constant and you need to know the normalization constant of the conjugate prior and this is only up to isomorphism at one conjugate prior and to be able to use it you need to know what that normalization constant is now you can do Bayesian inference with that but you might also be interested in what the corresponding connection now is to the statistical viewpoint in this framework and it is again connected to blackout divergence so if we keep making such observations so if n grows large then we can have a look at the behavior of the posterior distribution around its mode around the map estimate we already know from the previous slide that this map estimate converges towards the maximum likelihood estimate I'll leave it as an exercise for you to convince yourself that the Hessian the curvature of this log post or sorry of this posterior distribution of this exponential family distribution is given by this expression at the mode so yeah I'll just leave that to you and what you'll see here is that this second derivative this Hessian is given by the value of the posterior at its mode so that's a large positive value or it's the largest possible positive value times some particular value some constant at this so that's the curvature of the log normalization constant at this particular point that curvature has to be positive because the actually has to be negative because we are at the mode so this is here a maximum so this has to have negative curvature there's a minus here in front and in the important bit that there's this new here so that's the parameter of this exponential family probability distribution distribution that goes in front of our normalization constant now as the number of data points increases this parameter is new plus n from over here and n gets large so the curvature gets very large that means that our posterior distribution is concentrating around its mode of course that's not a full proof because it's only a local statement but it has to suffice for this argument in the limit of infinitely many data points you can then thus convince yourself that the probability distribution the full posterior distribution concentrates around its maximum likelihood value and not just well maximum posterior but we know from the previous slide that it goes to the maximum likelihood value and we already know what that maximum likelihood value is it's the expected value under the true distribution p not the approximate one of the sufficient statistics so what our probability distribution is going to do what our Bayesian inference framework is going to do is that it will as the number of data points increases find an estimate and concentrate the posterior around it which is given by the choice of W which minimizes the KL divergence between the true distribution and the expected the approximate distribution so with that let's briefly summarize exponential families actually provide an entire framework for learning probability distributions you can use them in a maximum likelihood maximum or posterior or full Bayesian framework depending on how many quantities you have available how many integrals you are able to solve given data drawn from some unknown distribution p if you decide to approximate that distribution with an exponential family distribution then the you can do maximum likelihood inference in time by computing gradients and that involves computing sufficient statistics over the end data points you can assign a prior which always exists called the conjugate prior it's not usually tractable in general but if it's not tractable then you can still do maximum or posterior inference and if it is tractable you can do full Bayesian inference which ever of these three frameworks you use in the limit of arbitrarily many data points asymptotically this posterior distribution or this likelihood distribution will concentrate around a point estimate which is given by the choice of W which minimizes the KL divergence so if you like the expected log empirical risk between the approximate distribution the exponential family distribution and the true generating distribution so the reason I've gone through this exercise of showing you this perhaps somewhat technical notion of exponential families is that I want to empower you and maybe take away the respect you have for this somewhat short and very specific list of probability distributions that you might find online and which is associated with these big names the Gaussian distribution, the Dirichlet distribution, the gamma and chi-square distribution sometimes they don't even have the names of people attached but only abstract magical symbols to them and that might give you the impression that there is only a very finite list of these distributions and that you don't even have to think beyond them. The goal of this lecture course in general is to empower you to build your own tools and that means you shouldn't be using black boxes whether these black boxes come to you in the form of software libraries or in the form of some formula that some big guy wrote down a long time ago. So let's end this lecture with a little bit not entirely serious kind of game. Let's say I wanted to I just really wanted to be part of this list of cool people and I would like to add my face to this gallery of wonderful amazing mathematicians. Then maybe what I have to do is to invent an exponential family probability distribution and somehow sneak my name onto it on Wikipedia. So let's see how I would do that well the one thing we actually need to define an exponential family distribution is a log normalization constant. If we have that then we have everything because it's gradient defines the expected value of the sufficient statistics and we can do inference at least maximum likelihood inference with it. So what I did is I opened up a big book with a table of integrals and I looked at the integrals that are integrals over exponential functions you can find that list on Wikipedia you can find big books in the library and maybe go a little bit deeper into the books and find some obscure integral I've chosen, I've actually found one which is of this form. So I know that the integral from 0 to infinity over the exponential of minus w1x2 minus w2 over x over x2 the x is given by a constant it's given by the square root of pi over w1 times the exponential of minus 2 times w1w2. So that's exactly the kind of thing we need. That's a normalization constant of an exponential family distribution and all that's left to do is a little bit of PR work. So we need to come up with some dataset and that dataset of course has to be motivated by the shape of this distribution. So if we vary w up and down w1w2 then this actually amounts to creating a family of distributions that looks like these red curves here. So if you can sort of think for yourself what happens, what the relationship between w1 and w2 is maybe you can invent a name for them actually for yourself, maybe that might be a good exercise to think about the shape of this distribution and then if you have that then you're done. You can now inference on datasets that look like this. Of course you have to come up with a reason for why that is a good dataset or a kind of good structure to look at but you know maybe you can invent some. So for example I could say okay the sufficient statistics of this probability distribution are given by minus x2 and minus 1 over x2. So maybe it has something to do with the eigenvalues of symplectic matrices or and therefore the stability of some control problems. Maybe it has something to do with data that lies on a torus through which we have made a cut at some point, something like this. Doesn't really matter right. Let's first invent the probability distribution and then let's see whether there's a use for it. So I've just here written down the integral again so this is something I actually found in a book that's the hard bit which I basically in doing so let's say I took it on loan from someone who did this integral for me, it's actually some Russian mathematician from a long time ago I just now know what this is. Now that Z treated as the normalization constant of the exponential family defined by this form directly gives me everything I need. That's done I don't actually have to contribute anything myself other than maybe coming up with a fancy abbreviation for this data, for this probability distribution that somehow reminds people that I was the one who came up with it. So here it is just by definition again what this probability distribution is here is the exponential family form and here is the normalization constant, actually the normalization constant printed right in front of it. I don't know the normalization constant of the conjugate prior for this distribution because it's normalization constant isn't in the table of integrals I looked at, so all I can possibly do is maximum likelihood type inference. So let me do that to do maximum likelihood inference from a data set I will collect data from so here is on the right hand side is already a first datum that I've observed and there will be more in a moment and I will do maximum likelihood type inference to do so I compute the let me just write down again what the normalization constant is and its logarithm so that's what I have on the whiteboard behind me I've just taken the logarithm of it and all I now need to do is compute the gradient of this function with respect to its two parameters w1 and w2 that's given by this expression it's a simple exercise you can do on a piece of paper and a few lines. To solve for the maximum likelihood estimate I need to find the empirical expected value of the sufficient statistics or the empirical estimate of the expected value of the sufficient statistics that's given by this sum and then solve this expression for these two parameters actually it becomes kind of natural to introduce two new parameters that's called a mu bar and omega bar that makes it easy to solve this this equation and we can just replace like solve this linear or sorry not linear this rational equation here for w1 and w2 and that's it so now if you give me data draws x all I need to do is to take x, apply the square and one over the square of those sum them up and then plug in those numbers into this estimation procedure to get w1 and w2 and you can see here what this looks like so after one datum black is the true value red is the estimate the maximum likelihood estimate of course is overly confident because it's a maximum likelihood estimate but as I get more data even after just three data points I already have a pretty good estimate of the true distribution and if I keep going like this then they'll become more and more and this distribution will approximate the true distribution of course that's because I've assumed that the true distribution looks like the distribution I would like to use if I would use a Gaussian distribution for or actually some other distribution to draw my real numbers then what this process would give me the minimizer of the KL divergence between this family of distributions this exponential family and the true distribution with that we're at the end of this lecture just to summarize again exponential family distributions provide a toolset to learn probability distributions for like to learn basically generative models to learn distributions over random variables from data you can use them to do maximum likelihood inference that's always possible once you have an exponential family just by computing gradients of the log normalization constant and setting them equal to the sufficient statistics or rather their estimated values that's possible in linear time you can easily correct for a map estimate by introducing a conjugate prior that conjugate prior self an exponential family distribution if you happen to know the normalization constant of that exponential family conjugate prior you can even do full Bayesian inference on probability distributions and we saw a statistical interpretation of this process which provides us with a guarantee that as the number of data points increases we will at least find a best estimate for the true distribution within our parameterized family of exponential family distributions the whole like the linchpin of this entire process is knowing the log normalization constant so if you find an integral somewhere in a table of integrals that allows you to compute expected values of some weird sufficient statistics then you've just invented a new exponential family or maybe a little bit more realistically if you have a data set that you think can be described whose generative process can be described in terms of some sufficient statistics maybe go out and see if you can find the corresponding integral in a table of exponential integrals with that we're at the end for today thank you very much for your time