 So over the last four lectures that we've had so far, we went from seeing there's this fundamental theory of reasoning about with incomplete information called probability theory for this lecture to, okay, this is a beautiful thing, but when you actually do it in practice, then the complexity of doing the associated computation for discrete variables rises exponentially with the number of variables that we are tracking. So the complexity of doing inference in a set of problems that has D variables rises exponentially with, well, okay, let's first write it like this with the number of variables. If you have binary variables, then it's two possible values they can take and then the complexity is like this. And the third lecture, I said, well, we can also think about continuous value variables. There's this complicated math around it, beryl, sigma, algebras and whatever that allows us to distribute truth over a continuous space of possible values. And then what's down here? What's the number of values that they can take? It's sort of infinite, right? Ooh, so we're not gonna be able to do any of this, right? If we leave it this abstract form. So in the last lecture, and the one before that actually we encountered though that there are sometimes situations where things just kind of work out. And the first one was maybe found by Laplace in 18.0 something, five, eight, whatever, 15 was 12 was when he published a paper. I mean, back then plus minus one year didn't really matter, it's not like today where it changes the world. That where everything kind of remains tractable to the point where the entire computation is just adding up numbers. So what statisticians actually have been doing historically since then through a large part of the 20th century is to find lots and lots of situations like that where you can do the inference in closed form. And for that they found all sorts of distributions where this is possible to some degree. Actually maybe directly the next one or the one right before Laplace maybe even was Gauss who did the same thing with real valued rather than count valued variables. So what I'm trying to do with the last lecture and this one now is to wrap all of this stuff, statistics up into one single package so that we can discuss it in two lectures and then move on into the future because it's 2023 and we unfortunately can't dwell on 60 years of statistics and do like a stats course where you talk about every single one of these distributions for a long time and then derive tests and estimators and maximum likelihood estimators and whatever and whatever and study them in all sorts of glory detail and then convince yourself that that's an entire kind of field to work on. Instead I'm trying to encapsulate it into the formal language of computer science which allows us to formalize the process. So what we're trying to do is Bayesian inference so that means computing a posterior that's a probability distribution. Typically it's a probability density function for continuous valued variables and if there are discrete variables it's basically just a special case. For that we have to multiply a prior by a likelihood and divide by the evidence. That's simple to write down and often there's a simplification which is that we have several data, many different data points and we assume that they all come from the same kind of underlying distribution when conditioned on the latent quantities. You remember that we encountered graphical models in the third lecture where we saw that this has something to do with conditional independence. So this is a form of conditional independence of the data as given the parameters in the model which doesn't mean that they are independent marginally they're just conditionally independent when given the model. Why because somehow that seems like an interesting algebraic structure because you can take more and more data and kind of learn from it. That's the process of learning maybe. And unfortunately just writing down this equation even though of course you've all seen this equation based on inference over the course of your entire bachelor education and you all go like, yeah, base theorem, I know, I know it's actually, if you think about it, a very, very abstract statement. They are just these functions being multiplied with each other and you're somehow supposed to be able to do that. And we now realize that we have to not just multiply two numbers, we have to multiply two functions with each other. If you think about it, that's a really complicated thing to do. What is the product of two functions? It's a new function. But okay, of course I can multiply floating point numbers with each other on a computer, but I don't need to integrate, normalize by normalization constant which is a big integral. Integrals are very hard. Why is this even, why do we even talk about it like it's an easy thing to do? In practice it's not. There are only a few special cases where it's straight forward. So what we would like to be able to do is to find the structure of these cases where this is the case, where it's easy to do. And that's actually a base situation for what we're trying to do in machine learning in AI. Extract information from data in a computable efficient fashion. One way to think about the structure of the algorithm underneath that we'd like to have is like this. We'd like to be able to take some data. Then, so data comes typically in the form of an array. Right, it has a data set size, a batch size, how many data points do we have? Let's call that the first dimension of the array and it has size n. And then it has a number of, every datum has a number of dimensions. You could call those features or values or whatever. And sometimes they are indexed by D, or maybe by P. I've already messed up my own notation here. Or maybe actually, yeah, it's sort of the same D in a sense, right? So what we'd like to do is to write down some piece of code that pre-processes the data and transforms it into a bunch of numbers which tells us something about this data set. We would like to compute some statistics. And then, after that, maybe we're able to put away the data in the drawer and never touch it again. We just keep around the statistics. And then we tell those statistics to some algorithm that does this thing up here, that takes care of this complicated algebra. And that algorithm will have to do something on a computer. So these days we're kind of used to saying, oh, there's some magic happening inside, right? You have to shift it off to the cloud and some data center and call some fancy algorithm that does something. But what does it actually do? Ideally, it just does some floating point computations. And so we saw in the last lecture that a really appealing structure that we would like to go for is that of a conjugate prior, or a conjugate pair of prior and likelihood, actually. So notice how the conjugate prior sounds like it's the important word, but actually it's a thing that you are constructing for your likelihood. So the thought process starts with the likelihood and then we think about what the prior would be. Why? Because that's what actually happens in practice. I give you data and you think about, well, what structure does this data have? What does it come from? What are the variables that create this data? Can I build a little prior, draw a little graph? Or draw a little picture for myself? And then that defines our likelihood. And then with that likelihood, we have to work and make Bayesian inference possible. So we usually think about the prior after the likelihood. It's actually post hoc after the likelihood. So such a conjugate prior has this kind of structure that I've written down before already, which actually has been thought about already for just under 100 years since Pittman, which, well, maybe I'll read it out sort of, is as follows. Assuming you have a likelihood, let's call it L, likelihoods are functions of two things, of the data and of some underlying parameters. Let's call the data X and the parameter W in this slide. I keep changing it. A conjugate prior is a function G which takes data drawn from this likelihood and has the property that the resulting posterior, so the thing that Bayes' theorem dictates us to use is of the same form, where what happens inside is that we compute these things, these statistics, the sufficient statistics of the data, and then do something that allows us to stay within the same algebraic class. So one way to think about this in the lingo of computer science, the metaphors of computer science is G is like an object, like a class. It stays within its own type, right? You instantiate it as one object, and then incomes data, it does some internal computation with the data, drops the data again, but it stays its own object. It doesn't become a new object by consuming the data. Maybe that helps, and it's foreshadowing for what we might do later. What this sort of hides is the implicit assumption that, I mean, there's nothing written in this text about what we do with computation, right? It's just a mathematical statement. The typical thing that mathematicians do is not a computer science statement. But of course, for us, the underlying assumption is that G is something we can track. It's actually a function that can be computed. And we already kind of reduced this question of what computability actually means a little bit in the last lecture, which I'll get to next, so by maybe just a quick summary. So conjugate priors are the metaphor for what I just described. For taking the data, extracting some information, and then do inference by calling this function G. So G is the object that does the inference, magically. And that magically is written down here with the third point. There is some kind of underlying assumption that you're supposed to be able to do this on a computer. But we haven't actually said what be able to do this actually means. It would be nice. So in the example case that I showed you in one of the first apps, beta inference, who's wearing glasses and who is not, that just being able to do it, reduced to the fact that the algebraic form of the function is some polynomial of tractable order. And the normalization constant of this distribution is an integral over this polynomial of tractable order, which we just pretended was tractable. So we didn't actually talk about this, but this beta function, this normalization constant of the beta distribution, it's actually, even in 2023, a function that's not tractable. The problem of Laplace has not magically vanished because we've become smarter than Laplace. We've just found algorithms that approximate them very, very well. So there are just numerical recipes in whatever books that have atomic implementations of this function that approximated to machine precision in very efficient, low computation time. But it's not like the beta function is something that's mathematically tractable these days and wasn't in the past. It's just computable to machine precision. So we sort of assume that that's what we mean by tractable. There's this integral and there's some fancy algorithm that comes from the new C library that just does the computation for us. So this fancy thing being able to evaluate the integral, that turned out to be important for the definition of this object that is in the title of this entire lecture in the last one. An exponential family. So it's an algebraic structure that we encountered several times in several historic examples and extracted some algebraic notions from it. So we saw that there are these distributions like actually I'll go forward and show them to you already like the Vanoulli distribution, the Poisson distribution, the Laplace distribution, the Kei square distribution, the Dirichlet distribution, the Gamma distribution, the Vishard distribution, the Gaussian distribution, the Boltzmann distribution and many, many, many others that you've seen maybe in statistics classes before. And they seem all sort of to be important. Why? Because they allow us to do this. They are conjugate or they are, they're either conjugate to some other likelihood. So like for example, the beta or the Dirichlet distribution is conjugate to in some sense, the Bernoulli distribution or variants of it like the binomial and multinomial distributions which are also exponential families but they don't fit in this table anymore or they are themselves likelihoods like the Bernoulli distribution that are important and have conjugate priors. And what like historically of course all these distributions didn't come in one table. There wasn't a book that said, oh here all the exponential families we know so far let's add one more. That's not how they evolved. They evolved in individual works by individual people. But in hindsight we can go back and find one underlying algebraic pattern which is that they all are of this form. And it admittedly it's not a particularly easy to read and understand form, right? Lots and lots of symbols. It kind of, it reads like driving over gravel. It's just not a smooth ride. So these are probability distributions over some data X. So they are likelihoods, right? They are distributions over X that are parameterized by some W. So you can also write them as P of X given W which are of this form. So there are, there's first a function that doesn't depend on the parameters but only on X that's called the base measure. And then in the exponential there is a linear term which mixes the two. It mixes the data and the parameters. And then there's a term that only depends on the parameters. So really what this says is it's a likelihood that depends on data and parameters. And we can write it as a function that only depends on the data. A function that only depends on the parameters. And a mixing term which can be written as E to the and then an inner product between features of the data and well without lots of generality the parameters. We could have a more general form where we assume that there's some parameters theta which I have down here. And then this W is just a function of those parameters but that's just a rewriting, the redefinition of what the parameters are. So why is this math so important? What can we do with it? Well I think the only real way to see why this is useful is to transform it into code. So what I'll try to do today is I'll show you an implementation of exponential families that I've written together with Marvin Fertner, a PhD student in my group, who I'm very thankful for. Without him I couldn't have finished this code in time. It was just a little bit tight and actually we sort of both sent us the code back and forth in the end and he had to leave for a business trip this week so I finished it last night and I couldn't even finish the final experiment so you'll see what I mean. And this is an experiment for me because in my experience looking at code together in a lecture hall, it's a bit complicated but at the same time it's also important to look at code. Just like it's important to draw plots and look at math and maybe talk about history. So I'm curious to hear from you in the feedback whether it helped or not to look at code. So now I have to think about how to best split up my screen. What we are going to do is I'll go back, we'll look at this definition and first of all think about what actually is on this slide in terms of code. Does this define a probability distribution? Can I write down an object that defines a particular exponential family with this definition? No, some people are shaking their heads. Yes, so what we're going to write is something abstract. This is an algebraic structure that we'd like to translate into code. So it's gonna be an abstract base class, some abstract data type that we're going to define. And for that, I have to code over here. Boop, I go up. First of all, can you read this text before we get into the details? No, then I'll zoom in. Of course the price of zooming in is that we can look at less and less. Can you read it now? Good, also in the back? Yes, very good. Thanks a lot. There's a thumbs up in the last row. So I'll try and go slow and then we see whether that gets boring or not. So it's Python. First of all, we're trying to write very explicit code. Maybe that makes it also sometimes a bit tedious to read, but I'll try and point out the things for you. So in case you haven't seen this, I'm loading a module of the base Python interpreter called ABC, which is called abstract base class. So this allows us in Python to define abstract base classes. And then fun tools, which are also part of core Python, which provide some form of functional programming. That's a bit fun. And this annotations thing up here is, maybe many of you, who has seen annotations before? Ooh, less, that's a minority. Okay, so there's a new feature being slowly introduced into Python that provides types hints, type hints, where you can basically say that certain variables that you're handing to a function have certain types. That's built in the future, make things much faster and hopefully make Python much faster. We're already seeing some versions of that. But it's not in the actual current versions of Python yet. It's 3, 10, 3, 11. It's sort of trialed for 4.0 and that's why we have to import it from the future. We're gonna use Jax and you'll find that that makes things a bit complicated, but also once it's actually works, kind of elegant code. And we'll see, I'll tell you why this is annoying when we get there. So what we're going to define, and this is, now we can step out again and think about the high level picture, is an abstract data type, an abstract class. It just says, what actually is an exponential family? And the whole point of abstract classes, in case you didn't notice in your undergraduate courses, is that they give you something beyond just writing down what you're about to do. So that's what we're trying to get to. They give you some practical value. If it's otherwise, you wouldn't need to write them. You could just write lots and lots of classes. So we'll see what the practical value of this is. So here is an exponential family. Actually, I can just about show you the whole thing that's important, that's good. So let's go back to the definition. Exponential families are defined by a base measure, sufficient statistics, and a log normalization constant. And then by their parameters as well, but the parameters are just things that parameterize each individual family, right? Oh, by the way, I should point something out that's really important, that's a common misconception. So this is a list of many exponential families. This is not the family that is the exponential family, but each row in this table is one exponential family. Why is it a family? Because they depend on it, well, because they are parameterized by parameters, by W. So each of these rows defines probability distributions over X, which have parameters W, and by changing W, we create different distributions over X. That's why this one row is one family. The Gaussian distribution is one family. The Gamma distribution is one family. The Vishard distribution is one family. This is not the exponential family. Some authors see this differently, so Kevin Murphy in his book just says the exponential family, anything that can be written as H of X times e to the phi times W, everything like this is part of the exponential family. I find this not particularly useful because you can't parameterize the space of all phi's. The space of all the official statistics, it's uncountable and unmeasurable and it's just annoying. It makes much more sense to say, if I give you phi and H and log Z, then you can parameterize by W and that's a family. So that's what we're going to do. We need to define H, phi and log Z, and that's exactly here. So an exponential family consists of sufficient statistics, log base measure, log partition function, and then there's a special thing that we've added to add this little final sentence in the definition because sometimes the textbook definition of the distribution, like for example, for the binomial, for the beta distribution, and for various others, or for the Gaussian, doesn't actually use the natural parameters. So remember for the Gaussian, the natural parameters are mean times precision and precision, but we all think of it as mean and variance. So sometimes we need a map from one to the other, so we'll define that as well. And actually that's it. This here, oh, that's Z. It's log Z. The Z is called the partition function. Actually it's in the slide as well. So here we go, right? Oops. Here. And they are all abstract because we haven't defined anything yet. We just say there will be a function that computes with sufficient statistics and it takes in the data and nothing else, and the data is an array, and it returns an array. And it maps from the number of features that the data has to some other dimensionality. The number of sufficient statistics. By the way, this is sort of common notation for Jack's code. This also means that if there are additional dimensions to the sufficient statistics, let's call them N, sorry, to the data, let's call them N. So if the data is actually N by D, then this function is broadcast across N. So if I give you not one datum that consists of D entries, but five, it'll just do those computations five times and concatenate them with each other. The base measure takes in, yes, yeah, that's right, takes in X. Good thing. Takes also in X, which is of the same type as before, of course, and it might return an array or actually typically a scalar. But scalers for Jacks are also arrays. So, yeah, here we go. And not normalized. So these two things are objects that operate on the data, but sufficient, this explanation families are distributions, no, let's start the sentence again. Exponential families are functions of two things, of data and parameters. So those are the two things that are functions of the data, sufficient statistics and base measure. And now we need to talk about the functions that consume the parameters and then mix them together. So the log partition function is a function of the parameters, which are also an array, and it returns again a scalar, which for Jacks is also an array. By the way, so for this type annotations, this Jacks numpy and the array is often, is sort of the replacement for pretty much any because that for Jacks that always works. So it takes in the parameters and consumes them and returns a scalar. And finally, we have this map from the natural parameters to the, no, from the canonical parameters to the natural parameters, which is this thing that in the slide here, we called eta, which maps from one representation of the parameters back to them. Yes. The meaning of the slash at the end for Python means that this is a function where you're not, at least not supposed to be allowed to give a keyboard arguments. So you have to just call this function with a comma b and not give it, you can't call this with parameters equal to, huh, and x equal to, but you just have to call it with these two things. And this also means that there are no other, there's nothing else handed to this function. Because that's what the exponential family is. It's a function of x and w. Okay, so of course, by writing this down, we haven't actually defined any exponential family yet. But if we think about it, if we had these things, then we could do certain things with them. In particular, we can evaluate do I, yeah, we, let's do it like this. We can evaluate the log probability distribution function because that's actually what the slide that I keep showing you has defined. So if someone wanted to call the log probability density function of this thing, which is a function of x and the parameters, then while we do what's on the slide. So we compute the log base measure, compute the natural parameters. For that, we have to call this function that takes the parameters and maps it back to the natural parameters, to w. And then you compute this linear term, which is the sufficient statistics of the data in our product with the natural parameters. Compute the log partition function and then call log base measure plus linear term minus log partition function. So this is a function of only the data. This is a function of only the parameters. And this is this mixing term, which takes a function of the data and a function of the parameters and multiplies them with each other. And this complicated indexing, that's the price we pay for elegant broadcasting in the jacks. You basically have to tell jacks which dimensions are the ones that it's supposed to broadcast over and which are the ones that are supposed to be consumed. And you basically have to stare at this for some time to convince yourself that that's the right thing to do. And you'll notice it mostly afterwards when we actually make some plots to see why this works. So maybe I'll stop with the code here and then think about whether we could, how we would use this now to define an actual exponential family. We've done that here. So here's a Jupyter notebook, where I look at one particular example. So I've loaded a bunch of, this is not important, where there's a bunch of boring Python. Let's look at one concrete example. So the Poisson distribution is a distribution that we haven't actually talked about yet, but maybe you've seen it in some other statistics class. It's a very common probability distribution. It's the probability distribution over things that happen with a certain rate, but otherwise independent of each other. For example, and I think this is a nice picture on Wikipedia that I saw yesterday actually scrolling through this, the distribution of chewing gum on sidewalks is approximately Poisson distributed because you're assuming people don't pay attention to where the other chewing gum is. They're sort of absent minded spitting it out. And so it just lands with a certain rate. And if you go around and count how many chewing gums are on each tile, then the distribution of those chewing gums is Poisson. And it has this probability density mass function. So here it is, that's the important bit. It's too small for you to read. So I'll zoom in, it's here. But actually, I mean, it's also on this, on the distributor notebook. So that's what it looks like. So that's actually an exponential family because most distributions that we use in practice are exponential families with very few exceptions. So if you can write it like this, right? Let's just convince ourselves that that's an exponential family. Well, there's a term that only, it's a function of two things, the variable, the observed. K is the number of chewing gum you've seen on the tile. And lambda is the rate at which chewing gum lands on tiles. It's also the, oh yeah, okay. Question, the log PDF, it returns a number. The log PDF returns the probability for, it's literally the logarithm of this function. So it returns the probability to observe K pieces of chewing gum on a tile given that the rate is lambda. So in this case, actually, K is a discrete variable. It's an integer from zero to infinity, like zero, one, two, three, four, five, or so on, but an actual number. But it doesn't matter. It's just a thing to evaluate. So clearly, this distribution contains a term that only depends on K. We could call that the base measure. It depends a term that mixes K and lambda. That's sufficient statistics times natural parameters. And then it contains a normalization constant, which only depends on the parameters. So we can see that the sufficient statistics are K. The base measure is one over K factorial. And the natural parameters is the logarithm of the rate. And the normalization constant is either, well, just the rate, or E to the, the log partition function itself is E to the natural parameters, yes. E, no, K would be equal to X. But lambda is eta of, sorry, it's eta inverse of W, right? So lambda is theta. And that's actually, so yes, this is confusing, but that's also actually why we have this eta in there. So each of these distributions comes historically from a different person, and it has its own notation. The mean of a Gaussian is just called mu, and the covariance of a Gaussian is just called sigma. But together, they are parameters theta, which you can translate into the natural parameters, that's called W, of the Gaussian, which are the mean times precision and the precision. So there's all this confusion about the notation. And yes, that's confusing, but in a sense, that's also what we're trying to get rid of by making an abstract definition of them. So the question is, it seems like lots of people have defined all these distributions, and why don't we just make a list and call them the data types for every single variable that you might want to use, and just use them? The answer is sort of we should do that with some caveats. So actually, I think it's a useful metaphor to say exponential families provide base data types for observables that you see in the wild. So when you observe a real number that can lie between minus infinity and plus infinity, the Gaussian is probably the standard distribution you're going to use. And you observe counts of successes and failures or people wearing glasses and not wearing glasses, positive and negative, then the binomial distribution and the corresponding beta prior are probably the pair you're going to use. And when you observe classes, one through K, the division distribution is probably the one to use. At least at first. But typically, or not typically, but often then we realize in practice that the data we actually work with has some slightly different properties that need to be modeled a little bit differently. For example, for a long time in natural language processing, Dirichlet distributions were used, or variants of Dirichlet distributions were used to build language models because you can use them to keep track of the probability for individual next tokens given the context. The classic sort of 1980s language model is you keep a table of Dirichlet distributions and for each token, for each context, you just count how often you've seen the next token and that gives a Dirichlet posterior. It's a very efficient data structure and so language modelers were loath to give it up. But in 2023, we might think I could have been a bit more. And in a sense, actually, transformers are like the next generation of that. That's just parameterized in a different way. There's still this big attention matrix in there. Yeah, I would say it is actually the standard thing to do try first and then maybe you sort of want to deviate from it a little bit later and we'll do this. I mean, there's a reason why this is lecture five. It's actually similar to standard data structures in programming languages. A list is a very good metaphor for lots of things. But then when you start using the list, you realize that you need some other functionality that's on the list and then you start, you know, using a different data structure. So let's see if you can implement our Poisson distribution. So for that, I've done this yesterday night. I hope I haven't messed it up. We define an actual not abstract class called the Poisson distribution, which is an exponential family. It has a constructor, but that constructor doesn't do anything. Why? Because this exponential family doesn't have anything else that I need to store. There are other distributions and maybe I'll make one of them a homework on Monday which have other quantities that you need to know before you can define the exponential family. One example that I'm not gonna use in the homework is last Monday, Thursday, I talked about inferring the variance of a Gaussian if you know it's mean and then we saw that the gamma distribution is the conjugate prior we need. So for an exponential family called Gaussian, which has an unknown variance, we might need a mean and that mean needs to be stored somewhere that could happen in this constructor. So actually we don't need to call this constructor at all. I wouldn't even need to define it. I'm just doing it for the sake of completeness. Yes? Oh, yes, yeah, whatever, inverse gamma, gamma. So the gamma is the conjugate prior for the precision of the Gaussian. So the only thing we actually, so this is really, this is the totally superfluous line. I could just remove it. It's just calling the constructor of the superclass. So instead, what we need to define is the sufficient statistics. So let's look at the equation again. The sufficient statistics is this here where it's just k. It's an entity function. It doesn't do anything, but it's there, right? So the sufficient statistics are, let's take in a k. Let's just make sure it's actually an array and return it again directly. Just translating it into the underlying data structures that are useful for broadcasting afterwards. The base measure is one over k factorial. So the log base measure is minus log k factorial. Because that's one over k factorial. And you may have noticed by now that we don't call this factorial function that you learned in your first year, as like your hello world example of all sorts of programming languages. Instead, we always call the gamma function because it's a very efficient implementation that is a generalization of the factorial function to the continuous domain. So the gamma of k plus one is equal to the factorial of k. Then this plus one is annoying. It's because that's how Euler defined the integral when he was trying to interpolate over the factorial function. But we just need this plus one. And gamma is implemented in all sorts of packages, in particular also in the copy of Psi Pi that is in Jax. So it's called the special function called the gamma function and the logarithm of that function at k plus one. Almost done, we need the log partition function. The log partition function is just lambda. So the log partition of lambda is just lambda. And here I've called it lambdas for two reasons. First of all, you probably all know that I can't call it lambda because that's a protected name in Python for good reasons. But also because we're typically going to pass in an array of lots of lambdas. So we might as well call it lambdas. And then finally, there's this thing coming back to your question, right? On Wikipedia it's called lambda, but lambda is actually not the parameter. The natural parameter is the logarithm of lambda. So we have this function that says parameters to natural parameters, which takes in lambdas, turns them into arrays and just returns the logarithm. And that's it. So now we've defined an exponential family. I'll leave out a bit at the moment that I want to do after a quick break. But we can now use this to make plots of Poisson distributions. So I can define a likelihood that is a Poisson distribution. And what is the thing that I might want to point out? In particular, evaluate its log probability density function in lots of different places. It's this thing is now a function that takes in data and parameters and returns log probability density functions or probability mass functions to be precise over K given lambda. And what I do in this thing, I've uploaded it to Ilias. You can look at it later. It's mostly just a bit of matplot fiddling to plot likelihoods for lots of different values of K for different values of lambda and lots of different values of lambda for different values of K. So actually, if you go back to the definition of the Poisson distribution, you've seen this plot on Wikipedia. This plot is now created on the left here, basically. So for 10 different values of lambda, one, two, three, four, five, six, seven, eight, nine, 10, we get these red curves. And they have little bubbles because these are discrete distributions. They're only defined at the bubbles, at the actual dots. And at the same time, this is also a function of a lambda. So for different values of K from one through 10, you get different curves. This is a likelihood for lambda, essentially, yes. So each of these red curves is a probability distribution over K. And each of these blue curves is a function of lambda that is not a probability distribution. Yeah, so when I use the word likelihood, I mean a function of the right-hand side for conditional distribution, which is not necessarily a probability distribution. And when I use the word probability, it's something that integrates to one. So the sum over these red sort of collections of numbers, if you would sum to infinity, is one. But here, the sum is not one. Okay, at this point, maybe we should take a quick break because then afterwards we can talk about conjugate priors. So let's say five minutes, I'll continue at 9.13. So okay, maybe let's quickly pick up the thread again. So now what we've now done, like what all this complicated piece of Python was to say, there is this algebraic structure, this abstract object called an exponential family, which what actually does it allow us to do? So so far, the only thing I've shown you is to say, oh, if you define these things, you can evaluate this function. Okay, that's, is that it? Why do we do all of this? That's not so useful, right? I mean, of course, once you define the function, you can evaluate the function. Optimization, we'll get to optimization later on. So far, there's no optimization. The reason why we care about exponential families is not that they can be written like this, but that they allow tractable Bayesian inference. Why? Because every single exponential family has a conjugate prior. So let's first do this on the slide, then think about what that means in code and then do it in practice once. So if we have an exponential family, this up here is, it is, that's our definition of exponential family again. Then we can notice that if we now get data that is distributed like X, and we would like to know what W is, that we can find a prior which looks like this, we'll talk about it in a moment, which allows us to compute the posterior very efficiently by just adding up numbers. Why? So this thing contains sufficient statistics, parameters, and log normalization constant. If we multiply it with a prior, then as a function of W, we'll get lots of terms of phi of X, and lots of copies of log Z of W. So that means we can define a prior that is itself of this form that takes W times a number and then log Z of W times another number because if we multiply this with a bunch of copies of this likelihood, then there'll be sum over phi because this product in the exponential turns into a sum times W plus W times alpha. So we just need to add to alpha phi. Remember that phi is itself a vector for each datum, so an array in general. So alpha will be an array as well, it doesn't matter, it's just a vector. So we just add those up. And for every individual datum, there will be one term like this that involves a term in W. So there'll be N copies of log Z of W. So if you have a term like this in here, we basically just add N to new. And we get back a posterior which is of the same algebraic form, so a conjugate posterior. What happens to H of X? It doesn't matter because we'll normalize, right? So in Bayes' theorem, there's the normalization constant at the bottom and anything that only depends on X will show up both in the normalization constant and in the likelihood and we can take it outside of the integral downstairs and it'll cancel out. So H of X doesn't actually matter. So this here, this function which is a conjugate prior is also an exponential family, right? Because it's of the form something that only depends on, first of all, let's make clear. In this exponential family, W is now the data object and alpha and new are the parameters. There's sort of one level higher in the hierarchy. This is sometimes called hierarchy to Bayesian inference as well. So this is a function of W and a bunch of parameters. Let's call them alpha and new. Which has, first of all, a function that only depends on W, the Bayes' measure. That function is just one. We don't need it. Secondly, it has a function that only depends on W called the sufficient statistics. These are the sufficient statistics. They are sort of defined by the likelihood. Then it has a function that is called the natural parameters, alpha and new minus a function that only depends on the parameters alpha and new. The log normalization constant. So if we have that function, then we can do Bayesian inference. The only thing that is not completely specified by writing down the likelihood is this thing. This integral over the distribution. So where is the definition? It's up here. So we've reduced the entire problem, this super complicated problem of Bayesian inference on multivariate continuous valued variables to one question, which is what is this function F? If we can compute this F, if someone tells me this F, then we're in business. And we can do closed form, array, centric, Bayesian inference. And in fact, we can do more. We can use it to predict, for example, future data. I've done this last Thursday on several examples, so I'll show it here to you again just briefly. If I want to predict some other axis, basically I want to marginalize out W. You could either do this after inference or before inference, it's just a marginalization. I just want to know a distribution of X. Then by the sum rule, I have to solve this integral. It's a distribution over X and W, of which I want to integrate out of W. And then I can just do some staring at algebraic expressions for a while, that we already did on Thursday, so I'll do it a little bit quicker. And say, so these two terms multiply together, this likelihood and prior, both exponential families, so they are base measure times exponential of, and there's a bunch of terms, which turn out to be of the form that also makes up our conjugate prior. So this integral is actually an evaluation of this F function, except for this term. So this term down here doesn't depend on W, so we can just take it outside of the integral, and we're left with an expression that just asks us to evaluate F twice. So if we have F, we can not only do Bayesian inference on W, we can also do marginalization over W. We can do conditioning and marginalization, the central operations of probabilistic reasoning. So the entire problem has been reduced to, what is F? And all this structure, all this complicated stuff on this slide is just imposed basically on us by the fact that the likelihood is an exponential family and that we'd like to do conjugate prior inference. So what does that mean in code? In our definition of the exponential family, I've now scrolled down a little bit, maybe I should pick up the thread. So we started the definition of the exponential family of the thing called exponential family, the concept called exponential family by defining these quantities, sufficient statistics, block-based measure, or partition function parameters to natural parameters. Once we have those functions, we can already evaluate log-probability density functions. But the really interesting thing is that they define a conjugate prior, a conjugate prior which is of another type, a derived type called conjugate family, which I'll define below, and a conjugate family is itself an exponential family. So it inherits something from this abstract base class. I'll get to what this is up here in a moment. It's probably best to now go down and say what actually is this conjugate family? It's here, let me see if I can, how far can I go? Okay, so this is the code translation of this slide. So the conjugate prior to an exponential family is an exponential family that inherits something from a parent, a likelihood. It's another case of the prior is induced by the likelihood. So it's itself an exponential family and it's initialized by telling it about its likelihood that it is being constructed for. That likelihood is itself an exponential family. And then when we initialize it, we can automatically construct its sufficient statistics and its log-based measure and its natural parameters purely by looking at the likelihood. That's what the slide told us. So if I give you this likelihood, then the log-based measure is one, so the log-based measure is zero. The sufficient statistics are W, so the parameters of the parent class and minus log of the normalization constant, the partition function of the base class and the natural parameters are called, they just give them a name, which is called alpha and new. And there are, how many of them? Well, there are sort of P plus one where P is the number of natural parameters of the previous class. And this extra one is new and it's accounting variable. It keeps track of how many of these we've seen to do Bayesian inference. Okay, so the sufficient statistics of this conjugate family are the likelihoods natural parameters and the negative is a minus here, log partition function of the likelihood. Those are the sufficient statistics. They are defined automatically if I tell you the likelihood. The log-based measure is just zero because the base measure is one. And the log partition function is what is the log partition function? We don't know, right? That's the only thing we don't know. Annoying, why? It's this thing. Oops, sorry. It's this F. So in some cases, it might be that we don't know F and we know that there is this conjugate prior but we can only do certain things with it. What we can do is, for example, we can multiply the prior and the likelihood and then we can effectively define this function with this proportional thing inside, which is this product. And we know how to evaluate this product because we know both of these functions now, up to normalization. So what we can do is, it's actually in the code, is if you don't know the normalization constant yet, you can already evaluate the log normalization, sorry, the log probability density function because the base measure is one so the log base measure is zero. It doesn't add. The log normalization constant, we don't know yet, so we leave it out, but we can compute this in our product between natural parameters and sufficient statistics. So that gives us a shape of the curve, the geometry of the space, but not the normalization constant. So we basically need to ask the parent, do you know what my log normalization constant is? Do you know what F is? And if the parent knows, then it can say, the likelihood can say, I know what your log normalization constant is, and then you can do Bayesian inference and we're done. And if you can't, then we have to think about it. So up in the parent class in the likelihood, there is a function called the conjugate log partition, which is actually not an abstract function in this abstract base class. It just says, it's the answer to the question, do you know what the normalization constant of your conjugate prior is? Because you know what your conjugate prior is, you know it's sufficient statistics, it's base measure, it's natural parameters, the only thing missing is the normalization constant. Do you know what the normalization constant is? And in general, the abstract base class answers, nah, I don't know. But when you instantiate the class, you can of course overwrite it. You can say, maybe I know, and then I can do Bayesian inference. So let's do this. Previously, we've talked about the Poisson distribution. Quick reminder, the Poisson distribution is this distribution of chewing gum on pieces of stone. It looks like this. It's an exponential family, which we've already implemented. What is the conjugate prior of this Poisson distribution? So let's look at this again. So this is our Poisson distribution. It looks like this, sorry, it looks like this. So its conjugate prior by construction is the exponential family that looks like this. It has base measure one. Sufficient statistics, log lambda, why? Because log lambda is the natural parameter of the likelihood. So the sufficient statistics are w and minus log set of w. So w is log lambda. And remember, log set of w is just lambda. Up here? Oh, it's here actually. So there we go, here it is. So the negative log normalization constant is just lambda. The natural parameters are alpha and nu, and now we just need to know what the normalization constant is. So if you write this, if you just stare at this equation for a bit, we can rewrite it like this. So what we need to know is, what is the integral over lambda to the alpha times e to the minus lambda nu? D lambda. And then you look up, you step deep into the seller of mathematics and you find in 17, I don't know, 80, whatever, 90 something, Euler, there he is again. And he said, I know this integral. It's called the gamma function up to a reparameterization. So the normalization constant of this conjugate prior is the gamma function. The conjugate prior to the Poisson distribution is the gamma distribution. Okay, so the only thing we need to tell our implementation of the Poisson distribution is that the normalization constant of your conjugate prior is, if you give me my natural parameters, alpha and nu is, sorry, I'll go back down again. This is the point where I'm talking about co is difficult. You have to score back and forth and gives you a headache. So the normalization constant is gamma of alpha plus one over nu to the alpha plus one. So I've implemented this up here. It's gamma, logarithm of gamma, alpha plus one minus alpha plus one raised to the nu. Now that we've done that, we can do closed formation inference. Because that's the only thing that matters. The only hurdle, even though we're talking about potentially several continuous valued random variables to do closed formation inference. And I do that here. So we can do inference like this. We define our likelihood. We have done this particularly explicitly so that it sort of drives from the point. Our likelihood is a Poisson distribution. Now we can say, automatically, what is the prior? We don't need to invent it. We don't have to come in and say, oh, what would be a possible prior? No, it's just ask the likelihood. What prior do you like? What's your conjugate prior? And it'll construct it for us because of the way we've implemented this abstract definition of the base class. It'll construct this derived object called conjugate family, which is an exponential family itself, which looks at the likelihood and says, okay, give me everything I need. And then it asks, do you have, do you know what my normalization concept is? And the likelihood says, yes, I know because the user has just told me. They've just implemented it. Okay, great. Then we can start working. So give me my parameters. I just picked some prior natural parameters. I set alpha to one and new to one. You can change the code if you like. These are the prior parameters of this gamma conjugate prior. And then give me some data. Here, I've only collected one observation, which is I've looked at one tile on the street. There's five pieces of chewing gum on it. Okay, five. Five is my first datum. Of course, I could add more later. And say, what is the posterior now? And here is this. This is the line that hopefully drives home the point. In function language, the posterior and the prior are the same. They just have updated parameters because of conjugate prior inference. So we could, of course, leave out this line, but it makes the notation clean, right? So the posterior is equal to the prior if we think of it as an object, a function of parameters and data. Of course, that's it. When we now evaluate it at different parameters. So what are those different parameters? So for that, I have to ask, what are your parameters? And for that, we need a function in the likelihood called the posterior parameters, which I haven't talked about yet, but maybe you can guess what it is if you go back to the slide here. It's this line. It says, take the sufficient statistics of the likelihood, compute them, add them to alpha, count how many data you've seen, add them to new, and then that's it. That's the parameters that you need. So they are actually implemented not in the implementation of the Poisson distribution. They're implemented in the base class because we know how to do it. So let me make a bit of space so that you can see it. So once you have your conjugate prior, you can actually compute posterior parameters by taking in the prior's natural parameters and the data. And then what you do is, you first translate the parameters into natural parameters. Oh, sorry, no, you don't have to translate because in the prior, they are already by definition natural parameters, and we just have to make sure that they are a race, that they sort of fit into the JAX framework. Then we compute the sufficient statistics of the data, phi of X. We count how much data there is. That's the extra additional sufficient statistic of the conjugate prior, the one for new. And then we just sum them up. So this is the magic thing here. You just take the sufficient statistics and for every single datum, sum them up. So that's this object. And now we just have to add alpha and new. So this happens down here. We have to split them up a little bit. So the first, up to the penultimate one are the parameters of the likelihood. And then the last one is the count variable. So we add that in here, alpha and new. And then just say, now we know what our updated natural parameters are for the conjugate prior. They are alpha plus the sum of the sufficient statistics and new plus n. So this bit, we never have to re-implement. In the Poisson, we get it for free. And then we can make a plot. So here's the plot. I started with this prior, the gray thing is the prior distribution. It's just how I defined it. I could have picked other parameters up there rather than one, one. I could have said pick two and five and whatever. Could have given different distributions. The likelihood is the Poisson distribution evaluated as a function of lambda given that I've seen five pieces of chewing gum on the, on the street. And if I multiply this blue line with the gray line, the red line is what I get out. Including normalization, because the normalization is the tricky part, right? I could, of course, compute the product of the gray and the blue line. That's this unnormalized posterior log PDF function. But if I also normalization constant, then I can get this red curve. And I'm done. So I don't actually know yet what I'm gonna put as homework for Monday, but it's surely going to be implement one of those exponential families that isn't the Poisson distribution in this form. And then you'll hopefully see that this automates the problem completely. So yeah, now I'll do this and then skip over something and do a final bit. Okay, so if we decide that the mathematical language that we operate in is of the form exponential family, which will actually be the case for pretty much the entire rest of this class and in some sense or another, at least approximately, then what does that mean? What does it go down to? It mostly means picking sufficient statistics. Once we decide what in the data we care about, what file text is, and the base measure is sort of unimportant, then we automatically get an entire scaffold inherited by Bayesian inference. That's really at the heart of this idea of probabilistic reasoning. I said in the very first lecture that the difference between statistical machine learning and probabilistic machine learning is that in statistical machine learning, you keep coming up with new ideas and then you always have to prove that they work. And that's very flexible, but it's also tedious because you have to keep writing proofs that things work. In probabilistic machine learning, it's the exact opposite. We convince ourselves that measure theory is the only right way to think about reasoning. And then after that, we only have to solve computational problems. We define what we care about in the data. We say, here's my data, okay, tell me something. It's actually what happens when you talk to users as well, right? If you're a data scientist, someone comes in or a machine learning engineer, someone comes in with their data and they say, well, what is your data? What do you care about? What would you like to predict? It's typically a very complicated conversation. And what you do there is you basically write down phi. Once you know phi and H, everything else, you don't need the guy with the data for. You can send them home, say, fine, you've computed phi. Now my job as the machine learning engineer starts. And you think about how to do this, okay? So the first thing is, you think I'm gonna define an exponential family or my likelihood, that automatically tells me what the prior is because there is a conjugate prior for it. And the only thing we need to know to be able to do full Bayesian inference is this F function, this log partition function of the conjugate prior. And the fact that all of this is automated, this entire process automated, I've proven to you by writing down code, right? That was like a constructive proof by writing this abstract base class and all these functions. With one example, this solves in some sense, machine learning, right? It's the scaffold that we're going to handle on. The only problem is that quite often, we don't know what F is. We don't know what the normalization constant is. So I will skip through one slide that I wanted to use to just remind you, but I'm gonna say one sentence about it because we'll need it for the next thing, that the Gaussian distribution is also an exponential family. And the only tedious thing about it is that, which is maybe by now really useful as a teachable moment, is that it's natural parameters. So the sufficient statistics of the Gaussian are the first two moments of the data. X and minus one half X squared. So they are like Taylor coefficients in some sense on the data. The only annoying thing is that the natural parameters are not the ones that we're used to. We're used to mean and variance or mean and covariance for multivariate Gaussians. It happens that the natural parameters are mean times precision, so mean over variance, mean inverse covariance times mean, and precision inverse covariance. And the normalization constant is known, it's this thing, it's basically a variant of Gauss's integral. So we'll need this to think about in the final 10 minutes what we do when we don't know what F is. And here, my main point is going to be, historically, I think it's not equally unfair to say what then, that this problem has actually been a main Achilles heel of Bayesian inference. So what many statisticians then argued, and actually you hear this to this day, if you leave this university and go out and work in the industry and say I'd like to do Bayesian deep learning, they're like, ah, Bayesian deep learning. It's so expensive, it doesn't work, it's complicated. Why? Because you don't know F. And this has maybe also led statisticians to say, oh, I mean to do maximum likelihood estimation and fissure information matrices and whatever, complicated stuff and confidence intervals and let's do proofs and like, so what I would like to leave you with, maybe we'll do it briefly at the beginning of next, the next lecture again, is to say you don't have to despair just because there's one function you can't evaluate. You can still salvage a lot by using what we do in machine learning which is take care of your code and look at the geometry of problems, use automatic differentiation and linear algebra to your advantage. And the funny thing, the sort of the joke about all of this historically is that the solution to it was not invented in 2020 or in 2018 or in 1980 or in 1905. It's already in the original book by Laplace with his approximation because he's already basically solved it for us. So the first observation I'd like to make is that if you have an exponential family, then but you don't and, but it's intractable in the sense that you don't know it's normalization constant like for our prior, then you can still find its mode with optimization. And these days in 2023, the main message I'm sending with that hopefully is this problem is not any harder than the stuff that you've come completely accustomed to in deep learning. Finding the mode of such a distribution is at most as hard as training a deep neural network. Why? Well, let's think about this object again. If you wanted to find the mode of this, I've left out the H already because you can probably guess why. So if I need to find the mode of this function in W, then I take its logarithm. So, or maybe even a product of that, that doesn't even matter. So let's say I have lots of data that comes from this distribution. I'd like to know the mode in W and like a maximum likelihood estimate. Then I take the logarithm of this object. So the X is gone. Take the gradient and now something nice happens because W is the natural parameter, the gradient with respect to W is just whatever is in front of W. So it's the sum of the sufficient statistics. And then the other term on which I have to compute the gradient is the log normalization constant. So if I have Z, and for the likelihood, I have it of course because I have to find it. Then, well, I mean, I need to know what normalization constant is for these exponential families, of course, right? But, I mean, for the Poisson and so on, we had all these integrals. This is typically one step easier than the F in the conjugate prior. Then I can just compute the gradient and solve this root finding problem. I call this a root finding problem because it has this structure that there is this thing on the side here with the data, but there is no W over here, right? So we can write this as gradient or function minus constant is zero. That's a root finding problem. And the only many thing that's maybe interesting about this is that it doesn't require us to go back to the guy with the data, right? So this person we just had our conversation over about over the data. They've told us the sufficient statistics and now they've left the room and we're like, oh, no, I wanted to know what a mode of the this, oh, I don't need this person anymore. They've already told me everything I need to know to find the mode. I don't need to have, I don't need to go through the data again. I don't need a data loader that keeps going through this. Once I have the sufficient statistics, I can just compute this gradient and somehow follow it or whatever and find the mode, okay? Oh, am I going to do this most efficiently so that because we're gonna run out of time and I don't want to like leave the punchline hanging. Let's, okay, I will do this example and then we'll do Laplace on Monday and I'll try and catch up with the Gaussians then. So, okay, so on Monday I'll tell you that, so what we're doing, what we're going to do here is construct a point estimate, okay? So, let me take two minutes. If I give you data that comes from an exponential family, then a statistician will tell you you can compute a frequent statistician, you can compute a maximum likelihood estimate from this data and that gives you an estimator, a point for what W might be. As a probabilistic person, you shouldn't be happy with that because it's not a distribution, it's just one particular estimate. So, I'm gonna leave you hanging today by telling you that we can construct such estimates and then on Monday, including with code, I'll show you that we can automatically also estimate an uncertainty around it because it's 2023 and we have autodiff. And you can look ahead if you want to. So, we'll actually leave it at that and just look at one example and think about how that would work. So, let's say I have a Gaussian distribution, last slide, piece of annoying math. Gaussian distribution, we have a bunch of data that comes from, let's say a univariate Gaussian distribution. I don't need to introduce the Gaussian distribution because you've seen it many, many, many times. So, it's a product of such terms and each of them looks like this. We know that we can write each of these as an exponential family. So, the base measure is one over square root of two pi. The sufficient statistics are x and xi squared or one half, minus one half xi squared and the natural parameters are these. The log normalization constant is known as well. So, we find the maximum likelihood estimate for the Gaussian. Of course, you've done this maybe before in an exercise sheet in some other class, but we can find it also kind of algorithmically, procedurally by using the fact that it's an exponential family. We take the log normalization constant and compute this gradient. So, the normalization constant is over here, the partition function and we take the gradient with respect to w. So, w has two entries, w one, w two. So, the gradient will have two entries. Let's first look at the derivative with respect to w one. The only this term depends on w one. There's a square here, the square comes down, cancels with the two where we have one w one over w two. That's the first term in the gradient. The second term in the gradient is there's a w two here and a w two here, but none here. So, we get w one squared and then this is w two to the minus one. So, the minus one comes down, we get a minus in front, the one half stays and there's a square now in the denominator. Plus, the gradient of the logarithm or the derivative of the logarithm with respect to w one is one over w two. We plug it all in and now we replace w one and w two with the parameters that we usually think about when we talk about Gaussians, which are defined up there. So, we just plug in the definitions. Here they are. And to find the maximum likelihood estimate, all we need to do is to make sure that this is equal to the sum of the sufficient statistics. And so, this is basically a sum over Xi and a sum over Xi squared. That's also why they're called sufficient statistics. And now you see that we're basically constructing the sort of non-central moments of the data, the mean and the non-centered variance. And that fits neatly to this thing over here. That sort of reminds us of what this definition actually is. So, the maximum likelihood estimate is going to be the data mean and the corrected one over n version of this estimate. If you're wondering why there is no one over n minus one, because you've learned something in your stats class, maybe we can cover that on Monday very briefly. So, let me sum up. What we've seen today, and now I'm gonna skip to the end and skip the bit that I wanted to do, my big joke. So, I'm gonna have to leave out this final bit. The punch line, we'll do that on Monday. But maybe it was still good to go slowly through the code. Please do give feedback and tell me in the feedback what you thought of the code. Was this too much scrolling up and down? Is it too complicated Python, I can't parse this? Is it better to just do math? Or is it sometimes helpful to have these constructive proofs in code? So, what I tried to show you today is, when we, if we sort of commit ourselves to using this structure of exponential families, then it provides us with an interesting language to think about what actually is hard in Bayesian inference. It tells us that to do Bayesian modeling, you first need to take the data and think about which statistics of the data you care about, phi of x. That defines an exponential family. And with that, an entire process sets into motion that tells you everything you need to do afterwards. And you can even write it down as actual code, as an abstract concept. Something that first of all defines what the distribution will look like and the shape of this distribution, that the distribution for the data and the likelihood for its parameters are pretty much precisely defined by the exponential, by the sufficient statistics that you've chosen. So by picking certain phi, certain statistics, you've basically assumed a particular distribution and parameters for it. And funnily enough, there is a pretty much automatic conjugate prior for those parameters. So the mechanism for inference on those parameters is also imposed on you, if you like. And the only thing you need to know to be able to do all of this is the log normalization constant, the partition function of the conjugate prior. That's sort of assuming that you know the normalization constant of your likelihood, which you also often do. So actually you need both. You need z and f, but z is usually much easier than f. Sneak peak for deep neural networks, z is usually something that's in the loss, and f is the complicated thing that we need to think about if we do Bayesian inference. So if you decide to use the cross-entropy loss, that's like knowing what z is, basically. But doing Bayesian inference on a deep neural network is the f part that's the hard bit. And so far, we've seen that all of this algebraic structure emerges. We haven't yet seen what we do when we don't know f. So we'll talk about how to do that on Monday. Thank you.