 Good morning. It's a Thursday again, so it's all very early and we can start By looking at the feedback from the last lecture Which is a very market shift from the first two lectures, which are like, oh, this is so nice and Lots of fun not so complicated and now you're Remember, this is kind of where you would like to be very good and just the right speed and your Scores have clearly shifted up towards this was very fast and some of it was for some of you It was actually way too fast Although you're still kind of like the lecture. So that's good And I also asked about the difficulty of the exercises So this is for the very first exercise that you handed in last Monday And I think this is a decent distribution It's always very difficult to ask you about how difficult the exercises are because no one says well No one would say they're too easy So I carefully Phrased the question so that I hopefully get a Divisibly realistic answer from you and I think this is a decent distribution No one thought they were impossible to do and no one said that they didn't do them, which is really good And yeah, so we've got some realistic distribution The most important detailed feedback was that many of you found the Monday lecture too fast There were two things we did at the same pretty much right after each other not at the same time But right after each other first and the theoretical foundations for how to work with functions of random events called random variables and How to construct probability distributions on continuous spaces and Right after that I went into an applied example Part day because I wanted to trade off a bit between doing fun examples and hard math and Partly because you needed it for the exercises and That was probably too much and it went too quickly Well, I got through my Monday, so I could give you my exercise sheet, but We should probably slow it down a bit. So what we'll do today is I've decided to make to make this lecture which Could have been two lectures Sorry, which which could have been one lecture maybe in the two lectures, so we'll do it today and then on next Thursday By the way, there won't be an exercise sheet next week because next Monday is a public holiday and exercise She says she's only go out on Mondays. Oh, there's no exercise next week And I've split this lecture up into two Of course, I can't quite keep myself from adding a bit of content if we do two lectures, but hopefully that helps slow things down So let's take a step back and think of again very carefully in terms of the concepts We did in the first half of the lecture on Monday About the stuff that we did in the second half of the lecture on Monday. So the question Was what's the probability that someone let's say the general population is wearing glasses and First of all for the setup. Let's think about the concepts that we've used So there are now random variables involved in this in this inference problem Two types of random variables on the one hand we have observations People wearing glasses or not wearing glasses On the other hand, we have an unknown probability For these observations to have some binary value this probability We call let's call it y in this graph and The observations let's call them x1 through xn for the individual observations Then all of these are called random variables because you can construct them as functions on some elementary event space and Here you can already see an example of this how we don't really need to talk about the Elementary events we just call everything a random variable because it's not even important to think about what the elementary events were and We have to think about the the domains the types of these individual variables Those are binary so they are either zero or one and this is a real valued Quantity that lies between zero and one including for a probability and Then the other thing made me to point out and I'll just do it for the last time here and then after that I'll be much much faster with it But I know that people are often confused about this So let's do it very slowly. They are different Symbols attached to the name and the value of these random variables so why is The random variable that describes the probability for someone to wear glasses and Pi is the value that that random variable takes So that's like when you define a function for example in Python you give the variable that goes into the function a name and Then when you call the function you pass in a value to that named variable So the name of that variable here is why and the value we pass in is pi kind of obvious for computer scientists, but maybe it helps to start thinking about it this way because then it becomes clear why there are two different symbols showing up y and pi and It's sort of customary to use capital letters for random variables and lowercase for the values they take But that's a bit too much constraint on our namespace so I'll later also use capital letters for something else because otherwise we're gonna run out of letters pretty quickly and Secondly so in principle if you were like proper probability theorists We would need to write things like this right to say there. What's the probability for the variable y to take the value pi? That's literally a function being evaluated with something plugged in But that makes like the formula very very long. So I'll usually just write this and Correspondingly this year is a function of two variables probability of capital X I Having that taking the value little x I given that the variable capital Y has the value pi And I guess at this point it over the comes clear why this is a bit tedious Now So now we've defined the problem Now we need to define that this is sort of the syntax now we need to semantics How are we going to do inference when we observe individual data? We use Thank you very much. Good. We use Bayes theorem and And so that means we start out with a prior probability distribution over the unknown quantity pi and I showed you this visualization Here in this plot on the left. You see the prior I've decided to use a uniform prior again And we already saw that we can vary this a little bit, but we'll get to that at a moment So uniform prior might be the most obvious thing to do in some sense because if you really don't know why prefer any of the individual values and Then to do inference we multiply this prior with the likelihood and normalized by the evidence We do Bayesian inference So when we observe the first datum the first person who is in this case not wearing glasses Actually, we need to multiply this prior by the likelihood and divide by the evidence The evidence is the normalization constant So we have to integrate out the terms on top and then when we get the second observation I'm going to multiply in a second likelihood and there is an assumption hidden in this in doing this This is not just totally automatic, but it amounts to the assumption multiplying probability distributions that the second observation is conditionally independent of the first one given Why the probability to observe someone wearing glasses or actually its value pi and What it actually means to be conditionally independent is this really really tricky question We already discussed it last Monday. So for example here in this room There's a certain group of people that probably isn't an ideal sample from the entire population Because of your your for example all pretty much the same age That's if they had us at distribution that also included some older people that probably be more glasses around We all I also started counting at the front so The people in the front might be slightly more likely to wear glasses because they need to sit in the front to see better or something like this and Such assumptions are going to be all over the place But we'll just make them because it makes the computation easier and it's not like everyone else isn't making them either So that's actually plug in our observations. I've seen two people were not wearing glasses So now I've multiplied this prior twice with the likelihood and for that we had to ask this question what actually is the likelihood when I stood up here and asked you a few times and we kind of convinced ourselves that the probability to see someone wearing glasses if the probability to see someone wearing glasses It's just the probability to see someone wearing glasses. So it's just pi. It's the identity function And if I now observe someone who is not wearing glasses like in this case Then the probability to see someone who's not wearing glasses It's just one minus the probability to see someone wearing glasses because there's only two possible options You either are or you're not wearing glasses Again, we could probably have some discussion about whether there's a third option or not But probably we all agree that that's kind of a decent thing to do. So I've multiplied that in the term one minus pi twice and we already observe by the way the third person is wearing glasses How convenient? We observe that this actually gives a quite interesting structure distribution and That's actually the first time you see it. Maybe this is a little bit surprising because You sort of think these individual terms, they're just linear functions, right? They all look like this or like this So if I just multiply a bunch of those that shouldn't be so complicated But actually it is because as a function of this underlying unknown probability pi this is actually a Power law in some sense, but it's a very it's a sort of complicated higher-order polynomial So if you think of this bracket if you open this up like if you Expand this bracket to get individual terms in pi and then multiply with those you get a very high-order polynomial Right. Well high order like of order number of observations. We've made so far and pretty much all the terms are in there so One part of the problem with this is that we need to normalize by the unknown function That integrates this this term. We'll talk about that in a moment. The other one was how to so let or actually know so To normalize this this object we need to compute this integral and then if we know what this thing is We just give it a name then we can describe this as a probability distribution Which is parameterized? And what do I mean by parameterized? Let me tell you so after five observations We have this kind of form and then we stare at this this expression for a while and try to convince ourselves What kind of prior might we might want to use? To keep this computation trackable and a simple option is to just use a uniform prior just multiply by one and then this question kind of goes away because It's just a term one so the problem is gone, but we can actually be ever so slightly more general By allowing for priors that are of this form so Maybe maybe I should stop talking here for a moment and ask you what's the cool thing about this form? Can someone say a sentence about what why this might be a good parametric form for a prior to choose? so You're saying a and b special define the prior belief in Say again the number of people So maybe maybe we can look at what this prior looks again. So if you if you choose alpha less than Less than one then this distribution can look like this Like an asymmetric kind of trough shape thing that puts high probability on zero and one But we can make it. Oh, yes. Yes, exactly so this This prior is of the same. I'll say algebraic form as the likelihood So the term up in the middle That is to the left of the prior is something of the form pi Raised to some power. Let's call it n times one minus pi wise raise to some other power Let's call it m and if you choose the prior of the same form pi to the Alpha times one minus pi to the beta where you could also write alpha is a minus one and beta is b minus one then the posterior distribution Will always involve only this term only this pi to the alpha times one minus pi to the beta and One nice thing about this is that for example, if you wanted to have a generalization of this prior to alpha values larger than one then we could just add in observations instead of Setting alpha larger to one just for this visualization So this would be the prior with three point seven on the right-hand side and 0.5 on the left And we could pretend that that's a prior and only start counting now That's a sort of it's like Code views right you could just use that the prior that only takes in alphas between zero and one and then generalize it to general observations Another nice thing about this that's actually more important is that if we are able to solve this integral Then we will be able to keep tractable full Bayesian inference for any number of observations And that's actually something that Laplace observed in his original text as well Where is I'm not going to ask you again to translate I've already done it so in in 1814 in his Treaties on the analytic theory of probabilities. He actually writes an observation like this, but it's 18 14 and people spoke a bit quaintly back then so You can read when the values of x Considered independent of the observed results are not equally possible So the values of x as our probability pi if they're not equally possible if we name z the function of x which expresses their probability So the probability of it's easy to see By what has been said in the first chapter of this book that by changing the formula to y times z We will have the probability that the value x is within the limits of Left and right end and this amounts to assuming that all the values of x are equally possible a priori So we can rescale the prior such that it becomes like a uniform distribution and to consider the observed results as being formed by two Independent results whose probabilities are y and z They can thus reduce all the cases to the ones where we assume a priori before the event an equal possibility of the different values of x and By this reason So he's sort of trying to explain that you can you can actually start with a uniform prior and then Instead of having a different kind of prior you start with a uniform prior and then you plug in some pseudo observations Like in this example Now Laplace had this problem that he couldn't actually solve this integral But I Already mentioned on Monday that he found some solution around it using another integral that he could solve which was close the Gaussian integral Yes So you're already making the main point that this kind of situation Where no matter how much data we have the posterior only always keeps the same algebraic structure It's called a conjugate prior and here's the definition for it let's consider a data set and Some unknown variable that we want to observe and they want to infer sorry So there's always going to be these two things the observed parts the data and the latent thing which we call x And they are connected by this function called the likelihood which is really central to Bayesian inference It's not the prior that's so important It's the likelihood that's the core part of the problem which is a function of two Quantities the data and the unknown thing x a Conjugate prior for this likelihood for this function which has these two parameters going in is a probability measure with probability density function p of x let's call it G of x and theta so that's g is a function that takes it x and some other parameters such that The posterior distribution that arises from applying base theorem. That's the thing in the middle is can be written as g of x and Prior parameters plus some function of the data So by this notation I mean in particular that this G is of the same algebraic form So G could be implemented by some computer code some program that takes in these parameters and then we can compute the posterior by calling the same function over and over again and updating the parameter by taking the taking the data running some other function called phi on it and feeding that function into the posterior This function that we need to apply to the data is often called the sufficient statistics and Today we'll talk about The kind of distributions that give us this structure and we'll first do a bunch of examples and Then we'll try and tease out the general structure the general algebraic structure We're looking for and then next Thursday We'll talk a little bit more advanced about how we could use these kind of distributions and How to actually write them in code? So on Thursday next week, we'll stare at some Python code for quite some while All right at the onset here one thing I we can already point out that for why this might be interesting to do So there are two reasons actually why this is an interesting structure to have the first one is that it's evidently computationally efficient if we know how If you can define the function g then that function takes over it automates the process of patient inference It basically says in such situations. We can do patient inference in closed form closed in the sense of in the abstraction of the function g whatever g has to do inside But it's so there's some code that someone can hand to you and it's just going to do patient inference for you Maybe it needs to do some fancy optimization inside, but that's what it's going to do and the other Maybe slightly less obvious nice thing about this is that The data processing in such a situation in modern language Happens outside of the inference So notice how we have this function phi the sufficient statistics Which we can call on the data set and once we have called that We hand its result to the function that does the inference and That function is not going to touch the data again Because it's encapsulated by Phi So this is actually in 2023 the most interesting aspect of this of this process That if we have such an algebraic form then we'll be able to encapsulate away the data processing And just compute these that's why they're called sufficient statistics And afterwards we can throw away the data set and never touch it again Notice how this is quite different from what you do in deep learning right where you have a data loader that keeps going back to the data Yes So so your question is can Phi be a deep network of course everything can always be a deep network because it is just a function, but The maybe the more important aspect is what it what the how we use the output of this function and what we feel in So you're probably thinking of something that transforms the data and then we hand the data to the inference That's not what's how what's happening here instead. We're collapsing over the batch dimension of the data So if I what's happening, let's say you have a data set that actually we can do it here, right? So I have a data set here of I don't know 120 people So the size of the data set is 120 all wearing glasses or not wearing glasses But what I feed into my app is just two numbers How many pluses how many minuses? Right how many people with glasses how many without glasses and I can store this entire data set in two numbers So that's a drastic reduction not in the Representation of the input of the data, but in the number of data points that I have to store right, so if you think of your Example this is not about taking the images and transforming them onto some lower-dimensional representation of the images But it's taking the 60,000 images and we're using them to do That's the that's the cool thing that's happening here. So let's actually do this Again for the case of beta inference. So here is I'm really just going to You know we write what we already had in the language of the stuff that we just talked about So the process to maybe just do it do it again really slowly is We have a likelihood someone comes in with a likelihood that looks like this. Oh And here on this slide actually made the mistake of forgetting the binomial coefficient But it doesn't matter because it's going to be actually. It's a nice. Maybe it's a nice teachable moment so Someone comes in with this probability for these observations and All the X's are zeroes or one So we observe someone wearing glasses with probability f or we observe someone not wearing glasses with probability one minus it now I noticed that I can rewrite this function by counting up how many zeros and ones I have and Actually, it's actually correct. It's actually correct what I've written and I can check with you why it's correct so clearly if I have n people wearing glasses and N1 people wearing glasses and and zero people not wearing glasses Then I can just count how many f's I have in there. Well n1 f's and how many one minus f's I have in there well and zero and Then just write it like this So I called this the binomial distribution at some point But did I actually call this the binomial distribution or not? No, so this is a distribution over the X's And on Monday, I said when you do a change of variable to some other representation Some random variable that is derived then you need to make sure that in your PDF You're multiplying by the Jacobian or actually No, but it's a copy for PDFs and for discrete distributions like this. You have to Count up the volume of the pre-image of this transformation So if I want this to be a probability distribution over n1 and n0 rather than over X And I need to multiply by this binomial coefficient so by capital N, so n1 plus nn n0 over Choose n1 or choose n0. That's the number of possible combinations and people Could take if what n one of them are wearing glasses But since I wrote this as a distribution over X. It's fine. I can leave it here So this is the typical situation someone comes in with a data set There's no prior yet someone comes in with a data set that that's that's the generative process of the data. There's some underlying unknown F and some known X Please tell me how to do Bayesian inference. So now Your job as the the machine learning engineer kind of starts and you stare at this expression They say, okay. So this is a likelihood So it's a distribution over X and you can make it a distribution over n and In N is this exponential function or in X as well, right? But in F, it's a power law It's it's F to raise to something times 1 minus F raised to something else So my prior is probably going to be of that same form. It'll have to be a conjugate prior So it'll also need to be of this form F raised to some power times 1 minus F raised to some other power And then for historical reasons, we always end up with this minus one It's just someone decided that that's more convenient to do Who oiler? Yeah, that's just how he wrote this integral because he wanted to interpolate the factorial function and for some reason he found that more convenient and So the only tricky thing left to do if you if we choose a prior like this with this for this likelihood Then the posterior is going to have this form right? We'll just get to add up these sufficient statistics of the data So here the sufficient statistics are just how many people have I seen who wear glasses and don't wear glasses? And the only problem left even for Laplace in 18.0 something is you need to normalize you need to know How this integrates to one and for that you need to divide by the integral over this thing So in a way, this is still good because the only thing we need to be able to do is to solve integrals of this form So if someone gave you a big book in 18.05 That contains the integral and says you can write integrals of the form f to the alpha times 1 minus f to the beta as This number then we can automate this process We can just plug in our numbers for n1 and n2 or n0 and look up in the table. What's the value of this integral plug it in Now in 18.05 Laplace didn't have this he only had Gauss's integral So that's why he used Gauss's approximation or actually his approximation. I'll call the Laplace approximation But it's 2023 and we have these things they're called beta functions and they are in sci-pi They're super super precise. They're down to machine precision precision. So just You just call them great That's actually what I've done in this code and here's the definition of this function again, right? So you can write it in this form this is related to this gamma function which is also in Euler's treatment and Euler was really interested in this function because if Z is an integer then gamma is an interpolation of the factorial function Which was back then defined in this way and the beta function can be constructed through the gamma function and if the Parameters are integers then we can think of them as computing essentially binomial coefficients for us and in fact In some of your homework next well the week after next week You'll get to use a bunch of binomial functions and then you'll find out that often that this is this convenient name This P because it's also the binomial function in a sense. So now to make sure we understand Let me ask you one question Let's say I have such a distribution I have this this this is by the way called the beta distribution because it uses the beta integral It's one of these there's a whole family of all these functions at this probability distributions Which have names of Greek characters not because the Greek character matters in any particular way It's just because the normalization constant happens to be the first or the second or the third integral that shows up in a treatment by Euler from 18 or something or 17 or something Let's say I have such a distribution so for example, I could say I've now I do the first row right so I have zero zero one one success two failures and I've chosen my uniform prior to make it a bit easier Now someone tells me actually in this room. There's a hundred and twenty people Please predict how many people are gonna wear glasses Without looking at the further rows. So of course I can look at everyone But maybe if I just want to predict the future from the data I've seen so far What do I do? You say we can either do the expectation of the posterior or we can do the arg max So Okay, so I wanted to say why not the posterior and but so yours your This is maybe good to stop briefly and so This is the posterior over the random variable y which has takes values pi So the probability for someone to wear glasses what I would like to have is a Associated probability distribution over the number of people in this room wearing glasses That's a different random variable and in particular its integer value It's values do not lie between zero and one its values lie between let's say zero and a hundred and twenty Let's pretend that I know that there's a hundred and twenty people in this room So for this I've actually made a slide. Let's hope that it contains the right things. So Let's say I So this is bad. What this is inconvenient that I've chosen use use x and y here in the same So let's think look at this expression up here again, right? Let's print. Let's say I know capital N And I would like to predict N1 as a function of capital N or Equivalently I could predict the individual axis Yes So the first sentence is that this is what we are trying to do is predictive Predictive inference. It's just prediction That's what machine learning is supposed to do all the time right you get some data You finish the training our training was this process. We just looked at on this on on on the app And I need to predict something else the future So you can check the right what I'm so what I need to do is I need to write down the generative distribution of the unknown quantity This case that's called it X. Let's for example say I wanted to predict the next person wearing glasses or not wearing glasses and multiply it with the Under given the unknown quantity and multiply it with my current belief over the unknown quantity So why is that the right thing to do? First of all, this is the sum rule I could also have written this as the probability of the unknown thing the thing to be predicted and the other thing that helps us predict it and Just integrate out the thing. I don't know if It's a case of I have two variables, but I only care about one of them so what do I do I sum out the one that I don't know and So now here is a case where the notation again is I'm constantly going to do this across across the term is Complicated, so if you have some distribution over F, whatever that distribution is you can predict Other quantities X in this way in particular this P of F could be a Posterior arising from some other data It could be the posterior arising from these three data points in the front row and I could use it to predict Then next datum Datum number four five six seven eight nine or maybe all hundred and twenty So what do I need to do that what to do to to make this prediction? Well, I plug in the likelihood for this next observation And here as I make this last night, so I swapped between X and N stupid So I could have called this P of N or N one given F and N or P of the next observation in this form This would be if I wanted to predict all the next observations all the a capital N observations They are going to be of this form as well, right because they also depend on F in the very same way and My belief about F This distribution whether it's a prior or posterior will be of this algebraic form so therefore This integral that I need to solve. What is it? Yeah, so I'm sorry about this X and N business So there's two situations that I mixed up in this slide either I want to predict the next datum one individual person That's a binary variable zero one Or I want to predict the number of people wearing glasses in this room and these are two different quantities If I just want to predict the next person X then this X is a binary random variable if I want to predict the next N people capital N people or N without index people then I need to write on a binomial distribution and multiply by a Binomial coefficient here, but that binomial coefficient doesn't actually matter. I can just like plug it in there or not Right, it doesn't depend like it matters for the numbers that come out But it doesn't involve like it's outside of this integral because it's outside of the integration It's not dependent on F. Yeah So I have to clean up this slide and say what if I wanted to predict N new observations Out of which N1 are going to be positive and N0 are going to be negative Then this distribution here is a binomial Distribution which has a binomial coefficient in front that is N over N1 But N and N1 do not depend like they are not F So they can go outside of this integral right and this slide is about this integral So that's why I forgot about it annoyingly. You had a question Uh-huh exactly right. So what we observe is to predict more data We get an integral that is of the form of the normalization constant from our posterior inference before And so we can solve it. It contains this number that we need to know to be able to do Bayesian inference in the first place And then we just take a fraction, a relation between them, like we just divide one by the other So actually I have an app for this as well. So if you want to pull from this skip Let me just do that. If you want to pull from this skip to get people again, there's a new last night I pushed a new folder for which contains another app which I have here and Here is another way of looking at this problem. So let's say there is some process that produces binary observations they are either zero or one and I get to choose how many of them I'm going to see let's say 10 observations and Those 10 observations happen with some unknown probability f and in this case I've rewritten it such that I actually provide this unknown f It's sort of like outside of the there's someone secretly in the last row here setting the probability to some value And I'm not setting the individual observations directly, but I'm creating a situation where I can create such observations So what I've done here is I've drawn From this actual binomial distribution 10 observations So here in red this this bar chart that those are the 10 observations. We have eight Failures and two successes the probability is 50 50 But if you draw 10 random numbers from the 50 50 probability, this is actually quite likely to happen It's a bit of an annoying. It's almost a bit of an outlier, but it just happens if you have 10 observations And now what I do is I write down this likelihood the likelihood for this observation So this likelihood is a function of two quantities we could think of it as a function of f and Fixed the data then you get this blue curve. That's the likelihood So this is really this function f to The two Times one minus f to the eight. That's just what this looks like And you can see that this is not a probability distribution It's not a probability density function because if you integrate this with your eye integral This doesn't add up to one, right? It's a domain is zero to one and we add up numbers that are you know Less than one So if you add up numbers less than one over a domain of width one You'll get out something less than one. So we need to normalize and you divide by the beta Integral over those observations and that gives us a posterior so this red posterior Would be our belief over what f is if we didn't know that the true f is actually 50% No one had told us this and we just had those observations this black line here, that's the actual value of of P Or f true right of the actual this the actual probability to draw these numbers and you can see That it's it's in a region where the posterior has low probability, but it doesn't have zero probability It's still possible that that's the right answer and What I could now do is sort of reverse the situation and say if this were your belief over p this red Distribution, what would be your predictive distribution for 10 observations if I tell you I'm not gonna draw 10 observations What's the distribution gonna look like and then you get these golden bars? So it's possible to get this observation or this observation or this observation And you can see that it's actually it makes these the red observations quite likely, but it also has a non-trivial probability on the The correct value and if I now increase the number of observations Then we'll get more more observations and of course they'll contract around the 50% because law of large numbers and The both the posterior and the corresponding predictive distribution are going to concentrate around the true value Yes No So well and no so we do not condition on how likely it is to observe six positive or negative So what I've done here is and I realized that this is a bit dangerous to do is to say first of all I give you some data Now use this data to learn something about the distribution and then predict the data itself Right, but not the actual numbers of the data, but predict something like this data set right so predict So let's be a bit more precise. I said I Have carried a number of observations in this case 45 and I have had this many positive and negatives now if I gave you another data set of 45 observations Under the posterior that you have after those 45 observations What would be your predictive distribution for the outcome of this experiment? And that's this golden distribution So it sounds a bit circular, but it actually isn't it's just that I didn't want to have Another set of sliders here with more plots in the water make it even more complicated So otherwise I could have had two different ends right like of course I can predict two different experiments with different ends Here it's 45 but So this is for the next 45 All right, that's why you get ever more bars, right? So actually if I go back to if I mean this code works for zero And it just doesn't predict anything because I can always predict the data set of size zero And if I have a data set of size one then I can only predict the data Well, this code can only predict the data set of size one Of course, I can take this posterior and use it to predict the data set of size a hundred. Yes Just just for this plot that's why I have two different apps All right, you can use the one from the third lecture to do the other situation Okay, so by the way last week After the lecture, I noticed that I had an email from streamlit saying your app ran out of resources Everyone clicked on the link used it and then it died because it didn't have enough resources Unfortunately, there isn't even a way to pay for the streamlit cloud. It's just a community service So I can't really do much about this if you if you want to have the app running on your own machine Just do this stuff down here and then you can just run it and it's going to work Assuming you have Python So and actually so one nice side effect of this is if it forces me to write this Requirements list because it has to work on the cloud server So I'm pretty sure that after this it should work for you. It's like automatically making sure you have to write packages So this is the situation. Let me see where we are time-wise Uh-huh, uh-huh Yeah, okay, let me speed up a little bit but Without going too crazy Because if now really gone we slow to make it very clear I hope this helps but if it gets too slow at some point, you know, let me know in the feedback So now let's think of a slightly different situation. Let's say I don't ask whether people are wearing glasses But I ask a question that has multiple possible answers Like for example, I could ask I'm not gonna ask but I could ask for your nationality or something That also is a binary value of which everyone typically only has one Well, actually nationalities don't work this way in Germany, but whatever something else that has more than two possible answers Then I would need a distribution Over what's called the multivariate? probability distribution or the multinomial or sometimes called the cut it so there's the binomial Which relates to the Bernoulli probability the same way that the multinomial distribution relates to the Categorical probability distribution. So the categorical distribution is I have one event that takes one out of k possible values And the multinomial distribution is I have n events of which each can take Values from one to K That's a pretty straightforward generalization of this binary situation And you're going to do it in your exercise or you're already doing it in your exercise sheets So I'm not gonna tell you too much about it But I'm gonna solve sort of the first part of the exercise for you this notion is connected to this young man from Richelais, which is in modern-day Belgium and back then was in some weird late Or early modern by-course date He Came from a from a family around there and they all come from from this little town Richelais and his father Wanted to distinguish himself from his father. So he called himself the young guy from Richelais the gender Richelais And you can sort of hear the Belgium accent coming coming through So he named himself after his father So he's called Pekustaf Lejeux Dirichlet and that's his distribution. So it's called the Dirichlet distribution So there's something about these distributions that they are always named after some random historical fact They are called after which whether the integral that Euler solves is the first or the second one the beta or the gamma distribution and The one that actually has a proper name is named after a guy who named himself after his father who named himself after the Place that he came from and he's the young guy from there. So weird Nevertheless, this is the idea behind it and now let's see if you can do it quickly, but get the patterns right So someone comes in with a likelihood that says I have observed a bunch of numbers that are Categorically distributed and they are independent of each other drawn with some unknown probability distribution that is of course parameterized by K numbers if you have K categories Actually, it's one. It's K minus one because the last one has to make sure they sum to one But you know, so let's say something like K parameters And I would like to learn that function that probability distribution I would like to know with which probability each of these classes shows up You can imagine that that's a very common situation all across inference So now what we do is we stare at this expression. We noticed that we could also parameterize it in terms of how many Observations we've made Not what the individual observations were but how many of them we had if we wanted to make this a distribution over n rather than x You would have to multiply with some multivariate generalization of binomial coefficients. You'll get those from the beta integral But we don't have to because it doesn't matter for posterior inference It's just going to be a constant that will show up both in the numerator and denominator So it's going to cancel out anyway. So for Bayesian inference. This doesn't matter We can write this thing like this This is not a probability distribution over x, but sorry It's a probability distribution over x but not over n, but it doesn't matter We care about its relationship to f. We want to do Bayesian inference on f So therefore our prior will have to be of the same algebraic form It'll also have a term like this in there and terms like this need to be normalized So now we need to this no-to-no this normalization constant again But actually it turns out there's a straightforward generalization of the beta case to the multinomial case and it's this Multivariate beta function which looks like this and you see that it again uses the gamma function So if you can call gamma functions, you can compute these things If we do this then Here's the point where you quickly have to pay attention clearly if I do Bayesian inference and I multiply this likelihood with this prior I Get out a posterior that is something f to the raised raised to some power right where each element of this vector f is raised to some power and What I need to do is I just need to add up how many observations I've made So the sufficient statistics are let's just count how many observations I've made So if you give me the entire data set I go through it I just count up how many of the classes I've seen I can do this in one pass through the data set And then I never have to touch the data again. I can do Bayesian inference afterwards This is what this looks like as a little app. So It's the same situation. It's just more complicated to plot So now we have three classes because I can't plot more than four because then I can't plot this distribution over here It's hard enough to draw it in three We have three classes They come from an original distribution that I set down here in this case I've set the first class to 30 percent probability. That's this the second class to 0.23 percent probability And then of course defines the third class because they have to sum to one. That's the golden bar here But I could set it further down to Hi, yeah, here we go. There's this thing I had not updated yet. Okay, so this is One draw zero draw. So before I see anything I have a prior The prior is over here. You don't see anything in this plot. Why? Because it's flat I've set all the parameters to one. So it's a uniform distribution over this thing, which is called the simplex by the way it's Polygon that Constrains the space of all possible numbers that sum to one I could also change the prior a bit and then you'll see some structure in this Dirichlet prior so the Dirichlet prior has these parameters alpha for each of these terms And we can use them use them to make pretty steep crazy distributions Or also slightly peaky distributions that are centered somewhere if all the parameters are larger than one You get some kind of centered thing I won't talk too much about this because you're doing it in your exercise and you can play with this distribution yourself if you want to So I'll go back to the uniform prior so that you believe me that it's nothing about the prior I set the true Probability to I let's actually make it pretty peaky. So it's somewhere decently outside of the center of the simplex and now we get the first Observations One observation. It's the second class Two observations. It's the second and the third class three observations It's the second and the third class twice and so on and so on and so on and if I now make several observations Like say 17 observations the posterior will concentrate around the true value, which is this point and I can use this posterior To make a prediction about what a data set of the size 17 might look like and those are these golden bars How do I make that prediction? Here's our sanity check to see oops Actually, no, I'm not gonna ask you here because it's that's in your exercise. So but We can do it for another case. So here's another situation which is connected to this wonderful chap Daniel Bernoulli clearly a Baroque Person from the age when people didn't have hair So they have to have to have to wear fake hair because it died because of all of the lies that they have that they had Let's say you make observations That are continuous value. They are not counts But they come from a Gaussian probability distribution. So they are normal distributed and There you even know what the mean of this Gaussian distribution is it has without those of that I've added he means zero But what you don't know is your measurement error, you don't know what the variance The standard deviation the spread of your Gaussian distribution is so you observe excess That are drawn from independently from a Gaussian distribution, which has a mean and a variance and Gaussian distributions look like this I'm not going to introduce the one-dimensional Gaussian distribution because that will be insulting to you. So it looks like this and Let's say we know what mu is, but we don't know what sigma is What do we do and Maybe by now you've seen the recipe and we can follow the recipe. Yes Yeah, so we did I said we don't care about you at the moment. We just care about Sigma actually So what you can do is we can stare at this expression and see what kind of algebraic form it has in sigma So whether there's a bug. This should be a sigma squared not a sigma 2 of course There's a hat missing here. So sigma shows up in two points in this equation Down here in the normalization constant and over there in the exponential So we could also write down The maybe it's easier to see if you write down the log of this distribution Then we see that the logarithm of this distribution is a function in which sigma shows up in two places Here is now. It's also correct without the stupid typo and here and then there is a bunch of numbers here floating around as well So to follow the very same process. We now just say, okay, let's think of a prior That has the same algebraic form and now the tricky part is it has to have the same algebraic form in sigma Not an X and Y. Sorry, not an X and mu, but in sigma because that's the thing we don't know So we'll need a function whose logarithm looks like this as a function of sigma and It looks like this. So We assume that our prior conjugate prior distribution for a sigma will look like Some something something something times the logarithm of sigma inverse squared Why is there an inverse? Well, because there's effectively an inverse up here as well This minus is like something I could draw in here What and it's also I mean you can see it. There's an inverse here sigma squared Minus some other constant times one over sigma squared minus some normalization constant So the only thing we need to know is the normalization constant z of alpha and beta So we need to be able to integrate over expressions of this form So here they are sorry if we have to integrate over the exponential of expressions of this form, right? because that's the probability distribution, so It turns out that you can do that and it gives an answer that looks like this so that's the normal the inverse of the normalization constant and So there's a beta to the alpha. That's just a transformation of variable thing But the tricky part is this stupid integral that we had a few times already the third or larian integral the gamma function and that's because Euler wrote this in the third third position in his in his essay on the I don't know page four or so. This is why this distribution is called the gamma distribution And it's connected actually with the work of Daniel Bernoulli so if we decide to parameterize our prior in this form then the posterior will be of this algebraic form as well and To update it we need to take the terms that look like the 1 over sigma square bit and add them up. So what's the terms there? It's n half All right, so we need to update alpha with n half Essentially need to count how many things we've seen and many today update beta with this sufficient statistics It's the sum of the square distances from the mean Sounds interesting So notice how this is something you can compute as you go through the data in o of n once And then you never have to touch the data again You can do everything afterwards just with this I have an app for this as well So let me make sure it actually is properly initialized. So here is no data. If I get one datum In red I show you in this case the correct distribution because it's a bit complicated, right? It's like it's like It's a Gaussian distribution with an unknown variance here. I've set a variance to one But I can make it larger or smaller Let's set it to one so that it's easy and on the right I plot the likelihood the likelihood is just the Gaussian function Plotted as a function of sigma Rather than of x and mu so it doesn't look like a Gaussian, but that's a Gaussian It's just a different way of looking at a Gaussian distribution. I Could multiply this thing with its conjugate prior this conjugate prior is this gamma distribution It also has parameters alpha and beta which I have here set to one You get to play with them if you want to install the app for yourself and shift those sliders around And then I get a posterior distribution that looks like this red curve over here After a single datum that was a lucky datum that got it just right at the true distribution If I actually increase the number of data points, it's probably gonna be a little bit off Yeah, so now the posterior moves around a bit, but it'll quickly concentrate on the true value So the red curve concentrates on the true value in this plot the golden thing is the prior distribution Which doesn't change and on the right hand side. You see the predictive distribution in gold or mustard colored so this is if we ask for what's the probability for One more Gaussian sample. What's its distribution going to like? Where is it going to fall and this distribution is not a Gaussian distribution? You can maybe see that it has heavier tails Especially if I go to a smaller number here, so it's sort of it decays more slowly Why well, let's think about how you would compute this this thing You take and this is connected to the story of actually maybe this helps to tell the story here I've mentioned it before in Last year if you were in in the other lecture to this guy. He's called William C. Lee Gossett He was he studied in the UK and then he lived in in Ireland and in London worked for the Guinness Brewery to make beer important job in Beginning 20th century. He was at some point the master brewer of the Guinness Brewery and if you make beer That's like a continuous experimental design problem He set up these barrels and then they sit and sometimes to be as good sometimes. It's bad. Of course these days. It's very narrow Precisely controlled, but in 1900 it wasn't it was a very nasty kind of process And so he needed to figure out what the variances of his experiments were but he had studied under Pearson in statistics He knew all about all of this beautiful math and so he did the following derivation That we are going to do now It's the same thing as for the beta and the gamma. It's really the same thing We we have a likelihood That in this case that someone has given to us which looks like this. We've convinced ourselves That that's the conjugate prior for it So if we have constructed a conjugate posterior of this form That's just a function that gets evaluated at sigma and has parameters alpha and beta Those parameters might come from some previous experiments or they might just be set in some way Somehow we have those parameters alpha and beta if I now want to predict the next observation Where it's going to fall Then I need to marginalize out this unknown parameter sigma in this generative process and If that if if this likelihood is of this Gaussian form, so I plug it in here and This prior is of this gamma form which I plug in here Then this looks really confusing and if the first time you see this equation go who am I supposed to look well Think of this thing as a function of sigma So sigma shows up here and there and there and there so there are two terms that look like Sigma raised to some and sigma inverse raised to some power and there's two terms that involve an exponential of minus sigma to the minus two and We just rearrange those terms take all the stuff that doesn't depend on sigma outside Add up all the exponents So it's just like updating the posterior with one more datum and then marginalize out to get a normalization constant Which is given by the normalization constant of the gamma distribution. It's just raised to other parameter values So it looks like this and then you can convince yourself that you can also write it like this Which is a form that some of you who have taken a stats class I've seen before this is called Well, this guy got it should have should actually be called the gossip distribution But because somehow these distributions always end up with stupid names. He was working for a company So he was under an NDA. He wasn't allowed to publish He couldn't write under his name, but they allowed him to publish under a pseudonym So he called himself student and this thing is called the student T distribution because in his paper He called it the T distribution That's what happens if you work in industry can't publish so you have to be under under a pseudonym So that's the that's this golden distribution and it's a more heavy-tailed thing because you can see that it's not a gaussian It's not an exponential of something something. It's a rational function It's one over a Quadratic term in the in x raised to a power. That's why sometimes called the rational quadratic one over Square power just one more and then we're done What about if you had a normal distribution and you know what the variance is typical situation in science you look up on yours In your instrument it says measurement error 0.1 But you don't know the mean you want to measure something you get you get to measure it many times and you don't know what the average is Imagine it's 17 or something you're living in Brunswick and you your your local Landlord has asked you to figure out the size of his country That's his job. That's what he did. That's why what did she got paid for so you triangulate the entire country You build all these towers traveling around with an entourage of servants with precise measurements across the country to figure out what the size of What was that if you have Hanover or something is and and you keep making measurements and you know pretty much what their Error of your measurements are but you want to know the values of the underlying things So translating this into math is what he did is to say we get observations That are all drawn from a normal distribution Centered on the same location But I don't know what that location is so that's a couch a distribution whose mean I don't know What's the conjugate prior For this distribution and some of you will know but let's follow the recipe so we Take the logarithm of this distribution as in the previous slide so There's a bunch of terms that all look the same here is my typo again That come up here and there they don't depend on X So they're essentially constants now in the previous slide. They were important because we were trying to infer sigma Now we want to infer X and then there is this quadratic term in here So how but there's a quadratic term and there's Y's in there How do I how do I come up with an algebraic form that keeps this like as a function of X? I need to expand the brackets So I really go in and say this is a quadratic term So it's yi squared plus yi X minus X squared and now I see ah, okay The conjugate prior of this thing is going to be an exponential of a quadratic function Okay, that gives me a hint. It's probably gonna be something Gaussian right because the Gaussian is exponential of a square But what what are its parameters actually? If I want to be able to sum up so what are the sufficient statistics of the data? What do I need to compute of y to be able to hand it on I need to compute yeah The average so I need just two sums that involve y here right and actually there's three sums It's just you can sort of think of this as a polynomial y It's raised to the second power to the first power and to the 0th power So what I need to know is how many data I have That's the sum over ones what their average value is that their sum up to normalization And then I need to divide by the number of observations, but thankfully I've just stored that and I need to sum the squares So there are three sufficient statistics The number of observations the average value of the observations and the average square of the observation So my prior That will be conjugate to this will have to have the same algebraic form It'll have to be something something with this parameter v square plugging in for sigma squared times a square of x like the term over here with a minus in front then something linear in x and something constant To update those sufficient statistics It's just that The coefficients of this polynomial Annoyingly have to have a complicated form so that I can write it in the form that we tend to think of when we do rations so This is a quadratic and you can think of this as a Gaussian with mean m and variance v squared But it doesn't look like it yet But if you write it like this then we can do conjugate Bayesian inference we can Multiply the likelihood with the prior that is like adding those two terms because they're the logarithms of each other and that just means I need to add the squares I Need to add them the the linear terms and I need to add the constant terms What does that do to my parameters in M? Well, this is sort of a finger exercise that you could do in a homework But I'm not going to ask you to do it in a homework. It's called Completing the square in English or quadratischer again, so I'm in German I need to sit and stare at it for a bit to figure out about how to set the parameters and you'll end up finding out that the updated sort of Gaussian representation is Looks like this so that the posterior on x after observing a bunch of y's is You compute the sum over the squares Sorry, you can be the sum over the over the observations You compute the number of observations you've you've seen and then you do a bunch of Algebra, so you add up the inverse of the variance Time plus n times the observation noise And if you find this confusing it doesn't matter because we're gonna have three lectures on this kind of algebra in a proper Form in a bit But the main point here to take away is if you have Gaussian distributions then They're they actually allow conjugate prior inference in the same way that we could do this for Categorical binary or rate-valued distributions So psi is defined over here It's this term that shows up here because if I plucked it in there it would be too long for the slide Okay, so there are these objects Actually, I have an app for this as well, but I'll just let you play with it. It looks the same way and you can sort of do the same thing There are these distributions Which allow closed-formation inference in these admittedly very boring situations. This is an AI yet It's just inference of a bunch of variables quickly. It'll become AI and They're neat Because they reduced the problem of in full Bayesian inference producing an entire posterior not just a point estimate the entire posterior Into writing one function called the sufficient statistics Which is a preprocessing of the data and then after that we never have to touch the data again and Secondly into writing down a function that does the closed form conjugate prior inference And what it all it needs to do is to just add up the sufficient statistics to some prior parameters and then we normalize by Evaluating this unknown normalization constant or this one hopefully known normalization constant the tricky one So basically these functions reduced to figuring out what the normalization constant is and if you have a book available of someone who tells you what the gamma is or the expert or the the beta integral or whatever else interesting function or if it's available to you in sci-pi Then you're done. And now in the remainder of the remaining few minutes before I let you I let you off the hook Let's think about the structure we need for this to work This is basically a transfer translation of these Individual observations we just made these individual recipes we followed into a clean algebraic statement This is connected to the form the notion of what's called an exponential family Remember that what we did is someone gives you the likelihood you think of You take the logarithm and then we look at what the what which expression the logarithm has as a function of the data If you look at the logarithm that means that the true distribution is going to be an exponential so exponential families are probability distributions over a variable here the variable is called x Which are parameterized by a bunch of parameters that we call w in this slide and They have this algebraic form which is carefully designed to be the most general thing We're actually not quite in a moment with the last set then see what's going to be the most general thing We can write without breaking this conjugacy situation So what are they like they contain a bunch of terms in x? Outside in the front which just depend on x and are not parameterized by the parameters further H of x This was for example in a moment. We'll see the binomial coefficient was a little bit like this this is called the base measure sometimes and then here's the business end there's an exponential of a linear term in the parameters and a nonlinear function in X Called the sufficient statistics Those parameters are then called the natural parameters of this exponential family And then there's the important bit the normalization constant also called the partition function and with the logarithm in front we sometimes call it the log partition function and For historical reasons sometimes there isn't actually a w in here But some other parameters which we can think of as providing the natural parameters which are then called the canonical parameters So here's an example Which you already had the binomial distribution This is the way we encountered it now. I have called it q rather than f just to confuse you even more And and if you stare at this expression you can convince yourself that you can you you can rewrite it in this way Even though it doesn't look like there's an exponential function in there actually there is right in k So this is a distribution over k parameterized by some parameter q and We can write it in this form So this doesn't quite look like the expression we were looking for but it has something that looks like a base measure And then an exponential which involves functions of k So if you assume that we know how many observations we make then we can write it in this form and we now it really looks much More like this exponential family form. It has a base measure something that doesn't depend on the parameter There's no q in here and then the exponential of a Function of the data. So here the sufficient statistics are just the actual number of observations and Then a parameter Which in the normal notation happens to be the logarithm of q over 1 minus q, but we can also just call it w and then a normalization constant which depends on q so therefore also on w and Not on the data. There is no k in here So that's why it's an exponential family Here's the beta distribution so that's a distribution over q parameterized by alpha and beta and We encountered it like this we can rewrite it like this. So There is a one in front so the base measure is trivial It's just one convenient because q is on the simplex but on zero one. So we don't need to need a base measure and Then there's a stuff that depends on the parameter And on sorry that depends on the variable q in particular log q and log 1 minus q and then parameters Which we could call w1 and w2 and they are identified with this alpha minus 1 beta minus 1 historically And then there's a normalization constant which doesn't depend on q, but only on alpha and beta or on w By the way, it's quite often the case that you can parameterize these families in different ways For example, you could also write the same distribution this way It's just a question of how you decide to drag out The dependence on the variable Away from the parameters. So another way to parameterize is to say alpha and beta actually are the natural parameters but there is a base measure out front It's this business of long philosophical discussion about whether The uniform prior is a natural prior or whether a spiky prior that has only masses at zero and one is the correct prior Jeffree's priors and so on if you've ever heard about this And then finally and most importantly the Gaussian distribution is also an exponential family. So Here is the form that we usually encounter in the textbooks and You can rewrite it by dragging out a base measure That's called one over square root of two pi and then observing that there is a Structure in here of terms that depend on x on the variable and A bunch of terms that don't depend on x and those are the partition function and annoyingly those They the natural parameters of the Gaussian distribution. They look a bit weird compared to The ones that we are that we are used to encountering the first natural parameter Actually, let's take the second one first. The second one is one over the variance This is sometimes called the precision or the negative precision and Then the other natural parameter is the mean divided by the variance that's sometimes called the precision adjusted mean Yeah, so so the sufficient statistics are a function of x So we can use them to estimate mu and sigma to do inference But the sufficient statistics here are here. They are they are x and one half x squared So they are actually polynomial terms or moments of the data the sufficient statistics of the Gaussian are the non-central moment other Yeah, the non-central moments of the data and the normalization constant is this So we were we are going to come back to this over the next I don't know 10 lectures because the main business with Gaussians, which will turn out to be a wonderful powerful tool is Sort of encapsulated in this situation, which is that the parameters we usually care about the spread The variance sigma squared and the center the mean mu They're sort of hidden in the natural parameters And if you want them if you want to know these things that we usually care about like Where is the center of the distribution and how broad it is? We need to do some computations on them, which in the scalar case are straightforward They're just annoying algebra in the multivariate case. They become computationally taxing So it turns out that there is even more of these distributions and if you were here in 1954 and this were a stats lecture This would have been maybe the last lecture in a way We would have this had been the end of term It would have carefully gone through all of these different special settings of how to do inference and these in gender all these corresponding wonderful concepts and statistics with p-values and unbiased estimators and all these things that you've seen in your stats lecture if you've taken one For example, those of you who study cognitive science at some point before you take in a stats class You've learned about all of these tests and things that they are all connected to these exponential families There is the Bernoulli distribution which we already Well, which is the trivial ones just a distribution for a coin toss There is the Poisson distribution, which is a special case of the gamma distribution It's a heavy-tailed distribution over positive value things that arrive with a certain rate For example, if you want to have a model of how many emails you get every day There's the Laplace distribution, which is a heavy-tailed thing It's used for extreme events like floods that happen every now and then there is the chi-square distribution Which was actually maybe invented by a German guy called Helmut and chi-squares yet another one of these stupid Greek names That don't mean anything. It's the distribution over the square Sorry the sum of squares of normal distributed random variables So you use them to estimate variances as well in specific settings There's the division distribution which we encountered for categorical variables. That's the Euler distribution Which is essentially well, no, it's actually the gamma distribution It's just named after an integral that Euler solves, so maybe it should be called the Euler distribution There is a generalization of the gamma distribution to multivariate quantities called the Vishara distribution Which can be used to estimate covariances and it's used in economics to study covariances between different derivatives There is the Gaussian distribution, which is very important And it has the sufficient statistics given by the first two non-central moments of the data Actually the first three including the zeroth one and they are even more more crazy generalizations So the Boltzmann distribution is in some sense a very general distribution that also makes it very difficult to work with it Why are these so important? That's going to be the final thing that I tell you before we head out first of all Every single exponential family distribution has a conjugate prior I'll go through this slide Now relatively quickly, but we'll start the next lecture with it very carefully because that's actually the business end of exponential families If you have an exponential family distribution here it is up here That's the definition of such a distribution that has this algebraic form. It's a distribution over some x with some parameters w then You can do the same this you can sort of do an abstract form of this staring at the expression bit that we did for the beta and the Dirichlet and the Gaussian and all the other ones and just see that there is you can think of another distribution of a w parameterized by alpha and new whose sufficient statistics are the the natural parameters of the conjugate likelihood and The negative log partition function of the likelihood and invent parameters for them And why is this the right thing to do because it's because clearly if you multiply this expression with this expression Then you get a new expression of the form w times alpha plus Phi of x so you add up the sufficient statistics of the data and New Plus the number of observations you've made so new is the counting variable and Alpha is the sufficient statistic variable and The only thing you need to be able to use this This is our function called g on the first slide for conjugate priors I said we need these two things phi and g phi is the data processor for sufficient statistics and g is the inference engine and The only thing we need in g is to know what this normalization constant is over this thing Let's call it f and If you know what this is then you're done And if you don't then we have to think and that's what we're going to do in the next lecture. So all of these Individual distributions, they all go down. They all come come back down to one smart person a hundred or 200 years ago Staring at one particular such expression of this exponential family form Discovering that they knew how to construct the integral over the conjugate prior Maybe from a textbook from a letter from their best friend and Then write a nice paper about it and attach their name to this distribution It's basically the invention of an individual integral that gives rise to each of these of these distributions and Why was that important in 1805? because they didn't have computers and This was important in 1920 in 1930 in 1960 Even in 1990 because at the back end of your complicated patient inference scheme You wanted to have something tractable where you just take the data you do some pre-processing you do some simple computation and you're done and Even up until the 90s There are a lot of papers in statistics and computational statistics that turns into machine learning that keep going I've I have this new cool way of very efficiently using data to construct some some posterior distribution and usually based on exponential family distributions because they translate the process of inference into two parts into processing the data sufficient statistics and Then doing being patient inference which amounts to evaluating F if you know F But now it's 2023 and we have computers So what we need to figure out is how much further we can go if you can push beyond completely tractable operations So we'll think about that on Thursday. Here's the summary for today Conjugate prior inference is The most efficient Neat structure you want to look for that makes patient inference tractable tractable in the sense that you pre-process the data and then you sum up a bunch of computed sufficient statistics and Evaluate a known integral The corresponding algebraic structure we need for this is that of an exponential family We just discovered that there's a lot of them already We keep PDIs full of them a large part of statistics revolves around them And I've even left out a lot of the interesting aspects about them You've seen some examples of how we can use these to do patient inference and to do predictive Estimation to towards new data sets and The main thing the main challenge that this all comes down to is to build this function called g So if you write the function g we need an Access to an integral essentially and next Thursday when we're back here in this room We'll actually translate this try and translate this into like a skeleton for An elliptic patient inference on simple variables in Python or actually in sort of Jack's flavored Python and Hopefully that will help you think about the algebraic structures we care about I do this because I want you to get of a sense for The computational aspects of this this entire lecture for the rest of this course will be about computational aspects of inference because that's what makes patient inference hard and It's also the last time we talk about these very very simple types of data sets where you just make observations of individual real numbers or Rates or even integers or even binary random variables Okay, I hope that today's pace was a bit better even though I forgot to take a break. So you're probably gonna let me know Please give feedback currently there's about a third of you who actually raises your your phones and gives feedback So I keep having to mention it maybe it's gonna be a little bit more than a third at some point Don't forget that on Monday is a public holiday. So that we won't be here There's also no exercise sheet next week, but we'll be back here on Thursday next week And then there'll be an exercise on the Monday afterwards But you still have to submit your exercises by Monday by the way, right? So the deadline for the exercise of this week is still next Monday, but then there is no exercise sheet next week. Thank you