 let's recall a bit what we did last time with three slides. So we sort of got the motivation for this entire part of the course was what do we do with the parameters in our probabilistic models? If we decided to do probabilistic modeling, so we assign probability measures to variables and then try to compute posteriors, we invariably still find parameters in our model. There is still just some choices we have to make and we don't really want to think about them but we have to because otherwise the model doesn't make sense to how do we set them? Well, ideally we would like to do full Bayesian inference. We marginalize out the variables that are part of the inference and then that yields a likelihood that's the evidence in the normalization constant of Bayes' theorem which we can treat as a likelihood for the model or for the parameters. But that usually involves intractable integrals. If it didn't, we would probably just treat these variables as part of the variables or these parameters as variables. And so instead we need some other procedure for optimizing these parameters. One simple thing is to just construct our plus approximation in the parameter space and that allows us to optimize parameters but not on the actual model likelihood but on an approximation to it. And we saw as an alternative this algorithm called EM, expectation maximization which is not something we can always do but which in certain cases where we are able to construct certain kind of probability distributions can be used to quite efficiently optimize evidences. And it works as follows. The thing we need to be able to compute is the expectation of, sorry, I'm in the wrong place. So we want to maximize this thing which is the evidence. So this here is the marginal over the latent variables under the model parameterized by theta for the data X. If we integrate out z then we are left with a likelihood for theta which we can then deal with. So we assume we can't do this integral very easily. So instead we consider, well initially, and this isn't even on the slide, we initially consider just computing the expected value so the integral over the log of this marginal over this joint distribution under the posterior for the latent variable z. And we saw that this can be written more generally as computing what we call an evidence lower bound so a quantity, an elbow, a quantity that is strictly less than this term up here where the difference between this quantity and this quantity is the Kuhlberg-Leibler divergence between Q and the true posterior P of z. So what we do is we iteratively set Q to this posterior under a particular choice of parameters then compute this expected value because this is the optimal choice. In some sense we maximize this evidence lower bound in Q and then we maximize it in theta and we iterate between maximizing in Q and maximizing in theta and those two iterative steps amount to some kind of coordinate ascent in between parameters and posterior like computer posterior, updated parameters, computer posterior, updated parameters and we did this as an example on last Thursday for this Gaussian mixture model. So I showed you this textbook example of this geyser data on old faithful geyser in Yellowstone National Park which has a kind of cluster structure. There are two groups of data points in this two-dimensional data set and we can represent this sort of structure of the probability distribution by a model that in as a graph looks like this and as an equation looks like this. So we assume that for each datum there's a decision being taken at random to put it into one cluster or another cluster or actually one out of K clusters that identity of the cluster is assigned to a latent variable called ZNK so that's the identity of the cluster for the nth datum and then when we know which cluster we're drawing from we look up the parameters of the cluster this is sometimes called the base distribution in this case it's a Gaussian with a mean and a covariance and then draw from that cluster but we don't know which data point belongs to which cluster nor do we know what the parameters of those clusters are. So the theta's now are mu and pi and sigma and the latent quantities that are the identities of the cluster which data point is in which cluster. So we did the derivation for this and last time I didn't show you the Python code for it, here it is. So this is an actual code that works this time not in JAX but just using numpy and a function from scipy I should have probably added a line for this so this is the scipy multivariate normal to just iterate between so this is kind of what the algorithm looks like we say how many clusters we're going to use to we figure out what the shape of the data set is we initialize the parameters of the distribution so there's these three pi, mu and sigma which are the parameters so those together make up theta and then there is R which is the parameters that describe the posterior on Z because Z is a set of binary random variables so they are assigned a discrete probability distribution that's just an array that contains numbers between zero and one such that each row of this array sums to one. Those things are initialized in R so far we don't need them yet so we can just initialize them with a zero matrix it's just basically memory allocation it's not actually a correct estimate yet but then in the very first step in the E-step we compute these quantities that tell us what to set R to from the math so this is the expectation step we're computing the expected value of the latent quantity set why is this called the expectation step because R for a discrete distribution literally is the expected value of Z once we have computed this we can then maximize the parameters of sorry not the parameters we can find them iteratively for each different part of the parameter space the maximally likely assignments under this posterior for the parameters pi and mu and sigma and we do this here which is an update that we derived last week so I didn't show you the math again and then we just let this run so notice that this is just a while loop we just let it run forever until nothing changes anymore so we could measure how nothing changes anymore in different ways we could measure whether mu changes or sigma changes or R changes or all three of them together we could also compute the elbow itself that's just a number and we could just watch it rise it's actually a good idea often but I didn't want to add the extra line here why because then it's sort of a sanity check you can see if it goes up if it doesn't go up you've done something wrong either you've misimplemented the elbow or you're not actually optimizing so this has to rise in every single step until it doesn't rise anymore and then we just stop so this is a way to fit parameters of a model theta but it requires us to be able to compute this posterior distribution over the latent quantities so what do we do if we can't compute this in closed form like here this Gaussian mixture model maybe also seemed very constructed it's this very specific choice of things so that everything just works out with this tiny little data set so if you want to have a powerful framework that works on general data sets and that we can sort of apply not just to waiting times at a geyser but whatever you want to do then maybe we have to think about where we can sort of cut a few corners to build a more general framework and this is what the entire lecture today will be about so the idea is let's look back at EM and see ah so what we did in EM was we noticed that we can write the log marginal distribution for the data under the model p of x notice that this is essentially a constant it's just how likely the data is under the model like we've integrated out all the variables but we can write this as a sum of two terms the elbow evidence lower bound which is this integral over sum distribution q log of joint over q minus sorry plus I'm sorry plus yeah because the minus is part of the definition plus the KL divergence between this q and the two posterior p of z and what we've constructed is we've repeatedly chosen in EM to choose the optimal q the one that makes this second term zero and then use this function of the model to update this thing to make it to make it increase but if we can't construct the posterior maybe we can use the same framework to construct an approximate distribution q that somehow gets us close to the posterior so there's a thing here that measures how close we are for a particular q to p and if you manage to find a q that makes this zero we found the correct posterior p and that's sort of a tantalizing thought there's this thing that measures how close you are to what what you want to have so it's like a loss right you want to be close to this maybe we can somehow in q space optimize q so that it gets very close to p of z given x and then that would be a way to approximately do full Bayesian inference and maybe we can do it such that we manage to do it even if the full posterior p of z given x is not actually tractable well that's maybe the thought we don't yet know whether we can actually do it but let's think about it and this is the idea behind an entire very powerful framework of approximate probabilistic machine learning called variational inference that has been studied for decades now and at least for a while used to be very popular right now the last two or three years it's fallen a little bit behind again maybe it'll come back in a few years and I'll make a case on Thursday why I think it might be useful to still learn about these kind of tools so notice first of all before we understand how we actually do it we first understand try to understand why this is a cool thing to do the cool thing about this is that these two quantities here both the elbow and the k-l divergence they are what you could call functionals so they are functions of a function they take in not values of x and z but q q is a distribution it's this infinite dimensional object and then they deal with them through integration integration is a way of dealing with infinite dimensional objects it's an infinite sum effectively and so they are statements about very general objects just a probability distribution that are in some sense sensitive to any choice we make about q and so this is an approximation in function space that we're trying or this is a way to potentially construct an approximation in function space so here's an actually I have a slide for this right so in general we've sort of written down a way to describe how far a particular q is from the correct thing we're trying to compute the posterior distribution for z given x and if we had a very very powerful language we could somehow approximate like just use this loss called k-l divergence to somehow optimize q until we've reached a point where it's actually zero and then we would know for sure that we found the correct posterior but of course probably we won't be able to do that because if we could well then maybe we can just construct the true posterior in the first place so what we kind of expect is that we won't actually be able to construct the correct q but that we somehow find a way to get close to it and now your first maybe actually yeah I think this is why I had a separate slide I wanted to ask you what to do right and maybe your first kind of knee-jerk reaction is ah deep neural network right I'm just going to be a really big network and I'm going to parameterize it with some weights and then I'm just going to compute this elbow and just make it go down right and then I'll have a good approximation and actually maybe that's something you can do and maybe we'll talk about it briefly for a slide on Thursday but actually that's not the interesting um approach and it's it can actually be sort of bad for several reasons so a first thing you might notice is that that means you have to parameterize the distribution and it feels like we're throwing away this beautiful mathematical aspect of this KL divergence that it's a functional that it operates on function space wouldn't it be nice to stay in function space as long as as long as possible and not just directly jump into a weight space where we just have to optimize a bunch of floating point numbers and the other thing is that just because you parameterize something doesn't mean you can actually compute the quantity you care about so to compute this these two terms L or the KL divergence we need to compute integrals over Q so notice that I've written down both because of course you can do the same you can sort of describe this process in terms of both of these quantities you could either say I'm trying to make this thing zero or you could say I'm trying to maximize this thing because the two together sum together give a constant p of x or log p of x actually so if we maximize this thing then we equivalently minimize this thing or if we minimize this thing we're equivalently maximizing this thing but both of these are integrals so if we decide to use a neural network to minimize this loss or this maximize this one then well we need to somehow deal with these integrals and you can maybe guess that okay what people then do if you really decide that you want to use a neural network is you somehow give up on the integrals either you find some magical language in which you can always always solve the integrals in closed form that's probably going to be a relatively restricted space of functions so this is actually connected to some ideas that exist in non-parametric Bayesian modeling or you just say I'm just going to approximate these integrals in some way I'm just going to draw some samples do some multicolor approximations and it'll just have to work somehow and that's the idea behind various approximate forms of variational inference but what we'll do today is actually this is maybe a little bit anachronistic is to say let's see if we can keep optimizing Q without breaking the idea yet that they actually are a function and it turns out that this is connected to a super powerful idea that has been around for a really long time because it was invented by people who didn't have the luxury to take us to just ignore that all these challenges exist and just spin up a virtual machine with a big GPU and just you know let it like approximate whatever you need to approximate so they really had to think for a long time and what they came up with is so okay so we don't need some way of like approximating in function space Q so a way to somehow restrict the class of functions that we can consider because if we don't restrict the class of functions we can consider then well then we know what the optimal Q is it's just a posterior but we already decided that we can't compute the posterior if we could we would just do it so we just somehow find a space of functions that is a little bit less powerful than the one that P of Z given X lies in but we want to do this in a way that doesn't yet explicitly say what that function space is what it that doesn't explicitly say here's a bunch of numbers that describe the space of functions and the idea that they come up with is absolutely ingenious it's to say well so Z Z is a bunch of variables right so typically not just one if it's just one variable okay then that's maybe that's just a one-dimensional integral and maybe we can actually just evaluate this thing but if it's multi-dimensional then one of the challenges one of the ways in which P of Z given X can be interactable is the fact that all the Zs interdepend on each other remember that independence and dependence are the most challenging aspects of probability theory so what if we just said that we want Q to be such that the individual elements of Z are independent of each other so if you want to that's called factorize Q so we assume that Q of Z is a bunch of distributes it's a product over a bunch of distributions for some subsets of the Zs which I'm going to index with I so in of course the most extreme case would be that every single Z I is just a scalar thing and we just factorize everything but it could also be that the I's are over some groups in variable space right some some parts of the model that you somehow want to separate from each other so writing a product like this as you know means that that's the definition of the distributions being independent of each other and we just say that we want to have this that's the entire idea so we're not we don't say which Q we choose we just that there's going to be some distribution that we don't know yet we'll find out later what it is but what the one thing we want is that they factorize now let's see what happens so if the distributions factorize in this way then let's focus on the elbow and just see what happens to it so remember that the evidence lower bound is the integral over Q log P of joint over Q so now we've decided that Q is a product so we just plug in all those products on the this integral over Q becomes an integral over product of Q i's log of joint divided by is also minus in the log right so minus the logarithm over Q and because Q is now a product it becomes a sum of logarithms and plug that in and now we think about this integral for a little bit and see okay so we're going to have two parts here a front part and the back part let's first focus on the front part so here let's say we pick out one particular variable Z j that we want to deal with in one step of the algorithm then that one Z j and its distribution we can pull out of the product into the front and then we look at this integral here and we see that it's integral over D Z 1 Z 2 Z and so on over all the Z i's including Z j so let's move the Z j out and let's keep the Z i in now we have a term in here which it's actually not a simplification at all of this from this line to this we just want to think about it in a moment so we'll keep this around this is going to be our operative object and then at the back what is happening well so here we have an integral over the sum over the individual logarithms of Q i Z i over all Z i ah so this helps because now there is a in every individual term of that sum there is a big integral over a lot of Z j's that are not i or a lot of Z i's that are not j whatever that all just don't have a term in there that matters so they're all just integrals over probability distributions they're all just one lots and lots of terms of one and so those just don't depend on Z j so we can kind of move them out to the back and call them a constant because they don't depend on Z j and the only term that's left in that sum is one term the one for Z j so this is the actually the entropy of Q j of Z j that's just what this term is called so now we can stare at this and say ah so what has happened now is that here there is a term inside that somehow integrates out all the individual Z i's that aren't j and then what we're left with is effectively a thing that only depends on Z j so when I say that Z i'm kind of dropping the fact that of course we have to be able to compute this integral so now let's say we kind of suspend this belief and just say okay let's say we could do this integral somehow then if we could then this would become a function of only Z j because all the other ones are integrated out is this some kind of probabilistic equivalent of currying so there's a function here p of x and z that depends on a lot of Z's and we somehow construct an emulator for an for a currying functionality if that makes sense to you so we integrate out its dependence on all the other variables so that we're left with something that is locally sort of frozen in Z i unequal to j which we can think of as a function of Z j and assume that there's some internal cool thing that takes care of all of the Z i's and that means this thing here is actually the sort of equivalent statement to um to the the derivation of the elbow for the whole distribution but now only for Z j so this is this is an elbow right it's an expected value of q log p of x and only Z j divided by log of q of Z j so there's it's it's like an evidence lower bound for a version of the problem where only Z j exists and so we can find its extremum actually well i have a slide for this actually it it's extremum by minimizing minimizing kial da virtues so let me repeat what i just said assume you have a joint distribution with lots and lots of Z's and also big data we um ah this line is stupid we can leave it out that doesn't really help much so to find a good approximation we decide that we want to factorize this distribution the only thing we want to impose to simplify the problem is that it has to separate into individual bits that we can optimize individually then the math kind of works out such that we are left with an an evidence lower bound in each step that can be separated into multiple evidence lower bounds one for each term in the factorization which we can think about individually and so what is the right thing to do with these evidence lower bounds well we want to maximize those individual lower bounds and we don't do this actually by computing a gradient and then setting the gradient to zero we already know how to set the gradient to zero we just have to make the corresponding kial da virgins minimal so we have to just set it to just set this thing to zero and that means that we have to set the logarithm of q j to the expected value of the log of the joint that's just what it means to minimize kial da virgins and what why is this cool so I have a big star next to it that means it's cool it's going to become going to come up several times over the course of the next few minutes why because if you look at this expression with this expected value in there this is a function of yeah of the data okay but the data is given you don't have to worry about the data it's just there it's just a bunch of numbers but it's a function of only one particular z i the other other ones are integrated out but it's a function so it's the solution to our problem it's the correct function that maximizes the evidence lower bound for z j or minimize a scale divergence so if you can read it off if you manage to do the other integrals then we can read off this function and sometimes if you're lucky we can just see what the distribution is it's just on a piece of paper and I say piece of paper because very much this is the sort of thing you have to do on a piece of paper it's not actually something that the computer will just do for you automatically it requires a little bit of staring at the algebra and then figuring out what's going on and we'll do an example today so you'll see what I mean by the staring at the x out at the algebra so this process is is also it there's a lot of theory that says this thing will converge because this function l minus l the negative version of it can be shown to be convex in function space it's like might not even be clear to you what it means to be convex in function space and yeah we don't need to talk about it it's just convex are you waiting for no okay then come on you can only have a seat if you like so um it'll turn out that this this optimization process will work right we can just uh it turns out that there is a there's a function valued version of curvature beautiful functional analysis that shows that this optimization processes problem is convex and therefore it'll converge it's also known in physics as mean field theory has anyone heard of the word mean field theory before not so many physicists in the room okay so then it doesn't make too much sense to talk about this I might mention it again on Thursday so the the idea behind this is that this was this kind of formalism was invented by physicists who needed to construct statistical descriptions of the universe and all of its contents and they realized that they can't actually do this if there's even in the laws of motion of Newton more than three bodies or more than two bodies actually interacting with each other it's just intractable you have to do it numerically and they didn't have computers but they wanted to have models for molecules with 10 to the 23 different parts and so they came up with this beautiful idea to say well you could imagine that every single particle in this 10 to the 23 collection in a in a free gas somehow um they they all interact with each other individually but for them because they keep interacting with so many different particles it's a little bit like they are completely free but there's just one joint field that affects all of them together in their behavior that is somehow jointly created by all of them like a like an ant colony where each ant doesn't actually care about which other and it interacts with all the individual ants might just be the same ant but they somehow all interact with it and the jointness of the entire ant colony creates this effect on them so a particle in a molecule in a free gas doesn't actually care about which other molecule it interacts with all the other molecules jointly create one field and that's the average field the expected field under all the other ones the expected field under the interaction with all the other particles and therefore it's the mean field because it's the expected the average field okay so this gives us this connection kind of creates for us an actual algorithm from the computer science perspective you can think about algorithms so it creates a framework for constructing probability distributions that allow us to approximate a non-analytic posterior distribution over latent quantities for pretty much any distribution you might care about by minimizing an abstract object a functional that takes in our approximation and measures how good the approximation is so this is called the KL divergence of course there are other divergence as well but this is the one we like um which you want to minimize this is the same as maximizing the elbow because the sum of those two is equal to a constant and this is a powerful framework why because we get to choose which kind of set of queues we want to consider so we have a freedom to decide how rough we want the approximation to be at the extreme end we just write down uh it's basically a like a single simple product of parameterized probability distributions with a bunch of parameters such that we can do the integral in the KL divergence in closed form or we approximate the integral and then we just run gradient descent but actually at the extreme other end sometimes we are able to not even just write down what kind of space of approximations we want to be in but only impose the fact that we want to have probability distributions that factorize into individual terms why factorization because factorization translates into iterative processes on a computer into four loops that go through the blocks of variables and update them one after the other it's an extremely flexible powerful framework i want to say this already now at this point that for a long time actually was made was a very promising direction within machine learning was studied as the maybe the next big thing that would change everything that would yield a very very powerful framework um not only because it's a complex optimization problem and so it actually like those variational approximation algorithms that exist they actually get much closer to the kind of behavior you have come to know in love from computers than deep learning does they just work you have to implement them and once they are once they're implemented they're just pieces of beautiful code that just work they can be bug fixed they can be tested they can be deployed they don't break they just they just run all the time without any any like drama but constructing them is a very painful process and we'll go through it now slowly um so that you get to experience the pain for once and uh then you might understand like a sense for why people might not be so keen on doing this kind of construction anymore but also why it might actually come back at some point and that's why I do that at the very end of this course because the sort of thing that is currently a bit out of fashion maybe it's a bit anachronistic in 2023 this kind of way of thinking about approximate distributions and I'll tell you a bit of the backstory on how to construct these things um but also maybe why it's a very powerful framework so um maybe at this point some of you feel like I haven't actually told you what the algorithm is yet I've just said something that we are going to try and do and at least that was always my impression when I heard earlier earlier on in my career a lot of introductions to variational inference and it always felt to me like someone is not telling me what the trick is they just have this oh there's a beautiful bath and then it'll work out and we'll find the exact distribution that approximates and I was always like but where is the like how do I do it like can you give me the the rules because this doesn't actually say there is no line here that says this is the approximation well it's it's sort of here but this is too arcane right it's just one like what is this supposed to be and that's actually maybe the main challenge of these approximations these variational bounds that they are very difficult to just translate into a very formal process and as a result their their sort of the the the software stack for constructing them is not as rich yet at least as the deep learning stack so here is what the actual practical construction of a variational bound looks like so at the top I've basically copied in again from the previous slides so the first two things are the same thing as before and now I'm telling you what what we actually do and then we're going to go through this recipe so the process is you write down your model that's step number one that's the step that was always there in probability theory and probabilistic machine learning always write down the probability of everything so when someone gives you the data you sit down you write down a probability distribution p of x given z and often this also means drawing a little graph and kind of making yourself familiar with the distribution thinking about it staring at it a little bit and then you try to figure out what's actually hard about this process that you're trying to construct this this posterior distribution what makes the integral hard for you to do or the computing the posterior and then impose the factorization such that the hardness goes away and you notice that this is not really something that I can formalize mathematically it's just you need to kind of have a mental picture of what you're trying to do then once you have the the impose the factorization this line above takes hold and you can write down this object so writing it down means you'll write this big e this bold e as an integral over a bunch of other distributions and then you have a line on your piece of paper that tells you as a function of zj what you're approximating distribution is going to be and then you have to squint your eyes and we'll do that together and see what that distribution actually is that you've just constructed and it's going to be difficult to sort of stare through the math to see the functional relationship and that'll tell you what your approximation is and then you go to wikipedia honestly and look up all the properties of that distribution why because what you now need to do is this line this line 2 gives you tells you what sets what the distribution on zj is but it only tells you in terms of integrals over all the other distributions so if you've done this process of writing out the integrals and staring at them and understanding what the distributions are that you're looking at you can then go to wikipedia and hope that you're able to close the loop and find for all of these individual update equations closed form expressions for what the numbers what these integrals actually are for the corresponding approximations and this used to be a big part of the process of machine learning so when i did my phd i also got to derive a few of these variational bounds and what we used to do is we would just we would stand in front of the whiteboard for the bunch of people for a long time and like draw graphs and draw joint distributions and say ah this is something i want to model this is something i don't want to model and but i need that oh but this will oh there's a collider here and this will create dependence interdependence so what you're going to do ah we're going to impose factorization at that point and then like you write down the the factorization that you want and then you go away in a lonely corner with an a three sheet of paper in landscape with a sharp pencil and just write down the entire thing the log of the joint and the integrals for the individual parts and hope that you find some pattern that's the the part that the human does in this process and for which there is some optimization available in some soft press decks so if you want to look up something check out the infer.net platform of microsoft it's a little bit outdated by now but it tries to do exactly this but it's a bit restrictive for what you're actually allowed to do and then you need to implement it and the implementation of course is also something that goes wrong because you need to write it by hand and then you quite often have a bug and you just spend another two or three weeks to get it to actually work but then when it works it suddenly works and you never have to worry about learning rates and batch sizes and all the other stuff it's all just really beautiful and you can make plots and understand what's going on. Okay so after the break we're going to do this for our Gaussian mixture model and why are we going to go back to the Gaussian mixture model? Well we already saw how to fit this model using the EM algorithm to fit two clusters to this data but there's a few things that you might be unhappy with in this particular algorithm the first one maybe is that we have to make point estimates for pi and mu and sigma and maybe you want to be uncertain about them maybe you want those to be probability distributions maybe you want to say that there's only three or four data points here how can I infer five clusters from that I wouldn't know right so we somehow need to be uncertain about some of these variables so that the algorithm also works if you have very sparse data but another thing that may be even more prominent that you really may want to do is you maybe you're really frustrated by this k and you don't want to set it you want the user to have the freedom to just tell the algorithm find out how many clusters there are for me I don't want to tell you how many clusters there are and to do that we need to be uncertain about in particular this object up here we need to be able to say to our algorithm that there might be a lot of zeros in this pi through a proper prior and I'll tell you how to do that after the break let's continue at four past so there was a really cool question about this this kind of framework that I'm sort of tempted to answer now but actually maybe maybe it's not the right time to do this now but I can do this at the end of the of today's lecture actually it's maybe more natural to do that instead I'd now like to go to the actual examples or I said we're going to somehow like pimp up this model to make it to make it actually much more powerful and so here is like I realize I mean I can hear you moan as you get up for the break and I realize that this is already a pretty complicated model and now this is maybe this is already a bit of a history lesson but not from the far back history but from like a few years ago up until quite recently this was actually quite normal like people would stand in front of these like white boards and black boards and draw things like this and then they said ha but I'm unhappy with the fact that I have to set k what could we do with k ah we need to make all of these variables in particular actually if you just want to deal with k it would be enough to just make no actually we need more need to make all three of these variables otherwise it doesn't work and so how do we make the variables that means we need to assign probability distributions to these three dots the dots mean that they are estimates point estimates not probability distributions so you want to make them circles empty circles with distributions over them and that that requires us to say what the prior distribution over those variables is so turning something into a probability distribution means to write down the joint in particular so a generative model right a prior for mu for every single mu in the k copies prior for every sigma k and a prior for every pi and of course we could write them as a joint thing but maybe we don't need to because this we can just keep independent from this under the prior maybe so what kind of distribution do we use ah and this is where our toolbox actually helps because we think about them and realize ah that's a probability this is a real vector this is a positive definite matrix aha we have our standard distributions for those they're called exponential families they are the standard conjugate priors for observations of this type that involve these quantities so what is the going to be the prior for this for a probability distribution the Dirichlet what's going to be our prior for the parameters of a Gaussian mean and variance if you observe data drawn from this Gaussian that's the one that I maybe didn't spend enough time on so that you all kind of yeah it's the Gaussian inverse Vishar and you're like oh my god so but this actually is like it's a mechanical process yes it takes time to do this but we know how to do this we know that we just look at the data types ah mu is a real vector sigma is a positive definite matrix so the prior is going to be a Vishar the relative real vector prior is going to be a Gaussian because the Gaussian is the conjugate prior for its own mean but actually if you want to be jointly uncertain about mu and sigma there is this Gauss inverse Vishar framework which is down here which requires us to draw a little arrow from here to here because we're actually going to jointly infer mean and variance this is important because it's impossible to estimate the variance of a Gaussian if you don't know the mean why because the variance is literally the expected square of the numbers minus the square of the mean so if you don't know the mean you can't estimate the variance right so we're gonna have some arrow pointing from sigma to mu and then yeah we look up what all these distributions are probably going to be so our prior for pi is going to be a Dirichlet Dirichlet's have parameters that are called alpha so now we have a new thing that's called a hyper prior or a hyper prior parameter alpha that just moves the like the power of the model one layer up like one more level of abstraction and our joint prior on mu and sigma is going to be this Gaussian inverse Vishar so that's a product between a Gaussian distribution over the unknown mean of each cluster which has a mean and an unknown variance scale that scales the actual variance of the Gaussian and then a Vishar prior over the inverse of this variance because that just happens to be the right algebraic form in the exponential family sense why because the sufficient statistics of sorry the natural parameters of the Gaussian remember are the precision and the precision adjusted mean so the inverse of sigma the precision which has parameters and the parameters of a Vishar they are two of them one is a matrix that has to be symmetric symmetric positive definite and the other one is a scalar called the degrees of freedom new which has to be larger than d the number of dimensions of the problem or larger or equal to so we just you know initialized those somehow and that means we have now defined a new generative model for our data it says to draw the data do the following first draw a probability distribution over the clusters from a Dirichlet distribution parameterized by alpha Dirichlet's remember from the exponential family lectures have the ability to actually create priors over sparse distributions if we make the elements of alpha of this vector alpha less than one then this Dirichlet puts high mass on the corners of the simplex so it allows us to say maybe some of those elements of pi are actually zero maybe almost all of them are and then we get sparse approximations that only use a few small number of components even though they have more initialized then please draw a sigma for each cluster from the inverse Vishar distribution and then draw means for each Gaussian from this this thing down here this probability distribution now we have those three parameters and now draw from a mixture model defined by those three parameters so one sort of the pedestrian way of doing inference in this model is not actually not inference of generating data from this model is you run a bunch you could write the forward path to this model in sci-pi right you instantiate from those models from from alpha and m and beta and w and new you instantiate these three probability distributions then you ask sci-pi to draw from them and then you write down the probability this generative model for the identities of each cluster and then once you've drawn the identities of each datum to each cluster draw the individual axis maybe you can convince yourself that if you didn't have the data set yet x you could write a piece of simple python code using sci-pi that just draws axis that's just a pass down through this graph now the challenge is someone has given you an x and you want to go back up and that's what we're going to do our variational inference so yes this is very tedious it's a lot of math and lots of also of these peak scary symbols in in in those expressions but now let's see if you can follow this abstract recipe and see where we get so by the way we've talked about graphs a bit and their atomic structure so there is a really big v structure in here a collider all of these arrows pointing down towards our data and the data is the thing we've seen so remember that when you have an arrow where things collide where the arrow heads meet and then you condition on the variable where the arrow heads meet everything above it becomes dependent on each other this is explaining a way this is the story of the alarm and the burglar and so on so now people would stare at this and say oh okay this is gonna be hard right so um somehow we need to break this dependent but we already done em and in em we realized that there is this z in their place a very important role so if i would only know what z is so if i would know which datum belongs to which cluster then things might get easy why because if you think about it i mean if you if i knew without doing the math i can realize that if i knew which datum belongs to which cluster of course i can just take the individual data and use them to do conjugate prior inference on this gaussian that's just an instance of conjugate prior inference for gaussian distributions that you've already done in your the seven scientist example and of course if i know which datum belongs to which cluster then i don't from the perspective of pi i don't have to care about x anymore because this here is a chain graph so if i would know this variable then this thing like pi becomes independent of x given z and of course it's a simple instance of Dirichlet inference it's like the example with uh who is wearing glasses right if i just observe which of my data points are in which cluster then i can estimate how large the clusters are so that probably means that we have to factorize our joint distribution over z pi mu and sigma into a joint probability distribution over z separated from pi mu and sigma so that's the assumption we're going to plug in that's the only thing we impose on our approximation we say there will be some q i don't know yet what the q is but it'll have to have the property that the z's are independent under q from the pi's and the mu's and the sigma's and maybe here's a moment where it's it pays off to reflect a bit on what this actually means philosophically speaking right it seems like a weird thing to do to say well but if i make the z's independent of the pi mystic there's a completely different model no right but the cool thing is we're going to build through variational inference the best possible approximation for such a factorizing distribution to the actual distribution under which they are not independent and so yes of course this model is very different from the real one but it's going to be the best one of that type compared to the real one and we're just going to make sure through variational inference that we get very close to it question ah no so so here remember remember that when you maybe write p of or q of something for probability distributions um i on like lecture three i had this slide on notation that probability distributions are things that know their inputs they they reflect right so this q is not the same as this q which wouldn't make sense just like just like in up here right this p is not the same as this p it's it's just something some distribution which will call q of z okay so now starts the process of doing the tedious math so from the from the recipe i'll go back again sorry for hopping around a bit the recipe says to find your optimal q you have to write down the logarithm of the joint and then compute expected values under the other q's ah so here we only have two q's now we only have the q of z and the q of pi mu and sigma so we will need to do two steps first to get q of z we need to compute the expected value of the log joint under the other q but the one over pi mu and sigma and then we have to swap roles and to find the q of mu sigma and pi we have to compute the expected value of the same thing the log joint under q of z and if we've written them both down hopefully we'll find some structure that then we can then use and this is where the magic happens so now just if you just wait one moment here comes the magic so okay so let's first look at this joint so we can go back one slide again and see how do we actually generate x and z and pi and mu and sigma well we first draw pi let me go back here we go so here this is the joint this is the line right so this thing p of x and z and pi and mu and sigma is this we said we first draw pi then we draw sigma then we draw mu from given sigma and then we draw x and z given pi and mu and sigma so if we want to construct and our approximation for q of z we need to take the logarithm of this so that's a sum over this plus this plus this plus this with the logarithm in front and compute the expected value under q of pi mu and sigma now there's going to be a term here where we have to compute an expected value of log of pi under q of pi but we don't care because it's just going to be a constant there's no z in that term and there's going to be a term here where we have to take an expectation over the logarithm of p of mu and sigma under q of mu and sigma and pi but we don't care because there's no z in that term the only term that actually matters for our function on z is the logarithm of this thing and then we need to take the expected value under pi and mu and sigma so now we go up here here is this this is the business end of this model the the probability for the for the individual data x and z is this mixture model so it's a product over all of the data x and a product over all of the components pi raised to the z n k power i.e. k subscript here is maybe not needed times a Gaussian over x n given the parameters of this of this mixture component so that's the thing that we need to take the logarithm of to and then compute the expected value under q of mu and pi and sigma so when we take a logarithm those two products will become a sum a sum over the logarithm of this plus the logarithm of a Gaussian with each with a z n k in front and that's what I've done here so we need to take the expected value of the log of this which separates into a probability for z given pi times plus the expected value under q of mu and sigma of the logarithm of p of x given z so when you know which datum depends to which cluster you can then draw x and for that you need to know what the components of the cluster are mu and sigma and here it is so that's the big expression in here and now this is the bit where the squinting of the eyes comes in so this is the thing that is very difficult to do for a computer and that you can do when you've written down things on your a3 sheet of paper where you you are sufficiently far away from the from the screen to see it maybe for me it's good that I've already done it we can stare at this expression and think about what this actually is and so for that we look at z so where does the z show up in this equation ah it's only here in the front there's no z in here no z z no z that's just a constant z no z only z here ah and then there's a sum this is supposed to be the logarithm uh-huh so q of z is going to be a product over n a product over k something nasty raised to the z n k power so let's give a name to this nastiness it's logarithm of something rho n k why n k because it's indexed by n n k it it's sort of sits inside of this double for loop and that means the probability distribution over z is going to be rho n k raised to the z n k and z n k's are these binary variables that are that sum to one if you sum out over k so what is this it's a multinomial distribution a discrete distribution sorry not the multinomial it's just a discrete distribution uh-huh so we've sort of abstracted away all this complication here into just a constant that just says ah if you give me r n k which is given by rho n k normalized then i have a probability distribution and i know what the expected values of z are under under discrete probability distributions they're just given by r n k ah so first half of our cycle is already built we know that we're going to get these probability distributions and so um yeah so just this is just sort of the mind you that is uh not discrete distributions are okay this is maybe ah i did this before the i added this slide before the lecture right away i should have called this r k whatever so this is just and maybe i should have just called this discrete distribution rather than multinomial because otherwise i would have had to put a normalization constant in front so this is just sort of that's my version of a wikipedia page you go there and say ah there is a discrete distribution and it can be written like this with binary values for z and they sum to one and it has its properties are for example that the expected value of z k is just rho k uh-huh nice so um another thing to note maybe as we pass by here is there's more products here so what this thing says is every datum is drawn independently and then well this is this here isn't actually an independence it's just there it just looks like independence but there's a hidden dependence term in the fact that these r n k's have to sum to one which is sort of visible from this equation but actually there's independence here over the data n and we did not put that in well at least we didn't put it into our variational bound right we didn't claim that the like that this q of z has to be a factorized distribution over the individual z n it's just that our model already had this right there was already this sum here over n that's the plate in our graph and we just inherited from the the variational bound just kind of inherits this independence uh-huh so now we have the first half of our of our job done and now the second part is to close the loop now we need to compute our approximation for q like the approximation q over pi and mu and sigma and so for that before I actually show you the slide I'll go back again I'll look at our recipe that says the recipe says now we need to compute our q star over mu and pi and sigma for that we need to take the expected value of the log joint under the other q the other q is now our discrete distribution over the z so we go back to our model and say i in this big model that was this is the this is the joint q x of z and pi p of x z pi mu sigma which factorizes like this where this term is given by this thing up here we need to take the logarithm of this and then we need to take the expected value under q of z and then write it as a function over pi and mu and sigma and see if you can find structure in it so now we are not going to be able to drop these terms because they actually depend on pi and mu and sigma so even though in those terms there is no z we still need to leave them in because they tell us what q of pi and mu and sigma is going to be and now we need to take the whole thing pretty much so we take the logarithm of this and need to compute the expected value under q of z the only thing where we're going to save some stress is that when we take the expected value over q of z only this term will actually matter right because there's no z in here but we leave it in so that we see the functional relationship on pi and mu and sigma and i'll actually only do one half of that and then the other bit is just too tedious so i'll just tell you how it works so here we go um let's not look at this here up yet yet i'm just defining a bunch of constants i'll tell you in a moment so to find our approximate distribution over pi and mu and sigma we take the expected value under the other q for the logarithm of the joint and that so that's the line from the recipe now we look at the joint we see that it factorizes into this p of pi which is a Dirichlet given alpha plus p of mu and sigma given the parameters of the Gaussian inverse Bichard plus log of p of z given pi plus log of p of x given z and mu and sigma and there's no pi in there actually because once you know the z you know you don't need pi anymore because it's a chain graph right so we notice that this thing does not depend on z so we can take it outside of the integral this thing does not depend on z so we can take it outside of the integral these things this thing depends on z so we have to leave the integral in but there's at least a sum here so now we are going to have two different integrals this one and then this one so this is the first one and here's the second one down here and um here i've already plugged in the actual structure of this log um so i can go back to the corresponding slide but you might also remember that the the mixture model has this component pi raised to the z and k and then Gaussian over x given mu and sigma k raised to the z and k so this has now already come down um okay so i'll take this line this last line this is actually two lines but the last equation and move it up to the next slide because there's not enough space because i don't have an a3 sheet and a sharp pencil so i need to put it on multiple uh-huh no it's gone okay well i'll just point with my hand and now we see that actually if you look at where pi and mu and sigma show up in this expression we notice that at the at the top okay i'll move my mouse here is a term that depends on pi here is a term that depends on mu and sigma here's a term that depends on pi we've integrated out z and here's a term that depends on mu and sigma but there's no mixing term there's no point where the pi's and the mu's come together so this is supposed to be the logarithm of our joint distribution over pi and mu and sigma so when we take the expect we expect the exponential of it we're going to get one term in pi which would involve the exponential of this time this thing times the exponential of this thing and one time one term in mu and sigma which would involve the exponential of this times the exponential of this which is interesting because this means this thing factorizes into one part in pi and one in mu and sigma but we didn't require this we didn't tell our approximation method that we wanted to have this factorization it just so happened this is called induced factorization it's something that comes in from our model structure it's also something that I already sort of raved my hand about right if you know which datum belongs to which cluster then you can estimate the components of the cluster without knowing pi and you can estimate pi without knowing the clusters so let's think about pi first the distribution on pi well for that we need to take here's again the rule to find the optimal distribution on pi we need to take the expected value under q of z of the log of the joint so we look at this thing up here we realize that these two terms don't matter so we can get a bit of them we only need to take the expected value of the log of this plus this so there's no expected value here because there's no z in here we only have an expected value here so it's this it's the log of p of pi which is a Dirichlet distribution plus the expected value of the log of p of z given pi so p of z given pi is discrete distribution actually it's a multinomial distribution so how many sets like what the probabilities for the sets given pi are just pi for the corresponding sets and the logarithm of a Dirichlet is just remember that Dirichlet distribution actually I have a slide on this so the Dirichlet distribution looks like this on wikipedia you can find the Dirichlet is defined with this probability density function where this disannoying normalization constant that's called the beta function and then people tend to use this alpha head thing which is the sum of all of the alphas it's just a convenient notation and so that's the this is the logarithm of it and it all has all sorts of beautiful properties like we know it's expected value it's variance it's covariance it's mode the expected value of the logarithm of pi the entropy and lots and lots of other things so we might sort of read this is okay maybe we can come back to this and might be useful so let's think about this the logarithm of a Dirichlet is therefore so from the previous slide right it's sum over alpha minus one times logarithm of pi plus a constant so oops sorry wrong line here we go it's alpha minus one times sum over the logarithm of pi k now if all the alphas if alpha was actually like a vector with individual components it would be inside of the sum under the prior we might assume we just have one constant alpha so we can direct it outside of the sum and the logarithm of this discrete distribution is just this it's just a sum over the individual clusters and over all the ends r and k logarithm pi well actually it's the sum over the z and k log pi k but I've already taken the expected value because the expected value of the random variable z under the discrete distribution with probability r is just r that was two slides ago so here we've essentially closed the loop we use the fact that we have the other approximating distribution the discrete thing on z to construct an exact an exact expression for the expected value in the variational bound and now we can look at this and say oh so what kind of function is this actually this is where you take your step back and say ah there's a logarithm of pi here and there and a big sum over k so we can move the sum outside collapse it's just one term in log pi with alpha minus one plus the sum over n r and k so let's give a name to this sum over n r and k let's call it n k it's the number the expected number of points in cluster k and now we just have here alpha plus n k minus one times the logarithm over pi k ah and that's the logarithm of a Dirichlet distribution so our approximating distribution on pi is going to be a Dirichlet distribution because Dirichlet's are given by product over k pi k raised to some power there it is so we know what our approximating thing for pi is going to be and now um we might sort of gather some hope if you go back to two slides ago where we had this nasty thing here this nasty expression for log of rho and k that we might be able to actually compute this expected value over q of pi uh of log of pi k why because we can go to our Wikipedia slide and say well what are the properties of a Dirichlet does it have something about the expected value of the log of the ah here it is the expected value of the logarithm of pi is this is the die gamma function it's the derivative of the log of the gamma function nasty thing who knows what this is ah it's available in python so whatever right so here's we are saved by living in 2023 someone has already written the code for us to do it we don't have to think about what this nasty function is it's just there so the last thing that's left the only thing we actually need to deal with is this other expected value and here I'm going to wave my hands around a little bit so we we still do you need to think about the expected value under z and k of the bits that depend on mu and sigma so this is the prior on mu and sigma plus the expected value of this logarithm of a Gaussian distribution under z and k and you'll actually if you do this and then it gets really tedious now you have a lot of long expressions on your a3 sheet of paper you realize that what you have in front of you is actually just like we just had the logarithm of a Dirichlet distribution you have the logarithm of a Gauss inverse Vishar but you may remember that this Gauss inverse Vishar expression is this really tedious long thing that you had to deal with once in an exercise sheet so I'm not going to show it to you again it's just this really long expression and we find that it also just like for the Dirichlet we had this simple update where we just have to add nk there's going to be a similar update for the Gaussian inverse Vishar that just involves annoyingly more quantities to compute but I'll tell you what they are so we need to compute the expected number of observations we've made in cluster nk those are the counts for our Vishar prior we need to compute expected sample means so what is x bar x bar is the sum over the xn weighted by r nk it was on a slight ago that I'm not yeah okay so it's up here it's this thing and then we also need this first form of and of square distance to the sample mean which is used to estimate the the covariances and then you may remember from your homework that this is actually the beautiful bit about the Vishar Gaussian inverse Vishar mechanism that there's an extra turn here that also captures how far the data the observed data mean is from the prior mean and increases the estimate for the variance if sorry for the precision if those numbers are large why because if you see data that is broadly distributed then you are going to underestimate its variance because you're actually estimating a large mean and therefore you need to have an extra term okay so that's an update line so the main thing the main thing about this is that you can find on Wikipedia how to compute the expected value it's still not working the expected value of the logarithm of a Gaussian under Vishar actually what do we need we need to compute our magical to close this problem we need to find the expressions for these last two terms in the sum so we need to know what under a Gauss inverse Vishar is the expected value of the logarithm of the precision and the expected value of this square distance and you can find on Wikipedia or in Chris Bishop's book and in various other pieces of text expressions for this thing they just happen to be available and that means we can build an update and for that I have an actual slide that says close the loop by setting q of z to the discrete distribution involving the parameters given by this expression which can be computed from the other end of the approximation so we know that this is a Dirichlet so therefore the expected value of log of pi is something involving diagram functions that we can call with side pi dot special dot diagram and the expected value of the logarithm of a precision matrix under a Vishar is also given by something nasty with a diagram function plus a bunch of logarithms and oh that should probably be a two pi here plus log of determinant of the parameters wk and the expected value of a square distance under a Gauss inverse Vishar is again something we can actually estimate and now we have update equations that we can implement so we can write a set of like a while loop that iterates between doing two things one after the other it constructs an approximate discrete distribution on sorry first an approximate Gauss inverse Vishar distribution using no first an approximate discrete distribution here we go using the the exponential of the line from the previous slide so you take the individual terms in the exponential and then or with a logarithm in front and then you take the exponential afterwards and then you normalize there's this proportional sign here so you need to normalize actually there's a fun little trick in python if you want to take such an expression where you've computed the logarithm and now you want to take the exponential and then normalize that's the same as the softmax so you just call the softmax on the log right because that's the exponential divided by the sum of the exponentials and then we construct an estimate for this isn't even on the slide for it's on the previous slide for this Gauss inverse Vishar and then we keep iterating and we keep checking how the variables change the the estimates change one way to do that again is to check the parameters we are optimizing and just see how they evolve and whether they converge at some point another one is to actually write down the elbow which we can it involves all of these complicated terms and just see that it rises and goes to some point and then it just stops and why would you do that well one of the nice things about this is that you can use it to estimate the number of components there was actually a big selling point for these kind of models for a while in early sort of around 27 28 9 10 there was a whole explosion of models with infinite degrees of freedom using variational and sampling probabilistic formalisms the very first one actually was in my at least as far as I know in the machine learning community on gaussian mixture models it was called the infinite gaussian mixture model came from kalvas musul and subingar Amani here in tubing in cambridge which basically showed what I'm going to show you so what you can do with these kind of models is you can initialize on this data set with in this case I've taken six clusters so there are six gaussians here one two three four five where's the six one I don't know and what I'm doing in this plot is that these individual clusters have an alpha value so they are transparent but its transparency is chosen to be proportional to their probability under the post the variational posterior for the corresponding component pi so if they are completely transparent then they have a very low probability and if they are dark red they have a high probability and I've initially drawn them from some kind of empirical Bayesian prior so what I've done is I've computed the mean and the variance of the data set and just drawn the clusters from the initializations of the cluster means from the this distribution from this kind of gaussian distribution and drawn from a Dirichlet to have initial weights why do I need to do that because if I would set them all to the same thing initially and one point in the middle then they would all get the same update under the variational update and they would all move to the same point and then get stuck there and that would not work so we have to what's called break the symmetry initially by adding some randomness but only once and then after that there is no sampling involved it's just a bunch of optimization steps where the algorithm decides that it would like to use only two clusters it has like faded out all the other ones it says two is enough um and here is your cluster means so and that's just one of the many nice things about this why would we use this framework let me first summarize a bit so variational inference is maybe a tool from a more civilized age um these are your father's variational bounds um a general framework to construct approximating probability distributions when you can't compute the posterior directly and they work by minimizing the L divergence the kubach-leber divergence between the approximation and the true posterior which is the same as maximizing the evidence lower bound for a fixed model by making this kind of maximization tractable not by parameterizing the distribution but by imposing factorizations imposing product structure on the probability distribution and then analytically iteratively updating the um these these components in the variational bound to locally minimize k l divergence and that amounts to setting them to this value setting means like this value means the expected value under the other parts of the approximation for the logarithm of the joint and to make this actually work in practice you need to first write down what this log joint is so you need to build a model you don't get around that of course then you impose the factorization so you say in which kind of schedule you want to go through the data then you need to expect there you have to inspect what this function then actually looks like in terms of its algebraic form so you leave the integrals in but you just look at how things functionally depend on each other detect hopefully that they amount to certain known exponential families that you can find on Wikipedia and then hope that you can find the necessary terms in here in analytic form on Wikipedia and so our question I got during the break was how do I know that this will work and the simple answer is you don't be you have to do it you have to sit down and actually do the math and that's also why it maybe fell out of favor because it's much more tempting to just say ah it's just a deep neural network let's buy a bunch of GPUs and just somehow work rather than pay someone of like there aren't aren't many people in the world who are able to do this I mean they are actually by now more than a few handfuls but still not enough to satisfy the needs of industry and they have to actually spend some time in front of a piece of paper rather than provide code but there are a few sort of ways to automate this so one thing to observe is you get to choose in a sense how good the approximation is going to be by imposing more or less factorization so by deciding to use more or less terms in here to break dependence between variables you can make your life easier by moving more integrals out into the bits that just make a fact make a parameterization of your not a parameterization give an algebraic form of your approximation so and the other thing you can do is you can just decide to use exponential families everywhere because then things tend to work out a bit better and so there is actually a framework called variational message passing that is due to a combination of people like chris bischoff and david makai and john win and tom inka and a few others that boils down to effectively saying a we're going to factorize everything so every single variable in our code will have its own distribution that's the full mean field idea b we're going to only use exponential family distributions of which we have a very finite small collection and we're going to use the standard ones so if you have a probability distribution you will sign a division if you have a real value thing you assign a gaussian if you have a positive definite thing you assign a vishar and so on and so on and then you're almost done now if you restrict yourself to models that only involve certain times of interaction between the sets not the excess the excess are fine because they can you can mask them away by some feature functions but only certain types of interactions between the sets typically linear but maybe log linear then you can actually be guaranteed that you can automatically construct your variational bound and this is the idea behind the infer.net toolbox that microsoft built more than a decade ago and which maybe in hindsight didn't take off as much as it could have had so you saw that this is a tedious and maybe painful process to do and i realize that it's maybe not so much fun to watch me do this and that's maybe why it didn't really take off and also why i do it at the end of this um of this lecture course right before the exam if you like um because we first had to maybe do all the things that you now expect in 2023 to cover in accordance on machine learning deep learning and regression and unsupervised learning and so on and what i showed you today is maybe yeah a blast from the past of how these models used to be constructed but that doesn't mean that you shouldn't think about these kind of approximations anymore it's just that they maybe aren't the first tool in your toolbox so variational inference i vote here is a powerful tool it's a very powerful tool actually because it works in the way the computers are supposed to work you once you've implemented the variational bound there is no surprises anymore it just converges it just works pretty much all the damn time um if you set it up correctly it requires a lot of knowledge expert knowledge to construct you need to know about experiential families you need to know about graphical models you need to know all the properties of experiential families you need to know about scheduling and stabilization of your iterative optimization routine and you need to monitor it and you need to initialize it correctly and so on and so on but then it just works so people used to joke it used to be this thing for for computer science or machine learning phd students that sooner or later once you've set up your model your advisor would tell you to build a variational bound to make it actually work because you had fiddled around with some approximate thing and it didn't work and it's lost too much time and then you did some sampling so like there's this other piece basically piece of toolkit that's also general called mark of chair Monte Carlo that i didn't cover in this in this course which is just sampling pretty much and then you can just call amount of Monte Carlo sampler and then it just runs forever and takes forever and ever but then eventually produces some interesting samples and your advisor used to tell you now is the time to sit down and actually do this derivation for the variational bound and it was very painful to do and then i haven't even told you that of course at the end you also have to implement this thing so i actually have a piece of python code for this gaussian mixture model but i'm not going to show it to you because it's like this long and just a lot a lot of jacks so you won't be able to parse it anyway and of course you would make bugs when you do this so i had once during my phd a three week long period where i couldn't make any progress because i had a big bug in my variational bound and it ended with me inviting another phd student to stay with me after hours and i brought along two bottles of wine and we spent the entire evening in a lecture hall completely covering blackboards me slowly covering them reading off what i've done and him just sitting there really slowly checking every single line and after two hours we found the bug i don't even remember what it was some minus some something there's a minus one missing or whatever and it just like suddenly worked so machine learning used to be like this it used to be hunting for bugs in math now it's you know trying to reduce your learning rate well okay so um i'll leave it at that on thursday i will actually try and do a little bit of a history history lecture just as the end closing thing um of of terms so there's going to be no new content i hope that you nevertheless enjoyed today's high brow math lecture give some feedback and i'll see you on Thursday