 Okay, hello. Welcome back. Please take your seats. As always, I'll start with questions from last time. Yes? Yes? I'm going to ask them to repeat your question now. Yes? Yes? I would like to know what happens if the shoe, bionic, and the likelihood are two completely different factors, for example. I have a uniform bionic glove, and I have a Gaussian likelihood. After, I mean, does it happen that after some updates the posterior sense to a Gaussian starting from a uniform? And is it possible for the prior to be changed shape? Yes. It is. It is. So the special case where the shape, the family or, yeah, the family of distributions does not change is called conjugate prior. So when the posterior is like the prior, then you've got a conjugate prior. And there are whole long tables of what likelihood distributions take, what conjugate priors. So for a Gaussian likelihood, a Gaussian prior is conjugate. For a Bernoulli likelihood, a beta prior is conjugate, and so on. So these tables are widespread. You can look them up on Wikipedia. But as a side remark, I have to say that these conjugate priors are often not unique. You get a range of choices of conjugate priors you can choose. And this can have consequences. So your update equations may look much more straightforward, much better interpretable depending on what conjugate priors you choose. Yes. Yes. Yeah. My question is, is there a way to understand the value of the conjugate priors? Mm-hmm. How, I mean, is there a quantitative way to quantify how much conjugate priors? I mean, if I calculate the conjugate priors, I see that this is much. A lot. Or little. I mean, can I infer from the number that I get that these distributions are a lot like one another or very different? Basically, I would say no. But it's a question I've never asked myself. So I've never gone and started interpreting the actual numbers I get. Yes. So relatively, you can, yeah. Yes. Here we have a suggestion. Yes, of course. I mean, maybe. But the question is, would one mean anything? Is one sort of a. What does a callback variable divergence of one mean? We say, why compare it to one and not to two or to pi or two? Yes, yeah. So I would say that when we are much bigger than one, they are very different. And if you are much smaller than one, I mean, that's a very. Well, yes, yeah. You would have to face such a scale on a distribution of distributions. So, you know, you would have to look at what values in KL divergences are very common and sort of where you are in the kind of distribution you want to look at. So the answer is basically, perhaps somebody can. I cannot. Further questions? Good. Then the topic for today is to go into variational or plus in detail. So there will be basically four parts to this. The first will be a little rehash of the basic concepts on the blackboard. Then we shall look at the free energy bound on the log model evidence again. And then we'll go into the thick of it by looking at the variational optimization of the free energy under the mean field approximation. I don't know whether we'll finish this today and otherwise we shall continue on Thursday when you will get a big dose of me, a four hour marathon from nine to one. So at least then we should get through all of this. We will do this on the blackboard. So I hope those who were longing for the blackboard will be happy with this. So first part, this is basically a rehash of the basic concepts. So what we're dealing with is always a generative model, p of y and theta given m. This is the model. And it is also the joint distribution y and theta. According to the product rule of probability theory, you can take this apart into the conditional probability of y given theta. And everything is always conditioned on a particular model, the marginal probability theta. Now these two things have names. What are the names of these things? What's the name of the first one? Likelier. First one is the likelihood. And the second one is the fetus. The distinction between states and parameters becomes important when we look at time series where we infer on hidden states of the environment. As I already briefly mentioned, states change with time. So from one time to the next, the states are going to be different, but parameters are constant. For the purpose of what we're doing today, this distinction does not matter. We just take this together in one big set theta. Those are our observations. Data measurements. So these are the basic quantities involved. Yes. I should write where should I write? Not below the line here. So I shall redraw this line. That's what the line was for. What I wrote here is observations, data measurements, just three words belonging to this arrow. These are the basic quantities we're dealing with. And then we have Bayes' theorem. This is the foundation of everything. Conditional distribution of theta on the observations y and the model n is simply the inverse of the likelihood times the prior divided by the model evidence. Posterior, this is new. We didn't have this on the left side. And this is the model evidence for marginal likelihood. It's also sometimes called the marginal likelihood. To be confused with the likelihood proper up here. And now let's look how we got this. What is this? This is again the product rule that gives you this. If you multiply this to the other side, then you simply see the product rule of probability theory. But if we apply the sum rule, we know how to get this. So the model evidence is the integral theta of the term up here. So this is for a specific theta. And if we just integrate over all thetas in this expression, we get the denominator here. So we take the likelihood times the prior, put a prime on theta because it's our integration variable. So this is the model evidence. And by the product rule here, you can see that this is a simple marginalization over theta. So these are the basic concepts that we're going to deal with. So it's already time. Y is part error. Free energy bound. The ingredients we're going to use is an arbitrary density. This is the Q of theta that you saw in the slides. We're going to write it like this. This is Q of theta. And then this is a semicolon. And then this is a lambda. And this means Q of theta parameterized by lambda. Just imagine the simple case again would be Gaussian, where Q of theta is, for instance, for example, taken to be Gaussian. And then lambda are the two sufficient statistics, mu and sigma, mean and variance that parameterize our Gaussian. So this is an arbitrary density parameterized. For example, in this case, yes, that's an assumption we're going to make that this... No, in general, no, in general, no. In the most general case, no. But we're going to use... we're going to want to use this because we want to make assumptions about our Q so that we can control it. So we are going to... As I said in very general here, we are going to wiggle around our Q of theta in order to reduce this KL divergence between the Q and the true posterior because that gets us closer to the true posterior and the variational energy closer to the true free energy. Variational free energy, I should say. So, for instance, Q Gaussian, lambda equals mu and sigma. So this is one of the ingredients that we're going to use and a model, plus a model as before. These are the ingredients we're going to use. No, m is basically the only model we're going to deal with here. So we're going to choose one m. That's going to be our model m. We choose a model m and a Q of theta parameterized by lambda. P of y theta... This is the model evidence. This here... I understand what you mean. Yes. So another way to write this... In some sense, I'm going to ask you to live with this degeneracy. But another way to write this would be P m of y and theta. And then your P m would be the particular likelihood of y given theta without P m and the particular prior on theta. So what I mean is I have a particular likelihood and a particular prior. And those are probability distributions. And my way of saying these refer to a particular model m is simply to say conditional on me having the model m. Yes. Theta is the set of states and parameters we want to infer. So we have no control over them. The lambda is what parameterizes our approximate posterior. No. We will adjust them so that the goal is to... Let me write this down. Go, adjust lambda, prize by lambda is approximately... Sorry. The posterior here. So we make observations y. We have our model m. Theta are the parameters we're inferring. And lambda is what we're shifting around in order for this to be as like this as possible. Yes. We will go through examples. Many more blackboards will be filled with examples. Mu and sigma are lambda here. Lambda is just generic placeholder for the set of sufficient statistics that our particular queue has. Well, a one-dimensional Gaussian always has two sufficient statistics. They're separate. Theta is what we're trying to infer. This is what defines the world that we want to understand. Theta in some sense generates y. We observe y. We take y. And we infer back on theta. So it's this picture that we had in the slides. So we have basically the brain here. Let me try to... Yeah. This is a brain here. And now this is the word outside. So let me... Maybe this is North America and then South America. Like this. Yeah, maybe like this. And then with Europe and Africa it's more difficult. So this is the word. And we have a forward model that tells us how the world, this is theta, generates y. So this is our model. Probability of seeing y given a particular theta. And then we also have a prior on. This is the observation we make. But now what we're interested in is knowing what the world outside is like. And in order to do that we need to perform Bayesian inference in order to get the posterior given the observations that we have. This is the inverse problem. This is what we're trying to solve. And this is the forward model. The short answer is no. Because theta is the reality out there. So this distribution over theta is a probability distribution. And this distribution is different from the prior distribution. But theta is theta is theta. Theta is the external reality. So to go back to this example that we'll come back to again, if I'm out at sea and I'm measuring the angle between the lighthouse I'm seeing at north, theta is that angle. And there is one true angle I'm trying to infer that. And before I make my observation I have a certain belief about theta. And this is my prior. After I make my observation I have a new belief about theta. But theta is theta is theta. Slide number 21 in this example. Yes. So this is a particular example. So this is generic, what you see now. And then I take particular values for the quantities here. I fill them in and I get a Gaussian. But this is only, okay, I see you say, okay, theta is, this theta is another thing here. What is theta here is lambda there. I may have to adjust that. So, no, no, no, no. It's a different one. It's a different one. Yes. Yeah. So I should be more consistent in the notation. So here, here, though X is Y, X is Y. And the theta that we use to infer, so we have an added layer of complexity here. So theta here is the external reality that we're trying to infer. So there is a consistency but there's an added level of complexity because now in this whole scheme here, which is actually much simpler than the one we're looking at here, we don't have a queue. Yes, exactly. But now we're taking the, exactly, exactly. So the external reality that we're trying to infer is these parameters in this setting. So conceptually, this theta and the theta here are the same. But we've added a level of complexity now. So we have sort of interposed this queue between theta and ourselves. So we are mediating with theta on the basis of lambda. So we're using an approximate posterior. We are describing theta by lambda. We're learning a description of our universe and there we're sort of directly inferring on the parameters of the distributions generating our data. There's an added layer of complexity. The added layer of complexity is that we are not directly inferring the posterior. So what we're doing here is we're solving Bayes' theorem, doing this here directly. We're solving the inverse problem directly without ever going through an approximate posterior queue. What this will give us here is an approximate solution and what this gives us is an exact solution. So we are parametrically describing reality using our queue here. So here we're sort of directly going to the parameters generating our input and here we're using a description, a parametric description of that and we're going to adjust that so that it comes close to this. So we're not going to directly find theta. We are going to find the best lambda to describe our theta. That's the added layer of complexity that we have. And we do this because that allows us to solve the problem which would otherwise be impossible to solve. So we are doing all of this because in all interesting cases Bayes' theorem that you had here ten minutes ago cannot be solved analytically. There is in principle no solution to that equation. So you cannot derive an analytic solution to it that you can write down. So we have to find a way around this and the way around this is to, our way here, there are several ways but our way here to get around this is to use the free energy bound on the log model evidence as the title says. That's what we're doing now. I hope this will become clear in the course of the lectures as we also do examples. So we will get to, well, a whole class of examples where we apply this strategy to determine what is going on outside and what the states and parameters of our systems are. You can just ask so we can discuss it. Yes, okay. This is, yeah, the brain makes observations, has sensations. Yes. Theta is what I have. I can show you another slide that may be helpful here. Okay, okay, okay. This one here, I like this slide. Let me just update this to say, this is our agent, this is our brain or the robot we're building or whatever. This is the world outside. There are two interfaces between these two separate little worlds, the outside world and the inside world. One interface is the sensory input and the other interface is the action our agent takes on the world. So theta is basically out there. It's the state of the outside world and the lambdas are what describes that. So keep this in mind. I'll leave it up. We continue by introducing the mean field approximation. We'll do this over here. Mean field approximation originally from physics. Q of theta, and now for clarity, I'm going to stop writing this M. The M is always N, right? So let's forget about it. Q of theta is the product of all the elements in a partition of index sets Q of theta j. So we have our theta. So this is, all of our thetas are in here. Partition of j. So this is normal j and this is capital J. Little j, big j. Well, how can we, well, so there is improbability. There is the sort of bad habit which is at variance with all the rest of mathematics that when I write P of x and I write P of y, this is not the same function. So it just means probability distribution on y and probability distribution on x. This is unfortunate, but it's ubiquitous in probability. Yes? I know. So of course, hardcore mathematicians cannot tolerate this and they then write something like fx on x, on capital X, something like that. We're not going to do this because in much of the literature that we deal with in probability machine learning and so on, this is not done. So it's not just me who's doing this and you'll have to get used to it. P just means probability distribution and this distribution can be different from this and these Q's are going to be different from each other. So what counts in some sense is the variable inside. So the argument tells you what function this is. This is the distribution on x, this is the distribution on y. It's unfortunate but it's a fact that's historically grown. So we're partitioning up our thetas into different groups of thetas and so here we have little j equals one, little j equals two, little j equals three and so on. It's a set of basically indices. The whole set of states and parameters that we want to infer and big j is an index set and this is big j. That's what we're doing. So we're going to do some notation. Theta without i. So this is a backslash. You know from set theory you have a without b. So that's if you have a and b but I'm not going to say minus i. I'm going to say without i. This is defined as the set of theta j's without where j is not equal. It's all thetas in here except the ones that are in the i-th subset. You'll be happy about this. We're just going to write qi as a short term for q of theta i and q without i is going to be of all theta j with j unequal i. That's q without i. And in all of this we have i and j elements of big j. Free energy. This will be used as soon as we have looked at the free energy functional. Free energy functional. A functional is a function of a function. Not a function of a variable but a function of a function. So we are going to look at the free energy as a function of q and q itself is a function. That's why it's called a function. We take the log model evidence. The log model evidence. The marginal integral. The joint. Given m d theta. And then we're going to do some algebra. We're going to keep the log outside here. But then inside the integral we're going to do some very simple trick. We're going to multiply by q of theta parameterized by lambda. And then we're going to divide by the same thing again. So effectively we've done nothing. We've done nothing. We've multiplied by q of theta given l and we've divided by it again. So this equality holds. And now we take the log inside. We take the log in here. Why we get this inequality. This is called Jensen's inequality. It's itself easy to prove. You can look it up on Wikipedia. There's a proof on Wikipedia. Long divided by and integrated over theta. And then this thing continues. We take this part simply again using the rules logarithms. So we now have two integrals q of theta parameterized by lambda times the numerator here. That's the log joint minus q of theta parameterized by lambda times the log of the same quantity theta parameterized by lambda d theta. This is how the internal energy is defined because this is the expectation under q. That's what this notation means is the expectation under q. And this is again a functional and I'm going to write functionals with square brackets. Of the log plus the entropy, which is also a function of a function. Sorry, you see this is a function of a function. So I can just write q here. And this in turn defines our free energy function. So to be consistent with the slides and we call it a. This is a and this will be a function now of lambda of the observations and of whatever we put into our model. So our prior and our likelihood. So M here is just a placeholder for definition of prior and likelihood. This is what the free energy, the variational free energy is functional. This is the variational. This is the expected log joint. The physical analogy to it is the internal energy. We saw that in the slides and this is the entropy. To be exact, this is the expected long doing. This is the negative internal energy. And here in this notation, we need a minus sign here to get the variational free energy here. So because energies are always negative log probabilities, energies correspond to negative log probabilities. We want to have our optimal l, optimal lambda. And I will write down this goal that we have. Yes. Why is there no theta here? Because q is parameterized by lambda. So if you have your q, you have your distribution on the theta. But this is not a function of theta. It is a function of lambda. If you look at this purely as a function, so you don't... In this here, you have a distribution on theta, but it is parameterized by lambda. So what you put into the function q here is the lambda. Imagine a Gaussian again. So in order to define your Gaussian, you give it a mean and a variance. And the variable whose distribution is described by that Gaussian is not what you put in. So the theta is what this Gaussian describes, but it is a function of the mean and the variance you put in. So it only depends on this here. So what we want is basically lambda star, which is the arc min variational free energy. Let's... Sorry about this. I shouldn't have done it like this. Let's call this F. And then this is the negative variational free energy. And to be consistent with the slides, this is minus AB. So this is the quantity you have in all the slides. This is F and this is AB. So we want the lambda that minimizes variational free energy. So arg min of AB, which is a function of lambda y and n. Like this. This is what we want. We want the optimal lambda, the one that makes our Q as similar as possible to the true posterior. Yes. It's not just the mean. So it's the optimal mean and the optimal variance of the Gaussian that describes theta. So it is your posterior belief because some uncertainty about theta will remain. So you will still have a variance in your belief. So under Bayes theorem, if you could solve it exactly, you would have a posterior distribution with a certain mean and variance. The exact distribution may not be a Gaussian. So it may be any kind of distribution, but it will be a distribution. So if you approximate that distribution using a Gaussian, you will have a certain mean and a certain variance. And that will be the mean star and the variance star, for which lambda star is a short term. Let's just boldly move on. So what we do is we have a constraint on our Qs. And that is that all of the Qs have to be probability distribution. The Qi to be probability distributions. And that means in mathematical terms, the integral over each of the Qi's have to be one for all i's. So we want to minimize Av under this constraint. We have to observe this constraint. We're going to wiggle around with all of our Qs. We're going to change the lambdas parameterizing our Qs in order to minimize Av. What is the way to do that? From your calculus classes. If you minimize something under a certain constraint, what do you do? What do you employ? Exactly. So we're going to solve this using Lagrange multipliers. So we introduce Lagrange multipliers. So we're going to use f tilde, the functional f tilde, which is defined as, like the functional we have before, plus all of these Lagrange multipliers, lambda j, these are the multipliers here. And inside we have this constraint, that Q of feet to j d feet to j minus one b equal to zero. That's how you use Lagrange multiplier. You solve the constraint for zero. You multiply it and add it to the functional you want to minimize. And now we do the following. We take this apart, or we fill in the definitions, integral, the whole Q of theta. This is now our whole Q of theta. Not divided up into the theta's. And then the log probability, the joint minus the entropy. Also the whole Q of theta, log the whole Q of theta plus our Lagrange multiplier terms, lambda j, and in here, again, the Q of theta j d theta j minus one. I'll clean this up here. And now we're going to separate this out into different articles. So we're going to separate out one of the Q i's. So we're going to say this is the integral over Q theta i times the integral over Q theta without i. d theta without i d theta i. So this is exactly the same as here, but now we've taken this apart. So we have an inner integral. Overall, the theta j's where j is not i. And then an outer integral over Q theta i is still the same. And we do the exact same with this term here, too. So Q of theta i integral Q of theta without i and log Q. And then this is the log that applies to both of them. Q theta not i times Q theta i. And this is the end of the log. So all of this is in the argument of the log. d theta not i d theta i. Plus, and now here, it's easy to separate them. The lambda i times the integral simply over Q of theta i d theta i minus one plus a sum over all the j's that are not i. So I'm simply going to write not i here. This is what we're summing over. And this is the integral of Q theta not i d theta. And now we're going to write this as the functional f tilde argument. And the first of them is going to be Q i. And the second argument is going to be Q not i. And now we're going to take the function. Yes, I was waiting for somebody to spot this. You're right. We don't need the sum. We don't need the sum. So actually, yeah, let's be totally anal. And let's do it like this. I think then it's the clearest possible. Let's put in indices j here. And let's explicitly write j not equal i like this. Do you agree? This is an i. Sorry. Ah, so here we're summing over all of the lambda j's. It should be. Oh, OK. Sorry, I didn't see that. Yes, absolutely. Thank you. So we are going to take the functional derivative of this with respect to Q i. Who in here is not familiar with variational calculus? Everybody familiar with variational calculus? So you're fine with taking the functional derivative. So it's basically like taking the derivative with respect to a variable. If you have a function you take, and it is a function of x and y, you can take the derivative, the partial derivative, with respect to x, or you can take the partial derivative with respect to y. Now here we're dealing with functions, functions of functions. And in exactly the same way, we can now take the functional derivative here with respect to Q i. So this f tilde has a certain value. And we want to find out whether we can find a minimum with respect to Q i. And this minimum will have as a necessary condition that when we vary Q i at the point of this minimum, we will have a derivative of 0. So what we're going to do is we take the functional derivative with respect to Q i and set that to 0 and solve the whole thing for this optimal Q, which will be parameterized by lambda. And then we have found the optimal Q i with respect to our free energy and can use this in actual applications to models. So we shall, I will at first give you the definition of the functional derivative again, and then we shall actually start doing the functional derivative. We'll probably finish with it tomorrow, but we shall at least start now the functional derivative of f tilde with respect Q i. So this is the derivative with respect to a variable epsilon. Evaluated where epsilon is 0 of f tilde, which is a functional of Q i, plus epsilon times phi i, where phi i is a test function. And the second argument, which we're not worrying about at this stage, is Q without i. This is the definition of the functional derivative. We're just using this definition here. Basically, as badly behaved as allowable, so it has to vanish at the ends, basically. So it is in a thing called what space, which means it cannot grow to infinity at the ends, basically. It certainly has to be continuous, and it has to be differentiable. I don't know how many times, but I can, yeah. So it has to be reasonably well behaved, but we don't want to be too restrictive either. You can find pathological fies where this doesn't work anymore. So basically the rest is algebra, but it's going to be a little bit of algebra before we get our final result. So we're going to take the derivative. Respect to epsilon evaluated at epsilon equals zero. And we're just going to put in... I'm going to give you the first step, and then basically I will leave out a few steps and we'll let you fill in this rest during a tutorial. So basically, when are your tutorials due or set, or how are they organized? When can I make you or let you fill in the gaps in the whole derivation? Is there time allotted for that? Or do we do it all here? All here, you're saying all here. You don't want homework? Okay, I will not do every little step, but I'm happy to take questions about the missing steps that may still be there. So times... the second integral here, because much of the whole thing is copying from a bob. Log joint by theta given m and d theta not i d. And this is then minus the derivative with respect to epsilon. And okay, this is a little algebraic trick. You don't have to give you this, otherwise you may get stuck. We subtract this here, plus epsilon phi theta i q of theta i plus epsilon phi theta q of not i and then close the parenthesis and add d theta not y d theta i. And then we continue. No, we need two more terms here. Plus, plus lambda i times u of theta i d theta i minus 1 plus d. Again, derivative with respect to epsilon. Evaluate the epsilon zero. And then we do the sum again. This is the j unequal i sum times lambda j theta d theta j minus 1. So this is the energy term. And then entropy term. And then the same for the... So the four terms are energy and entropy for theta i and theta not i. That gives you four terms. Take out the phi i's. That's the next step we do. We're going to come out before the integral. So I'm going to write the first line. I hope everybody's finished with copying here. Everybody finished with copying? No, no, no. Okay, okay, sorry, sorry. So I'm going to slow down a bit. Okay, we're just going to do the next step and there is going to be a radical simplification already now because we're going to take all of these derivatives there. So after we take the derivative in the first term, all that we're left with is phi of theta i, sorry, the integral over phi of theta i, I shouldn't forget the integral, q of theta not i and not i d theta i. So that's all that's remained of our first term. Second term, integral over phi theta i times integral q of theta not i log these two q's multiply together. q of theta i times q of theta not i d theta not i d theta i. That's our second term. Third term, that's going to look a bit more complicated. q of theta i, q of theta not i, and now a big fraction of phi theta without i and q of theta not i. So you see we're doing our old trick again. We're multiplying, yeah, we're just multiplying by one theta not i plus our last term which is very nicely simple. Okay, and from here it's uphill in the third term. Is it a, let me first check whether I did everything correctly. Yes. Now your question is about which exactly? The third, the whole third term. So let me check. No. It was somewhat misleading to relate the previous four terms directly to these four terms. Because if you look at the previous four terms we had, then the last of them does not contain epsilon. So if you take the derivative to that, it falls away. And this one. Exactly, yes. All right. Yes, you were absolutely right. Thank you for spotting this. Yes. Shouldn't we have phi in both terms. So here and here. On the right, what do you mean? No, no. Phi times q, q times phi. Yes? Yes. This can slide, yes. So we're going to need this in the next step. Yes. Because we want to end up with interoperable terms like entropy and a partition function so that everything will be very, very simple. So I will give you, maybe in the last few minutes, a preview, a sneak peek at the end result. So just to see you, just to show you that the effort will be worth it. We will end up with a q of theta i. This will be our optimal q of theta i. That is 1 over z i. This is just a normalization constant. It's the exponential of what we're going to call the variational energy, capital i theta i. This term, i of theta i, will turn out to be simply the expectation, I'm going to use angle brackets for that, of the log joint under not i. So all we need to do to find our optimal q is this. We take the log joint. This is simply our model. This is our likelihood and our prior. And we take the expectation of this, so basically we integrate over all the thetas that are not all the theta j's where j is not i, so that these remain. We take the expectation with respect to that and this gives us a function which is only a function of theta i because all the other thetas have been integrated out. Only a function of theta i. We exponentiate that function and this is our q finished. So exceedingly simple. All that we still need to do is normalize it, but that's easy. All you need to do to find your optimal q here in this mean field approximation is take your log joint, take the expectation of your log joint with respect to all of the q's that are not i, exponentiate this and you've got your q. Then you multiply all of your q's together and you've got the whole q. The whole approximate posterior which is going to be optimal under your assumptions about the parametric form of q. We're going to finish these algebraic steps on Thursday and then we will look at the Laplace approximation. So in the remaining three minutes or so any questions right now? I suggest you digest this at home. You find any more sign errors I could have made. I hope there aren't anymore and then we'll complete this on Thursday. If there aren't any immediate questions. Thanks for bearing with me. See you. Yes, I have this. You can have this written down at some point. A direct interpretation. We call this the variational energy because this is the log joint. So this is the physical analogon to this is the internal energy. So it's the internal energy.