 OK, welcome back everybody. As always, at first, are there any questions from last time? If not, then this is a little reminder of what we're in the middle of. We're in the middle of taking the functional derivative of AV and setting it to 0 so that we can find this optimal approximate posterior q of theta i. So q star is going to be our optimal posterior. I will start with the step where we ended. So I'll write that on the board again. The last thing we had before we stopped yesterday. And this is just after we took the derivative. We were left with four terms, and those were integral phi theta i times integral of theta not i. And then the log joint. And then we're integrating over theta not i and theta i. Overall, the theta's. And minus, first, the outer integral over phi theta i. And then the inner integral, which is the expectation over q theta not i, logarithm of all the q's. So q theta i times q theta not i, grating over theta not i. This is the inner integral and the outer integral over theta i. So it's the second term we were left with. And now the most interesting one was the third term. Again, no, not again, but this time the integral over q of theta i is the outer integral. And then we take the inner integral. And this looks interesting. So what we wrote was q of theta not i. And then we have this fraction of phi theta i, q theta not i, q theta i, theta not i. And then let's finish it before we talk about it. We're integrating over d theta not i. And now if we quickly look at this, you see that I've introduced this. This actually cancels, right? And also this. And the only reason I wrote that is because we're going to need it in the next step. So actually, the only meat in the integral is function phi. And this is just to be able to take this apart into terms that we can interpret. Yes? Theta not i. So this is the, oh, sorry. The first one is the right one. I'm sorry, I'm sorry. OK, yes. OK, yes, that was the one I was thinking of. You are absolutely right. So and the fourth term is the simplest one. This is the one with our Lagrange multiplier here. Lambda i times the integral over phi i, phi theta i, theta i. So that's all. We already have this yesterday. Now we're just going to algebraically turn this around and end up with terms we can interpret. So the first thing is, if we deal with the first term up here, this is simply an expectation in here, the inner integral. That's the expectation over this distribution of the log joint, log here. And we're going to write it like that. So we're going to say this is the integral. I'm going to write the outer integral phi theta i. And then the inner integral, we're just going to write with these angle brackets that will indicate the expectation of the log joint under this distribution that we're going to write as q naught i. And then we're integrating over d theta. So this is now our first term. And the second term, we're going to take apart. So we're just going to write it like this, minus phi i, or phi theta i, to be exact, log q theta i. Then we have this integral here. We can take this log q theta i outside, because this is the integral over theta naught i. So we can take this outside the inner integral. So the inner integral now starts here. And we have q theta naught i in here. And this is what we're integrating over. And then we have to close the outer integral. Now, what is this? What is this? What? Exactly. This evaluates to unity. So we can forget about it. Then, turning to further terms we have, next thing is minus integral phi over theta. And then the inner integral we have is q of theta naught i log q of theta naught i d theta naught i d theta i. Now this here, what is this? In the second one here, there is a log q theta naught i that I missed you're saying. Minus phi theta i log q theta naught i q theta i d theta. Yes, so we're going through the terms one by one. So we're taking everything apart here. So in the end, everything should be accounted for. If you're still not happy, once we're through with all the terms, because we're taking them apart and reassembling them, then shout. But in the meantime, what is this entropy? Exactly. So this is the entropy of the distribution q naught i. Further stuff we have to deal with is this term here, phi theta i times, again, q of theta naught i d theta d theta i. And again, as before, this here evaluates to unity. And the last little term, I'm going to write it above here, is simply, again, the Lagrange multiplier, the integral phi afterwhiting the board. Everybody finished here? Next we have, we're going to take everything together, the whole integral over phi theta i. One big parenthesis. So this is the expectation that we have there. Expectation of the log joint under q theta naught i minus log minus the entropy of the q naught i minus 1 plus the Lagrange multiplier. And we integrate over theta i. And now we're going to give these things names. So we're going to call this the variational energy. We already have that in the slides. So the expected log joint, actually the expected negative log joint is going to be an energy, because negative log probabilities are energies. So we're going to put a minus sign here in the definition. So this is the i of theta i. Because this is integrated, overall, thetas with index other than i. So it is still a function of theta i. All the other thetas have been integrated away. It's the expected log joint under all the other distributions over theta, where theta is not theta i. So this is a simple function of theta i. Now here, this whole thing, I'm going to call minus log z, minus log zi, because this is for the i-th partition. And importantly, this is constant with respect to theta i. So there's no difficulty in integrating over this, right? It doesn't depend on theta i. Yes, question. Should it be plus the entropy is the question? We have a minus sign here. Entropy minus? You're right. So should I have written a minus sign here? Yes, you could. Yes. We're going to write plus here. And this minus the entropy equals constant. So this is constant with respect to theta i. We have no problem integrating it. So what we end up with is, in its simplified form, the expectation of a minus i of theta i minus log q of theta i minus log of under the distribution phi i. And this is because we put it out that way. This is the expectation of the functional derivative on f with respect to qi under phi i. And this means that we can now write our optimal q if we set this to 0. This means optimal, the functional derivative being 0. And the functional derivative here is minus i theta i minus log q of theta i minus log zi. And this should be 0 at the extremum. So this is what we write. And we solve for our q here. So we have the optimal q as 1 over zi exponential of minus i. So that's the culmination of our whole work we've been doing. We've seen that if we do the mean field approximation, we partition variables theta into groups indexed by i, then the approximate posterior, the approximate posterior has this shape in the i-th partition. And I call this z because this is the usual way to call the normalization constant. That gives you a probability distribution from an exponentiated quantity. And the quantity that we've exponentiated here is the negative variational energy. And the definition of the variational energy was that it was the negative log joint expected under all the other q's. The q's where j is unequal, the q's of theta j where j is unequal to i. And this is the exact same thing here on the slide. Here I gave q a star to show that it is the optimal q. Any questions about this? Yes? Yes. So if we say, this is q star, the one that fulfills this equation. Out of all the q's that fulfill this equation, this is the one we give a star to. So our optimal q star, yes? It is a mean field approximation because we have taken all of our thetas, basically in our case where we do inference, all of the states and parameters in the outside world that we want to infer on in physics that would be your particles. These are all our thetas. And we said the posterior, the approximate posterior. So p of theta, given our observations y under model m, is approximated by a function q of theta. And this q of theta is the product over an index set j of the q of theta j. And in very general, these j's is index sets j. So you get an index for each partition of thetas that you take here. You can take several thetas into one partition if you like. But mostly, you take one theta into one of your partitions describing one particle. And this amounts physically to saying, OK, if I have in the simplest example, I have three particles, one here, one here, one here. Now, if I start looking at how this one influences this one, and I change the way this one works, the state of this one, on the basis of this influence, then we have a change in the influence on this one. And this, in turn, influences this one. So we have to update that one. And we cannot do this at the same time. And this is why we say we look at one at a time. And we look at the mean field that the other particles are creating to influence that particle. This is what we're doing here. When we take the expectation over the log joint of all the other particles, so this is our expectation on all the other particles, all the other partitions in our parameter space. We say this variation of energy is what, this is only a function of that partition in which we're interested in, of that particle we're interested in, because we've integrated over all the others. That is the mean field. So in general, it is not true that probability distributions can be factorized like this. So you will always have to consider that this is an approximation. And it may work in practice, but there's no guarantee. So let's put it this way. I cannot give you any guarantees. So people are working on developing guarantees like this. All I can say here to you is there is no guarantee that it'll work. It may work in practice, but there's no guarantee. And it's an active field of research in developing guarantees for approximations. But in practice, people often use approximations to see whether they work. And usually, because this is an efficient way to do inference, you have ways to do inference that are much less efficient but allow you to check whether this is working. So if you solve a problem approximately like this and you have a big enough cluster, you can go and try to find the posterior using sampling. And then you can compare them. And then, because sampling is, if you have infinite time, going to give you a posterior that looks like the true posterior, you have a way to check whether this works. And in the practical application where I use this, what I did was I used this for the time series. And then I built a particle filter. A particle filter, it's called. So it's basically sampling across time steps. So you take an ensemble of particles. And then you do your time update. And you look where it sits. And then you take the next time update and so on. It's very inefficient. And it can very easily break. But at some point, we got it to work. And using this particle filter, we saw that in the case we were applying this, it actually gave us the same thing. So I can give you the example that we will do. So in this hierarchical Gaussian filtering model we're going to look at, we have a hierarchy of states called x1, x2, x3. And then it goes up to xn. You can add as many as you like. And then you have a time dimension because it's a time series. So this is at time point k, k. And then you get a next time step, x3 at time k plus 1, x2 at time k plus 1, x1 at time k plus 1. And then we partition it like this. Simply each one has its own posterior Gaussian. This is our partition. That's because it's missing in the slide. So people, and I've been inspired by bad examples. But people in information theory often define energies the other way around that physicists do. This is the way physicists have them. So the higher your energy, the lower your probability. So in the Boltzmann distribution, this is the way it works. So you have a lower probability to be at a higher energy. But then you always have these minus signs everywhere. And in information theory, people sometimes out of the false sense of convenience just invert the sign there. And that's what happened in my slide. Other questions? Good. So after this long slog, another thing we need to look at is the Laplace approximation. The basic idea of the Laplace approximation is that you take a probability distribution and you enforce a Gaussian shape on it. And it's called the Laplace approximation because Laplace first did this with the beta distribution. So let me say, I think this is chapter 3 here, Laplace approximation. Who is not familiar with the beta distribution? Some of you, yes. So the beta distribution is this. It describes your uncertainty about quantity in the unit interval. So you have 0 here and you have 1 here. And you don't know where on the line your quantity is. So your probability distribution will look something like this. So it's a distribution that you use when you have a quantity that you're uncertain about on a bounded interval. So it's bounded below, bounded above. Somewhere here it is. Let's say you get a coin and you don't know whether this coin is fair. So you want the coin to give you one half heads and one half tails, but it may not be fair. And how do you find that out? You toss the coin, you keep tossing it. And the more you toss it, the more you find out about how probable it is as opposed to tails. And what you're doing as you're finding this out is you're updating your beta distribution about where this parameter sits. Parameter of the Bernoulli distribution is going to be your probability for heads. So that may be one half. It may be three quarters or whatever. And as you reduce your uncertainty, you get a narrower and narrower beta distribution. Anyways, Laplace knew this already 200 years ago. And he found it inconvenient because the functional form of the beta distribution has gamma functions inside and so on. It's not very tractable. So he said, yes? So yes. So I was just using it as an example. But I will give you, maybe I will give you the functional form. I will look it up. I don't know it by heart. Or I'm not sure I would get it right. So it'll be something like, let me guess it, and then let's check it. Probability of X, if this is X, parameterized by alpha and beta is, don't write this down. It's going to be corrected. Something like gamma function of alpha times gamma function of beta divided by gamma function of alpha plus beta times X to the alpha and 1 minus X to the beta, something like that. Beta distribution. Let's see how close I was. Yeah, here. OK, I mixed up the denominator and the numerator. But otherwise, I was fine. Oh, there is a minus one. So this is your beta distribution. And the interpretation for alpha and beta is you have your two possible outcomes and you count them. So alpha is your count of heads in the coin toss. Example, and beta is your count of tails. So alpha plus beta is the number of times you've tossed the coin. This is how it gets narrower and narrower if you toss your coin again and again. And for alpha 0 and beta 0, you've got a flat distribution, which is of course located at 1. And it's flat like this. And if you, of course, 1 times 1 is 1, so this is your density. Sorry? Well, it depends on whether you put in the minus one here, I think. So there are different parametrizations. Let me see how this is the one I took from Wikipedia. Yes, we can check that and say that next time. I think when you parametrize it the way it's been done here with the minus one here, then if you start with 8, you have the one there. So this is your beta distribution. Now Laplace thought this was complicated to work with. He preferred to work with Gaussians. So he said, well, let's go and fit a Gaussian in here. Now, of course, this is in some way absurd because the Gaussian will have probability mass outside the unit interval. So it will spread out too far. But Laplace found out that this worked well anyway. Because if you close the peak, the approximation is very good as long as you're not far away from the peak. So using a Gaussian as an approximation to at the peak of your distribution has come to be called the Laplace approximation. And that's what we're going to do in general here. That's just the basic idea of it. Yes? No, that's a historical remark. Don't get hung up on the beta distribution. I just wanted to make a quick remark on the history of the Laplace approximation, why it's called the Laplace approximation. Don't worry about the beta distribution. I only wrote it down because somebody asked the question. So this is not about the beta distribution. Just a quick historical remark. Let's switch away. So what we want to do is take our posterior, p of theta given y and m, as always. And we want to approximate it by a Gaussian q. So now we're going to fill our q with some meat, with some content. So we're going to say our q will be Gaussian, which specifically means it is parametrized by a mean and a covariance. So this is a capital sigma because it's a matrix. It's a covariance matrix. And this is the mean vector mu. And this is Gaussian. So this is an n theta, parametrized by mu and sigma. And the definition of the Gaussian is in d dimensions, 2 pi minus d half, then this is the determinant of sigma, the root of the determinant of sigma, minus 1 half of the inverse of it. And then the exponential of minus 1 half theta minus mu. And post the inverse of sigma theta minus mu. So if you remember your linear algebra, this is a column vector. This is a row vector. And this is a square matrix. So what does this give us in here? What kind of tensor is this? It's a scalar, yes. So we exponentiate the scalar. And we have a scalar probability here. Good. So do some notation and preliminaries before we derive the specifics of the Laplace approximation. So we're going to say, if a function has theta in its index, that means that this is the partial derivative of f with respect to theta. We're just going to use the index theta as a sign for the derivative. And of course, if we use two thetas, then that'll mean we've taken the secondary. And then we will define the quantity L. So we're dealing with the Laplace approximation. This is a function of theta. And this is defined as the log joint. So for the log joint, we're just going to write L of theta just to make things simpler. Because theta is the only variable in here. Why are our observations? They're fixed. We've made them. We cannot change them. They're not variable. The model M is our model. So the only variable in here is theta. So some preliminaries, for a, what the large sigma is, it is a covariance matrix. So in your Gaussian, in one dimension, this is characterized by two quantities. You have the mean mu. In this case, this is a scalar, because we're in one dimension. And then the width is characterized by the standard deviation. The variance is the square of the standard deviation. So it's equivalent. We're going to describe the variance by the letters sigma. And now we expand this to several dimensions. And this gives us a covariance matrix. This will now be a vector. And this will be a matrix. To give you an intuition for that, let's look at two dimensions, x1, x2. And now we're going to place a Gaussian distribution on this plane. So we're going to use contour lines. So let's basically put a Gaussian like this, sort of aligned with the axes here. These are the contour lines of the Gaussian. And this is now mu. And it's a vector. So mu1, mu2 equals mu. This is the center of our two-dimensional distribution. And now, if the Gaussian has two axes here that are aligned with the axes of the coordinate system, then our sigma will be a diagonal matrix. So I'm going to write sigma in this way. So here we have the variance along the first dimension. Off diagonal terms are 0. Then we have the variance on the second dimension. Now you start needing the off diagonal terms as soon as your Gaussian isn't aligned with the axes anymore. If you have one that's sort of lying like this, you're going to get off diagonal terms, which are the covariances between the values on the different dimensions that you have. So if you have a cloud of dots, it's going to be my third point here. You have a cloud of dots here. And you know that they were drawn from a Gaussian. So you can estimate the covariance matrix by taking the variance in this direction. That goes in here. The variance in this direction. That goes in here. And the covariance between the two dimensions goes in here because this is symmetric. So it'll always be here. Covariance matrices are always symmetric. And positive definite. What does positive definite mean? All the eigenvalues are positive, exactly. Because if you diagonalize your covariance matrix, it is geometrically impossible for the variance in any direction to be negative. It would be absurd. It wouldn't make sense. So it has to be a positive definite symmetric matrix. Other questions? So let's start our preliminaries over here. For a d times d matrix A, we have the expectation under a distribution q of a term like theta minus mu transposed times A plus theta minus mu. You already know where this is going. If you look at the definition of the Gaussian we've had, we'll write this out. So we'll take it apart. Expectation of the sum. Now, this is not a matrix. This is a sum. Sum over theta i minus mu i. And then the elements of the matrix A are the ij's. And then we have beta j minus mu j. So this is simply this written out within dc's. Again, the expectation on the q. We can take all the terms that don't depend on theta, because this is q of theta. We can take them out of the integral. So what we left with, and we can take the sum out of the integral, so we have the sum outside here. And inside the sum, we are left with theta i minus mu i theta j minus mu j expectation under q times the elements of A ij. Now, what is this? We look at this. Yes. Or correlation is not covariance. Exactly. That's the definition of your covariance. So, oops, oops, oops. Microphone's getting tied. So I don't know what I can do here. I hope this will hold. So this is our covariance matrix. So this is specifically the ij element of the covariance matrix. And if you remember your linear algebra, if you have the sum over ij of the ijth element of one matrix times the ijth element of the other matrix, this gives you the trace of the product of the matrix. TR is a trace. This will just be useful in what follows. It's just linear algebra. Now, if the matrix A in here is the covariance itself, this simplifies radically. Specifically, the expectation of theta minus mu transposed, sigma minus 1. This is as it is in the Gaussian there, sigma minus mu here, expectation under q. Then this is, of course, now the trace of sigma sigma minus 1. How much is this? D, exactly. So I'll interpose one. So this is the identity in D space. And this is D. The trace of the identity is D. This is just something we need to remember. So now let's start. Let's look at the free energy under the Laplace approximation. Let's just say f under the Laplace approximation. So this is the expectation under q of the log joint, which is a function of theta, plus the entropy of q. This is the definition of f. So we take these two and treat them separately. So first, I'm going to look at the first term. We'll see what happens there. Where brackets. So this is the expectation, just another way to write it, of L of theta under q. And this is, I should write, approximately, because now we're doing the Laplace approximation. How do we do the Laplace approximation? Expand the log joint to second order. Anybody not familiar with the Taylor approximation? I know. I know. So we're being good physicists here, and we're doing a Taylor expansion. The thing is, if you take a probability distribution, you exponentiate it. I know you take the logarithm of a probability distribution. And this probability distribution happens to be a Gaussian. What do you get? What is a log Gaussian if you take the log of a Gaussian? What shape? Sorry? You get a quadratic function. So what we want to have is a quadratic approximation to our log joint. And as soon as we have that quadratic approximation, we go and exponentiate it, and there is our Gaussian. So this is our log joint. Now we're going to expand it inside the expectation here. And we're going to expand it at a point that we're going to call mu. This is our expansion point. This is where the peak of the Gaussian will sit. So we have lambda of mu plus the derivative of lambda with respect to theta, according to our notation of L to theta. Yeah, so it's not a lambda. I know why I'm saying lambda. That's an L times mu. So this is a gradient. Gradient is a row vector. And here we have this theta minus mu. So multiplied together, we again have the scalar that we need. No, it's not an outer product. It's a dot product plus 1 half. I mean, you know this. It's just a Taylor. And so here we have L theta theta, which is a matrix times theta minus mu. And this is all inside our expectation. And now we take the constant parts out. This, of course, is a constant. Doesn't depend on theta. So it's outside our expectation here. So like this. Plus the derivative evaluated at mu is also not dependent on theta. And then we're left with the expectation of theta minus mu under Q plus 1 half times the expectation of theta minus mu transposed L theta theta evaluated at mu theta minus mu under Q. Now this simplifies. Who sees in what sense this simplifies what happens to this? It will be 0 exactly. This vanishes. Yes? Exactly. Yes. Exactly. So this is basically the expectation of theta. I mean, you could do it in several steps. So you take the minus mu outside. That gives you just minus mu. And then inside you're left with theta. But the expectation of theta is mu. So we've got mu minus mu. What do we have here? According to what we saw, this is the trace of the covariance because we can rearrange this. We can take this and this together and then we get the covariance. Should I do this in several steps? Or does everybody see why this is the case? Now typically the ones who don't see why this is the case won't say so. So I will. Exactly, yes. So what I am using here is the fact that if you multiply these together, you can take this here outside because it doesn't depend on theta. And you're left with this inside. And that's the definition of sigma. That gives you the trace of sigma times. Let's try something else here. So we're going to leave it at that. We're going to say in total, this first term of the free energy, continuing here, is simply the log joint evaluated, the log joint evaluated at mu plus 1 half the trace of the covariance times the second derivative of the log joint evaluated at mu. This is our first term. Now we turn to the second term, which is the entropy term. This is the negative of the expectation of the log of q under q, the definition of the entropy. And now we simply take the definition of the Gaussian that we saw before. And this gives us the expectation of d half, number of dimensions in half log 2 pi plus 1 half log determinant plus, continue here, plus 1 half theta minus mu transposed inverse of covariance theta minus mu under q. Now this, taking everything out of the expectation that we can take out, we have, of course, d half log 2 pi plus 1 half log determinant of the covariance plus 1 half, again this, or not again maybe, and this we know from before is equal to d. So all in all, we have log 2 pi e, that's from here, to incorporate the 1 half d into here, plus 1 half log, let's make this a bit, oh, so I disregarded the line under which I shouldn't write. So anyway, I'll write this line again up here. I'll copy it up here so you can see it. d half plus 1 half log determinant of the covariance. And this means the free energy under the Laplace approximation is taking the two terms together. The log joint evaluated at the point mu plus 1 half trace of covariance times the Hessian. So you may remember from calculus, sometimes call this the Hessian, the second derivative function in multiple dimensions. Hessian plus d half log 2 pi e plus 1 half log. Now, the challenge is now, or let's say I'll ask you what the challenge is here. What are we going to do next? What do we have to do? Exactly, exactly, we need to minimize x. We have to find the optimal mu and sigma. So basically, we take the derivative with respect to sigma. How do we do that? Variation, no, we're going to do it straight up exact and analytic, not variational calculus. We can do that. So we first have to look at how we take the derivative of a function with respect to a matrix. Sigma is a matrix. So it's not just a scalar variable. We're going to take the derivative with respect to a matrix. This will require a little work. So yes, in essence, this is what we will do. Just look a bit more relevant what we do. Yes? This is because we have the energy definitions negative. It's the same thing as before. So because this is basically a negative energy and this is also a negative energy, the one we had. They have the same sign as the entropy. So you may remember last time we defined as F was minus AB. So that's the key to this. F is minus AB. F is a negative energy. So the conditional variance sigma. So I will just write the goal down. Find the variance sigma, maximizes F. It's a negative energy. That's why we're maximizing with respect. So we set the derivative to 0 and solve for sigma. What we do is, again, we have a few preliminaries. We say the vectors e1, e2, ede, an orthonormal basis of rd. Then the next preliminary is the partial derivative with respect to sigma. So we're taking the sigma, this trace. So this will help us later. We'll use this result in the actual calculation. So this is the first preliminary now. We have a second preliminary. Yes? Exactly. We're going to use that. We're going to use that. Exactly. So that will simplify everything. So first we'll do this. And then basically in the first step, we can simply say without loss of generalization, without loss of generality. We can assume that sigma is diagonal with respect to this basis, exactly what we're going to do. So this is simply d sigma. I'm sorry. This is a sum sign. This is another sigma, jk, or kl, sorry, kl. Otherwise I'll get confused later. So this is the definition of kl of sigma, the elements of sigma kl and the elements of a lk. That's the definition of the trace. And now we're going to have to unpack this here. What does it mean, this partial derivative with respect to sigma? It means the sum over the ij. And these are the partial derivatives with respect to the elements of sigma, ij. So this is one element of sigma, the ij element. And now inside the derivative we have what we have before. The sum over k and l, the elements of sigma kl and the elements of a lk. And now, because this is a matrix, we're going to write the tensor product of the vectors here. So this comes right here. You have the tensor product of the basis vectors ei, ej. And this gives us the sum of ij, delta ik. So these are Kronecker deltas, delta jl, klk. You're also all familiar with the Kronecker delta. It's equal to 1 when i is equal to k and 0 otherwise. And again, our basis vectors. And this simplifies to sum ij, aj, i. This is basically linear algebra. And the basis vectors ei tensor product ej. And this is equal to, so here we have j i and here we have i and j. Exactly. This is equal to a. So intuitively, you can calculate with matrices as you would with scalars. So this is basically taking the product of sigma times a. And this is basically just like saying, well, I want to take the derivative of d of dx. And this gives you a. And this is the equivalent with matrices. Instead of the product here, you have the trace here. But it's the multidimensional generalization of this. So the second thing we'll do. Now this is a first preliminary with the basis vector. This was the second preliminary. Now a third preliminary. What happens when we try to take the derivative of the log determinant of sigma with respect to sigma? We're going to have to think about that a little. What happens when we try to do that? So as you already said, without loss of generality, we can assume that sigma is diagonal with respect to the basis we have. Because it's symmetric and positive definite. That means it's always diagonalizable. We can always find the basis vectors E, I, and E, J where sigma is diagonal. We assume without loss of generality that we're working in this basis. So this whole thing simplifies to derivative with respect to sigma, the log of the product of the diagonal elements. So sigma k, k, like this. And now we're going to take the product out of the logarithm. And we're left with a sum over k of the log elements, log diagonal elements. And again, we're going to use the same definition of this partial derivative as over there. So this is the derivative with respect to the I, J element in here of the sum over the log diagonal elements in the basis vectors E, I, E, J. And this gives us sigma I, J. And again, corner for deltas, delta I, k, delta J, k, 1 over. Because that's the derivative of the logarithm. I think you remember that, I, E, J. And this is, what is this? It's a matrix again. The inverse of sigma, exactly. So this is the inverse of sigma. And again, wondrously, we see that stuff works in multiple dimensions. So taking the logarithm of x, when x is a scalar, gives you 1 over x. Taking the derivative of sigma with respect to sigma when sigma is the determinant of sigma, I have to say, sorry. You have to take the determinant. But then, you get the inverse as the derivative I may very well have in here, right? Or did I, did I, did I, did I? No, no, no, no. I think, yes, why would I need the sum over k? OK, OK, right. So I will need to introduce the sum over k here. Probably here. So if we get to k, this only counts if i is equal to k and j is equal to k. So if it's a diagonal element, and then we insert, yes. And then we insert the inverse of the k-diagonal element. And this gives us the matrix here. So the derivative with respect to sigma of the free energy is 1 half the Hessian, evaluate that mu, plus the inverse of the. And now if we set this to 0, what follows is that we need to choose as our sigma, as our optimal sigma. So I'm going to give it a star again. Sigma star is minus the Hessian, evaluate that mu. So we take our log joint. We calculate its Hessian at the point where we want to have our Laplace approximation. So we choose the point mu where we do our quadratic expansion. Yes? Yes. So mu, the way this is practically often done, but not always. And if you do the pure Laplace approximation, you find the maximum. And then you take the log joint, calculate the Hessian, and use this as the covariance, the negative of it, as the covariance of your, here, instead of the star. In addition to the star. So yeah, I'm going to lose the star in this. We need the inverse. We shall see. In practice, you don't. In practice, you don't. Yeah, that's one of the problems of doing this. In practice, you do not. You never know. I mean, it's hard to get a guarantee for a global optimum. So I mean, those are important questions and people devote their careers to finding guarantees for global optimum. But in practice, you almost never have a guarantee for a global optimum. Good. So this is where we stop for today. If today wasn't enough for you, tomorrow, you're going to get the double dose. Anyway, try to digest this. If any questions appear, we'll have enough time tomorrow to deal with them. So I'll see you tomorrow.