 OK, so there are at least three interesting ways that we can decompose our variational energy, variational free energy, I should say. And we'll look at them in turn, but I'll show you all of them right here at the outset. So the first one is a decomposition into the expected energy and the entropy. We've already seen that. That's how we introduced the variational free energy by replacing the posterior by q of theta in every place where it appeared. So we can use other ways to decompose variational free energy. And a second way is to decompose it into the Colbeck-Leibler divergence between the approximate posterior q of theta. The true posterior p of theta given y and m minus the true free energy, the negative log of the model evidence. Third way to decompose it is into the Colbeck-Leibler divergence of the approximate posterior and the prior minus the expectation of the likelihood, of the log likelihood, which we can call accuracy. And this here we can call complexity. And then it becomes a trade-off between complexity and accuracy. Who knows? Yes, the second one is this one on the second line. We'll go into them in turn in the next few slides. So I don't know why this apparently the current version of PowerPoint for Mac has font issues. And I don't know why this doesn't work and why the title is in the wrong font. But this is since I made an update to PowerPoint. And they haven't come up with a solution yet. But I hope you can decipher this. So this says complexity and this says accuracy. And this is the Colbeck-Leibler divergence between the approximate posterior and the true posterior. Who knows what a Colbeck-Leibler divergence is? So about a third of you. We'll see the definitions in a few slides. But just very quickly, the Colbeck-Leibler divergence is a measure for how different two probability distributions are. But it is not a distance measure. Because mathematically, a distance needs to be symmetrical. So the distance from A to B must be the same as the difference from B to A. And this is not the case in the Colbeck-Leibler divergence. So you get a different number, whether you put cube theta first or whether you put the posterior first. But it is a measure for how dissimilar two distributions are. And it is 0 if they are equal. And it's always positive, but not significant. Triangle inequality, does it fulfill it? I guess not. So the first decomposition, as we saw, this is expected energy minus entropy. And it illustrates the mathematical analogy to statistical mechanics. And it tells us, this is the most important property of this, that we can calculate the variational free energy. Because it only contains terms that we know. We know cube theta because we put cube theta in. We chose cube theta. And we know the log joint, which is this here. Log p of y and theta given m. This is the log joint probability. Here, you've got it again. And we know this because it's the product of the likelihood and the prior. And this is our model. We chose the model. We chose cube theta. Therefore, we know all the quantities in here. And we can calculate AB. So it's a quantity accessible to us. The second decomposition. Now, here we take this Colbeck-Liebler divergence. And this is the definition of it. So it's probability distribution 1 with respect to probability distribution 2. And then you take the expectation under probability distribution 1 of the ratio of the two probability distributions. And this gives you a measure of how different they are. So it's basically the expected surprise of probability distribution 1 under probability distribution 1. So it's entropy minus the other way around. It's the expectation of the surprise of probability distribution 2 under distribution 1 minus the expected surprise of the entropy of probability distribution 1. Yes, there are many measures that try to achieve that. But since this is very useful and also drops out in terms like the ones we used, this one is probably the most popular. So a simple way to symmetrize it would be to take the mean of KL of P1 with respect to P2 and KL of P2 with respect to P1. Then you've got a symmetrical measure. Now, the second decomposition again shows that AB is always greater than or equal to A for all q because the divergence is non-negative. This is a positive term. And this here, including the minus sign, is A. So we have A plus a KL divergence. And this means this has to be higher. This has to be the greater number than A unless our divergence is 0. But if our divergence is 0, this means that q of theta and p of theta given y and m are the same. And this is the case when our approximate posterior is equal to the true posterior. And that's another way to see that by minimizing the difference between the approximate posterior and the true posterior, AB goes closer to A and the other way around. As AB goes closer to A, this thing here shrinks. So that's how we can infer the posterior. We move AB, we wiggle around q of theta so that AB goes closer to A, goes down. And if it goes down, we know it's going closer to A. And that means our approximate posterior is going closer to our true posterior. That's the magic of this. Now the third decomposition of AB of the variational free energy shows you why it is a good measure of model performance. We want to score our models. We want to find out how good they are. And you may know measures of model scoring like the Akaike information criterion, AIC, or the Bayesian information criterion, the BIC. And this is the one that stands behind them because they are approximations to this one. We will see this in a moment. So let's go for this simple one. The accuracy is simply the expected log likelihood on the cube. So this is an expectation value and this is the log likelihood. So how much error do we expect when we use this model? That's a measure we get out of the second term there, the accuracy. Now you might say, isn't it better to have as little error as possible? Shouldn't we just minimize the error and maximize the accuracy? Forget about complexity. Just do that. What's the problem with that? Yes, we'll end up with a very complex model. What exactly does that mean in practice? Overfitting, exactly. We'll be overfitting. Let me just draw a picture of overfitting. So here is our horizontal axis called this y. And now let's make some observations here. This is an observation. This is an observation. This is an observation. This is an observation. This is an observation. This is an observation. This is an observation. OK, so all of our observations have been in this region here. Now should I make an observation here? I'm going to give you the point. Let's say here. Going to make an observation here. Going to move my piece of chalk upwards and you shout stop when you think I'm in a region where an observation is plausible. Going to make an observation here. Going to make an observation here. I think you get the game, right? So everybody's laughing because this is vastly implausible here. OK, yeah. So you get the game. Now, but let's forget about these invented observations because I only invent these, but we actually made these observations. Now, if we have a perfectly fitting model, then let me get another column, then this perfectly fitting model might look like. So let's fit some kind of polynomial to this one. Shall we go about this? So perhaps we could have a polynomial that comes up from here, then goes up very steeply, goes to the sky, comes back down from the sky, and then does something like this, goes back up to the sky, comes back from the sky, and goes down straight to hell. So now this model would predict a point here somewhere in New Zealand. Well, I didn't want to imply that New Zealand is hell, but it would make widely implausible predictions. So there is a trade-off between fitting your data points exactly with your model and making good predictions. So a plausible model here would simply be to say, well, this is my model, and then we have some noise on our observations. This is what's called overfitting, taking a model that fits your data exactly but doesn't make good predictions. And that is why we cannot simply look at the expected log-like theory. We also have to take into account complexity. And complexity is the divergence between the approximate posterior Q and the prior. So it's a measure for how much the model has to distort itself, how much it has to update its prior in order to get to the posterior. So people often speak of this as model complexity, but this is slightly misleading. It is actually the complexity of the data to this particular model. So the same data can look very straightforward to one model and very complex to another model. And the model that it looks complex to is the one that has to tie itself in knots in order to understand, in order to fit these data. So remember, this complexity is actually complexity of the data to a certain model. How complex do these data look to this model that I'm scoring? And of course, our red model here has lots of parameters. So in order to fit a polynomial to these points here, we need at least 1, 2, 3, 4, 5, 6, 7 parameters plus an intercept. And so we have to fit many more than for this straight line where we can just use an intercept and a slope. Plus, if you look at the coefficients of this polynomial here, they will be huge. So you would have a large complexity. You would have to move, you have many parameters, and you will have to move them far from their prior values to get a well-fitting posterior. This illustrates how you minimize surprise, because that was our point of departure. The true free energy AV is the surprise of the model at the data. And to be least surprised, you need to find the right balance between complexity and accuracy, because then you're making good predictions. So a good model is one that makes good predictions. And this means that inferences, based on currently available data, have to generalize to new data, as we did in our little exercise here. There are two dangers here that have to be balanced and avoided. One is seeing patterns where there are none. That's what we did here without fitting. In overfitting, you're basically fitting noise. So the deviations from the green line, you can't see it's green, but it's supposed to be green, the one that's going up here. The deviations from the green line, that's just noise. We have a generative process here. That's a straight line. And then we have some noise in our observational generation. And if you fit an exactly fitting model, as in the red line, then you're fitting noise. So you're seeing patterns that aren't actually there. And the other danger, of course, is to miss patterns that are actually there. And one way to do that would be to fit a model with too little complexity. This also exists. And in this example here, that would be simply taking a constant. Or if we look at our observations, I should draw the constant further down. So that would have too little complexity. But in general, if we increase our accuracy, we pay a price in complexity. And we have to find the sweet spot where the two align perfectly. So the principal reason why AB is a good measure of model quality is that the difference in AB is an approximation to the log base factor. The base factor is simply the model evidence of one model, probability of model 1 divided by probability of model 2. This is the base factor. This is often used to score different models. So this has to be the same. We can only compare how different models fit the same data set. So y is the same data set here. And we compare model 1 and model 2. Now, a difference in the log of this, if we take the log of this, then we have log p of y given m1 minus log p of y given m2. And now AB is an approximation to these. So this is A, the exact A. So this is actually approximately minus AB because the minus sign belongs to it. But we can compare models by looking at the difference in variational free energy. So we can score them according to their variational free energy. And the AIC, the Akaike information criterion and the BIC, the Bayesian information criterion, are approximations to this that use the exact same accuracy term but use a simple heuristic for the complexity term. In the Akaike information criterion in the AIC, this is the simplest possible approximation to complexity because it simply counts the number of parameters. Complexity in the AIC is the number of parameters. And in the BIC, it's a function of the number of parameters and the number of observations. A little illustration of this is here. So we want to minimize surprise. That's where our best model is, here, where surprise is minimal. And we minimize surprise by looking at the difference between complexity and accuracy. And this difference is minimal here. And you see how we can increase accuracy here by increasing the precision of the likelihood in this simple Gaussian model. Same kind of model we saw everywhere today. So we can increase, increase, increase, increase our accuracy. And this minimizes our surprise. But then at some point, if we increase accuracy further, we get an increase in complexity that makes our surprise look worse. Because we want surprise to be low. We don't want it to rise. Questions at this point? Yes, yes, yes. So that is, of course, a fundamental problem that you have to solve when building a model. So let's take an example from medicine. Exercise is healthy. That's what most people think. So if I stand here and I jump up and down three times, and now I do a study on many, many people who do this, and I want to find out, I watched them over decades, and I want to find out whether this increases or decreases the probability of dying of cancer, a parameter for that. And that parameter says is a multiplicative parameter that says by how many times my probability of dying of cancer was multiplied by jumping up and down. Now let's say I find that this parameter is 0.5. So jumping up and down three times halved my chances of dying of cancer. How plausible do you think that is? Very implausible, right? So if you put a prior on that parameter, you will be careful to choose it in a way that 0.5 is excluded. So you would take prior here. Let's call this parameter theta, as we often do. And you say, OK. Now if this decreased the chance of dying of cancer even by 1%, I would be very surprised. So I would put the scale here at 0.01, and here at 0.99. And then I would choose a prior like this. Because even in effect of this size is quite improbable. But let's say we'll be open-minded and choose a prior like this. But we would certainly not choose a prior that is very wide. We wouldn't have 1.5 here and 0.5 here. So what I'm trying to illustrate at the point I'm trying to make is it is often not quite true that you know nothing about the parameters you're trying to estimate. There is a point where you would be surprised to see the result. Where you would say, no, I don't think this is it. So choose your prior in a way that takes into account your background knowledge. For instance, your background knowledge of human health or your background knowledge of the field of physics that you're working in. There's always a lot of background knowledge. And you can use that background knowledge in defining your priors. And this will help you not to overfit. Because this is exactly what happens when you overfit. You get huge parameter estimates. So as far as we know, there is quite some prior knowledge in the sense of trying to avoid using the word hardwired. But at least to a large degree, evolutionarily determined the DNA and there are various other channels that can influence this. So you can, I mean, the embryonal cells that you grow out of contain much more than just DNA. But if you look at a little filly, a little horse that is newborn, it just stands up and walks around. So it does have some knowledge of how to navigate the world from the start. And at a very fundamental level, you have lots of prior expectations. So you have a prior expectation for what kind of surrounding temperature you will find yourself in. And you will move to areas that give you that temperature. Your blood pH level is within a very narrow range and your body reacts immediately when it goes outside that. So these are all kinds of, you can interpret them. And I think it's an important and extremely good way to generate insights about how biological agents behave by assuming they're acting to confirm their predictions. That you can actually build little models in silico and also little robots that exhibit adaptive behavior by programming them to fulfill predictions that they have. So if there aren't, no, I won't say it like that. Are there any further questions? So then I will give you an outlook on tomorrow. Variational of plus. So we've had the variational free energy here. And up to now, this queue here, our approximate posterior wasn't sort of filled, fleshed out, filled with any content. All I said about queue was that it's an arbitrary probability distribution. Now we're going to fill this queue with content. So we have our model. We want an approximate posterior. And we also want to be able to do variational inference of the kind illustrated here where we minimize AV. And variational Laplace is one way to do that. So it is shorthand for variational base under the mean field approximation and the Laplace assumption. We'll see what that is. And we'll work it out in detail on the blackboard. Now the mean field approximation is the assumption that the true posterior can be approximated by a queue of theta that factorizes across subsets of theta. This is something that was originally introduced in physics. And that's why it's called the mean field approximation. And it is simply the assumption that if you have a many-body problem and you're looking at the field generated, let's say, by three particles here, as soon as you change something about this particle and you look at how this particle influences this particle and you change something about it, this also affects how this particle influences this particle. So in the mean field approximation, you neglect these complications and you just look at the mean field that this particle sees from all the other particles. And you play around with the state of this particle and neglecting how this affects all the other particles until you're done. And then you update everything. And then you go around from particle to particle. Now here, we just take theta. This is our range of parameters or states in the model. And we partition this in some way. So here, this theta is a very long vector of thetas. And we divide this up into chunks theta 1, theta 2, up to theta n. And we assume we can factorize our q of theta into a q1 on theta 1, q2 on theta 2, and so on. This is an assumption. We lose accuracy when we do that. And the Laplace assumption is an additional assumption in addition to this factorization that the posterior is Gaussian. So in particular, q of theta will be Gaussian if each of the qi is Gaussian. So we will divide up our approximate posterior q of theta into lots of Gaussians multiplied with each other. And this will be our approximation to the true posterior. And we will see what that gets us. We will apply variational calculus to this. So I don't think you need the reminder at this point. So we minimize AV to get close to the true A and to find this optimal q. Now in the space of all of these q's, of these approximate q's, which are not going to be exact, there will be one optimal one. You have all the q's that fulfill these criteria that they're factorized. And they're all a product of Gaussians. And out of all these q's, one of them will be the best in the sense of the closest to the posterior. We're going to call this q, q star. And the question will be, how do we find it? How do we find q star? The q among all the allowable q's, that is optimal. And we will do variational calculus. So we will take the derivative of AV, the functional derivative of AV with respect to the q y's. And we will want this to be 0 so that we know we are at an extremum in function space with respect to minimizing AV. So solving this equation for q star, we find a solution. This is what we'll do next time on the blackboard. It turns out that the solution is proportional to the exponential of the variational energy. And this is the definition of the variational energy. We will look at this in detail the next time, which will be tomorrow. Tomorrow afternoon, we will put this to use in what comes in applied to actual models that can do inference on what's happening in real experiments. Any questions? I'm already keen to go to lunch. So I'll see you tomorrow. Thank you.