 Oh, is that hold for a very quick navigation? Sure. It's got a sort of probability distribution. So you have a request for the slides? Yes. I can send the link to America. Uh-huh. Yeah. Perfect. Good. OK. Thank you. OK, welcome back. So up to now, we've been looking at base theorem and saying, well, yes, we have our likelihood here, our prior here. We're going to use base theorem to derive the post theory. And now we will attack the same problem in a totally different way. We're not going to try to sort of solve base theorem primarily. We will do something else. And then in the end, we'll see that it's exactly equivalent to solving base theorem. So what we'll do is we will minimize surprise or free energy. So as we saw, and maybe I didn't make this clear enough, Bayesian inference makes optimal use of available information. This is what it means to use Bayesian inference, using all available information. If you do something other than Bayesian inference, that always amounts to throwing information away. You're disregarding information that you have if you're not doing Bayesian inference. Bayesian inference is the only way to take account of all information that is there. So it makes sense to assume that Bayesian inference leads to a minimization of surprise and new input. Because you've taken account of all the information you have to predict new input, you will now be minimally surprised at the new input that you get. So we will take this as our point of departure. And we shall see that minimizing surprise is equivalent to doing Bayesian inference. And often it is simpler to do implicitly by minimizing surprise. So what is surprise? How surprising an event is felt to be depends on its probability. So surprise, like belief, is something like an everyday expression that we use and that seems psychological and emotional and so on. But now we're going to give it a precise mathematical definition. It makes intuitive sense to take the negative logarithm of the probability of the event y of the observation y given the model m as a measure of surprise. Because then if the probability of observation y given model m is 1, then the outcome was certain. We knew that y was going to happen anyway. So there is no surprise at all. The negative logarithm of 1 is 0. However, if under our model m, observation y is impossible, its probability is 0, and we observed it anyway, then under our model, the surprise is infinite. So the negative logarithm of 0 is infinity. And in between impossible events and certain events, surprise is greater than 0, and it increases for less probable observations. So this is surprise in a graph. You have the probability of the event y on the horizontal axis, and the surprise at seeing such an event given your model m is on the vertical axis. Now, another quantity we need to look at in order to understand all this is entropy. Yes, we mean a prior and the likelihood that combined to give a generative model. So in the example we had, this was the p of theta is Gaussian with mean mu theta and precision pi theta. The likelihood we had, observation y given theta was Gaussian around the true value of theta with observation precision pi epsilon. You take this together. What you get is your generative model, this here. And by the product rule of probability theory, we know that this is the joint probability of y given theta. Yes, exactly, exactly. This is m. And under this model, you have a particular probability of making observation y. Further questions? This is the definition of a model is always a generative model consisting of prior and likelihood. Yes, why do we use the log? Can I defer the answer to that later? Or I can give it to you because negative log probabilities correspond to energies in physics. We will see that. So you can think of a negative log probability. Actually, the negative log joint probability here corresponds to the internal energy of the system. We'll see that. Mathematically, you can use the same mathematical tools that you use in physics. So it's another very practical reason to like log probabilities is that they lead to fewer computational problems because probabilities are often very, very small, then small in the sense of being positive and close to zero. And if you take the logarithm of that, numerically, they're much easier to deal with. Your computer likes them much more than very small, positive numbers. And if you multiply many of them and you take the log, then you get a sum. That's also much easier. So many reasons to use log probabilities. So entropy. Concept closely related to surprise is entropy. So the more ignorant we are about quantity, the greater is the surprise we may expect when observing it. So this is now about the expected surprise. And that's what we call entropy in information theory because it's mathematically structured exactly the same way as entropy in physics, statistical physics. So it's the expectation of surprise of the negative logarithm. So you've got the negative logarithm here inside. And this is simply an expectation here. Who is familiar with how you, or who is not familiar with how you calculate an expectation value in probability? Everybody's familiar with that. So this is simply the expectation value of the surprise. That's the entropy. So entropy is a measure of ignorance. And its name is due to an analogous quantity in thermodynamics, mathematically analogous. So as a simple example, let's look at a coin toss. There are two possible outcomes. So y, again, is the observation heads and tails. Since the outcomes are discrete in binary, we use a sum instead of an integral. And the binary logarithm to define the entropy. So this is now the logarithm base 2. Now the entropy is the sum over all the outcomes y and all the outcomes of heads and tails, just two of them. The probability of these outcomes times the log of those probabilities. Now let's say we have a fair coin. So we have probability of tails and probability of heads equals 1 half. And that gives you an entropy of 1. Let's do this on the blackboard. So this is minus probability of heads times the binary logarithm of the probability of heads minus probability of tails times the binary logarithm of the probability of tails. Now we said we have a fair coin. So this is minus 1 half times the binary logarithm of 1 half. Probability of tails is also 1 half times the binary logarithm of 1 half. Now the only question remaining is, what's the binary logarithm of 1 half minus 1? Yeah, because 2 to the minus 1 is 1 half. So that's what I mean. Now if you put in other probabilities, like 9 tenths for heads and 1 tenth for tails, you get a different entropy. And this entropy will be lower. Actually, the entropy is maximal when the coin is fair, because then you're maximally uncertain about what will happen. As soon as the probability moves away from 1 half, as soon as the coin isn't fair, you're less uncertain about what will happen. Your expected surprise decreases. And that's how you can make money with an unfair coin, because you know in advance what's more likely to happen. Your entropy is lower. So this is the concept of entropy. It's expected surprise, a measure of ignorance. We have to keep apart a number of kinds of free energy. And we'll see how free energy relates to entropy relates to surprise and so on. Now in information theory, free energy is the surprise given a model m and a particular set of observations y. So this is the definition. Often people use the letter f. I'm using the letter a. One banal reason is that this is what chemists do. There's a deeper reason that I'll not go into today, but here I'll call it a. Yes, so because this is a probability, and then you can take the logarithm. So is it an energy? Yes, yeah, yeah, yeah. You might as well take your time. Yes. So in some sense here, we don't need to worry about dimensions because we're dealing with probabilities. So there are at least three kinds of free energy we need to keep apart, and that's thermodynamic free energy, informational free energy, and variational free energy, which we'll look at later. So free energy and thermodynamics and statistical mechanics, this is just a little repetition. You probably know this from high school and from your studies at physics. So we have the Helmholtz free energy, which is the internal energy minus temperature times entropy. Then there's also the Gibbs free energy, which involves pressure and volume, and very importantly, in statistical mechanics, the Boltzmann distribution describes the relation between energy and probability. So if particles in a system can be in states s1, s2, and s3 corresponding to energy levels e1, e2, e3, and so on, then the probability PI of finding a particle in state si is given by this Boltzmann distribution. How many of you do not have a physics background? You already knew this. So T is temperature here and K is Boltzmann's constant, and z is called the partition function. And this is simply the sum over all these exponentials of minus energy divided by kT. So taking the logarithm on both sides and rearranging gives us this. You will get the slides. You can check this at home. Taking the expectation value on both sides, we get this. So now we have minus kT times log z is the expected energy minus temperature times entropy. And the definition of the entropy here is as in information theory, but in physics, we have this Boltzmann's constant here. So it's the sum of probabilities times log probabilities minus times minus k. So in analogy to the definition of Helmholtz free energy in thermodynamics, and that's a equals u minus Ts, this motivates the definition of minus kT times log partition function in statistical mechanics. So this is the free energy in statistical mechanics. This is the free energy in thermodynamics. It all depends on the Boltzmann distribution here, the link between the two is the Boltzmann distribution. Now returning to information theory, on the slide before, we had thermodynamics, classical thermodynamics, Helmholtz free energy. We had statistical mechanics with the Boltzmann distribution. But now we're turning to information theory. So now no more k's, just dimensionless probabilities. So we take the definition of informational free energy and perform a series of algebraic operations on it. So we take the negative logarithm of the surprise at seeing event y given m. That was our definition in information theory. Then we just put an integral around it. We just take the negative logarithm of p of y given m and put it inside this integral. And this is the integral over the posterior distribution of theta given y and m. And because it's a probability distribution, it'll evaluate to 1. So this equality holds. It's just a more complicated way of writing this. But now this allows us to use Bayes theorem to write this quantity here in this manner, so that you understand this. Bayes theorem tells you that the posterior is the joint. This is the product of likelihood and prior, divided by what we call the model evidence. And now we just exchange these two. And that's where we get the second equality from. We rewrite this as this divided by this. And now we use simple algebra to take this apart. And now you can see that this can be interpreted as an expected energy. And this has exactly the structure of an entropy. This is the entropy of the posterior distribution. So it's this expected energy minus the entropy here. And this gives us an information theoretic analogon to the definition of Helmholtz free energy in statistical mechanics. So the main comparison is here. This is statistical mechanics. A is minus kt log z. And z is defined as this, as the partition function. And in information theory, we have this. A is defined as the negative logarithm of probability of y given m. And this is the integral where we integrate out all the thetas here from the joint distribution over y and theta. So this is exactly the same to each other if we set kt to 1 and interpret the negative logarithm of the joint probability distribution as an energy. So that's what I already said in advance there. So if you use this, the negative logarithm of the joint probability of y and theta given m as your definition of energy, then from the information theoretic definition, you get the statistical mechanical one. So this is the correspondence of negative log probabilities to energies. This is the deep and intimate connection between physics and information theory. But there is a problem. Namely, this entropy here contains the post-theory. This is what we get when we apply Bayes' theorem. And this is the problem that we're trying to solve. So this has not saved us from solving Bayes' theorem. We still have this posterior in here. We cannot directly apply this to anything before we know the posterior. So the spectacular trick is, in some ways it is spectacular, it's just it can know that fact. And to say, let's pretend we know the posterior. So we take the true informational free energy and replace it with something we call variational free energy, where everywhere where we have the posterior here, here, and here, we just take an arbitrary, just any probability distribution. The only condition for probability distribution is that it is a positive function that integrates to 1. We take any such function. It has to be positive, or at least non-negative, maybe zero. It's being non-negative, and it has to integrate to 1. That's the only condition. And we insert it everywhere where we have the posterior that we don't know. We put it in here. And we call this the variational free energy. And the reason we call it the variational free energy is that now we are going to wiggle around with our Q of theta. And we're going to look how A, B changes. Because it turns out, and this is really a piece of mathematical magic, this is something that should have you gaping open mouth. Whatever Q of theta we put in here, whatever crazy probability distribution we put in here, A, B, the variational free energy, will always be greater than or equal to A. So if we take this and simply replace all posteriors with Q of theta, we will end up with an A, B that is larger or at best equal to this A here. So what we can do is we put in any kind of Q of theta, and then we change it. We vary it around. And if A, B goes down, we know we're moving in the right direction. We're moving in the direction of the true answer of the posterior we're looking for. And the branch of mathematics that describes how this goes is variational calculus. And that's why we call this variational free energy. So without having to know anything about A, we can vary Q of theta such that it minimizes A, B. We don't know the true A. We don't know the true posterior. We don't know this. We don't know this. But we know this because we put it in ourselves. And because all the quantities here are known to us, we can calculate this. So without ever knowing this, we can vary this and minimize it, and we know it's getting closer to this. It's getting lower. It's getting closer to this. So minimizing A, B with respect to Q theta leads to an approximation of the true posterior by Q of theta because of the theorem above. And because A, B is equal to A exactly when Q of theta is the true posterior. So when we've minimized A, B until it's equal to A, then we have the true posterior. So we can now use variational calculus to find Q of theta without ever having to know the true posterior itself. So we don't have to solve base theorem. So this is also how the brain can build, update, and compare models of the world without ever seeing behind the scenes of its sensory input. You never know the truth behind it all, but you can approximate it using procedures like this. This is the proof that A, B is always greater than or equal to A. A few simple lines. The only thing we need is Jensen's inequality, which is itself easy to prove. You can look it up on Wikipedia. OK, I think this is a good point to end for today. Is it right we were scheduled to end at 10.45? Is that right? OK, good. So I should see you next week. Are there any final questions before I run off? Yes, I will send the slides to the secretary and she will distribute them. Yes, back there? More loudly, please. I'm more quiet here, please. So it is the surprise on the model evidence on this quantity here. In general, a surprise is just the negative block probability of any kind of probability of this, this, of this. But the free energy is the surprise related to the model evidence to this quantity where all the thetas have been integrated out. So it's a special case of the surprise. Yes, so in physics, entropy increases in systems that are isolated. So the universe has a whole maximum of entropy and so on. But the equivalence thing to the second law of thermodynamics for such systems in systems where you have constant, have to be careful not to make a mistake here. You can correct me next time if I make a mistake, but I think it is in systems where the temperature and the volume is held constant. The second law means that free energy will be minimized. So it's another way to state the second law of thermodynamics for a different kind of system than for the kind of system where entropy increases. OK, I'll see you next week.