 First, I want to give you the opportunity to ask questions that may have appeared in the meantime and Then I'll give you a quick recap of what we went through last time So first the questions any questions that you have immediately. Yes No, there will be a certain dose of blackboard Beginning today, so there were just going to be three slides on variational base and the The rest is going to be on the blackboard. Yes Were you longing for the blackboard to play a role or okay? Good. Good. So you'll You'll get something in that department today Are the questions? Yes But I mean model no A model is something more specific than a probability distribution. Yes, please. Don't forget to sign the attendancy a probability distribution is simply a positive function that integrates to one and If you have a model You need a likelihood which is basically a model of how you make observations and You need a prior on the parameters governing the Observations then you can continue this hierarchically you can have priors on your priors which are hyper priors and then you can have Priors on your hyper priors and so on So that's how you build a hierarchical model But a model is more than just a probability distribution so as we saw last time you need two components of your model the likelihood and the Prior where do we have this here? So and if you combine Likelihood and prior you have your full generative model So that's what you Need to generate your model and this is also the joint probability distribution over your observations and parameters So one way you can put it is your model is your joint probability distribution over observations and prior That's the kind of probability distribution a very specific time. What would I say about the model? so the model is the joint probability distribution over your observations and your parameters and You can decompose this into the likelihood, which is basically your observations Conditioned on the parameters or hidden states of your environment Times the prior on your hidden states or parameters The difference between hidden states parameters is simply that parameters are constant while hidden states change with time if you have a model that Looks at time series then your states are going to change from time To time to time and so on whereas your parameters stay constant. Was that what you were getting at with the question? No good Further questions. Okay. Good. So this is a good point to start the quick recap of last time This is The basic way we do Bayesian inference We use Bayes theorem to go from one conditional probability P of a given b to the reverse P of b given a That's what Bayes theorem does for you and all you need for this is the rules of probability theory and it is what People do when they run experiments. They have a model of the system. They're modeling in This example the brain They have observations they make on the brain in this case EG or MRI measurements and They have a model of how the states of the brain will produce measurements and on the basis of that model you can infer back on to What the state of the brain was as you were doing your measurements and the same thing? applies to how the brain itself or Or if you want to build a robot that works like a brain How that robot would operate in relation to the outside world? You need to have a model of that outside world in order to navigate it successfully so You infer the state of the world You solve the inverse problem on the basis of the model You have of the outside world of what states? The outside world will evoke in you via your sensory input then we saw this Every good regulator of the system must be a model of that system So there's no way around having a model or even in some sense being a model of The environment you're trying to manage to regulate So this raises the question. How do we update beliefs? how does the brain do that and Bayesian inference is the optimal way to do that because Bayesian inference does not throw information away That's the key feature Bayesian inference if you do inference in any other way you that amounts to it is the same as Throwing part of the information that you have away If you do Bayesian inference you take account of everything you know So not doing Bayesian inference is in some way illogical Irrational The problem is that Bayesian inference is very complicated and there can be a very high computational burden so we need to use approximations when we do it and When a biological system like the brain does it it will also have to resort to some sort of Approximation because if it tries to do everything exactly it'll take it longer than it takes its adversaries to eat it To finish the calculation So organisms that take too long calculating their posterior is they're not going to be around for very long So we looked at updates in a simple Gaussian model So we had a Gaussian prime a Gaussian likelihood and this gives us by Bayes theorem a Gaussian posterior and The interesting thing then is how do we pass from the sufficient statistics of? the prior and the likelihood to the new sufficient statistics of our posterior so we talked a bit about sufficient statistics or there was a question in During the break Sufficient statistics are simply The parameters that fully describe your probability distribution So if you have a parametric distribution a distribution that can be described Using a few parameters and then you have that full distribution Then these few parameters that you can use to describe that distribution are sufficient statistics So for instance if you have a very simple Bernoulli distribution like in a coin toss Then you only need one Only have one sufficient system if you have a fair coin The parameter is going to be 0.5 Because it's equally likely to get heads of tails If you don't have a fair coin your parameter may be 0.6 representing the probability of 60% of getting heads If you have a Gaussian distribution you need two sufficient statistics You need the mean and the variance so here We're not working in terms of variance We're working in terms of precision which is the inverse of the variance because that gives us a simple update equation the posterior precision is simply the prior precision plus the likelihood of precision the observation precision and The posterior mean is an update to the prior mean By a precision weighted Predictionary so this is your prediction error Your observation y will differ by something from your prediction simply your prior mean new theta So this is your prediction error You wait this prediction error in some sense and you add it to your prediction to your primary and The way you wait it is By the observation precision divided by the posterior position This gives you your new mean So you have a prediction you have a prediction error and you have a weight and This weight you can interpret as a learning rate and this makes sense on top of being Correct and Mathematically derived. It's also interpretable. It's a very nice feature of it because we can say how much we learning here works in favor of the prediction error and How much we already know is in the denominator So it works against the prediction error the more we already know the less weight the prediction error has The more precise our observation is The more the prediction error counts Yes, there's in this little model. It's assumed to be known So it is the precision of your measurement instrument, basically You can also simply make observations And infer it from your observations. We shall look at that in what is to come So here we're basically Looking at updates to the mean here and the precision of a Gaussian if also the precision of your observations is unknown you need a more slightly more complicated distribution like a Gaussian gamma distribution and you have to update more sufficient statistics in order to perform inference on that But it is possible Here we just assume that we know pi epsilon pi epsilon is a constant here Assume known but it can be inferred Yes, exactly so the more information you have the more precise your posterior will become this is one of the fundamental problems that we're going to deal with because I'm not going to give no I'm not going to give the because away yet. So I'll let you find The problem with this there is a problem with this. Yes, the question is about why is that Pi pi we have two precisions We have the precision in the likelihood the pi epsilon and the precision of the prior So the mean of the prior is mu theta. This means the width of the prior is characterized by Beta so to be sure what you get here is Sort of the standard deviation. So it is the root of 1 over pi theta So the larger pi theta is the larger your precision is the narrower Your prior is and this is before we make an observation now. We make an observation the observation y Is let's say here, but the thing is You can only make your observation with finite precision because you There will always be noise in any observation you make So Let's say your observation is quite precise. So you have a high precision here Then in your prior now here the width of this Gaussian Will be one over the square root of one pi epsilon It's basically the precision of your observation. Now you update this here by taking the difference here This is the prediction error y Minus mu theta It's positive because why is greater than mu theta you use this prediction error to update you theta So you add to this to mu theta now as we've seen this is positive This is a fraction of two positive quantities because their precision they have to be positive so We know that our posterior will be to Let's use blue Maybe Let's use yellow Better visible so we know that our posterior will be to the right of The prior which makes sense because if your prior is here and your observation is here You're going to want to adjust your belief in the direction of the observation So question for you. I'm going to give you two options Which one is the more sensible one for the location of the? posterior mean Is it going to be is mu? The updated mu theta given y Is it going to be somewhere around a or is it going to be somewhere around B? Who who votes for a considerable number who votes for be about equal about equal So somebody who voted for a give me a reason as you can see from the equation In our case You've changed your mind to be you so B is the correct answer Can anybody just very very simply it just takes it perfect? Yes? That's exactly what I wanted to hear. That's how you can formulate this in a very few words. It's going to be closer to why if The observation was more precise than the prior It's going to be closer to mu theta if the prior was more precise than the observation It's going to be exactly in the middle if they're equally precise Let's just look at this if these two are equal Then we're going to get exactly one half here and this means we add one half of the Prediction here to our prior and we're we end up in the middle now in our case We're going to be Near here because we had a much more More precise Observation than a prior now how precisely should I draw this one who would like it like this? Hands up at least one hand. I should give you an alternative before I ask Who would prefer it more like this? Sorry, I didn't get that no Those are the two alternatives one of them is more sensible than the other so and always consider I believe you know These are Gaussians. I'm trying to draw Gaussians all of them are supposed to be Gaussians I'm just varying their precision that's sharper one. Why because yes Exactly exactly we made an observation so our precision has increased So the precision of the posterior is the sum of the precision of the prior and the likelihood So this is it the more precise one Yes, yes, exactly. We know more After making observations, so I our posterior will be more precise so there are Tricks ways in which you can get broader distributions after making a new observation but not in this simple straightforward model Simply making observations in in this model and then all of them like it Increases your information it increases the precision of your beliefs. There's a very fundamental example And I'm glad we're discussing it in depth so If you have any remaining questions about this Please ask Yes Well Okay, you have to account for your for the failings of your observations in your Likelihood in your observation mode So if you assume there may be a systematic error Then I mean if you don't have a way to assess Your systematic error All you can do is look at the variance of your measurements But if you have a way to estimate Your systematic error, then you should introduce a parameter Into your likelihood describing systematic error and Estimate that parameter with the other parameters in the moment So for instance, you're looking at a time series with seasonal ups and downs You're gonna include a parameter for that you Or you have a time lag so so you make an observation now and you believe This tells you something about the state of the world three months back You introduce a parameter for that You do worry about it and you worry about it in a very specific way and We will also get to that. What do you do with that? Well, that's your new prediction and then you make your next observation. Yes. Yeah, so You use this Sequentially to update your beliefs. Yeah Yes, you can construct such models in in various ways so so you can do Bayesian inference that doesn't and this is the first thing we'll look at You can do Bayesian inference not just on your inputs, but also on the variance of your inputs and Then when you see that you've underestimated the variance of your inputs You increase the width At least of your predictive distribution So but then you're you know, you use Inference on this and in front on that and then you use what you know about this to make predictions about that We'll see that we'll see how that works Then another thing you can do is to violate the rules of Bayesian inference so in this analogy between Statistical mechanics and information theory I didn't dwell on the point, but you may have noticed temperature was set to one you always had internal energy Minus temperature times entropy And then when we went to information theory Temperature was gone. So effectively it was set to one Now if you mess around with that You're gonna violate the rules of Bayesian inference But you can get situations when you introduce a temperature lower than one That gives you broader posterior distributions Then prize we're not going to do this here. Yes so then if you know if the Your data generation model, you know Implicitly in what you've said you had a generative model of the data that we're fitting with this model And if they don't match Then this kind of inference will be inadequate Inadequate not appropriate so You always want to have a model that Corresponds to how data is really generated. So you were proposing To generate data using a process That is totally at variance with the likelihood that we assumed here and then of course This model will fail. You will calculate some kind of mean but as you already see This doesn't work for the kind of data generation process that you Assume to be underlying the data that you're fitting So you should always if you have reason to believe that a particular process is responsible for Generating the data you have then you should always use that as your model and If you're not sure you can always use model comparison to look at Different Possibilities for how these data could come about and what the best explanation for these data is So we will look at some aspects of some simple ways that we can use to compare different models Because in general you don't want to have just one more We want to have several models that you can compare and and see which one works best Okay, then we'll just continue the little recap So something We shall get to Again In more detail is this can be generalized to all exponential families of distributions We'll do this again Then just basically by way of Letting you know a hierarchical message passing system is Assumed by the most advanced theories of cortical functioning To be responsible for the brains Predictions the how the brain updates its predictions is by these advanced theories of cortical functioning Assumed to be a hierarchical Message passing system where an error signal is passed upwards and the prediction signal is passed downwards and then we changed tack we Said we're not going to calculate the posterior directly as we did here because we had a very simple model We could just do the analytic calculation. We could just solve the integral in Bayes theorem and then we had an analytic solution for mu theta given y and Pi of theta given y and that's Characterized our posterior in most situations. This is not possible And in almost all interesting situations, it's not possible So we have to find clever ways around that Clever ways to calculate our posterior or to approximate our posterior in situations where we cannot find an analytic solution to it one that you've certainly heard of is sampling and another here is to Take another way at the problem and Minimize surprise So we looked at what surprise was and surprise is just this It is the probability of an event given a model M and you take the negative loch of That that's your surprise An event that is certain Is not surprising at all your surprises zero an event that is impossible under your model is At least under that model infinitely surprising then We looked at entropy entropy is the expectation value of surprise so it's How much you? Expect to be surprised by what happens So it's again the negative loch probability of an observation and then the expectation value of that and we looked at this with a coin toss example and We saw that the entropy was highest when the coin is fair So when heads and tails are equally probable and as soon as the coin is unfair The entropy of the outcomes goes down Because the outcomes are less surprising or you can be you can expect To be less surprised So from an entropy of one We go down to an entropy of for a fair coin we go down to an entropy of all point four seven for a heavily Unfair coin with the probability of heads nine tenths and the probability of Tales one tenth So much for entropy and then we started to look at free energy and we distinguish thermodynamic free energy That's simply Helmholtz free energy internal energy minus temperature times entropy Then we went on to be we looked at Free energy and statistical mechanics and we saw this little comparison here Where you have the free energy and statistical mechanics and The free energy in information theory, which is simply the negative log Model evidence, so this is the quantity in the denominator of Bayes theorem Let's look at this for a second because it's important. So in Bayes theorem you have Your posterior probability on theta given observations y and model M is the likelihood probability of an observation y given parameters theta and the model M and your priors on the Parameters and now in The denominator here you've got your model evidence This is called the model evidence and this is the integral over this for all theaters So here you have a specific theta Now that I'm integrating theta out I'm gonna just call it Theta prime so this is Given model the model is always given. So this is the evidence of this model It is basically the probability of the data Weighted by all possible values of the parameters weighted by their prior as you should see in the next equality So we integrate this here. So we integrate Theta out that gives us this This is the sum rule of probability and this is the Product rule rule says Conditional times marginal gives you joint and the sum rule means that if you marginalize over theta prime then you get This marginal distribution just on the data it's basically given My model Including the prior on the parameters theta how Probable are the data and now once we take the negative logarithm of that as We do here We have a surprise at the data. That's basically the free energy is How surprised is this model at seeing these data? Yes, well you could take the surprise here at this distribution at this distribution It is a particularly kind of surprises the surprise at the model and at the model evidence So so I mean in general surprise is The negative logarithm of a probability distribution you can apply that to any probability distribution the model evidence But the local model evidence is the surprise at the at this quantity here But so the free sorry the free energy is the surprise at the model evidence then we did this trick where We took apart The model evidence Applying some algebra we showed that this was the same as minus log P of y given M. So this was on the slides before So we derived this and Then the problem was we couldn't calculate this Because it contained the posterior and that's exactly the problem. We're trying to solve. We're trying to and get at the posterior so now Very boldly We just replaced the posterior that we didn't know with an arbitrary probability distribution q of theta so now instead of Internal energy minus entropy we have the Variational energy this we're calling this the variational energy because it contains q and we're going to vary q using variational calculus and The variational entropy and now it turns out that for whatever Q theta we choose the variational free energy AV Will be greater than or equal to The exact free energy a and this means we can wiggle around Q of theta and See what direction it goes if it goes down. We know we're getting closer to the exact free energy And in this way we can approximate The exact free energy By varying the variational free energy Without ever knowing the true value of a We just know we're getting closer If AB goes down with the changes we make to q. This is quickly the proof That AB is always greater to greater than or equal to a and then There are three ways to decompose AV and We didn't yet go into these before we do that I Think this is a good moment to have a quick break anybody not want to break Want to be unpopular raise your hand So we'll move we'll meet back here in a bit less than ten minutes, huh? Seven eight minutes