 And the tutorial on Thursday has been shifted at 17.5. And today we have also, I'm glad to introduce you to the director of CISA, Professor Stefan Olufo. And given that CISA is a strategic partner of the whole activity, a very partner of the master in the physical complex system. And then on this spring college, the director wants to greet you. Thank you. So I'm very happy to be here, especially because I'm introducing a very optimistic workshop that is the spring college one month in advance. So maybe I'm wrong with the date of the spring, but okay, maybe it's another aspect. The seasons are changing. So I'm very happy, especially to see so many students. I know that you gathered students from different years. But it's important that you establish a sort of contact with also different years. And also I will welcome you at CISA on the 7th of March. You will have a round of our departments and you will visit the labs. And of course, if anything you need, you may pass your comments or complaints through the organizers. And I will be also very happy if we want to come to visit me in my office at CISA. So another thing, okay, another comment is that I am indeed I am very much close to the topics of your field. I've worked in statistical physics and so and I've seen very much the physics of complex systems growing. And I see that the spread of topics is amazing also in this workshop. And I would also like to be a student of the workshop because many of the things, for instance, Chris today is saying I don't know at all. So but unfortunately I cannot stay. So I agree to hear again and so have a nice workshop and I'm at your disposal if you need. Okay, so thank you very much. Today is to my pleasure to introduce this practice. He is originally from Swiss, even though he was five minutes away. Then he started, he had a very interesting part. He studied physics in Türi and then he moved more towards neuroscience during his studies in London. And then he eventually landed in CISA one year ago more or less. Yes. And the topic of the lecture is really exciting and very important on the search. So it's our play. Andrea and thank you to the organizers here for having me as a lecturer here. It's a great pleasure and I also love the fact that there are so many here at the spring college, even though it's still cold winter. So we will be speaking about, I will be speaking to you about hierarchical influence mostly on time series. And in this lecture today, I will give you a brief introduction and then I will also make a connection with the brain because sort of mainly I do this because I'm interested in how the brain processes. The brain is an information processing machine and we can learn a lot from how the brain does that and from the first mathematical principles of how information can and optimally should be processed. We can learn a lot for going about understanding the brain because the two must in some sense match how the brain has solved problems and what the problems to solve are. It gives us a connection between the two fields of neuroscience and information theory and understanding complex systems in nature and of course the brain is one of the most complex systems we have in nature. I will run you through Bayesian inference, I don't know how much basic knowledge you have about that. To some of you that may be a bit of a repetition but I don't think that hurts. Now we'll talk about free energy as a tool of inference. This is information theoretic free energy. It is related in the mathematical tools that we use to analyze it to physical free energy. Basically it just exists in the information space. Then more concretely I want to look at variational Laplace which is a method of doing inference and we'll use the blackboard for that a little and look at some equations. Please remember something that should remember you to make these around for this attendance. You should remember to pass them and you should remember you to pass it around. Please always, especially in my lectures, sign the attendance sheet but also everywhere. I don't know how many signatures you need to pass here but be sure to sign it. If it never comes to you at the back then scream please. A little inference and learning in the brain, quick overview. One conception of what the brain does that goes back to the 1860s is by Helmholtz who said the brain actively predicts what is going to happen next. The alternative view would be that it passively sits there, waits for input and then acts as a kind of filter on that input. It processes the input and hands it on to the next level because we all agree that the brain works hierarchically. Information comes in at one end and then is passed up a hierarchy of levels as it is processed. Now the two alternative views that there are is the one that this is a kind of passive filtering kind of way of processing information. It comes in, it's somehow processed, it's passed on to the next level. Now this view that Helmholtz has is that the different layers of this hierarchy don't just sit there. They actively predict what will come in to them from the level below and the lowest level which is our sensory organs, for instance the retina at the back of our eyes. They have a prediction of what will happen next, of what sensory input they will receive next and the higher levels of the hierarchy have a prediction of what the lower levels are going to hand up to them. And this is more efficient and we have lots of empirical data that it actually happens this way in the brain because you only have to react to prediction errors as long as what you predict happens, you don't have to do anything. All that you might have to do is adjust the precision of your prediction. So if your prediction is true then you can make it slightly more precise because it's slightly more certain that it's accurate. This will be a big part of what we're going to talk about, the adjusting precision to be just right to perform an inference. So prediction errors can be used to update predictions and the optimal way to do that is Bayesian inference. We'll look at in what way this is optimal just a bit later and then we'll look at the mathematical properties of Bayesian inference. So when using input in further states of the environment that are causing sensory inputs, the brain must use a model. That's another fundamental truth, you can prove that. You do have to have or in some sense even be a model of your environment in order to manage that environment successfully and manage it means to be successful in an evolutionary sense in that you live a long and happy life and reproduce widely. So such a model contains parameters that are taken to be constant in the short term. However, in the long term parameters have to be adjusted to reflect the structure of the environment. That's what we call learning. So we perform inference on states. But the model that describes these states as parameters, these parameters are in the moment we're fitting the model constant and updated on a longer time scale by a system like the brain. That's what we call learning. And this can be implemented in the brain by methods like predictive coping. I won't go deeply into predictive coding, I'm just mentioning it here. predictive coding is a way that is a concept of how the brain implements Bayesian inference, implements learning and acts in a predictive way. So one very interesting field that has emerged in the past few years is computational psychiatry. This is basically trying to understand disorders of the mind by describing them in terms of information processing breaking down and using mathematical methods to look at that and to see whether in models we can reproduce the kind of symptoms, behaviors, syndromes that we see in psychiatric disorders. So that is when inference and learning break down. So another thing that we'll be looking at is it is possible to reduce Bayesian inference with some restrictions, but within these restrictions it is possible to reduce Bayesian inference to belief updating. And here I mean belief in a concrete mathematical sense, in the sense of a probability distribution. So you update a probability distribution, that's what you do when you update the belief. And you can reduce that in many cases to a precision way to prediction it. And then I will talk about what I do day to day and we call that hierarchical Gaussian filtering. There's a particular way of analyzing time series using hierarchical Bayesian inference. And I will show some experimental studies and results and we will look at the derivations of the equations, the mathematics behind hierarchical Gaussian filtering. So if we look at the predictive Hamiltonian brain for a second, we'll get to the deeper mathematics very soon. But this is just the general principle. I don't know if I have, I'll just use the mouse to point. You can see my mouse, yes. So the basic components of Bayes' theorem are the prior here. And then you get your new information. This is the likelihood in red. And you use Bayes' theorem to arrive at the posterior. And as you can see, the posterior is a compromise between the prior and the likelihood. And it is more precise, more peaked here in this simple Gaussian model than the prior and the likelihood. And actually in this model, the posterior precision is the sum of the prior and the likelihood precision. So this is just the mathematics. But this is what we do when we perform an experiment. As practicing scientists, we look at some kind of reality out there. This is an ion channel in a neuron. But any kind of other system that you may be looking at as a physicist, as a chemist, as a biologist, you describe in some terms that ideally are mathematical and ideally have this prior and likelihood. Because that gives you a full description. That is what we call a generative model. And once you have that, you can make your measurements. Here are the measurements y. And you can use Bayes' theorem to turn that around, to invert this. This is what we call model inversion. This is our model. It's called the forward model because it goes forward from the state of the system out there to the measurements we make. But when we perform the experiment, all we have is the measurement. So we have to take y and use that to infer the state of the system we're interested in. And we use Bayes' theorem to do that. This is the posterior distribution. We invert our generative model to find out how the system of interest in what state it was, why we were doing our measurements. And now, this is exactly what the brain also does. The brain is confronted with a world out there who states are unknown and not directly accessible to us. All we have is the sensory input we receive. The photons hitting our retina. The sensations of touch we have. What we hear, what hits our eardrums, and so on. And our brain constructs a picture of the world like that. Just like the scientist infers on the system he's studying, our brain constructs an image of the world. The way Helmholtz put this, objects are always imagined as being present in the field of vision as would have to be there in order to produce the same impression on the nervous mechanism. So that's slightly convoluted from the 1860s, but that's basically this idea and it goes far back to Helmholtz. And he came up with this in studying optics and how the eye works. So the mind actively infers the objects of its perception in what Helmholtz calls unconscious inference. So if this is true it means that to understand the mind and the brain we need to understand the mechanics of inference. And this is what this will be about. So the mechanics of updating predictions. So after this initial overview we'll get down into the details of inference and updating predictions. So what we need to find out is how to reason about uncertain quantities. We have one quantity that we cannot observe exactly, another quantity that we cannot observe exactly, and we have to find out in what way they're related. So here's an idea. Does chocolate make you clever? So we have chocolate on one side, cleverness on the other side. How do they relate? This was a few years back. It made a splash in the news. And the BBC wrote, eating more chocolate improves the nation's chances of producing Nobel Prize winners. So at least that's what a recent study appears to suggest. But how much chocolate do Nobel laureates eat and how could any such link be explained? Well what do you do? You apply statistics to this. This is what somebody actually did. So here's a graph. It's simply plotted on the horizontal axis. On the horizontal axis you have chocolate consumptions in kilograms per year per capita. So the countries with the flags towards the right eat more chocolate. On the vertical axis you have Nobel laureates per 10 million population. So countries with the flags higher up have more Nobel Prize winners per capita. And you can see a very clear association. So the more chocolate the nation eats, the more Nobel Prizes it wins. So how are we going to look at the relation between this? Well, just going to do a correlation. And the correlation's impressive. It's 0.8. And the p-value is really small. It's 0.0001. And this is not just something that appeared somewhere in the newspaper. It was actually published in the New England Journal of Medicine, which is one of the most prestigious medical journals, perhaps the most prestigious medical journal. But if we want to look at this more closely, if we're not satisfied just with correlation like that, what do we do? So we'll look at how to reason about this in terms of Bayesian inference right now. Of course, I'll say at the outset to really make sense of something like this or perhaps make nonsense of it, then you need to have a model for how this happens. You need to have a model that goes from eating chocolate to winning a Nobel Prize. Just doing a correlation, of course, will not be enough. So as a first step to making some sense of this, let's look at reasoning in terms of uncertain quantities. So like almost all scientific questions, this question cannot be answered by deductive logic. Deductive logic is what you use when you do a mathematical proof. You have definition A, definition B, and so on, then you deduce your theorem out of that. So nonetheless, we can reason quantitatively about uncertain quantities. But the answer can only be given in terms of probabilities. So our question here can be rephrased in terms of what is called a conditional probability. Who has seen conditional probabilities? Hands up please. Yes, okay, so we can go through this very quickly. Wow, because I'm hidden behind it. So I could also just move to the center of it, but then the cable interferes. Good. Thank you. Okay, so what we're interested in is the conditional probability of winning a Nobel Prize given that I eat lots of chocolate. So the tool for calculating with these quantities is Bayesian inference. And Bayesian doesn't mean anything mysterious. It simply means conforming with the rules of logic. And logical simply means in terms of probabilities. So if we don't violate the rules of probability theory, then we're being logical and we're doing Bayesian inference. So this is a quote from the physicist Maxwell from 1850, who says that the true logic of this world is the calculus of probabilities. And this takes account of the magnitude of probability which is or ought to be in a reasonable man's mind. All we're trying to do is being reasonable, not throwing away, and being reasonable also means not throwing away any information that we have. So processing all the information we have optimally. So in what sense is probabilistic reasoning logical? In the 40s it was shown by RT Cox that the rules of probability theory, you can either just define them as axioms as Kolmogorov did. But you can try to find a reason for these axioms. So in some sense that's counterintuitive if you believe that an axiom is just something that you posit and it's there and you just accept it. But that may be somewhat unsatisfactory just to posit these axioms. So RT Cox showed that the axioms can actually be derived from three basic disiderata. So if you admit three things, you can no longer deny the validity of the axioms of probability theory. And the three things you may want to admit, of course you're free to deny these, but if you deny these three things you cannot deny the validity of the axioms of probability theory is the first one is representation of degrees of plausibility by real numbers. So this is the idea that if I make a statement like it's raining outside, you can attach a number to the plausibility of that statement, some real number. And some statements have a higher number than others because they're more plausible than others. The second thing you perhaps will admit is that there should be a qualitative correspondence with common sense. So if I receive new information that increases the plausibility of my statement, then the plausibility number should go up. If I receive new information that makes it less plausible, the plausibility number should go down. And the third is consistency. So if I calculate, if there are two ways to calculate my new plausibility number, then these two ways should lead to the same new number. Now you can go and look at Cox's paper, but from these simple disiderata he derives the three rules of probability. And the three rules of probability are normalization. All probabilities add up to one. Marginalization, if I marginalize the joint probability of A and B over A, I am left with the marginal probability of B. So just the probability of B. And conditioning, most of you, I think about 80 to 90% of your hands went up. Know about conditional probabilities. That is simply that the joint probability of A and B is the conditional probability of A given B times the marginal probability of B or the other way around. And as Laplace, another physicist mathematician who was instrumental in coming up with this said, probability theory is nothing but common sense reduced to calculation. And these are the rules of calculation that you can apply and then you're on the right side of common sense. So conditional probabilities, you know this. So let's apply this to the chocolate example. So as I said, the quantity of interest is the probability of winning a Nobel Prize if you eat lots of chocolate. Now that cannot be observed directly. However, what you can do is you can take your phone and call up lots of Nobel Prize winners and ask them how much chocolate do they eat. And then you can tabulate this quantity here. The probability that a certain amount of chocolate is eaten given that somebody has a Nobel Prize. And then you can also calculate this number, the probability in general in the population at large of having a Nobel Prize. What's the proportion of Nobel Prize winners in the population? And then you can also tabulate this in the general population. What's the probability of eating a certain amount of chocolate? If you have these three quantities, you can use Bayes' theorem to calculate your quantity of interest. Now, of course, what's missing here, as I already mentioned, is a mechanism. As a scientist, you want a mechanism. Simply associating stuff with other stuff doesn't give you insight. And we'll be looking at models for the whole rest of this. This was just an introductory example to show you how you can reason about things that are uncertain and that you can only measure distributions about and go from likelihood and prior from your genitive model to the posterior. Because this is your model. Again, I briefly mentioned this, but it's an important point. You have a generative model. You have described the system. You're investigating fully if you have a prior and a likelihood. So if we, for instance, do neuroimaging, what we do is we construct the likelihood. And that means a model of how states of the brain that we're measuring here will result in measurements that we make using EEG or MRI or whatever. And then we use these measurements using Bayesian inference to infer on the state of the brain why we were making the measurements. And again, this is what we think the brain does and the evidence is accumulating that in effect the brain works this way. Another point that I briefly mentioned that I'll go into in some more detail now is every good regulator of the system must be a model of that system. That's a theorem by Conan Tadashby, 1970. And the abstract is short and sweet and I will show it to you in full here. The design of a complex regulator often includes the making of a model of the system to be regulated. The making of such a model has hitherto been regarded as optional as merely one of many possible ways. In this paper, a theorem is presented which shows under very broad conditions that any regulator that is maximally both successful and simple must be isomorphic with the system being regulated. The exact assumptions are given making a model is thus necessary and they immediately saw the consequences for the human brain or for any brain of any organism. The theorem has the interesting corollary that the living brain so far as it is to be successful and deficient as a regulator for survival must proceed in learning by the formation of a model or models of its environment. This applies to any system you want to regulate successfully if you want to build a robot and that robot navigates its environment. The robot must explicitly or implicitly have a model of its world or be a model of its world. So, updating beliefs. Bayesian inference simply means inference on uncertain quantities according to the rules of probability theory, logic. Agents who use Bayesian inference will make better predictions because they use logic as opposed to others who don't. And this will give them an evolutionary advantage because the other point only agents will be left that actually do something resembling Bayesian inference to a reasonable degree. Another constraint is that you have to be fairly quick at doing your inference because otherwise the other agent will have eaten you before you've finished with your inference. Predictions are predictive probability distributions and we refer to these distributions as beliefs. So whenever I say believe, I don't mean something ill-defined, unclear, whatever I mean, a probability distribution. So Bayesian inference amounts to the updating of beliefs. But how can we reduce Bayesian inference to a simple algorithm that can be implemented by neurons and also by computers in little systems that we build? So updates in a simple Gaussian model. So, one image that I like to use and perhaps we'll return to it. We'll see is the following. You are on a boat. You're out at sea. You can imagine here the Bay of Trieste. You've lost your modern equipment. You don't know where you are and there are highways. You want to get back to shore. But the shore was dangerous full of cliffs. You have to know where you are in order to know where you need to go to shore. So the only thing you can do is determine the angle between a lighthouse that you can see on the shore and your current position between north and the lighthouse. So you can only see that lighthouse when you're at the top of a wave, when you're down in the trough you can't see it. So you make a series of measurements and then you try to infer that angle. Now if you do that in Bayesian inference we'll look at another example of a simpler heuristic to do it. But if you do that using Bayesian inference you need to have a prior. You need to have a likelihood. In this simple model we'll assume that both of them are Gaussian. So our Gaussian prior, this is a prior about the trough angle theta has a mean mu theta and a precision pi theta. A precision is simply, in this simple example, is simply the inverse of the variance. So you would have your variance here of your Gaussian and because this is the precision it is the precision to the minus one. Now your likelihood is centered on the trough value. So if you make an observation why this will be Gaussian distributed around the trough value theta with a certain observation precision pi epsilon. So your observations are never going to be 100% precise. So you will always have some uncertainty because your instruments are not perfect. Because the boat is wiggling around as you make your measurements and so on. This is captured by the precision of your measurements pi epsilon to the minus one. And now Bayes' rule tells you that if you have a Gaussian prior and a Gaussian posterior and a Gaussian likelihood the posterior will be Gaussian again. You can do this integral at home. You will find that this is actually the case. So you have a new Gaussian now with a new mean mu theta given y and pi new precision of that angle theta given your observation y. There's always theta is the parameter we're trying to infer the angle between north and the lighthouse and y is the observation we're making. And the big question now is what is the new mean and what is the new precision in our posterior? We know that the shape of the distribution is again Gaussian but it's now shifted. It's updated. We first had our prior then we made our observation and now we have a posterior with respect to the prior. So in terms of these quantities that we had the prior mean, the prior precision and the observation precision given our observation y what is now the posterior mean and the posterior precision? Turns out this is very simple. The posterior precision is simply the sum of the prior precision and the observation precision. Exceedingly simple. And the posterior mean is the prior mean plus some weight on the difference between the observation and the prior mean. And this means the mean is updated by an uncertainty weighted more specifically precision weight prediction error. So let's look at this. The size of the update here is proportional to the likelihood precision pi epsilon and inversely proportional to the posterior precision. So this is a precision ratio that weights this difference between the observation and the prior mean and the prior mean is our prediction. That's where we believed this to be before we made our new observation y and this is the difference so it's a prediction error. So it turns out that this is not specific to the univariate Gaussian case but it generalizes the Bayesian updates for all exponential families of likelihood distributions with conjugate priors and we look at what this is in what comes. So to run you through the measurements we have here too we have a prediction we have a prediction error and we have a weight on that prediction error and you can interpret that weight as I hope you can read this how much we're learning here divided by how much we already know and this makes sense since you did I mean it's mathematically true because we derived it this is an instance of using deductive logic when we solve that integral and so on so it has to be mathematically true but it's more beautiful than that it is interpretable and makes sense because if you predict something to be here turns out to be here were you off by much? Is that a lot? Who thinks it's a lot? It depends, yes it depends exactly so a prediction error means a lot if your prediction was very precise so the more precise your prediction was the more precisely you believed you could predict this the more it means if you're off and this is expressed in this equation by the likelihood precision the observation precision here in the numerator the larger the precision of your prediction is the more weight your prediction error has but what if you already have lots of information stuff you learned in the past tells you that it is really here and yet your observation is over here then you're likely to dismiss that observation because your posterior precision is very high you have already accumulated lots of knowledge about the location of your parameter of interest and this works against your prediction error it's in the denominator here the more you already know the less your prediction error means the more precise your observation the more the prediction error means the more you already know the less it means demonstrated in this update equation and we can show that this applies in a very wide range of cases when we do Bayesian inference this was only a very simple Gaussian example but it really applies across the board for exponential families of distributions with conjugate priors yes so for different distributions we can derive these update equations so we can derive them in a very general case I will show you this just now the generalization to all exponential families of distributions and then I will even have a recipe for how you can go about deriving update equations of this kind in cases where you don't have the we don't have exponential family distributions but there you will have to work with approximations and this is much of what we'll do in this lecture series we will look at more complicated models where we derive update equations that are precision weighted prediction errors just as we saw them here and we will look at the strategies the approximations that we need to make in order to get them very good question so you all have a mathematical background so probably you have an idea of what an exponential family distribution is so here's just a partial list of them so that's the beta gamma, the binomial, the Bernoulli the multinomial, the categorical Dirichlet which are Gaussian gamma, log gas in multivariate, Gaussian Poisson and exponential distributions many more and being an exponential family distribution means it can be written in this form h of x times the exponential of eta theta times t of x minus a of theta and for each of these distributions they are filled with different content so for the univariate Gaussian that we have here that you're all familiar with we say x vector x bold is x simple scalar x the vector theta is mu and sigma h of x is one over the square root of 2 pi and so on and if you insert these quantities here what falls out is this if you insert different quantities here then you will get all of these different distributions so for each of these distributions there's a particular way of defining x theta h of x, eta theta, t of x and a of theta in order to get that distribution there are distributions where this is not possible anybody know examples of distributions that are not exponential families sorry? okay, yes power loss and prominent distributions that log normal yes what about distributions with fat tails Cauchy Cauchy is a special case of the t distribution and these are not exponential families but in exponential families we can derive this updating by precision weight prediction errors if we don't have exponential families there's a way around it and I'll show the way around it so the likelihood of the exponential families is the exponential families likelihood is an exponential family in the general form like this and now the clever bit is the choice of the prior we want the prior to be conjugate and this means as most of you surely know a conjugate prior is one that results in a posterior that has the same structure with the same kind of distribution so you have a Gaussian prior you end up with a Gaussian posterior you have a beta prior you end up with a beta posterior and so on but there are several ways to choose a conjugate prior with exponential families if you do it this way that you see here then you get posterior of the same structure I mean you can immediately see that this is this has the exact same structure where you replace the parameter nu by nu plus one and the hyperparameter xi with an updated xi prime using the same kind of precision weight prediction error as we saw before so this is your prediction error here g of x, x is the observation here g of x minus xi this is the prior hyperparameter you have your prediction error on the hyperparameter here you update in a precision weighted way this nu increases with the information you have nu represents precision here and you have a precision weighted prediction error update on your hyperparameter so the same structure we saw before for all exponential families of distribution this is the proof you can take a picture anyway I'm going to share the slides it's a simple proof so these principles even before we knew what I just showed you when we still thought that this had to be done approximately always Carl Friston came up with an idea for how the brain might work along these principles where you have the input coming in here and then you have an error signal going up to the next level of the hierarchy and the prediction coming down and these are Carl's ideas about how this might be implemented in the cortex of the brain with prediction errors being processed in the superficial layers of cortex and predictions being processed in the deep layers of cortex the cortex has six layers and is mostly populated by so called pyramidal cells these triangles represent pyramidal cells superficial pyramidal cells and deep pyramidal cells questions about this a quick introduction into sort of the general outlook that we will be having the main keyword is precision weighted prediction errors you'll hear that a lot over the coming nine lectures because now we will perhaps so we're going to go on until 11 is it usual to have a break here or should I just talk at you for two hours would be great to have a break right okay so let's meet back here at 10 past 10