 Guys, please take your seats. OK, so we'll start. I realize you just had an exam and are perhaps a bit exhausted. So we'll just slowly ease our way into the hierarchical Gaussian filtering that we're doing. So what we did at the end of yesterday was to take apart this learning rate at the outcome level in the standard HDF hierarchical Gaussian filter. And we saw that all three kinds of uncertainty that we knew have to be taken account of were actually taken account of here. So we have outcome uncertainty, the irreducible kind of uncertainty, risk in economic terms. We had informational uncertainty, our ignorance about the state of the world. But this we can learn by making more observations. And then we had the uncertainty stemming from the environment changing, states moving around. And all of this was represented here in the adaptive learning rate that we wanted to have. This is what we set out to get. And now we've got it. Then a first little toy application to a financial time series. We can see volatility at the second level. At the first level, it's simply the time series of exchange rates that we're looking at. At the second level, it's the volatility of that time series. And we can see that at interesting points when interesting stuff happens, this volatility shoots up. And even the volatility of the volatility at the third level shoots up. And what this means is that the model realizes that it's somehow lost its grip on reality, that the underlying reality has changed. Old information has to be thrown away. And the way to do that is to increase the learning rate. And you can see this dramatic spike in the precision weights applied to the updates at the different levels. It's most extreme at the second level, where you see the precision weight shooting up. But then after the model has caught up with reality, you see the precision weight dropping again, because now it would be maladaptive to throw information away too quickly. So we want to slow down. In the next few slides, you will see the update equations. And what I was projecting there, I promised always to go over here, because apparently people can more easily see it over here. Or maybe the best way to do it is to use the mouse. So the precision weight on the prediction error, this is the prediction error, and this is the precision weight. And this adjusts automatically in response to an increase in prediction error. So what drives this is an increase in prediction errors. And this, in turn, leads to higher precision weights. And this leads to a greater effect of prediction errors. So there's a kind of self-reinforcing weight on prediction errors. So once the system says, these prediction errors are more than I expected, you can actually see the algebra here. This is the update. Now if this is, if i is 1, this is the update we have in mu 1 in the mean of the belief on x1. So basically the mean of this red line here. Then if the update here is large, this leads to a large quantity here. This is the volatility prediction error. The prediction error on how quickly stuff is changing. So this increases. And if this is large for i minus 1 equals 1, so here we have i and here we have i minus 1. So if this is large at the first level, and we're updating the second level, this is delta 1 for the update of mu 2. And the weight that drives the update on mu 2 is then in turn also increased. But this prediction error gets more weight. And what you see is this spike here at the second level. This is the weight applied to the updates at the second level. This shoots up because the volatility prediction error here, the delta 1, the volatility prediction error at the first level, this is a prediction error that tells you how much you were off in your prediction of how much your x1 will move. So we can parse this term by term. So the main thing to remember is down here, we have predicted variance. And this has two components. The first is our ignorance about x1. This would be sigma 1 here. And then we have the environment driving changes in x1. This is from the level above. So this would be mu 2. If this is the prediction error on the volatility prediction error on the first level, this would be delta 1 and this would be mu 2. So if mu 2 is driving this, then this is the expected amount of change in x1. Now in the numerator, we have the observed amount of change in x1. So we have our new, our updated uncertainty about x1. This has time index k minus 1. This is before we make the update, before we make the observation. Sorry. And this is after we've made the observation. And performed our update. So this is the new, the updated uncertainty about x1. And this is the update in the mean of the belief about x1. And it is squared. So we don't care in what direction it went. It's just the amount we updated. So in a sense, this is observed uncertainty about x1. And this is predicted uncertainty about x1. And now because this works at the log scale, because in the definition of the HDF, if we go back, back, back, back, here, we have an exponential. So the levels are linked by a volatility that is situated at the log level. So our prediction error is not a difference. It is a ratio. So if you take the log of this fraction, you get a difference. You know that. So it is observed uncertainty divided by expected or predicted uncertainty down here. And now this is a prediction error driving an update, driving this update in mu2. So you see that if our observed uncertainty is exactly like our predicted uncertainty, then this fraction will be 1 minus 1 gives you 0. If the volatility is exactly as expected, we do not need to update mu2. This will be 0. And mu2 will be like the predicted mu2. No volatility-driven update. However, if the observed uncertainty is greater than the predicted uncertainty, the fraction will be greater than 1. And subtracting 1 will give you a positive prediction error, positive volatility prediction error. And this will push up the estimate in mu2, the estimate of x2. Mu2 is the mean of the estimate of x2. And conversely, if the observed uncertainty is less than the predicted uncertainty, the fraction will be less than 1. If we subtract 1, it will be left with a negative prediction error. And the estimate on mu2, the estimate on x2, will go down. And this is exactly what we see happening here. So where you see, I'll use the mouse, where you see this spike, at first, the volatility prediction error is always positive. And this drives mu2 up. The updates in mu2 are all positive. And then it comes down again. So the updates in mu2 are going to come down again. So this is what drives mu2 up. And then it comes slowly down again as the system understands or settles back into a belief that it knows what's going on so it can decrease its learning ratio. Yes? When I'm iterating these up, yes. So what we do here, and the reason is one efficiency and we also want to anchor this in the brain. So what we do here is we use this one-shot update. So we use the update equation once. But we can also do this until convergence. So I haven't actually tried to prove that it'll converge or when it will exactly converge. In the years ago, I programmed an application where we actually do the updates through the system until convergence. And the difference it made with the one-shot update was at least in the examples I looked at, minimal. But it's a good question. So if you'd like to do a research project on that, that would be a good idea. Because actually, I mean, we like to keep things simple because we think the simpler they are, the more likely that the brain can do them or collection of neurons can do them. But it is actually not implausible to assume that different populations of neurons keep passing messages between each other until they're at a certain equilibrium. So this is what actually might happen. So computational efficiency, we decided to stick with one-shot updates because in the cases where I looked at, the difference between waiting until convergence and doing one-shot, there wasn't an appreciable difference. But it's a really good question. So how do we get the update equations? We get the update equations exactly. And we'll do this next week. I first want to give you an overview of what's possible with this. And then we'll go into the details and the derivations of this. So as we saw this week, in exactly the way I showed you, we derive the variational energy here. We fill in the model and use the definition of the variational energy at the i-th level. And for each level, we have a variational energy. So for generic levels, it looks like this. And then using the equations we had here, here. So using these equations and the variational energy that you saw here, we derive the update equations for our new here variance. Or if you take the inverse, then the precision. And for the new mean, that gives us our update equations. And this is what you have here. These are the update equations for a generic HDF level. We shall see that these are modified if you change the way the HDF works. In a concrete case, we'll see some examples of this. So why is this practically useful? So first, the updates have a general and interpretable structure. So the whole story I told you about this volatility prediction error, which is sort of the main feature of this filtering scheme that we have. We do not only have a prediction error on the amount that we're predicting and that we have an error on. We also have an undirected prediction error about the volatility of the time series we're filtering. Other schemes do not have that. And the great thing is this drops out of the derivation of the update equations. And yet, even though you've derived it, you will do this on the blackboard, or at least partly we'll do it on the blackboard, even though you've calculated around for a while and you get this thing that at first looks ugly, if you look at it long enough, you can see that it makes sense. It totally makes sense. It's interpretable. This is the predicted uncertainty, or at the I-th level. This is the observed uncertainty at the I-th level. And that gives you a prediction error regarding the uncertainty, and not just about the quantity at that level, also about the uncertainty. So this is one main feature here. It's a general structure. It's always the same. It's interpretable. It's computationally extremely efficient. And a great feature when we fit this to actual data from different brains, different people, is that we have parameters that can differ from subject to subject. So when a neuroscientist, a psychologist says subject, they mean a person doing an experiment, so from subject to subject, so from person to person, and can be individually estimated from experimental data. And this enables comparison of parameter estimates between subjects, so between persons, and of evolving beliefs on states within the same person. Because it gives us a measure of the free energy, it provides a basis for model selection, because the free energy is an approximation to the log model evidence, as we saw. AV is always greater than or equal to A. So we can compare different learning models with different hierarchical depths, for example. And we can compare so-called decision models. We'll see what decision models are in just a few minutes. So here's an experiment we published a few years back. It's a very, very simple experiment. It's almost the simplest experiment you can think of. The idea is this. We have a person, a subject sitting here, and hearing a tone. The tone is either low or high. This tone is referred to as a cue, with a duration of 300 milliseconds. And this tone will be followed by a picture of either a house or a face. Now, why are we doing that? A house and a face have, when they're processed by the brain, very distinct signatures. So when we do that in the MRI scanner, it is easy for us to tell whether this person was looking at a house or at a face. We know the kind of activation that we'll see in the brain from looking at houses and from looking at faces, and they're different. So that's why we chose to use houses and faces. And of course, these cues, the high and low tones, are predictive of whether there will be a house or a face. So perhaps a high tone will allow you to predict that it's more likely to see a face. And then conversely, because we constructed it that way, if the high tone is more likely to result in a face, then the low tone is more likely to result in a house. I mean, very simply, that we have probability of house given high tone is 0.8. Probability given house of a house given a low tone is 0.2. And then conversely, probability of a face given a high tone is also 0.2. And probability of a face given a low tone is 0.8. Yes. They know that. And they know that there are going to be these probabilistic associations. That's what we call them, probabilistic associations between houses and the different cues. And the reverse associations between faces and the cues. And now the twist is that this will change in the course of the experiment. So at first, we have this. And then we switch to something like 0.3, 0.7, 0.7, 0.3. And then we switch, and so on. We could actually check that. We didn't do that. That was not the question in that experiment. I would bet that will work, yes, but we didn't do it. Yes? Wanted to keep it simple here. So also at the modeling level, this allows you to assume that with every outcome, people learn about both kinds of contingencies. Because we're fully transparent about how the experiment works. We don't hide anything from them. So from every outcome, they can learn about both kinds of contingencies. So we only have to model one learning trajectory. But we also do other experiments. And we have done other experiments where this was not the case. So where they have to learn about these associations separately, it depends on the experiment. I think I have colleagues who didn't tell people whether there was this coupling between the probabilities. And in this situation, we told them how it was here. So really, it's the choice of the experiment. And of course, there's a whole zoo of experiments like that. And people do small variations and see how the effects change in response to small variations. And one interesting question is always, do sub-populations of subjects. So if you take people who are somehow special, people with schizophrenia, for instance, and you subject them to two different paradigms, do the way they adapt to the change in the paradigm, does that vary in a characteristic way compared to how healthy controls react to such a change in the paradigm? Then you can publish a paper saying people with schizophrenia are more or less able to deal with this kind of change or this kind of difference in the paradigm. No, they have to learn them. That's the point of the experiment. These numbers are hidden from the people. They don't know these numbers. They have to learn them. They just know there is an association between these tones and the outcomes and that these associations are symmetric. And this was done for 320 trials and 64 so-called null events. So we need this for the MRI analysis so that we have a baseline where we just present a tone, but then the outcome is nothing. So we have neither a house nor a face. Now, the thing is this. In the HDF, as I introduced it, we had continuous quantities performing gas in random walks like this. Like we have X2 here and X3. So that's the standard HDF thing. But now the outcome we have on each trial is binary. So outcome coding, we have two kinds of outcomes. We have a high tone followed by a face. We code that. And likewise, because it's the same kind of outcome because the probabilities are coupled, a low tone followed by a house is coded as an outcome of one. Followed by a house as well as a low tone followed by a face is coded as zero. These are the two kinds of outcomes you can have on each trial. And basically, the probability here, we look at just this part here, the probability of getting an outcome of one, this corresponds to, probability of, and we're going to call this X1. X1 equals one is what? Here, who can tell me? The probability of getting an outcome equal to one is, I heard the right answer somewhere around here, 0.2. So we code the outcome where the high tone is followed by a face as one. High tone is given. Probability of a face is 0.2. So there's already our answer. But we can also check whether we're right. So we can look at the other 0.2. And this is the following a low tone. So we're correct. And the probability, so this is also of X1 equals 0, is 0.8. This is our X1. And now the twist is this, this X1. And X1 is always a state of the external world. This is an objective fact. This is how the experiment turns out. It's out there. It's not in my head. And this is a binary state of the world. The outcome of the experiment on each trial is either 1 or 0. It's either one of these or one of these. And there's a clear probability for getting a 1 and for getting a 0. And this is what the subject has to learn. Now how do we get such binary outcomes into our modeling? Since we assume Gaussian random walks and the Gaussian random walk by definition extends to the whole real line, it's unbounded, we cannot model changing probabilities because probabilities are confined to the interval from 0 to 1. But there's a very simple trick. We can map the real line onto the unit interval by using a logistic sigmoid transformation. So the logistic sigmoid function, useful in many situations. And its graph looks like an s. That's why it's called a sigmoid. Here's 1. Here's 0. This is the x-axis. Here we have 1 half. And this maps the whole line here to the unit interval. So we can take a state x2 whose sigmoid is the probability of getting an x1 of 1. And that's what you get here. So the probability of x1 equals 1 is the sigmoid of x2. The probability of x1 equals 0 is 1 minus the sigmoid of x2 because as you'll immediately believe me, if this is the y-axis, then this is the graph of y equals s of x. And this is the graph of y equals 1 minus s. You could try that, but I'm not immediately convinced by the idea because what we want to have is basically a probability on the unit interval. So this is a probability of getting x1 equals 1. So it's between 0 and 1. And as we know from the way we're conducting the experiment, this can change. It can sometimes be here. It can sometimes be here. It can sometimes be here. Now with a step function, you would just get a jump from 0 to 1, right? Yes, you can do a proposal just to parse your proposal. You would, I'm understanding you correctly. So it would just be if you're mapping x2 onto x1, then you can do it. But I don't think in this situation it would make much sense. So that's what we do. We model this changing probability of getting an outcome of 1 as a change in x2. And we take this model here, and we apply everything we learned this week to this model. So we basically write it down. Yes, cumulative distribution of the Gaussian. Ah, yes, that's another sigmoid. You could do that. But as you should see, using the logistic sigmoid gives you quite straightforward update equations. And substantially, yes, substantially, it does make much difference what sigmoid you use. You can also use the hyperbolic tangent. I'm using the sigmoid property of the sigmoid. So in theory, in theory, any, you can use the arcastangents. You can use the cumulative Gaussian. Anything that has this that you can use in a sensible way or in a continuous and monotonic way to map the real line onto the unit interval will, in principle, work. In practice, you want to have a function that gives you straightforward update equations. So you could just learn about the probability of how the interface is appearing. So there is, of course, some background to this that I didn't mention. So we wanted to dissociate pure association learning from reward learning. And we had another condition where people didn't just get houses and faces, but got monetary rewards or not. We wanted to see whether the same mechanisms in the brain were in play when prediction errors were about rewards or when they were simply about associations. So we had to have two things that are associated with each other or associated with a reward. So we wanted to know specifically whether the dopaminergic midbrain becomes active only in learning about reward or in learning about association also. And the answer was the second, the dopaminergic midbrain, cares about both rewards and simple associations. So in some sense, it cares about getting it right. And getting it right is a reward in itself. So you don't need an external monetary or whatever. But if you did the experiment in animals, you would use something like food or drink as a reward. So the dopamine system cares simply also about learning associations. So yes, how you can model that exactly in the same way. Yes, absolutely. The modeling would be even simpler because you wouldn't have this strange kind of mapping here and so on. But there's always sort of a neuroscientific background to this and the neuroscientific question that we are trying to answer through modeling. So the modeling helps the neuroscience. And it's not just done for the sake of the modeling itself. Further questions? Yes, what is the modeling doing? So basically, this is a model, one of several possible models. And we compared many of them for how a subject processes what is going on during the experiment. So what idea does the subject have of how the world works? This is one possible way external reality could produce the sensations you have as a subject while you're doing this. And we do this for several models. So to be strictly exact, no. We are modeling the world here. We're modeling the outside world. So you can see this is x. And we're actually consistent with everything we've done before. I said x is a special case of theta. So that's outside in the world. Once we do inversion, once we do inference, then we're working inside the subject's brain. So once we derive the update equations, as we do here again, so this is what is going on inside the subject. But this is inference. Whenever we're dealing with the muses and the pies, we're dealing with beliefs about the x's. x is the outside world. u and pi is the belief inside my head about x. So this is inside the subject, but the model here, this is a picture of a probabilistic process outside in the world. Because we don't know the exact process as it really is, we have to use sort of simpler ideas of how these things work in general. So this is a quite generic model of how information that has a form of time series can be processed. So the generic HDF update equations, the ones we just saw before, they apply at all the levels where we have continuous states. x2 is continuous. x3 is continuous. x2 and x3 live on the whole real line. But x1 is binary. It lives in the set 0 or 1. So we need to derive update equations for x2 on the basis of the observation on x1. Because we know the outcome. We see whether the outcome was 1 or 0. So this is x1. We don't have to infer x1. We see x1 direct. But we have to infer x2. And our inference on x2 is the form of a belief of a probability distribution with two sufficient statistics. And these are mu2 and pi2, or it's inverse sigma2. So we need a way to get from x1 to the belief on x2. So we need a way for every new observation of x1 to update our mu2 and our sigma2. And again, it's precision weighted prediction. We show on this slide to get out the exact nature of the precision weighting. You can do a Taylor approximation. And you can see that it's all made up of precision terms. And again, it's the same structure. It's prediction plus weighted prediction error. So in the delta 1, delta 1 is minus prediction. Prediction is mu1 half. Yes. Because in the model comparison, three layers were optimal. We had the most model evidence for that number of layers. So whenever we fit the model, we calculate the model evidence, which is basically the free energy that we have. And then we compare that for many models in the same data set. So we use an HDF with two levels, with three levels, with four levels, and we compare that. And we see we have the most evidence for three levels. And then we compare with all kinds of other models. So we also applied for a Scholar-Waffner learning to this. We applied reinforcement learning models, Sutton K1, and so on. What is evidence? P of y given F. And this is integrated over. That's the model evidence. When we had base theorem, we had the posterior given the data and the model is the likelihood divided by the evidence. So in base theorem, here we've got the evidence. The deeper reason why we use that is if you compare two models, now just looking at two models, what you want to look at is the ratio between these two. So the base factor, this is called the base factor, this is the evidence of model one divided by the evidence of model two. Now what is a sensible question to ask when I tell you this is the base factor? Question immediately springs to mind when I tell you this is the base factor. Well, does it look like a factor? It looks more like a quotient, doesn't it? But why are we calling it the base factor? So the answer to the question is it is what lets you, the factor that you need to multiply your prior odds with to get your posterior odds. Yes. So yes, that's an important point. We have two different models. For instance, and that's where we started, we have an HDF with four levels and an HDF with three levels. So m1 is three levels, m2 is four levels. For instance, and we fit the same data and we calculate the model evidence here. We can only calculate it approximately, but the variational free energy is a good approximation to that. Other approximations are the AIC or the BIC, things that you see in model scoring. And then there's a whole zoo of these measures. But if you have the variational free energy available, then it's a very good approximation to the, it's one of the better approximations to the model evidence. So let's say you and me both have an idea how probable model one and model two are. And we both write this down. Now we collect some data wide and this changes how probable we think two models are. So this is my belief in the probability in the relative probability of these models. Before I see the data y and this is my two beliefs after seeing data y. And now the thing is, this is called the prior odds. So the ratio of the probability of two events is called the odds. Anybody familiar with betting will know that. When you go to the betting shop, they give you the odds. So they say the odds of this horse winning are seven to one. And the longer the odds are, then the more unlikely it is that horse to win. Now these are the posterior odds. And what gets us from the prior odds to the posterior odds is the base fact. So this is the y. And now the thing is, this here is a objective. The prior odds may be different for you and for me. But the factor that takes me and you from the prior odds to the posterior odds is the same for us both. And this is a measure of how good in the light of the data y is model one relative to model two. If in the light of data y, they are both equally good, then this ratio will be one. And both of us will not change our opinion. So the posterior odds will be equal to the prior odds. However, if applied to these data y, the model m1 is superior to the model m2, then both of our beliefs will change in the favor of model one. So this is the model evidence. That's why it's because it allows us to compare models. That's why it's such an important point. We can only calculate it approximately in all interesting cases. But it's fundamental for model comparison. And on the basis of this, do we decide how many levels to include in the modeling? Yes. So do we have any sort of predictions of how the model evidence will change if we add a layer? No, I don't know. Yes. But it's always, as we saw, it's always a trade-off with the accuracy. The complexity will usually rise by adding new layers. But the accuracy may also rise. And if the accuracy rises more than the complexity, then the more, no. Because you've got the complexity term in there, which safeguards against overfitting. So we always compare how much does an additional layer, or in general additional parameters, how much do they affect my complexity and my accuracy? And it depends on the balance of this effect, whether your new model will be superior. So if the introduction of new parameters increases the accuracy by more than it increases the complexity, then the introduction of your new parameters will be justified. So if the introduction of the new parameters increases the complexity more than the accuracy, then you can say if adding another layer adds more to the complexity than to the accuracy, then you shouldn't. So that's just the logic. So always when comparing different models, you will prefer the ones with more evidence. And evidence is accuracy minus complexity. So I calculate the evidence for two layers, for three layers, for four layers. I see a maximum at three layers, so I take three layers. Theoretically, I could go up to, so you think it could go down for four layers and then up again for five layers. So there are principled reasons not to expect that, you know, because adding a layer will only help you if it increases your accuracy. So if you, if adding a fourth layer doesn't help you, then adding a fifth layer won't suddenly start helping you. I mean, this is a quite straightforward situation. Not all situations may be so straightforward. So you cannot always predict what will happen to complexity and accuracy when you tweak the model at this end as opposed to that. Sorry, I didn't get that. So the fetas here are the states and parameters of the model. So theta is a collection of all that we have here. The x's are parts of theta. The kappas, the omegas, the fetas, and this sort of thing. So because this is a time series model, what gets updated is, again, in the inversion process. And in the inversion process, the parameters kappa and omega and theta here are not updated because they are assumed to be constant here, which is a modeling assumption. However, the sufficient statistics describing our belief on x1, x2, x3, they are updated step by step. That's what we have the update equations for. So the update equations are for mu2 and sigma2 or mu2 and pi2 and for mu3 and pi3. That's what gets updated step by step. So here, this is an example of exactly what I'm talking about. So the three parameters, this is a simulation. The three parameters are chosen. So I go and I choose a theta of 0.5 and omega of minus 2.2, a kappa of 1.4. And then I give these x1's, the little green dots to my little model. And this is how mu2 gets updated in the course of time of these 320 trials that we have in the experiment. And this is how mu3 gets updated. And at both levels, also the uncertainty gets updated. So at this level, it's not plotted in this plot. But at this level, we have pi2. At this level, we have pi3. That's also updated. But the parameters theta, omega, and kappa are constant. So in this example, we see an interesting effect. Because what I did when I simulated this was I took the exact same inputs in the first 100 trials as in the last 100 trials. But you can see that the belief updates, and what I plotted here at the first level is the prediction of the next outcome being 1. So this is our little agent making a prediction about the next outcome in the form of telling me the probability of getting an outcome of 1. You can see that in these phases where there's lots of outcomes equal to 1, or I promise to show everything over here, in these phases where there's lots of outcomes equal to 1, the belief that the outcome will be 1 increases. And then there's almost no outcome equal to 1 anymore, but lots of outcomes equal to 0. So the prediction that the outcome will be 0 goes down. And as soon as there's another 1, it ticks back up a little and so on. Because we have this intervening period of volatility here. So here, the probability that the black line here is the objective ground truth probability of getting a 1 or a 0. And we leave it at 0.5. So 1 and 0 are equally likely for the first 100 trials. And quite appropriately, the belief that the next outcome will be a 1 sort of fluctuates around 0.5. Actually, learning is taking place at a somewhat too quick pace here. So you can see this is basically overfitting by our little agent. So our little agent interprets too much into the fluctuations that you have between 1 and 0 because they're entirely random. So ideally, you would just have a flat line around 0.5 because there's actually nothing to learn from these outcomes. They're entirely random, equally probable to get a 1 as to get a 0. But then this changes. The probability of getting a 1 jumps up to 0.8. And our agent learns that this is now more probable. Probability jumps down to 0.1 or 0.2, I don't know. I don't remember. And the belief goes down and up again and down again and up again and down again. And now, because the agent also learns about volatility, its belief in the volatility has increased. This is the volatility level. And now, the same input that the agent got here is interpreted in a different way. Namely, it is seen as much more informative in the sense that because we're in a volatile environment, we have to learn quickly. So even though the agent was already learning a bit too much here, it's now learning even more. So its beliefs are wildly swinging around, but in response to the same inputs here. And this is just, yes, because the belief about the volatility of the environment is increasing. Yes, yes, yes, exactly. You build a memory of how volatile the environment is or has been, and that affects your learning rate. So to put it very slim, simply here you've got a higher learning rate than back here. And this is because the agent has learned that these changes in contingencies are quite likely to occur. And if you keep it stable for a very long while, this belief about the volatility will come down again. It's not long enough here. In the bottom, the black line is the ground truth from which the green inputs are sampled, but the agent doesn't know this. So here I sample my outcomes, 0 and 1, from a Bernoulli distribution with parameter 0.5. And here I sample the outcomes from a Bernoulli distribution with parameter 0.8. And here from a Bernoulli distribution with 0.2, and so on. So this is basically what I'm doing, what I'm using to feed the agent with inputs, with x1s, how I'm generating the x1s. Yes. Everything changes again, and you have to relearn. It's not useless, because here you accurately learn that it's now much more likely to be 1, and here you accurately learn. But as soon as he's learned what he needs to learn, we change the world again. You need to, yes. So let's look at, I will show you some real data. So this is an actual subject. You can see this is actual data. Here the x's indicate that this person missed the trial. So he wasn't quick enough in giving an answer. And you can see this guy learns a lot. So he has a kappa of 4.1. We'll see what that means. But you can see this guy has an enormous learning rate. It's ridiculously high. And because everything's changing all the time, you can see his volatility belief also increasing most of the time. So this is one subject. And this is an inferred learning process for this actual human being. And here you have another guy. And this guy is much slower. And he's so slow he misses giving an answer all the time. But anyway, we have hundreds of answers from him. And this still allows us to fit his belief trajectory. And actually, the learning rate here is much more appropriate. So you can see these fluctuations in the beliefs. But it's not as if it was pointless for them to try to do a good job on the task because they do get more answers right than wrong. So most of the more of their predictions are correct than incorrect. It's not as if you couldn't learn anything here. It's hard. It's not easy. But it's not as if it were pointless. It would be pointless if just randomly answering gave you the same score in the end as trying to learn something. That's not the case here. So you will be far from being correct on every answer, but you will be correct on considerably more than half of your answers as you're doing this task. And also just as a consideration of experimental design, people are lying in the scanner while they're doing this. And if what they have to do is too easy or too boring, they just go to sleep. So I mean, I don't know how many times you've been in a scanner, but it's a very sleep inducing environment. So if it gets boring, people go to sleep. So just out of that consideration also, you have to gauge your task so that it is interesting. It has to be somewhat challenging. It shouldn't be too easy, but it shouldn't be pointless either. So and this worked well in this task. So it takes about 25 minutes for people to do this. Yes. Sorry. This is from the behavioral data we recorded while people were in the scanner. We also recorded how their brains reacted to what they were doing. And we have nice pictures of how the brain processes stuff that's going on here and how the brain processes what's going on here. So it's mostly the dopamine-ergic system dealing with this. At least that's what we found in this experiment. And mostly the colonergic system. So these are two different neurotransmitters, two different neuromodulators. So dealing with this is mostly in the midbrain. So at the top of your brain stem, you have the midbrain. It's an old part of your brain. And here you've got mostly stuff going on at the volatility level in the basal forebrain, which is a younger part of the brain. And one is mostly populated by dopamine-ergic neurons. So neurons that release dopamine as a neurotransmitter. And here the activity is mostly in a region where the brain has neurons that release acetylcholine. Yes. No, that's the actual data you get from their behavior. It's their button presses. So they lie in the scanner. And they have a button box. And one button means I predict a face. And the other button means I predict a house. I get the MRI data. I get the MRI signal from their brain. No. So this is just from the behavior. All we have is their button presses. So the button presses are the orange dots. If they press 1, or if their button press corresponds to a 1. So we don't have it on the blackboard anymore. But if the button press corresponds to 1, the orange dot is up here. If the button press corresponds to a 0, the orange dot is down here. By feeding their response, your vision is fit together. So we have one estimate for omega. We have one estimate for kappa. You saw the one before had a kappa estimate of around 4, if I remember correctly. Now this one has a kappa estimate of 1.2. And the theta estimate here in omega is fixed to minus 1. And to minus 4, sorry. And but it's a good question because we need to have a model of, where is this? We need to have a model of how people translate their beliefs into decisions. On top of the learning model, which describes how they update their beliefs, we need a so-called decision model. And as I said, we update the mu2, sigma2s, mu3, sigma3s on each trial. And then we have a decision model that is based in this case. But you can build other decision models. But this little example here is based on the sigmoid of mu2. Why is that? Because on mu2, the belief fluctuates. So here we have 0. Now if we apply a sigmoid to this, we get what we call mu1 hat. So this is the prediction on x1. So when mu2 is up here, the belief that the next outcome will be 1 is high. When mu2 is back here, the belief is low. And this corresponds to these belief trajectories that we just saw. Let's take this one. This is mu2 going around. Here is 0. So whenever it goes high, then the probability that the next outcome is 1 is high. Whenever it goes low, the probability of getting a 1 as an outcome is low. So our decision model is based on the subject's belief here about the probability of the outcome 1. Sigmoid of mu2 is basically a measure of the subject's belief that the next outcome will be 1. And now this is mapped onto the probability of the subject predicting an outcome 1. So we want to allow for the probability, for the possibility, that even when a subject has a belief that the next outcome is 90% probable to be a 1, the probability of predicting that will not be 90% because there may be many reasons for that. The simplest explanation is that people are, in some sense, noisy in their decisions. The decisions are not an accurate reflection of their beliefs always. Another explanation, a more principled explanation, is people are exploratory. Even when you're certain that one option would be the most rewarding option, you may want to choose another option because you want to learn about your environment. You know that if you go to restaurant A, you will have an enjoyable evening. You know that because you've been there 1,000 times. But now there's a new restaurant. And your expectation of having an enjoyable evening is actually less than because it's a new restaurant who knows and your prior is sort of set around the average new restaurant. But you want to go there because you want to explore. So the probabilities about the outcomes don't always are not entirely predictive of behavior. So it's not 90% probable that people will predict an outcome of 1 when they believe then an outcome of 1 is 90% probable. So why now is the decision by the subject? And this is our model. This is now also a sigmoid model, but it's a different kind of sigmoid here. It's the unit square sigmoid that basically is a sigmoidal map from the unit interval to the unit interval. So we have a decision model. And based on this decision model, we can infer the beliefs of the subject as the subject does the task. Yes, I have the actual decisions of the subject. I have the orange dots. And I know what I put in. I have the orange dots and the green dots. And then we go in search of the best parameters, the parameters that best explain, parameter values that best explain the subject's decisions in light of the input. That's what we do when we fit this model. Yes, we estimate that subject by subject. Each subject has its own value there, because some people are more noisy than others. Some people, very reliably, choose the option they think is more likely. So that would be a high zeta. So you can see for a zeta value of 64, you get an agent who, as soon as the probability of getting a 1 is greater than 0.5, will always predict 1. So that's sort of the reward maximizing behavior. Yes, if you take an economic definition of rationality, where a rational agent is one who maximizes his reward, that would be the most rational. Yes. But I mean, there are many ways to think about this. And there are more kinds of rewards than the immediate reward you get from an outcome. There are also rewards from knowing your environment. If you want to know more about this, there's a nice paper on active inference and epistemic value by first authorist Carl Friston, active inference and epistemic value. And you can see how the actual value of an action has more to it than what economists would call a rational choice, simply based on utility. So there's more to utility in deciding what to do, even in a very vigorous way you can show that. Yes. Yes. I'm happy to do that. This is actually a good slide where to do that. So this is, again, our generative model. So it's x here. Sorry, I'm again on the wrong side. Should do it over here or do it with the mouse. x3, the parameter we have in the model is theta. This is the variance of the Gaussian random walk at the top level. x2, because the equation for x3 at time k is normal around x3 k minus 1 and theta. So that's the top level. That's where the theta comes from, this theta. At the second level, x2k normally distributed around x2 k minus 1. And then we have the exponential of kappa 3k plus omega. Yes. So we fixed omega, because the data only allowed us to estimate, and there's a deep reason for that, which would take too much time to explain here, but the data only allowed us to estimate kappa and theta. These two are estimated. Omega is fixed to minus 4 in this experiment. Yes, exactly. In the decision model, there's also zeta. And zeta is also estimated, subject by subject. And then to make it even slightly more complicated, the initial values for all the trajectories, these are also parameters. And some of them are fixed and some of them are estimated. I think we estimated, I don't know exactly which one of the initial values we estimated in which ones we fixed. So in general, if you have data from 320 trials with binary decisions, the amount of information you have is 320 bits, so 320 ones or zeros. That is not much information. So there are limits to how many parameters you can estimate with an amount of information that's just 320 bits. And for that reason, and also the reason why we fixed omega, there's a deep reason for that. And for the initial values, there's a limit to the number of parameters that you can estimate. And maybe we'll have time to go into how we decide which ones these are. We also have to look at the posterior correlation, for example, of these parameter estimates. Because if they're highly correlated either positively or negatively, that means two parameters are fundamentally explaining the same thing. So all you can do in such a situation is to choose a value for one of them and estimate the other one. But so the interesting parameters that we fixed here were the theta and the kappa and the decision noise theta. So I think we're scheduled to go until 1, right? Now it's 5 past 1. OK, so I wish you a happy weekend and see you next week.