 This is where we stopped. Are there any questions from last time? Yes, yes. So what we have here is a little simulation of an agent processing binary inputs. So the foundation of everything else are the green dots here. Our little agent receives 320 inputs, and each of them is either 0 or 1. That's what you see plotted here. And the way we present these inputs to the agent is by sampling them from a probability distribution that is indicated in the black line here. You can see my mouse. So we start out with 100 inputs that are sampled from a Bernoulli distribution with parameter 0.5. So it's equally likely to be a 1 or to be a 0. Then we change the parameter of the Bernoulli distribution from which we draw the inputs to 0.8 for about 20 trials. Then the parameter drops to 0.2, rises to 0.8 again, and so on for a few times until, at the end, there's another 100 inputs drawn from a distribution with parameter 0.5, where 0 and 1 are equally likely. And you can see that, of course, when one is drawn from a distribution that makes one more likely, you get more ones, right? So up here, there are many more ones than the few exceptions where you get a 0. And conversely, if the parameter is low, you get many more zeros than ones, and so on. Now, what you see here is what the agent learns. This is the red line. So the red line is the agent's prediction about what will come in the form of the probability of getting a 1 as an outcome. So if we take this point here as an example, here, the agent actually correctly thinks that an outcome of 0.1 is 0.8 probable. And here, the agent accurately believes that an outcome of 0.2 is 0.2 probable, and so on. So what you see in red is basically the learning curve. Yes. Yes, this is all simulation. There are no right parameters. There are just parameters that lead to different effects. Yes, exactly. And here, the effect I want to show is that the same inputs at the start lead to very different learning at the end when the learning rate has increased because of an intervening period of volatility. Yes, basically, yes. But during the simulation, these parameters are constant. So the parameters don't change. The agent learns. So I don't fiddle with theta, omega, and kappa. These parameters are constant. The only thing that changes is the agent believes about the states. So the agent has beliefs about x2 captured by mu2 and pi2. And it has beliefs about x3 captured by mu3 and pi3. And these beliefs change. And because these beliefs change, the learning rate changes. At the beginning of the simulation, I choose a theta, an omega, and a kappa so that I can show you this nice graph and illustrate this effect of, yeah. When we do experiments, we estimate these parameters. After we get the data. So the framework in which we work is called observing the observer. So the agent, in that case, a human subject, observes experimental input and forms beliefs based on that. And we observe the observer's behavior. And on the basis of that, we, in turn, infer what is going on inside our subject. So there are two levels of inference. Our subject infers on the state of the world generating its experimental inputs. We infer on the state of the subject's mind. The middle one captures the mean of the probability distribution on the state x2. And in x2, the model assumes a Gaussian random walk. And in order for that to work, x2 has to be continuous. So there is a sigmoid, a logistic sigmoid transformation from here to here. So if you pass the red line here through the logistic sigmoid, what you get is the red line down here. But this line is now confined to the unit interval, the interval between 0 and 1. Whereas this line can go to either minus infinity or plus infinity. Yes? Yes. So at this point, the random walk, we assume, does not have any drift. But if you bear with me for a little while, perhaps still today, we will introduce parameters that contain drift, even variable drift. Where do we have that? It's a bit constant drift, like here. So would it be interesting to have a kind of drift? Perhaps. I wanted to keep the example simple. But we do often, in the experiments we run, check whether our subjects believe in a kind of drift. So because, as I said, we're trying to find out what's going on in their mind. Now, does their behavior show us any kind of belief in drift? So we would pick that up in our parameter estimate. So concretely, we would have an additional parameter rho, or at least one additional parameter rho, indicating a drift in the random walk here. And then we could estimate all of these parameters based on a subject's behavior. And if rho is essentially 0, that would indicate this subject doesn't see any drift in the input generating process. However, if rho is positive, then this subject would be seeing a drift going up. If rho was negative, then the subject would be seeing a drift going down. So one way you can interpret this drift is that if, for instance, 0 and 1 are neutral. And in the one experimental example I gave you, it was just associations between high tones and faces and low tones and houses. So that's basically neutral. But imagine a situation where the outcome 1 indicates a reward and the outcome 0 indicates no reward or even a punishment. Then somebody with what we call an optimism bias, somebody who tends to believe that outcomes will be better than they usually are, would exhibit a certain drift, a certain rho towards the top. So they would always be slightly overoptimistic in their expectation. Of course, if you always get punished, even if you're overoptimistic, even if you have a drift that pulls you up, reality can still drag you down. So the curve can still come down despite a drift upwards. So in experiments where we looked at optimism bias, we used exactly these drift parameters to look at the level of optimism bias that people have. For the questions at this point, I will run you through the graph. Otherwise, the other elements of the graph, OK. So we have this level. And when we apply a sigmoid transformation, we get this. So there's no more information contained in this red line than in this red line. But here, this is different. This is the volatility level. This is the level that determines how quickly this level here evolves. And you can see that when this level rises here, you see this line wiggling around much more. And that's exactly what we induce by this period of volatility in the middle. So the agent there, our simulated little agent here, learned that the environment is likely to change a lot. And so it adjusts its volatility beliefs upwards. And this leads to an increase in the learning rate. Now, of course, once we're back in this regime, this is somehow inappropriate. But it's also very difficult for the agent to learn that this is actually a constant probability. Because imagine, if you have an equal probability of getting a 1 or a 0, then it all just looks very erratic. So if we continue this for a very long time, you would see the volatility estimate coming down again. But it takes a long while. Because, yeah, it's a confusing environment where basically there's nothing to learn. It's just random outcomes, random zeros and ones. Whereas these environments here, where you have an 80% probability of getting a 1, that's an easy environment characterized by much lower entropy than the environment you have here at the start. Well, no, this is a simulation, purely a simulation. It is implied in the value of theta and the value of omega. So the value of omega gives you basically, we call this the evolution rate at this level. And theta gives you the evolution rate at this level. It is just, yeah, we have this thing at the third level. And theta is the variance here. So that's basically the meaning of theta. And so I think sort of what you're getting at is a kind of exponential decay, am I right? And what's the time scale of that decay? You would need, yeah, you can have that, but you have to modify the model a bit. Again, we'll come back here. So if you want an x3, that sort of can move away from an equilibrium value and then decay back to that equilibrium value. You're going to have to modify your process a little. This here is a Gaussian random walk. But then you can also use an AR1 process. And this stands for autoregressive first order. And then you have x3, depending on x3k-1 plus phi times, so it's going to be at the third level, phi3m3-x3k-1 theta. So this means the equilibrium value is m3. It'll go back to m3. And the rate at which it goes back to m3 depends on the parameter phi3. So imagine x3 is greater than m3. Then this here will be negative, so it'll be pulled down. And conversely, if m3 is greater than x3, then it'll be pulled up. So it sort of depends on which parameter regime you're in. I haven't systematically explored this, but intuitively. You get a phase transition. So on one side, in one phase, you get this relaxation. It learns that this is actually a constant probability. But if you're on the other side of the phase transition, then the agent just looks totally confused. Left, right, left, right. 0, 1, 0, 1 gets more and more confused. At some point, this trajectory explodes. That is a very good question. Yes, yes. Let us have another look at the update equations. So it's a bit complicated, but in the end, it's straightforward. So let's look at, this governs the update the third level. So let's look at mu3, what happens here. Let me use the mouse. So basically, we have our old mu3 or the prediction on mu3. If there is no drift, then the prediction on mu3 is simply the posterior from the previous trial. However, if we introduce this parameter rho I spoke about, then the prediction will be somewhat different from the posterior, because the model assumes that this quantity is drifting. That's why I write mu hat 3k. So i is going to be 3 here. And so this is our prediction. And now relative to our prediction, we get an update that is a precision weighted prediction. So this is a quantity v2, because i is 3. And this is defined down here. And this is the precision of the prediction on to the second level. That's pi hat 2. This is the posterior precision at the third level. So this quantity here. And this is the interesting bit. This is the prediction error. And it's the prediction error on the volatility. So you have the mu2 update here. So this is your prediction on mu2. So here, delta i minus 1. So this is delta 2. So delta 2 mu i mu 2k minus mu hat 2. So mu hat 2 is the predicted mu2. And mu2 is the posterior mu2. We have to be careful in what we call the prior. So yes, in the trial by trial setting, it is the prior. Yes. But not with respect to the whole time series of trajectory. Because we also have initial values and priors on these initial values when we estimate. Just to mention the point. But for our update, yes. This is where we start. This is basically our prior mean. Exactly. So if I switch back to the other. So this is slide 42. So in this case, because we have no drift, it's simply the posterior from the previous trial. That a mu2 hat at time k is simply mu2k minus 1. It's the posterior from the previous trial. And we update that based, we update the mu3 based on the update in the mu2. Because the mu2 update gives us our prediction error that drives the update in mu3. So if this is large, then this is more likely to be positive. And then the update on mu3 will be an increase. This is very small. And it's more likely that this will be negative. And then the mu3 will decrease, yes. That's how the two levels, the updates between the two levels are linked. And then you also have a link in the update to the precision. Because the same quantity, this same prediction error here, which is a volatility prediction error, a prediction error about how the time series is changing, is it changing more or less than I believed, this prediction error also enters the update for the precision. So at each level, we have a precision update and we have a mean update. And this gives us our posterior Gaussian. Exactly. From the downward point that the elevation is 1, I'm trying to look like 1,000. Right? But they're both for the kind of. Yes. But the model we chose is itself a kind of HDF model. No. So these are the generic HDF volatility prediction updates. So you can build all kinds of models. Again, I will give you a sneak preview on the kinds of stuff we can do. So you can build a model like this, where we've got the very different quantities here. So these are two different quantities giving you one outcome. So we will see this is the mean of a distribution you're inferring. This is the variance of a distribution you're inferring. And that allows you to make an outcome prediction and you update your belief on the mean and on the variance of the generating distribution separately. And they both have their volatility. So you have two HDF hierarchies going up here. You can do other kinds of stuff like this here, where you have only this quantity generating the output and this quantity pushing this quantity around. For instance, taking their sum, you can take basically any kind of relationship between them. But this is just the simplest possible example where it's just the sum of them. So if Z1 is positive, it's pushing X1 into a positive direction. And otherwise, it's pushing it down. And then X1 and Z1 both have their volatility hierarchies on top of them and so on. So you can assemble these models modulately. You can, in all of these cases, you can use the exact same framework we saw of the mean field approximation and this kind of Laplace approximation that we did where we expanded around the expectation value to second order. You can use this exact same thing with the variational energy and the update equations that you can derive from it for each of these nodes. And then for each node, you have an update for the mean and for the precision. The circles are constants. And the diamonds here are also time series, but non-Markovian time series. So time series that don't depend on their own previous state, whereas these hexagons depend on their own previous state. But we'll get to that also to this notation. So this is basically it about this graph. Any more questions about this one? Oh, yes, sure. What is the difference between x and mu? It is fundamental in the sense that one belongs to the generative process and the other belongs to the inferential process, sometimes also called the recognition process. So the inferential process is the process we are imputing to the environment. We're basically saying this is the way the environment works. This is the way x3 is generated. So this is how the agent sees, believes the environment to be. So sorry. X's are part of the thetas here. They're outside. This is how we think the world works. Our model of the world. Now as we receive input, we invert that model. So now this is the forward process. And this is solving the inverse problem here. So we invert the model that we believe produces u from theta. And the result of that inversion is lambda, which comprises the sufficient statistics of our beliefs about the environment. So the mu's and the pi's in our concrete case. So mu is here. Mu is part of lambda. And x is part of theta. No, sorry. The parameters do not get updated. The beliefs about the states get updated. So you can call these the representations. So the technical name for lambda is the representations. They get up. Yes, so if you look at the equations, then you get an interesting situation where sometimes you have, depending on the structure, you've seen that there are many different structures of models you can build. With every model you build, you have a certain dependency among the updates. So some updates cannot take place before others. So it's just that because the quantities needed in these updates first have to be calculated by a lower level update. So yes, you have to do these updates in a certain order. And now I'll give you another sneak peak at something I'm really excited about. This is not published yet? Yes. Oh, OK. So I think Matteo may have gone to get it. But otherwise, I'll run and get it from Erica. I think I've said that they need the sheet, too. Yeah, Matteo said I'm going to get to the lecture. OK, so he has been there. And I think we'll come back. OK, the sheet is on its way. So since you asked about the order of the updates, these are EEG data. And the great advantage of EEG data is that they have an extremely high time resolution. So these are milliseconds here. And you can see how these updates take place in the order that they have in the HDX hierarchy. So here are three quantities that basically don't depend on each other. So the updates here could theoretically take place in any order. Then as you go higher in the hierarchy, you depend on the updates that you make first here. And what you see is that the signatures of these updates. So the time points when the magnitude of these quantities here modulates the EEG signal prediction error, Q-related prediction error, advice prediction error. It's just a very quick preview. I will show you the experiment underlying this. So we have three different prediction errors here. We have a Q-related prediction error. We have an advice prediction error. We have an outcome prediction error. We have a precision of the belief about the accuracy of the advice. That's what advice precision means. We have a volatility prediction error. That's the prediction error about the volatility of the precision of the belief about the advice's accuracy. And then we have the precision of that volatility. So the higher you up you get in the level of the HDF, the later you see the signature from the brain. So this is evidence that the brain too processes these updates in order. So first of all, it processes them, which is already very gratifying for me to see. But then it also processes them in the order that you would expect from the update equations. That's from MRI. So we did the same experiment with EEG and with MRI. Because in the EEG, as I said, we have the high time resolution. And in the MRI, so MRI is magnetic resonance imaging. It's usually in medical diagnostics. You do this in a structural way. So if you've twisted your knee and you get an MRI from your knee, then it takes several minutes to take a 3D picture of your knee. And then your doctor just looks at this sort of static picture of your knee and then sees what's broken or what's still there. And we do what is called functional MRI, where we take a picture of the brain in about two seconds. And then every two seconds, we take a new one and we look at the changing oxygenation levels of the blood in the brain. And the thing is this. If a brain region is using lots of oxygen, is very active, then it asks for more oxygen. And this effect tends to overshoot. So perhaps counterintuitively, the most active regions have the most oxygenated blood. And because the magnetic resonance properties of the nuclear resonance that you get in oxygenated blood differs from that that you get in deoxygenated blood, we can tell which brain regions are highly oxygenated and which ones aren't. But the time resolution is only about at this two-second level. So much lower than we have in DEG. However, the spatial resolution goes down to cubes of about 2 millimeters. So those are so-called voxels. In analogy to pixels, you have in 2D images. In a 3D image, you have a voxel. So the voxel size is about 2 millimeters cubed. And so I think you were referring to what's here. Yes, you see lots of these regions that we color coded. These are the regions where we get significant activity based on our MRI measurements. Good. Next question. Yes, so that's about on the order of 500 milliseconds until you get to the top level for a whole new input to be processed. Which fits about with the experience you have when you're in traffic and you suddenly have to break. That's about how much it takes for you to realize I need to break further questions. OK, has the sheet arrived? No, OK. So apparently Matteo has it, but Matteo's walked out. So we're going to have to wait for Matteo to return before we get the sheet. Anyway, we're going to do a break at some point. And then we're scheduled to go on until 4.15, right? OK, so let's have a break now of about 10 minutes and then another 3 quarters of an hour. And I'll go in search of the sheet. Exam you will have on Thursday. The exam will be a multiple choice affair for the most part of it. So I will just ask questions that you should be able to answer by looking at the slides. So the second batch of slides should now be available to you for download. I created a link to them. So if you look at the slides, you should be able to answer these questions. And then at the end of the exam, with slightly more points given, I will ask for a few calculations. Most of perhaps all of them will also be not too difficult to derive from the slides. You're going to be alone. It's going to be you and your pen. It's going to be you and your pen. So that's basically the deal. So I actually don't know in what time frame you will need to know the results in a week. Oh, you don't need them. OK. Well, I don't need to make an exam. So probably I would be in trouble if I didn't. So I've made the exam. The exam is it exists. So yeah, we will just meet on Thursday. And you will sit the exam. And then yeah, it should be, by looking at the slides, you should be able to answer the questions. So yes? Yes. So I was just going to ask, we had this little discussion with the tutors a few days ago. And they said they would like to have some material for a tutorial. When would that tutorial be? Tomorrow. OK. So that would be sort of a short notice. So one thing I can recommend you guys is I can send a paper to the tutors and also to Erica. So she can send it to you. And in the appendix to this paper, let me just demonstrate the paper here. So this is a technical paper on the HDF with my colleagues. This is called Uncertainty and Perception and the Hierarchical Gaussian Filter. Let me give you a peek here. I have to assist in preferences, screen displays here, and mirror screens arrangement, mirror displays. OK. This paper here is about technical details in the HDF. You will recognize this figure here. And here are all the update equations and so on and some discussions about fitting and simulating and so on. And then in the appendix, coupling between levels, we've been through that variational inversion. It's this and then the actual calculation of the variational energy. So in the concrete case of the HDF, we have the generic variational energy at a generic level. Here we have the result of the quadratic expansion and how to find the updates for the mu's and the pi's. And then here, as an exercise, I recommend you, in the tutorial, if that's all right with you, the tutors, go through the calculation of the variational energy here. So you understand the details of where the update equations of the HDF come from. So in the end, we have these variational energies and we have the variational energies for categorical outcomes also, which I just showed you the results of. And then we get the results in the three-level HDF. This is the variational energy at the first level, variational energy at the second level, variational energy at the third level. Then you can calculate the update equations from this. So especially appendices B, C, and D would be, I think, very instructive for a tutorial. It's sort of most of it is already given to you. You just have to sort of fill in the blanks between the different steps. But being physicists, I trust you will be able to do that. So that's my idea for a tutorial. And basically for the exam, yeah, look at the slides again. You will be able to answer the questions from looking at the slides. And there will be a few calculations. So maybe 2 thirds of the points will come from multiple choice questions. And a third of the points will come from calculations. And perhaps somewhat more weight for the calculations also. But between 1 half and 2 thirds of the points will be from multiple choice questions. Any questions on that? Yes. Sample exercising for the calculations. So they will be along the lines of what you saw during, I'll have to see whether it's the, if this was in the first batch of slides. So let's go here. Along the lines of these things here, so there are a few steps in between, stuff like this, relations like these, stuff like this. Well, I'm not going to tell you which one's exactly it's going to be. But perhaps I'll ask you to show me why energy corresponds to negative log joint when we map physics, statistical mechanics onto information theory, something like that maybe. But it's basically, it'll be nothing beyond basic algebra calculus. So it'll be basic calculus, basic algebra. And of course, background knowledge. But you can get the background knowledge from looking at the slide. Other questions? OK, in that case, yes? Is there a negative? No. So the rules for the multiple choice questions will be that more than one possible answer can be correct. And you will only get the points if you take all the correct answers and none of the incorrect answers. OK, so let's continue a little here. So what we had here were three parameters that we set. And these three parameters defined the whole way this agent here learned. Now we're going to fiddle around with these parameters and see what effects we get. So theta was 0.5, omega was minus 2.2, and kappa was 1.4. And now we're going to play around with theta. So conveniently, theta is already on the blackboard here. And so it's the variance at the third level. And you can see that if we reduce theta, then that flattens the trajectory at the third level. Not very surprising. But there you can see how by changing parameter values, you get different effects. And now what this also does, it destroys this effect of the increasing learning rate after the period of volatility here. You can see how the trajectory here during the first 100 inputs looks almost exactly like the one here in the later and the last 100 exact same inputs. Because the learning rate in this example doesn't decrease. And that is due to the flat volatility trajectory here. So just, and we got this actually very different pattern of learning simply by reducing theta from 0.5 to 0.5. So now next parameter we're going to play around with is omega. So theta is back up at 0.5. And with respect to the reference there, only omega is changed. But omega is the sort of evolution rate at the second level. And you can see how much flatter the second level is now that omega is lower. And how much less learning even at the start. So if you compare the learning here at the start with omega minus 4 to the one with omega, what was it? I don't know, omega higher. Then even at the start, you have much less learning. And here you have much less learning. And here it's even more extreme the difference. And because not much is going on at the second level, at the third level, nothing much is going on either, even though theta is now the same as at the start. The third thing we can play around with is kappa. And kappa is the coupling between these two levels because we have as a reminder x2k is distributed as a Gaussian around x2k minus 1 with a variance of the exponential of kappa x3k plus omega. So kappa couples x2 and x3. And if we reduce kappa now from 1.4 to 0.2, the lowest level is not directly affected. So we have the same amount of learning here at the beginning as here in the reference trajectory. But because the volatility increase here isn't passed to the third level as much as it is here for kappa equals 1.4, we have this repeated tendency of the volatility to go down. So when the environment stabilizes, the entropy is reduced, the probability is at 0.8. You can see the volatility estimate coming down and then going up again as the contingency changes. So everything is happening quite appropriately here, but you don't see the same kind of increase in the volatility estimate as in the reference trajectory because the coupling between these two levels is now much weaker at 0.2 as opposed to 1.4. Then what happens to the precision of the inference? So at each level, we infer nu2 and pi2. And if we plot the pi2 update, so what we actually plot here is the square root of sigma. So the square root of 1 over pi. This is what's the square root of the inverse precision. That's what we're plotting here. And here you can see what goes on at the second level. We have the reference scenario at the top. Then if we reduce theta, the uncertainty about mu3 also decreases, if we reduce omega, then not much is learned at the second level. Because not much is learned there, we're also quite uncertain about what is going on at the third level. So you can see this uncertainty about the third level increasing, both with respect to the reduced theta scenario as well as with respect to the reference scenario. And if we have reduced kappa, then the information flow between the second and the third level is impaired. And this leads to lots of uncertainty at the third level, because the information is just not getting through from the second level. So these are the effects that we get. And you can see the richness of the different processes that we get from playing around with these three parameters. OK, here the decision model. I think we saw that before. So this is, again, this map allowing from a certain imprecision for a certain noise in the mapping between beliefs and choices. This is what the subject chooses as its prediction in the simple example we had. So after a high tone, does the subject predict a face or does the subject predict the house? And this decision model has a noise parameter zeta here. And this zeta can be very high, indicating basically a step function. So your sigmoid converges to a step function as zeta goes to infinity. And this means this is an agent who always wants to be right. So if this agent's estimate of the probability of getting an outcome of 1 is greater than 0.5, it always bets on an outcome of 0.5. And as soon as the estimate falls below 0.5, it always predicts 0. And in between, or if we lower zeta, we get to curves that allow for more exploratory behavior on the part of the agent so that we can also model people who sometimes act in a way that doesn't give them the highest possibility of being right, perhaps because they want to be exploratory, perhaps because they're not concentrating fully for whatever reason. This kind of decision noise parameter here is like a safety valve for the modeler. So we always want to have a bit of observation noise because that relieves the pressure on us to get every prediction right because this decision model is basically the observation model for us, the experimenter. So we infer the subject's beliefs. And then the subject makes a decision. So why is the subject decision? And we have to allow for some noise in that because our model won't be perfect and the agent won't be, even on the basis of the agent's own model, the agent won't be entirely perfect. We have to allow for that by having some decision noise there that we can explain away unexpected decisions with. Yes? Yes. So the sigmoid of mu2 is here. So mu2 is on the real line. It's in R. If you pass it through a sigmoid, then it's mapped onto the unit interval from 0 to 1. And basically, your sigmoid of mu2 is your prediction about x1, the explicit. Yes, yes, yes. So there is no direct, yeah. It's just one function after another. So the unit square sigmoid here is like this. So it's, I'm going to call it, what am I going to call it? Because the other, the logistic sigmoid is called S. What are we going to call the unit square sigmoid? USS, unit square sigmoid. USS of x equals to the zeta divided by x to the zeta plus 1 minus x to the zeta. And basically, what we have here is USS after S of mu2. It's basically S of mu2 to the zeta divided by S of mu2 to the zeta plus 1 minus S. So this is the whole thing. That's what you see on the slide. And S itself, so this is the definition. And S is defined as 1. So you can now go and fill in S, and then you can perhaps find some algebraic simplification. But that's basically the probability of the decision being 1 as a function of mu2. That's this. Yes, yes. So no, I have the probability of the subject saying, predicting an outcome of type 1. I can simulate decisions by the subject this way, yes. But in many cases, I already have the decisions by the subject. And then I compare every actually observed decision to its probability under my model. And I use this to fit the model in the sense of I wiggle around the parameters in order to increase the probability for each decision that was actually observed. This happens in an optimization process that is different from the inference process that we've been discussing. So if you look at that paper that the tutors are going to discuss with you, that is described in detail there. So give you the equation numbers. This is the right paper, yes. So I'm talking about the decision model calculations that we have. So the abstract discussion of this is in the chapter Maximum A Posteriori Parameter Estimation. And this is on page 5 of the paper, equations 18 to 21, 22. So what we're doing, I can write this down. What we're doing when we're fitting the model, when we have data from our subjects, we fit the model to that data, is we're going to find a quantity psi star. This is the optimal psi. And what is inside psi is all the parameters from both the learning model, the inference model, and the decision model. So in the concrete case that we had here, that would be omega kappa theta zeta. So if we take them all together, the set of these parameters is sine. And we want to find the optimal set of these parameters fit to the data, the observation y we have. And then we want to have this. So we take the argument, the maximum psi of the posterior of psi given y. These are the decisions by the subject. And you, those are the inputs we provided. We can unpack this and say this is the argument of the maximum with respect to psi of the following expression. Sum over k. And k, again, is the trial index of the time index. Log probability of y at time k given lambda k, lambda 0, and also given zeta, plus the log prior on psi. Now let me unpack this for you. These are subject decisions or observed decisions. The landers, lambda k, is the sufficient statistics of the beliefs, mu 2k, pi 2k, mu 3k, pi 3k in a simple model. Chi, that's the parameters only of the inference model. So a subset of psi. So this is omega kappa theta. This is the initial values of leaf trajectories. So mu 2, 0, pi 2, 0, mu 3, 0, and pi 3, 0. U is the inputs to the agent. And zeta is zeta. That's just the parameters of the decision model. And this is the prior on theta. And then we have to find an algorithm who does this. There are several candidates. And I'll have some slides on that. I mean, this looks very, very complicated. So one simple way to write this, this is basically just what's here on the blackboard just radically simplified here into this graph. We have a perceptual model or an inference model. We have a decision model or an observation model. All of these expressions are used interchangeably. So this can be a decision model. This can be an observation model. This can be a perceptual model. This can be an inference model. And shaded quantities are observed. They're known. We see them. So these are the trial outcomes. Is it a 1 or is it a 0? And these are the subject's decisions. They are observed. They are known. There is no uncertainty about them. Then hexagons. I'll show you on the next slide the definition of the. Hexagons are states that depend on their own previous states. So yk depends on yk minus 1. And yk plus 1 depends on yk. In this sense, they are Markovian. Then this diamond notation corresponds to the so-called plate notation. If you're familiar with machine learning texts, you often see this plate notation. So the diamond notation is even simpler. And when you have, of course, the circles indicate constants. So they have no time index. The diamonds have a time index, but they don't depend on their own previous state. The hexagons also have a time index, and they do depend on their own previous state. So if you see a structure like this, a diamond having an arrow going into a hexagon having an arrow going into a diamond, that means that you have states like this where the diamond states have arrows going from their own previous state to the current state. And the diamond states do not have these arrows. The hexagons do have these arrows. So that's the notation we use here. And in this way, we can write models very compactly without getting confused. And I mean, this is unique. So there cannot be misunderstandings. And yet, it's quite simple. So parameter estimation, I've already been talking about this. It's basically this process of finding psi star, the optimal psi, or the psi that we think was at work, the optimal psi to explain the behavior by the particular agent we observe. And the question, of course, is, now, if we have a concrete subject, are we able to estimate the kappa, the omega, the theta that was at work when that agent produced its decisions? Does it mean anything if one agent has a kappa of minus 1? Sorry, a kappa of plus 1, and the other has a kappa of, I don't know, 4. Does that difference mean anything anyway? And we systematically explored this. So what we did was we drew a grid where we buried the parameter kappa and the parameter zeta. And your black bar is always your ground truth. So for instance, here, let's look at the easiest case where zeta is 24, and kappa is 0.5. Now, this is the ground truth. This bar indicates kappa equals 0.5. And now, we go and estimate, and of course, doing it once doesn't tell you anything. You have to do it 100 times or 1,000. I think we did it 1,000 times. So we simulated an agent with a zeta of 24 and a kappa of 0.5. We let that agent make decisions according to our model, according to the models you saw. So what happens is with the particular kappa, omega, and theta this agent has, it'll have a particular kind of belief trajectories. As we saw when we vary these parameters and got different trajectories there, then on the basis of these beliefs, applying the decision model will give us simulated decisions. And these decisions, in this case, were also just binary. So we had a series of, I think in this case, it was around 400 zeros and ones produced from a particular zeta and a particular kappa. And because, of course, there is decision noise, we have this unit square sigmoid model that gives us a probability for a certain decision. We can do this 1,000 times and always get slightly different answers. But that's the nature of the noisy system. And then we take what we get and we try to estimate the parameters back. So we see whether we get back what we put in. And because all the simulations are slightly different, all the estimates will be slightly different. Parameter estimates. But here you can see in the case of zeta 24, that means little decision noise. So the decisions always very accurately reflect the agent's beliefs. In that case, we can very well estimate the ground truth, the true parameter value of kappa. And you can see four little box plots here. And these correspond to four algorithms for determining the parameters, for basically doing this part here, determining the argument of the maximum with respect to psi. So the four ways we use to do that is fminsearch. This is a built-in function in MATLAB, which is the fancy name for it is the Nelder-Mead Simplex Algorithm. Then GP is a Gaussian process, a global optimization method. DB is a variational base. We haven't gone into variational base in detail. But it's also a variational optimization method where you, as in the mean field approximation, partition your parameter space. And then you update one half of the parameters based on the sufficient statistics of the others. And then you turn around. And then you iterate until you converge. The fourth was MCMC. And this is sort of the gold standard, because you can prove that if you sample for an infinite time, you will get a sample distribution equal to your posterior distribution. MCMC stands for Markov Chain Monte Carlo. So all methods work very well when we don't apply any decision noise or very little decision noise. You can see here it's in all cases possible to estimate kappa almost exactly the kappa we put into the simulation we also get out again. And this means our estimation methods work. And then here, if we increase the noise, three of our methods still work well. However, the fmin-search function, this Nelda-Meet-Simplex algorithm, which is, in some sense, the most primitive of these methods, starts showing some abnormalities here and here. And then if we increase the noise even further, the Nelda-Meet-Simplex algorithm starts to show clear inadequacies. But also the other methods start working less and less well. And if we have very much decision noise, so this is a zeta of 0.5. And if we go back a few slides, this is this curve, zeta of 0.5. So this goes like this. So basically, you don't have much discriminatory power here on your mu2. It's very hard from your decisions to infer back on the mu2. So in generating decisions, you go from this axis to this axis. But then, of course, when doing inference, you're going from this axis to this axis. And so if you're here or in this region, then you can basically be in many places here. So it's hard to infer back on the state of belief of the agent that was in play when the decision was made. And that's what you see here in these very wide distributions for the possible values of kappa that we infer when we repeat this process 1,000 times. So if we do it the other way around, if we look at how the zeta estimates look plotted against kappa, then we see that in all regions of kappa, we can estimate zeta reasonably well. What you can see is that all the estimates are, or the means, no actually the medians, of the estimate samples are somewhat higher than the ground truth. And this is a result of the prior we applied because we wanted to have a so-called shrinkage prior on the decision noise. We wanted to force the model to explain decisions substantively. And that means by attributing differences in decision patterns to differences in learning between agents, we didn't want the model to take the simple way out, the easy way out, and just attribute it to noise. And that's why we put a prior on the noise that pulled the noise downwards. So you can see that even when the noise is quite high, so for low values of log zeta, then the estimate is actually close to the truth, but it's slightly biased towards a low value. And that's because we consciously forced the model to make a relatively low noise estimate. Further stats on the parameter estimation and comparisons between variational Bayes and MCMC. So in these algorithms, we have a posterior interval. And of course, if you're posterior interval where you think your parameter will be with a 95% probability, if that is too wide, then you're underconfident. And if it's too narrow, then you're overconfident. And so the interesting thing is to look at the proportion of posterior intervals containing the truth zeta. And you would want that to be 0.95 for an algorithm with the right amount of confidence. So it's your 95% confidence interval. So you want the truth zeta to be in there 95% of the time. Otherwise, your algorithm is either overconfident or underconfident. So this is overconfident if a lower proportion of the truth zeta is in the interval. For the campus, variational Bayes does a very good job of getting the right level of confidence. It's very close to 0.95 always. Then, yeah, just looking at the errors in the estimates. Of course, as we increase the noise, the errors in the estimates increase. This is for zetas, and this is for the estimate of log zeta for different levels of noise. And this is for the campus for different levels of zeta. OK. So now we know we can actually infer individual belief trajectories. So we have mechanisms. We have at least four algorithms that give us tolerably accurate results when inferring on the parameters of underlying actually observed behavior. This is one example of real data. This is a real subject making real decisions that we recorded here. The orange dots are decisions. Here, the subject predicts an outcome of type 0. Here, the subject predicts an outcome of type 1. And where you got an x, this subject missed making the decision within the required time. So this is one learning trajectory that we inferred from actually recorded behavior. And this is another one. So this guy misses a lot of trials. It doesn't learn very much, so you can see his learning trajectory is quite flat. And also his volatility trajectory looks very different. And so you can see vast inter-individual differences. So I think we have five minutes over time. And I'm, of course, happy to take one or two final questions. Otherwise, we'll see each other on is it tomorrow or Wednesday? Wednesday. Good. See you Wednesday.