 OK, so in what follows, we shall be looking at hierarchical Gaussian filtering. And that will be the application of all the preliminary work we've done, all the conceptual insights we've had to the problem of performing inference in a volatile, quickly changing environment. So we've learned a lot about inference. But there's this question, does inference, as we've described and adequately described the situation of actual biological agents? So we found various ways of trying to infer the underlying distribution of a series of outcomes that we see. Now, if we want to apply this to the real world, not just in the sense of economic data we have, but imagining a biological agent that moves in an environment where there are other animals, or where there's an environment where there's night and day and stuff changes, stuff moves, what did we not consider up to now? Basically, I already said it, but I want to hear it from you again. What did we not consider in the way we did inference up to now? Time, yes. So things change. So in this picture here, we assumed that all these x's, x1, x2, x3, x4, were drawn from a distribution that was static. What about if, as time goes on, observations are more over here, and then suddenly they're over here? How do we deal with that? And this is where this picture comes in again. This is the fundamental way we think about this. We have our agent here. We have the outside world here. The two connections, the two interfaces are the sensory input the agent receives and the actions the agent takes. What the agent needs to do to take appropriate actions is it needs to have a tolerably realistic picture of the true states of the world. I call them hidden states because they're not directly accessible. They have to be inferred on the basis of sensory input. There's no way to look at them directly. And on the basis of its sensory input, the agent constructs a picture of the world parameterized by this lambda here. And on the basis of the beliefs parameterized by lambda, it takes action. And then the cycle repeats. So the dynamics were missing here. Now what about dynamics? As I already said, up to now we've only looked at inference on static quantities. Biological agents live in a continually changing world. In that boat example, we didn't consider that the boat's position changes all the time. And with it, the angle to the light taps. So one problem we've just ignored is that older observations are worth less than newer observations. So how can we take into account that old information becomes obsolete? And as we already saw at the very start of this morning is that our learning rate in the mean updating formula becomes smaller and smaller. And also in the Gaussian update formula, because our posterior precision gets higher and higher. So the learning rate becomes smaller and smaller because our equations were derived under the assumption that we're accumulating information about the stable quantity. And this assumption is wrong as soon as you live in a dynamic environment. Now here's this question. What's the simplest way to keep the learning rate from going too low? The absolute simplest way. Yes? Steady? You mean study, OK. Yeah, I thought you said steady. But that would be another word for what I'm looking for. Yes? You could do that, but it wouldn't be the simplest way. Decrease it every time. It would automatically do that. And that's what we want to prevent. We want it not to go too low. Increase. OK, yeah. You can do that, but it wouldn't be the simplest way to do it. A constant. Well, why not just make it constant? I think you can't get simpler than that. Just use a constant learning rate. So nothing sophisticated. Extremely simple. Just keep it constant. Then the only thing you have to worry about is what is the value of that constant going to be? So I've written the mean updating formula again. And I've switched around the n's and n plus 1's and replaced them with n's and n minus 1's. So instead of 1 over n plus 1, we now have 1 over n, but it's still the same equation. So we take our 1 over n, which we don't want to decline from time point to time point, and we just replace it with alpha. And now this amounts to the exponential downweighting of observations as they recede into the past. I said this too early before when it wasn't true. Now it's true. If you use a constant learning rate, then that implicitly amounts to, and it's easy to show, it implicitly amounts to exponentially downweighting your previous observations and the further they go into the past. And the higher your learning rate is, the steeper your exponential curve is that downweights them. So in neuroscience and behavioral science, this is called Riscola-Wartner learning because it derives from a paper by Riscola and Wartner in 1972. They didn't arrive at this by this kind of reasoning, but in some other way. But it's been around for more than 40 years now. And despite its extreme simplicity, it has been quite successful at describing the way animals learn. So these were two rat researchers, and they applied this to the way their rats learned. And it was a very successful description. So you simply have your prediction error, and you update your value here by multiplying the prediction error with a constant. That's all you need to describe many experiments very successfully. But does a constant learning rate solve our problems? Let's remember we wanted to have a learning rate that can reflect the way the environment changes. Sometimes the environment changes quickly. Sometimes the environment changes more slowly. So does a constant learning rate solve our problems? Not really, only partly. It implies a certain rate of forgetting. Another way to see it is that it amounts to taking only the n equals 1 over alpha last data points into account. But it doesn't address the fundamental problem that we might have to adapt our learning rate in response to changes in the way the environment changes. So if the learning rate is supposed to reflect uncertainty in Bayesian inference, because a more quickly changing environment creates more uncertainty. So the learning rate takes the role of uncertainty in Bayesian inference. And as we already saw, the learning rate in the Bayesian update was a ratio of precision, so a ratio of inverse uncertainties. The question is, how do we know that alpha reflects the right level of uncertainty at any one time? So that's what we need to know. We want to have an alpha, an adaptive learning rate, that accurately reflects the level of uncertainty at any one time. And we want alpha, and this is basically a corollary of the before, we want alpha to be able to change in response to events. So if it's constant, that's not possible. Again, what we really need is an adaptive learning rate that accurately reflects uncertainty. Now, this forces us to think a bit more about what kinds of uncertainty we're dealing with. What do we want reflected in the positions that go into our learning rate? And there are different kinds of uncertainty that we have to deal with. So one possible taxonomy of uncertainty, going back to, for instance, you and Diane and other famous learning researchers, is that you first have, as your first kind of uncertainty, you have outcome uncertainty. And this means anything that's not accounted for by your model, it's basically the precision of your likelihood. It is where you, as a modeler in some sense, give up where you say, OK, this is what I choose not to model. So you're all physicists. And you understand that we often talk about coin tosses as random events. It's a random event, whether it's head or tail. But as physicists, you know that that's not actually the case. If you knew the initial conditions of the coin flip, you could calculate its outcome. It would be complicated. But you could. I mean, it's basically determined. If you could recreate the initial conditions exactly, you would get the same coin toss again. So it's actually a modeling choice to say that outcomes are random. Physically, they're not random. It's just that if we toss a fair coin 1,000 times, we'll get around 500 heads and about 500 tails. So this outcome uncertainty is what we choose not to model. We say, OK, for the purpose of what we're doing, it's much too complicated to look at the physics of each coin toss. So we're just going to say it's random. And this is the outcome uncertainty. It's often called the irreducible uncertainty. But it's not physically irreducible. It's just in the sense that we've not modeled it irreducible. And in our Bayesian example, it was the likelihood precision. Economists call this risk. Then another kind of uncertainty is informational uncertainty, sometimes called expected uncertainty. So we are uncertain about the value of model parameters. We have to infer these model parameters. Now staying with the coin toss example, the irreducible uncertainty, the outcome uncertainty, the first kind of uncertainty is what is left after you already know exactly what parameter governs your outcome. So if you know your coin is fair, then you know the parameter is 0.5. But you're left with this kind of uncertainty. But let's say you don't know whether you have a fair coin. Well, this is an uncertainty you can do something about. You can actually go and toss your coin 1,000 times and learn about your parameter. If you get 562 heads and the rest of them tails, then you know your parameter is about 0.56. So this is a reducible kind of uncertainty, something you can learn. So this is informational uncertainty or expected uncertainty because you know how uncertain you are about this parameter. And now imagine during your 1,000 coin tosses, you have to go to the loom. Then when you come back, you're not entirely certain whether there wasn't somebody who came in and changed the coin. So this is environmental uncertainty. It is sometimes called unexpected uncertainty, but in some sense it's a slightly problematic way to call it. But it is the uncertainty that comes from not knowing whether the parameters in your model have changed. And these three kinds of uncertainty you have to deal with, the irreducible one, the outcome uncertainty, what the economists call risk, your lack of knowledge about the parameters, and even if they were constant, and your lack of knowledge about the movement of your parameters or states. I'm going to call them states if they move. So several efforts have been made to deal with this problem. These are some examples. The first one is a famous one. This is the so-called Kalman filter. It is sometimes credited with putting men on the moon because it was used in guiding the Apollo rockets. And this Kalman filter works very well with physical systems that are linear. So the Kalman filter is optimal, provably optimal for linear dynamic systems. The problem is that realistic data, and I mean realistic data in the settings we work with where we don't have a somewhat simple object like a rock going to the moon, but where we have a very complex environment with interacting agents. This works non-linearly, and so the Kalman filter breaks down. So what happens with the Kalman filter very briefly is this. It starts out with an acceptably high learning rate. And then as in the mean updating, as it learns about its environment, the learning rate goes down. But then when the environment suddenly changes, it has no mechanism to increase it appropriately. So it doesn't go down infinitely. It doesn't go down to zero as it would if you just continued doing the mean updating. It converges to a certain level. But once it's converged to that level, it effectively works like a risk or a partner learning. It has a constant learning rate. So it decreases at the start, and then it converges to something that remains essentially constant. So it falls on its nose when unexpected stuff starts to happen. And then these others are reinforcement learning efforts by Sutton, and these are more Bayesian efforts, all of these down here. And I'll let you guess which one of these approaches we'll focus on in the rest of the lecture here. We'll focus on that one. So we use a generic nonlinear Bayesian model that allows us to derive update equations that are optimal in the sense that they minimize surprise. That's a feature the other approaches don't have. OK, so what's the big idea? The big idea is actually very, very simple. We have a quantity of interest, x1. This is the quantity we want to track. It moves in time. And sometimes our x1 moves slowly, slowly. It goes up, down, and very slowly. And then it moves more quickly, and then it slows down again. That's basically what we want to model. And on the basis of noisy observations of x1, we want to infer the position of x1, and we want to minimize our surprise at our observations of x1. So what we assume is that x1 evolves in discrete time because we observe it in discrete intervals as a Gaussian random walk. Now the index up here, it's not an exponent. That's why I put it in parenthesis. It's a time index. So the probability of finding x1 at a particular place at time k is distributed as a Gaussian around where it was at the previous time k minus 1. So the mean of the Gaussian distribution describing x1 at time k is x1 at time k minus 1. Very simple and straightforward. This is called the Gaussian random walk. Now the other variable we need to have in our Gaussian random walk is the variance. And this is where the hierarchy comes in. So the variance of the Gaussian random walk in x1 is a function. And at this moment, the only constraint on the function is that it's positive because it's a variance, is a function of a quantity x2. And now the whole game repeats. x2, in turn, moves sometimes slowly and sometimes quickly. And the higher x2 is, because this is a monotonic function, because in order for x1 to move more quickly when x2 is large and more slowly when x2 is small, it is another particle. So x1 and x2 are different ones by construction. Basically, think of them as states of the environment. I wouldn't think of them as physical particles. There is no physical model that I think works like this. So it's a hierarchy of Gaussian random walks coupled via the variance. So always the next high level determines the variance of the level below until we get to the nth level, where we have a constant variance theta. Because at some point, we have to stop. It's like you do your Taylor expansion up to nth order. We build a model to the nth level. Yes, this will arise in practice in the sense that when we update our beliefs about x1, we will have to use the latest state we knew of x2. So we will have to look in the past, where was x2? And then we can infer the new x1. Then on the basis of the new, it's not the new x1. It's actually the new sufficient statistics for x1, so mu1 and pi1. And on the basis of mu1 and pi1, we can infer mu2 and pi2. And so on up through the hierarchy, whenever a new observation comes in. Yes? Yes. Because we can, it's more tractable if it's a Gaussian. Yes, x1 is the lowest level, yes? Yes. So how it moves depends on the higher levels. So at the observation level, so we call our observations u or y, mostly u here, because there's going to be a twist where y is going to be another more proximal observation. So this observation here we're going to call u. Now u is going to be Gaussian distributed around x1. So u at time k, these are k's, yeah? And Gaussian distributed around x1 with a variance of what we call this pi hat u to the minus 1. So this is the observation variance corresponding to the pi epsilon we had in the simple Gaussian model. So this is our observation. And then we say, OK, we have some observation noise. So that's the simple observation model. Basically, this is our likelihood. Yes, yes. They're at the same time. That's the observation model, the likelihood. First thing, it's not as simple as that. But you can interpret it in that way. But it's not a velocity is a vector. It goes in a direction. And this just basically tells you how far it is likely to travel in whatever direction. Yes? Yes. Yes, so I should have put in an index here. Nope. So in the way the model is, we're going to have to distinguish always the generative model, which is this, and the inference process. And in the inference process, so when we have a new U, and from that U we infer on X1, X2, and so on, we will have to look into the past. You will see this when we derive the update equations. Yes? But yes, it is a function. Yes. The variance is a function. It's not an alternative. So we take it at this time. So this is basically we have X2. And we generate a new X1 from this. Well, of course, when you generate from this here, you have to take the last state you had when you generate the new one. As you generate from here, you go down like this. So you have the X2 at time k. So you always have to distinguish between the generative. This is a generative model. This is a generative model of the time series. This is how our time series of U's comes about. And you can write a little program that does this. First, it samples a new Xn. And then it samples a new Xn minus 1. And then it samples a new X2. Then it samples a new X1. And then it samples the observation U. Do you agree? I think this is where your apprehension comes from. But when we perform inference, when all I give you is the U, and you have to use this U to perform inference on X1, X2, up to Xn. Then you will have to work the other way around from the bottom up. And in inferring the new X1, you will have to use your last state of knowledge on X2. And that will be from k minus 1. This process here is down here. This is the so-called generative process. This is the process that we assume generates the time series. And our little agent who wants to learn about its world performs an inferential process. This inferential process proceeds in the other direction from the bottom up. Because all our little agent has is the observations. Yes. Absolutely. Yes, I totally understand that. And that's why we're going to look at this in detail. So in a nutshell, what we're going to do is we're going to take our model. We're going to do a mean field approximation. And then we're going to do something close to a Laplace approximation, but better. And this will give us update equations for the sufficient statistics. So I will have an update equation on the basis of u for mu 1 and pi 1. And then I will have an update equation for mu 2 and pi 2. And for mu 3, pi 3, up to mu n, pi n, on the basis that my belief about all the x's is Gaussian. So two sufficient statistics are enough to describe my posterior about all the x's. So at each level, I need to have an update equation for my mean and for my precision. No, they're not. But we're going to radically simplify them. So we're going to fill them. This is the first thing we're going to do. We're going to look at these f's. And we're going to restrict them. Now they're very unrestricted. All I said was that they're positive and monotonic. But we're going to narrow that down to a much narrow class of functions. OK. Now the promise of this is that it provides a generic solution to the problem of adapting one's learning rate in a volatile environment. And what is more, it does this in an efficient way. So variational inversion leads to update equations that are precision weighted prediction errors, as always. Now turning to the f's. The f's have to be positive. And what do we do when we want to simplify something? We do a Taylor expansion. So the thing with f is it has to be positive. So we cannot directly expand it. Because if f is a polynomial, it will turn negative at some point in general. Yes? Yes? Yes. So the only reason I want that is that that's a requirement I put into the model. This is not mathematics doesn't force me to put it in. I just want to have a situation where x1 moves more quickly when x2 rises. That's just me as the constructor of this model that demands this monotonous. So what we do is we take the logarithm of f, which has to exist. And so for all f that are positive everywhere, there is a function g where the exponential of g is f. So basically, we just take the logarithm of f and we expand it here. We rearrange a bit. And then this is our definition of the parameter kappa and of our parameter omega. So this is the class of f's that we allow. We say our f's are always the exponentials of a linear function, kappa x plus omega. And then we take the exponential. And this is positive everywhere and monotonically increased. It is like a? Yes? Yeah? That's a nice way to interpret it. Yes? Yeah? Yeah? These are all the f's that we allow. So now we will do this on the blackboard, just a little preview of how it will work. As we saw, we can do a mean field approximation. We can separate our levels. We're going to partition our model. So there we partitioned the thetas. And now I said the x's are a special case of thetas. States. So we're going to partition our x. Each x has its own partition. Then we're going to do a mean field approximation. And what the mean field approximation is, it gives us, is a variational energy at each level. So we get a variational energy for x1. We get a variational energy for x2. We get a variational energy for x3, and so on. And now in blue, imagine that this is your true variational energy. And it is not exactly Gaussian. We will see this. We will write it down. You can look at it. And you will see it is not Gaussian or not quadratic in log space. It's not quadratic. But we want to have something quadratic because we want our posterior to be Gaussian. We're enforcing that. We're going to say our q will have to be Gaussian. So what we do, if we did the Laplace approximation, we would go in search for the maximum of i. And we would do our quadratic expansion at the maximum of i. But we don't do that. We do something cleverer. We take our expectation. So the previous mean of our belief on the x, we take the mu at time k minus 1. And we use this as our expansion point. And then we have what we call i tilde of x, which is a quadratic approximation to the true variational energy, and which is different from what Laplace doing Laplace would have given us. So in green, we have the true variational energy in blue. In green, we have what Laplace would have given us. And in red, we have what we are actually doing. We're expanding around the previous mean. And this gives us update equations for mu. And here, I wrote it for sigma, but for pi, it's the same. It gives us these update equations. And the ingredients of these update equations are, because they are update equations, the previous mu at time k minus 1. And here, you have this variational energy, the i tilde here. It's at this point, because we're expanding here. This is the big trick. Because we're expanding here, i and i tilde will be the same at this point. So we can evaluate i at this point. And we can evaluate, sorry, we can evaluate i tilde at this point, and it will be equivalent to i. And this gives us the update equation for mu and for sigma. We'll do this in detail. This is a basic overview preview. So mean field approximation, quadratic approximation. And this leads to simple one step update equation. And what is absolutely remarkable is that these have the same structure as the value updates in Riscola-Wacken learning. And in Bayesian updates everywhere, when you do them exactly even, this is only an approximate update, of course, because we have two approximations in there. We have a mean field approximation in there. And we have a quadratic approximation. This, even though it is approximate, gives you updates that are a precision way to prediction it. So delta mu i, your update to mu at the i-th level is a precision weighted prediction error. And now, because this is a hierarchical model, it is the prediction error on the level below driving your update on the level above. So it is the prediction error on the i minus first level that drives the update on the i-th level. And this is how we update our beliefs. So there were several people asking themselves, how do we do that? It is because the prediction error at the level below is used for the update at the level above. So precision determines learning rate. This is exactly what we want to have. Yeah, so I mean, this drops out of the equations. This is not because we put it in by hand, but we have exactly the precision of the prediction onto the lower level. So hats from now on here always mean predictions. So pi hat is the precision of a prediction. Pi without a hat is a posterior precision. Prediction, posterior. And in the numerator, we have the precision of the prediction onto the level below. So the i-th level is predicting the i minus first level. And this is the precision of that prediction. And this is the posterior precision at the current level, at the i-th level. So again, we have this precision weighting in the prediction error. And this is the full update equation that you get in the high level of the HDF. It looks a bit more complicated. And we'll see what all these terms mean. But again, here you have the precision ratio, the one you saw on the slide before already. You have the prediction error. And this was our update in the simple Gaussian update. The structure is the same, the exact same structure, precision weight prediction error. And now if we look at the outcome level here at the bottom, here we have exactly what I wrote down here. This is our observation model. So this is how we observe x1. We observe it with a certain level of precision or noise if you take the inverse of the precision. And now we look at how, at this outcome level, the update for x1 looks. The update you saw on the slide before is a higher level update. This is at the outcome level. So very similar to the simple Gaussian model, you have the precision, the pi1 update, is the precision of the prediction plus the precision of the observation. This is the precision with which we believe to know x1 at the time when we're making the observation. And this is the precision with which we observe you. Together they give us the posterior precision on x1. The update of the mean is precision weight prediction error with observation precision in the numerator and posterior precision in the denominator, as usual. And this is the prediction error here. Now the interesting part will be the learning rate here. Let's take that apart. Let's see what is inside it. Yes, apt variables are always predictions. They always refer to predictions. So we always have a prediction step and then an update step, posterior. So we go from time point to time point in two steps. At first, we start with our posterior from the previous update. And then, basically, time evolves before we make the next observation. And we have to adjust our parameters for that. In these simple models, we don't have any drift. We don't have states pushing each other around. But in more complicated models, I mean, they may now look complicated still. But once you've dealt with them for a while, this is the simplest case. So in more complicated models, these states are actually pushing each other around. So before you make your next observation, stuff will have happened. So you update your posteriors to the predictions. So you have new predictions based on the time that has elapsed between your last observation and the next. And that's what these hats refer to. So first, from the posteriors, you go to the hats to the predictions. And then, you make your next observation. And based on that observation, you go from the hats again to the posteriors to the next posteriors, and they don't have hats again. So that's how it goes, yes? Yes. Well, these are precision. So the observation goes into the update to the mean. And the observation is a simple scale. And in other cases, we will have observations that are the squares. As we had when we were trying to infer the width of the Gaussian, there will be the squares of changes at the level below. And then you will have a much more complicated update to the precision here. But because you're just making one scalar observation, this is how your update takes place. So if we take the learning rate here apart, so this is just copied over from the slide before, we just take this as the part that we're interested in, the learning rate, yes? So necessarily, the precisions at the lower levels have to be updated before we update the higher levels. So you will see that the higher levels here will influence the precision of your prediction on x1. So just on the next slide, just one step further, you will see that the higher level is in here. It's hidden in here. So f is part of the generative process, and pi is part of the inference process. So f comes into play when you go from the top down, when you generate new data. And pi is in play when you do inference. So the f is what determines your next. You have x1 here. And this is x1k, and this is x1k minus 1, this is f of x2k. So the f comes into play when you take your x1k minus 1 and your x2, and you generate a new x1. This is how we model the world that's working outside. Now, in our heads, we get new observations, u, and we update our beliefs on x1. And x1 is described by new1 and pi1. This is what's in our head. Outside in the world, there is x1. That's the reality. And that's how reality evolves. However, we do not have direct access to reality. We only have access to u. That's our interface to the outside world. And on the basis of u, we have to form an opinion, a prediction about x1. And this belief about x1, and belief in the formal sense of probability distribution, our probability distribution on x1, is characterized by the two sufficient statistics, mu1 and pi1. Everything relates. So this is part of the generative process, and this is part of the inference process. And now we're going to take part the learning rate here. What we see is this. The posterior, as we saw on the previous slide, is the precision of the prediction on x1 and the precision of the observation. And now we take this here part, the precision of the prediction. And the precision of the prediction is this here. You have to derive this. So if you derive it, you see it is this. And now let's look at what this means for our uncertainty. So this is the outcome uncertainty. This is our observation precision. And this is the irreducible part of our uncertainty. This is, again, what the economists refer to as risk. This is what we, where we, as modellers, say, we're not going to bother with the details of that. We're just going to say, our observations are going to be a bit noisy. So we have this kind of uncertainty in our learning rate. Then informational uncertainty. Sigma 1, that's simply the inverse of pi 1. This is how uncertain we are about x1. k is the time index at the previous time point. k minus 1. So our uncertainty about x1, this is the informational uncertainty. This is what we can learn. In principle, if we make lots and lots of observations on x1, sigma 1 will decrease. And the third kind of uncertainty is also represented is the environmental uncertainty. This is the states of our environment changing. And here, you have the higher level. Here, you have mu2. And because this is the inference process going from bottom up, before you've done this update that this is the learning rate for, you cannot update your higher level. So you have to take the value, the last value you had in here. So this is k minus 1, and this is also k minus 1. And you can see that the learning rate will be larger. The larger mu2 is. The larger we believe x2 to be, the larger we believe x1 to move around. The more quickly we believe x1 to move around. Because this is in the denominator over denominator, this learning rate here increases with increasing mu2. So all kinds of uncertainty that we said we had to deal with are dealt with in this learning rate. And this is not because we put them in by hand. This is what falls out of the analytic derivation of this update equation. Now, how does this work in practice? Yes, the environmental uncertainty is like an entropy. So I actually think, yeah, that is a good way to think about it. Because in the higher the entropy is in your system, the less predictable it is. This is exactly what's happening here, yes. That's a good way to interpret it. So what we do here is we simply apply the HGF to a financial time series. This is the exchange rate of the US dollar against the Swiss franc during 2010 and much of 2011. And it's just the closing rate at each day. So the US dollar and the Swiss franc are usually about the same value. So it starts close to 1. And then it sort of wanders along here. But there are two really interesting points. The first one is here. The second one is here. Now, what happens here? This is in about April 2010. And suddenly, the markets realize that Greece is effectively broke and that the eurozone has a problem. This leads to people buying lots and lots of dollars and the value of the dollar rising also against the Swiss franc. So you see the spike in the dollar. And you see the learning curve in red for a while lacks behind the green exchange rate curve because the system wasn't expecting this. So what the system is expecting is what you see here, little fluctuations. And the curve smoothly just goes on, unimpressed by these fluctuations, as it should. But at first, it's confused by this because it, of course, tries to see this as another one of these fluctuations that will go away. But no, there's something systematic behind it. And where you see this is here at the next high level of the HGF. So the belief about X2 shoots up because the system realizes, oh no, something has changed. I am now in a regime where X1 moves much more quickly than X2. So this shoots up, the learning rate increases, and the learning curve catches up with reality. And as the markets settle down, so does the volatility estimate. But then, this is around, what was it, September 2011, the euro crisis deepens and deepens and deepens. And suddenly, the markets realize that, oh, the Swiss franc isn't the euro, and they start buying Swiss francs and selling euros. So the value of the Swiss franc starts to rise and rise and rise and rise, also against the US dollar. So the value of the US dollar starts plunging against the franc. And it plunges more and more and more and more. And at this point, and this is bad for the Swiss economy because foreign tourists have more expensive holidays in Switzerland than it's bad for the Swiss export industry because all Swiss products cost more abroad and so on. So the Swiss central bank intervenes and says, we're putting a floor under how far the euro can fall against the franc simply by printing a new franc for every franc that is sold in exchange for a euro. And this leads to an immediate turnaround in the exchange rate between the franc and the dollar. So the dollar rises back again. And again, this is reflected here. And if you even do a third level where you have the volatility of the volatility, this time series realize, oh, x2 is now moving much faster. So x3 increases, but only mildly. And again, here, you have the same. And what you can also see is if you do the same here, but without including a third level, then your second level will jiggle around much more. So the function that higher levels have is that of smoothing lower levels, of making the inference process less erratic in some sense than if you only had fewer levels. Now, what was? x2 is the volatility. So it's exactly the HDF model that I showed. x3 is the volatility of the volatility. So this is the x1 is the exchange rate. We assume we observe sort of the true value of the dollar with respect to the Swiss franc with some noise. So there's noise in the markets. So x1 was sort of the true value of the US dollar expressed in Swiss francs. x2 was the volatility of that. So just the next level by exactly the equations I showed. And then x3 is the volatility of x2. So the volatility of the volatility of x1. So now, yes? Yes. The shaded area, this is the uncertainty. This basically gives you an idea of pi2 and pi3. So what I'm shading here is the 95% interval. Pi inverse. Square root of pi inverse. And then a factor applied, I think. Why is the uncertainty here decreasing? We would have to go through the details of the update equations. The intuition is simply that the world has become more certain again. So our belief about x3 is becoming more precise here. So at this point, our uncertainty about x3 is maximal. And now, our uncertainty about x3 decreases again. It is when you look at the update equations. So this is the magic of these volatility prediction errors. So we're basing these updates on the squares of the updates that we have at the level below. And when we're very surprised by how much our time series moves, we increase our uncertainty. This is something I said in one of the earlier lectures. I said, if you're just doing a kind of information accumulation inference, then your precision will always increase. But if you're learning another thing, and here we are learning another thing, and this is the volatility. So if you have an increase in the volatility, then you can push up the uncertainty of another state. This is exactly what happens here. Yes? Of the second of the third, yeah. So this, again, is because something's happening down there, I don't know exactly why it's increasing here. It may be a reflection of the fact that I know the length of the fluctuations is increasing. So here you have sort of quicker wiggles. And then here, I don't know. It would be an interesting question to look at in more detail. So what this reflects is certainly that the system is getting more uncertain about the level of x3. And here, this is a depiction of the precision weights, so basically of the learning rates at each level. And you can see how these learning rates shoot up when something interesting happens. So the green line is the learning rate of the second level. This shoots up most markedly when interesting stuff happens. This is basically a sign that the system has realized it does no longer understand the world it's trying to predict and has to learn anew. So it increases its learning rate very quickly. And then as prediction errors go down, so does the learning rate. So now it's satisfied, again, that it knows what's going on so it can reduce its learning rate. Because you don't want to have a learning rate that's too high. Because if you have a learning rate that's very high, you're throwing away too much information. You're not learning enough. So in a more or less stable environment, information from long ago is still very informative. So yeah, this is the magic of precision weighting. OK. So I do think it's been a long morning. I'm glad you enjoyed it as much as I did. No, but I did enjoy it. And we'll see each other tomorrow if there aren't any immediate questions. OK.