 He came back specifically to the class and then had to stay down again, so he was stuck with me for today. But what's nice about today is that most of the stuff that I'm going to be talking to you about today is actually my stuff, so I can actually answer questions about it, unlike most other topics. Does anyone have any questions before we get started? Do you have any questions about the homework, questions about whether it was correct for you yesterday? I have a question. Yeah. The brother-in-law. Yeah. The weights on the basis get G. What sort of initialization should we use for that? So I... What's that initial value? Yeah. So finding the initial value is actually the hardest part of problem. How? The... Someone stole all the markers. So if you remember, there are basically three equations that determine the model. So the sensitivity on trial N at some error E of N, it was a function of E of N, not multiplied by E of N, so the sum over I for all, say, N basis sets, or N basis sets, WI times G of I of E of N, right? Where G of I of E of N is just a standard Gaussian basis set, the exponent minus E on trial N, that's sum U sum V squared divided by 2 sigma squared, right? So you can imagine that you have some space. So you have some space, some error space. Here's zero error. Here's 1. Here's 2 minus 1 minus 2. This has arbitrary units, right? For the homework, I'm just giving you plus 1 and minus 1 errors, or plus 1 perturbations and zero perturbations, right? But you can imagine this is, in a real case, if you're making reaching movements, this would be 7 meters. It's whatever your error is. But in the case of the homework, it's just arbitrary units. And you have bases here, right, that are centered at some particular mu's. You choose how many bases you want, right? So you have bases that span the error space, whose height initially is some value w naught, right? And so your question was, what is this value of w naught? And so I have not given you the value of w naught. What I am telling you is that for all errors, if you sum up the sensitivity, or if you sum up all of the bases at every single one of the errors, you should end up with a constant sensitivity equal to 0.1, 10%. So that's the naive learner? That's the naive learner. But before they even seek out many people? Before they've gotten many errors, so just like in Part A of the homework, we arbitrarily set eta to be equal to 10%. What that means is that when you see one unit error, when you see an error that is equal to one arbitrary unit, on the next trial, you change the output of the system to compensate for 10% of that error. So now your output is 0.1. On the next trial, it's going to be just less than 0.2 as you exponentially approach the estimating that derivation. So the naive learner that does not choose their error sensitivity has a fixed error of 0.1, right? Your naive learner is initially going to have a sensitivity to error of 0.1, but that sensitivity to error is going to change as a function of the history of the errors. Does that make sense? So the way that I would do it, you can actually... So w0 is going to be a heck of a lot smaller than 0.1, right? And it depends on how many bases you have. If you have 15 bases spanning this space, it's going to be really, really small, right? If you have 100 bases, it's going to be even smaller, right? Because you need to sum up each of the individual values of the Gaussians. So the way... You can actually figure it out based on how wide the Gaussians are, the width of the Gaussians. But to be honest, the easiest way to do it is choose the value of... Choose your a value of 0.1, I'm giving it to you in this case. Start with a w that's very, very small, 0 basically, and just add tiny amounts to it until the sum of all the Gaussians. Does that make sense? So you can iteratively do it. It's a hack. Oh. So wi is constant over i. wi is constant, or initially, but then I usually wonder if it's going to be constant over i. Well, so could you have an initialization where they're not constant, but still give you 0.1 within your range of errors? Could you have... Yeah, if they were not spaced evenly, or if they were spaced on top of each other and... Well, even if they're spaced evenly, you could have weights that vary that just cancel out and get 0.1. I'm assuming that you could still do that without them all being locked at the same... I didn't lock the value as wi being the same for all i. Yeah, you don't have a solution that... That is a flat line. The answer is going to be the same, regardless. The easiest thing to do, I would say, is fix all of the w's to be the same. That means essentially it's solved. It would be squares to do that. Yeah, I mean, exactly. So you can do exactly that. You can find the weights to be at least squares. So you can solve it, but I'll just do it hacky. Yeah, I have another question about the plots that you asked for and more. So we're plotting the magnitude of the error against the... against eta times the error? Against... exactly. Learning from error in this particular case. Yeah, remember... So in the simplest learning rule, x hat of n plus 1, or y, in this case they're going to be the same because x hat is going to be equal to y hat. So x hat of n plus 1 is equal to x hat of n plus eta times y minus... which I'll call here true x on trial n minus x hat on trial n. So eta is your sensitivity to error. It represents how much you're going to learn and the fraction of the error that you're going to account for on the next trial. This term here represents learning how much do you learn on each trial? So what I'm really asking you to do is plot this term for all possible errors. So over the span of the error space. So you're going to have a set of bases. They're going to have a set of bases. Some of the bases are going to have higher and lower weights than others. And I want you for every one of these errors to go through and just tell me what sensitivity value is. At the end of the trial. At the end of all the trials. So after you expose the learner to each one of the environments just go through and say, what's the learning from error? Find the sensitivity and multiply by minus 2 for minus 2. What is the sensitivity to error for minus 1, multiply by minus 1? Does that make sense? So in the first case, this is why I had to do it in the simplest case. What is learning from error in the first case? With a fixed A to the value of 0.1. It's a tenth of the error, so what is it? What would it look like if I plotted error on this axis and this is A to E? What does it look like? It's a line that passes through. No. It's a line that passes through the origin whose slope is 0.1. So when you say magnitude, you don't mean magnitude. Like that. Like absolute value. No. Did I say magnitude? In the homework, I think it says magnitude. Oh, I said magnitude? Well, you caught the absolute value. It's going to look the same. Okay. But it's a line with a slope of 0.1. It's not a horizontal line at 0.1. It's a line with a slope of 0.1. Does that make sense? The sensitivity, A to, is a line with a slope at 0.1. That A to times E is not that big. It's a line, E, right, with unit value 0.580. Okay? Does that clear it up? Yes, thank you. I apologize about it. I created the homework literally 10 minutes before class. So if there are questions, I'm happy to answer them and revise the homework appropriately. It shouldn't be too bad. It shouldn't be too bad. The first problem should take you like 10 minutes. The second problem might take you a little bit longer. But the way I would look at it is remember, beta represents how much things are changing on every trial. The way I would do it initially is set up all your stuff, set beta equal to 0. And it should look exactly like the target. It should look exactly the same. Okay. So, any other questions? So, we talked. I just erased the equation, but I'll write it again. So, the simplest possible learning rule, right? So the simplest possible learning rule is we'll call this a one state rule. You are estimating some state X. And to make it even more clear, I'm going to change these values to Y's. This is going to be Y and Y hat. And we're going to say Y hat is equal to C, right, where C is equal to 1. So just to be clear, you're observing some true output. You have some estimated output. And you change the estimate of your state based on the observed values that you're getting. And the amount that you change it is based on the value of A, the sensitivity to error parameter. Okay. So, this predicts very nicely some psychophysical results. Imagine that if you had a perturbation, so this is perturbation on the Y axis and trial on the X axis. So imagine that you had no perturbation for a little while. And then you suddenly turn on the perturbation. You abruptly turn on the perturbation. Well, I don't know what this perturbation is. It could be pushing your hand to the left. It could be rotating the environment. It could be some perturbation. Turns out that people do a good job at learning this. And their learning curves look like exponents, exponential first. So you get incrementally better and better and better and better and better until you finally reach some steady state value, which may or may not be exactly equal to the perturbation. Some people tend to have a little bit less than the perturbation. You've fully compensated for about 95% of the perturbation, depending on what perturbation you do. But interestingly enough, there are some, there are a lot of phenomena that this simple model can't, cannot describe. So, one of the earliest experiments, let me tell you, this is chapter eight, by the way, in the resident book. So this is Ross in 1968, so long ago. This experiment is 1968. It's actually not in the book, but this is the first example of this phenomenon that I was able to find. What they did was they had rabbits, 24, a line of female rabbits. And they have a condition stimulus, so I'll call it CS, which is a tone. They hear a tone, they play a tone. And an unconditioned stimulus, which is a shock to a little electrode that's in the island. And so when the rabbit gets shocked, they blink, right? And so what they do is they pair, so initially you just hear a tone, the rabbits just hear a tone, tone, tone, they don't blink, right? They have no reason to blink. So you just play tone, tone, tone, tone, the rabbits do what the heck they want to do. At some point in time, you pair the tone and the shock together, right? And just like Reza talked about, when we talked about human blocking, they begin to learn the pairing of the tone and the shock, so that if the tone is ever presented by itself, you get some percentage of the rabbits that are going to respond by blinking their eye, even though no shock was presented. So after you do about 50 pairings of tone plus shock, about 85% of the rabbits at the end of this pairing blink their eye when they hear the tone. So they learn as a percentage of the rabbits in the environment. Most of them learn to pair the tone and shock again. Then what you do is you wash out, you have a period of wash out in which you remove the shock and instead present just tones. So here was just tones as well, so tone, tone plus shock, and tone again. So what happens now is that initially the rabbits, you hear the tone and they're blinking, but over time with the tone not being paired with the shock, they realize that, oh, I don't have to blink anymore. And so over time, over a number of trials, they stop blinking. So after about 50 trials of this just tone, basically none of the rabbits. But what happens when you re-expose them to the tone plus the shock is that they learn the pairing, same pairing, they've washed out completely, they now do not respond to the tone alone. But as soon as they see a tone plus a shock, they learn quite a bit faster. So in many less trials than it took them to learn the pairing initially, they now respond at the same level as a percentage. Does that make sense? So clearly this simplistic model cannot account for that result. You have an aim of value which is fixed for all trials. And so you expose the model to this pairing of zeros and ones, zero being no perturbation, one being the tone and the shock. And it will learn at some rate and then it will decay at some rate. And then it will learn again, but it will learn at exactly the same rate it learned before. Does that make sense? So this phenomenon is called savings. So savings is basically learning a perturbation A followed by wash out or extinction then faster relearning of A. Does that make sense? So savings. So to give you a couple of, a couple more examples. So Yoshiko Kojima back in 2004 recorded some interesting data from monkeys. This is now in the book again. So what they performed is called an interpsychotic step paradigm. So basically what they have the monkey do is the monkey fixates a point on the screen and then another point appears over one side. Say this is ten degrees. The monkey, so the monkey is fixating here, another target appears and the monkey is trained to look at to basically make a rapid eye movement to that second target. So the monkey is trained to rapidly make an eye movement over here and we are tracking very closely the eye movement of the monkey. Now importantly when you make a saccade when you make a fast eye movement you're basically blind during the period of the saccade. So while your eyes are moving you basically don't get any visual feedback. So it's a very ballistic movement. And so when the monkey starts to look at the target that's ten degrees away so in that very brief period that they're moving their eye we take away the target that we have presented them and instead presents a target that's further away. So they end up looking initially on that first trial exactly ten degrees away. So they're looking there but they realize hey I fell short. I'm supposed to be looking at the target and there's this two degrees of error on the first trial. So we trick them a little bit humans do it too. So we trick them a little bit and so what happens is this is called an intrusacotic step paradigm what happens is over a large number of trials the monkeys learn that a ten degree saccade is not a ten degree saccade it's a twelve degree saccade. And so you start off say this is a gain of one so I'm plotting gain on the y-axis and here's a gain of 1.4 and this is trial so you start off in a period where you don't jump the target at all and so the gain is pretty close to one when you ask the monkey to make a ten degree saccade they make pretty much a ten degree saccade. You then turn on the pernovation so basically you turn on the steps so in this case I wrote ten degrees it's actually and we jump at two degrees we actually jump at four degrees okay so the monkey slowly starts to these are individual saccades on a given trial slowly starts to learn the pairing so now we'll say that at this point in time they're sitting at about 1.2 degrees so they still have a little bit of an error but they've learned to make bigger saccades than what they were making before to the target we then basically instead of showing them exactly the target at ten degrees we do the exact opposite and we show them the target minus four degrees away so the target at six degrees instead of fourteen degrees so we start they start coming right back down and if we let them if we left them in that environment they would have a gain less than one does that make sense? but once they get back to baseline a value of one we go back the other way and we start gaining them back up again so this is a period called adaptation this is extinction or washout and then a period of re-adaptation and so if you plot a line the mean line through the first period of adaptation and the mean line between the second period of adaptation is that the slope of the line during the re-adaptation period is larger than the slope of the line during the adaptation period basically they learn faster now that's over the first approximately 150 trials after 150 trials they actually learn is that how you should parallel it's been a long time since geometry you actually learn at about the same speed that you were learning the paradigm before if you keep letting them go and learning to adapt to this 1.4 gain situation they actually slow down their adaptation after about 150 trials and they learn at exactly the same speed they were learning over here anyway, so this period here is the same as what we talked about over here it's basically savings is that surprising that it go back to their slowest I'll get to it at the end do the extinction and slopes also follow the same pattern so if you do another extinction after this, the second extinction will be faster I don't know if anyone's done in the experiment where they continue the extinction after this point so if you were to extinguish this point beyond what you learned over here beyond 1 there would be no difference in how fast you learn at exactly the same rate as you do here it's only the fact that you've seen this previously that you tend to slow I'm sorry so if you do a second extinction period does that also happen faster in the second time than the first time? yes, it will go faster the second time than the first time I can't tell you whether or not it slows down I don't think anyone's done that experiment but yes you do a second extinction and you'll be faster so you can actually do periods of learning extinction and over time it becomes what we call a skill basically you can turn it on and turn it off real quick so you basically can learn in one try that takes many many repetitions but you can basically do that create a skill okay anyone have any questions? so interestingly the second thing that they did in this experiment is after they washed you out here so after they have extinguished your memory and you have a gain of 1 again what they did in the second version of this experiment is they exposed or didn't expose they basically left the room they turned off all the lights in the room they did it for 30 minutes and came back and did the same inter-psychotic step turns out that if you here's my 30 minute delay what ends up happening is that the monkeys start quite a bit higher than when you left the room they start by having a gain that's larger than the gain of 1 when you left the room so if you just let the monkeys sit there for 30 minutes they come back and they have a gain of 1.1 instead of 1.0 you haven't done anything but leave the room and turn off all the lights so they're not getting any feedback about their eyes this is a totally dark room but if you just let them sit there you'll come back and you'll test them and immediately they'll have a gain that's higher than what you like this is called spontaneous recovery and it's basically the recovery of a motor command or the recovery of something that you learn in the absence of any sort of so you spontaneously recover this memory that you adapted to you're not sitting down here so clearly this cannot describe that phenomenon either you expose the model here to the same series of perturbations you let this be a gain of 1 for a couple of trials and when you come back you're still going to have a gain of 1 you're never going to have a gain of 1 so spontaneous recovery and savings two phenomena that cannot be accounted for by this simplistic model so back in 2006 I'll move over here back in 2006 Maurice Smith along with Reza so this is Smith et al 2006 proposed a multi-state model and the multi-state model is actually very straightforward so this multi-state model can have many states we'll talk about that in just a second but in the simplistic case it's going to have two states so you have some x hat fast which I'll denote with the f equaling af times n plus 1 x hat fast times y on trial n minus y hat on trial n okay and some x hat slow on n plus 1 a slow times x hat slow on trial n plus b slow y of n minus y hat okay and y hat on trial n is simply the sum of the fast states on trial n and the slow states on trial okay so what are these a and b parameters well they assume in this 2006 paper that in the fast state you learn things very quickly and so you have a very high error sensitivity term I could write this as a, it's the same thing so you have a very high sensitivity term but you don't retain what you learned for very long because it's small in the slow state you learn very slowly and I'll provide an example of this in just a second you learn very slowly but as you learn things you retain them for a longer period of time so as is closer to 1 so here as is greater than 0 sorry af is greater than 0 which is less than the slow state which is less than 1 but I retain very little in the slow state I retain quite a bit from trial to trial basically I remember what I would do however the b's do exactly the opposite so in this case bs is slower than b fast smaller than b fast you learn more from error in the fast state what does that look like so if we have again this is going to be an axis what happens let's imagine we initially start off with a null perturbation no perturbation at all and then we turn on abruptly a perturbation so in the fast state what happens is initially on this trial you have an error of 1 you predicted that nothing was going to happen but in fact you had a perturbation of 1 so the fast state learns very quickly the fast state learns very quickly and then forgets very quickly so this is x fast the slow state learns very slowly but retains just about all of its information over time so the total motor output which I'll draw as a dotted line the total motor output is the sum of these two states so you initially start off fast and then sort of creep your way upwards so it looks sort of like the exponential that we drew over there in fact it looks almost exactly like the exponential that we drew over there it's a double exponential so why is this important well we just sort of made our original model somewhat more complicated but what you can imagine happening so imagine that we had a period of extinction so I'm going to redraw this a little bit and say here's my negative 1 perturbation and trial is again on this axis so here's my minus 1 perturbation plus 1 perturbation so if I do extinction training basically I give you a minus 1 perturbation the opposite perturbation just like in the Kojima experiment for a very brief period of time and then null the problem is that fast states will come down very quickly and then begin to forget the slow state comes down very slowly and the total motor output is the sum of these two states which looks like it goes to zero does that make sense so the fast state came down very quickly and even so quickly that it started to forget the slow state is coming down to but it comes down a lot slower than the fast state and so the sum of the two states looks like I have no perturbation so I've totally washed you out I've done extinction exactly the same way that the Kojima experiment did extinction with a gain of 1 does that make sense so what's interesting now is you can expose the model to no plus 1 perturbation and notice what has happened here the slow state which took a very long time to ramp up to what it was before is, and I'll draw it even more pronounced because it should be the same slope as that one the slow state is nowhere near zero where it was when you started learning it's actually quite a bit bigger and so what happens is the fast state returns to perturbation very quickly and begins to forget but the slow state is already here and so it begins learning the perturbation just as it did before but now the sum of the two states looks much faster than it did initially so this very simplistic model in which you now have two forms of two unknown states a slow state and a fast state the sum of which is my motor output so you basically have a memory of the perturbation that evolves very quickly that you forget over a very short period and another memory of the perturbation that begins to ramp up slowly but you retain that information for a longer period of time is that biologically plausible? sure and in fact we'll talk about it in just a second so you can imagine that so here we make artificial perturbations we're rotating the environment or pushing in your arm or doing something like that but you can imagine that there are perturbations that you learn over very long periods of time years and perturbations that you learn over a very short period of time seconds to minutes so to give you an example of a perturbation that you might learn over a very slow period imagine that you're you're growing you're 12 years old 6 inches in a year you need to recalibrate your motor output over a long period of time as your limbs grow longer you're reaching for things it's not going to be exactly the same as you were reaching for things when you were creating short so you have this very slow build up about basically a perturbation you make errors and you need to correct one just because your body is changing but similarly you can imagine a very fast perturbation you're walking on the stairs and your muscles become fatigued and you need to change your motor output so that you get the same you know the same power going on the stairs and that perturbation is over a very short period of time but it doesn't last very long does that make sense so you sit down for 2 minutes you're no longer fatigued so the perturbation you learn the perturbation quickly and then the perturbation disappears just as quickly so in a biological standpoint and from and it makes sense that parts of the brain would learn slower or faster than other parts of the brain so in fact most of this learning of perturbations is cerebellar dependent so it relies on your cerebellum which is in the back of your rather than underneath your brain it's the oldest part of the brain turns out that people who don't have a cerebellum can't learn to adapt their movements and so the condition that they have is called cerebellary taxia in which usually when you're about 30 or 40 years cerebellum you get to degenerate and so they have all sorts of problems they have problems walking and moving and stuff like that but interestingly enough they can't learn so if you put them in an environment with a perturbation they will not learn turns out that so we know that there's something relies on that these memories rely on the cerebellum but interestingly enough if you look at the rabbits that we talked about before so a group Mike Mouk and Javier Medina have been studying rabbits for the last 40 years in the same unconditioned stimulus sort of paradigm and looking specifically at parts of the cerebellum turns out that the slow state results in changes in synaptic plasticity in some of the nuclei in the cerebellum where the fast state results in changes in the cerebellar cortex so you end up seeing changes in plasticity for these two different timescales in two different parts of the cerebellum does that answer your question? I know if long we did yeah still this model won't be able to if we have very frequent changes from minus where we can adapt so fast this model this model does not account for that so is it visible what the next parameter to adapt for that? sure you would just change your fast state data value that the president described yesterday to account for that to account for the fact that sensitivity is we'll talk about it at the end another 10 minutes or so ok so this model also also accounts for the spontaneous recovery for exactly the same reasons so instead of having this perturbation here if you instead just stuck stuck the subject in a room in which they're getting no feedback or what we do is the way we model it is we model these as something called error client trials in which we basically force y on trial n minus y hat on trial n to be equal to zero basically we say you made a perfect movement you did everything just right even if you didn't but we tell you that that's what you did so basically you get no feedback about what you did you just said oh that was good and so what happens here is that the b terms drop out right because this is zero the b terms drop out and all you get is the a terms which are strictly less than one and so you end up the states decaying back to zero does that make sense so if instead of doing this right you just stuck people in error client trials which all denoted just two bars what happens is the fast state continues to decay and the slow state goes back to zero I'm sorry here's the fast state the fast state goes back to zero the slow state is returning to zero but much slower and so in the error client trial what you see is something that notice the sum of these two things looks sort of like this so this is the spontaneous recovery as the fast state decays quite a bit faster than the slow state and so what eventually happens is you end up with this sort of peaked response which in fact turns out to be exactly what we see in human subjects so if you expose human subjects to either the interest of how to step experiment in which you can give them error clamps and to give you an error clamp so remember we had that fixation point and we had that target that was 10 degrees away and if we gained you up you're going to look and you're going to make a saccade to 12 degrees what we do is we just wait until your saccade is done and the computers are fast enough nowadays that we just move the target over to exactly where you look right so we notice where you stopped your saccade and we place the target there congratulations you made it to the target does that make sense so you can force people regardless of the gain of the saccade that they have to have zero error you just record how large their saccades are in that case you end up with a cursor that was exactly what it is they recover spontaneously to have a gain that's greater than one a computer like a computer knew where I was looking everywhere I looked like a dot that followed that I don't know how to do an experiment I mean we do it doesn't really freak anyone out you can't tell you're blind during the saccade but I moved the dot right you said oh I must have done a great job congratulations did you maybe like 30 minutes in the dark room I don't know if this experiment was done or like very between 10 maybe or something and then I just subtracted off one of the terms to see if you get the other so this is a paper by Chris Manga Hemminger and Reza that's exactly the same question back in 2009 so Jerry you're only like four years old so what they did in this particular experiment they did exactly this except they waited different periods so if you wait no time and just expose people to channel fries you get something that looks like this that eventually decays but if you wait more time than that if you wait two minutes it looks pretty much the same if you wait instead two hours no I'm sorry if you wait 10 minutes it looks a little bit different what happens is you get significantly less spontaneous recovery right and instead so imagine you had a group just to give you the full experiment instead of talking about wishing watching here's perturbations again so group A does perturbation A followed by extinction in B and then testing in aerocline trials so A followed by B which I'll call A plus B a second group just does A and then goes into aerocline trials so you have A and A plus B so in A if you look at immediately after immediately after you learn A group A comes down a little bit I'm not plotting fast in slow state so if you just add A A comes down a little bit to this A and A plus B does exactly what you would think to do the spontaneous recovery so this is after zero minutes if instead you insert some delay here say 10 minutes they no longer have this this hop up right instead they immediately start group A doesn't really change but group A plus B starts there so as if they forgot B basically forgot what they were doing in B tells us a little bit is that the fast state is in fact a cane so it's decaying over time not necessarily over trials so in that 10 minute period basically all of the fast state washed out and you're just left with the slow state so you don't get that rebound instead you just get spontaneous recovery does that make sense Jerry? if you wait even longer than that things tend to come down a little bit but they get closer and closer together so over about if you test them after 24 hours later the two curves are on top of each other they're a little the same actually you just observe the same thing different way right? yeah but the key is exactly we observe the same thing when you begin from zero and you observe 10-4 hours 10 hours you observe more than you observe the same thing so somewhat why not exactly the same? because you can imagine that the fast state decays with some time and the slow state your A constants A turns out that the constants that over that 24 hour period you can't fit a single A, S and AF parameters everyone should decay to zero they should have come back they should have made zero output so you wait long enough this whole experiment takes about 5-10 minutes so if this is truly a linear model if I make you wait 20 minutes there's no possible way that you could have anything besides a zero state you should have totally washed out that's not what happens if we train you up and then we send you home for 24 hours the first trial when you come back you have some retention there's some non-zero output so it can't just be that we observed it 24 hours later in time there has to be some sort of switch basically where you stop decaying when we tell you to leave or you decay slower when we tell you to leave does that make sense? so it's not that we're observing it just 24 hours later because you should not have any motor output anything beyond above zero 24 hours later it was truly a linear model with all the two states does that make sense? zero is when they actually they get 10 degrees they do a 10 degree 10 degrees of cod so is there this even longer even slower state that we've learned over a lifetime how to do a 10 degree is it doing anything on the time scale of this experiment? sure so just great so back in 2007 so back in 2007 this is Corning as the first author so Conrad Corning back in 2000 I don't want to say nine but I don't know if I wrote it down 2007 so this is Corning at all with Reza to exactly that question so what happens if you have time scales you have many states, more than two and some of the time scales have A and D parameters that are on the order of days years so if you put monkeys in a gain up task so initially you start with a gain of let's do a gain down task because it makes a little bit more fun so here's a gain of one and here's a gain of 0.5 so a gain down task of half I ask you to make it 10 degrees of cod and then jump the target back by 5 degrees so the actual data is from Rick Robinson at all 2006 so over the course of 22 days they trained monkeys to make these 50 degree gain down movements so what happens on the first day is you start you come down a little bit maybe you're at 0.75 so here's the end of day one here's day two you've gone back a little bit to where you were before and you do just a little bit better day three you still go back a little bit day four and so over time you get down to by day 22 you get down to about 0.5 then what they do is they switch it back to a gain of one, pure wash out so we ask you to make a succot to a 10 degree target we leave the target at 10 degrees what ends up happening here's my gain of 0.5 my gain of 1 you started down here so you're at this point here because I've gained you down and you start to learn to gain up your succots interestingly enough the forgetting is now in the other direction you're not forgetting to a gain of 1 you're not forgetting to a gain of 0.5 which suggests that there's a short period time scale or there's a time scale on the order of minutes in which I'm learning things on the order of minutes there's a time scale on the order of days in which I forget things over days but on subsequent days I forget less and less so I'm retaining something with a time scale of days and then there's a time scale of weeks because the fact that I'm forgetting in the opposite direction is that I have now learned in my time scale of weeks that this is the appropriate action to do not the original does that make sense so the way they model them you can actually model it quite easily is they model it as as now a vector of states X on n plus 1 equals some matrix A plus vector B times y on trial n minus y hat on trial n where y hat on trial n is C transpose times X where C is equal to 1 so in this all we've done is basically extend the Tuesday model to make it have as many states as we want in the case that they did here they had 30 states with time scales ranging seconds to weeks and you can get exactly these curves back out the question really is so A here is equal to some A11, A22, A33 etc so there are these off diagonal terms which I'm there are things in them I don't want to draw them as zeros what are these off diagonal terms well if they are zero then all the states are independent just like they were here but if they're something non-zero what that means is that this state contributes to the next state at some period it means that as I learn things in the fast state it contributes incrementally to the next slowest state so over time my fast memory becomes a slower memory so the way that they they actually model this is they use the common filter so you have some noise, sigma and you distribute signal with some variance but good question but yeah time scales on the order of days to weeks the question really is how do you come up with A you have 30 this is now a 30 by 30 matrix what is A how do I know that it's 30 big how do I know that the number of states that I need to require is 30 and Reserv will actually get to that in just a couple lectures that you can use sub-space analysis to basically tell you exactly what A is and what B is and the sizes of those I'm not going to talk about it today but that's the real question is how do you know what those things are okay so let me just make sure I covered everything that I needed to cover yes okay so I'm going to talk briefly I'm going to return to this experiment right here now if you recall if you don't do this spontaneous recovery and you just teach them give them the gain up task you end up with a slope that's greater than the slope you had before and then a slope that is parallel after some number of tries and so the question is why is that the case clearly any multi-state model cannot account for that in any multi-state model the rate of learning is fixed you have some fixed B parameters and so if you learn faster here you should learn faster for everything similarly you can imagine a scenario in which exactly the same as the scenario we envisioned here where you have a perturbation A and a period of wash out where you just do normal movements normal movements exactly the same movements so this is the cicada 10 degrees 10 degrees with the slope over so we give you back 10 degrees so here you learn here you decay but let's imagine that we extend that period of wash out for a really long time a rather boring experiment if you extend it for a very long time say 10 times the length of view learning perturbation A in any rate based or states based model you have some input it's the error and all of the states decay with some time constant and the time constants are guaranteed to be less than 1 you can't retain more than you've actually learned you can only retain exactly what you learned or less than what you learned if that's equal to 0 you basically learned null perturbations all of your states have to be 0 after some point in time so after the length of A so say this is 30 trials since this system follows superposition after 30 trials of wash out all of the states by definition have to be equal to 0 does that make sense to everyone so the fact that I gave you all of these null perturbations and the fact that all of the states have to be 0 turns out that if you put people in A again they can learn to have a lot of that after like an hour of wash out you get 3 minutes of it in the beginning and then an hour of wash out and they'll still learn faster when they see it in 7 interestingly enough what is also funny about this paradigm is that if you expose subjects to exactly the opposite perturbation they'll also learn faster they have never seen a negative 1 perturbation before but they will learn the perturbation faster is this human or monkeys? humans do it, monkeys do it insects do it birds do it birds do it so this is an example of in the initial case it was savings again it was just a standard savings but we just have an extended wash out in there no multi-state model can count for those results this is an example of something we call meta-learning in which you have learned something about the environment the rules of the environment whatever perturbation 1 was I can infer what a negative perturbation is I have learned something about the state of the environment and by learning something about the environment I learn faster when I see a minus 1 perturbation even though I've never experienced it before and so this is where the model that Resid presented yesterday comes in so in the case where, so here we only have memories of the perturbation x hat is a memory of the state of the system the perturbation so in the way that I draw on them you always approximate the perturbation that you're being given so you approximate a perturbation and when I turn off that perturbation you're approximating null the equations that Resid gave me yesterday in which you have in a single state x hat of n plus 1 equals x hat of n with an a in front of it we'll say times eta on trial n times y tilde the error on trial n remember that you're modifying the value of eta on every trial as a basis set so the weights internal to that basis set are being changed given the history of errors that you experience so remember that in that case we said w of n plus 1 which is a vector is equal to w on trial n plus some learning rate beta times the sign of e of n and e of n minus 1 times g of e of n minus 1 so you update your weights based on the sign of the two errors that you experienced and about the error that you've experienced does that make sense so intuitively what does this rule mean well it says if I'm learning something and I learn a little bit and I did a little bit better then I need to increase the error so I made a reaching movement I fell short the next time I see that error again I need to have a higher error sensitivity so that if my goal is to reach exactly to the target on the next trial I need to have a slightly higher error sensitivity however if I reached exactly to the target in which case e of n is going to be equal to 0 I don't want to do anything my sensitivity to error was perfect e of n minus 1 and e of n are of different signs in one case I undershot the target in the other case I overshot the target which basically means I overcorrected I have too high of an error sensitivity I need to bring it down so that on the next trial if I ever see that error again I reach directly to the target that's sort of the intuition behind this rule basically it says if I'm doing poorly but getting a little bit better I need to increase sensitivity I need to decrease error sensitivity if I'm doing just right and reaching directly to the target I should do nothing sensitivity should stay the same so what happens here is for this perturbation of 1 you see an error of 1 on the next trial you do a little bit better because that's what your error sensitivity says so what you do is you update your error sensitivity about an error of 1 saying well I didn't reach exactly to the target in one trial I didn't correct all of the way so I need a slightly higher error sensitivity the next time I see that error so you up weight your sensitivity to error about an error of 1 and you do the same thing as you go along so you basically update your error sensitivity to say I did good I just didn't do fantastic I need to have a slightly higher error sensitivity now when you come back to that perturbation sometime later those weights are all set to be exactly what they were equal to before and so because you have a slightly higher error sensitivity your eta value is larger for a perturbation of 1 you'll learn faster you learn more of the error on each trial similarly if you imagine the period of wash out as also giving you errors here you have an error of positive 1 here you have an error of negative 1 does that make sense? as soon as I switch off the perturbation because you were estimating the perturbation pretty good you have an error in the opposite direction you have an error of negative 1 and you correct for that error by bringing your output down and it's consistent you're not great you didn't correct for the error immediately on the next trial so you need to increase your sensitivity to error now not about errors of plus 1 but now about errors of minus 1 so if you now give the opposite perturbation that's errors and they were here in that wash out period which is why you learn faster you've increased your sensitivity to negative errors as well as positive errors due to the wash out makes a couple of very strange predictions as well yeah it is not during the wash out steeper it is slightly steeper but that's because of the contributions of the A parameter A is less than 1 so not only are you also learning from does that make sense? so if you removed A if you set A equal to 1 they would have exactly the same slope or exactly the same exponential does that make sense? so a couple of weird predictions here so if instead of applying the perturbation abruptly turning it on and turning it off what you can do is you can apply the perturbation and grab it people don't even know that you're doing or ramp up the perturbation have no idea that you're doing anything for instance you can do a gain up experiment and you can gain them up instead of applying a 4 degree error you apply a 1 degree error people will happily follow you right along thinking they're making a totally normal loop even though at the end they're approximating exactly the perturbation here though you never experience a large error you never experience an error about 1 you experience very small errors the difference between what your estimated output is and the perturbation on every trial so very minuscule errors and so you're updating errors your error sensitivity near errors of 0 not near errors of 1 so if you come back later and you test them with an abrupt perturbation they don't learn any faster than they could they learn exactly as fast as if they had never seen the perturbation before so if you make them learn it over a very long period of time an even longer period of time than than the abrupt that's it I have a question about perturbation so in that x hat m plus y equals a so that's the same as last or as in the homework except for that a term is added here in the homework as I said it makes a lot easier to assume a is equal to 1 and you'll see how things change the difference when a is not equal to 1 is that if you give a null perturbation or if you give a unit perturbation the difference is you won't approximate the perturbation exactly you'll have some error at the end you won't learn fully because on every trial you forget a little bit and then the added term is that eta of the error in y or is that this eta function times the error in y it is is eta still a coefficient or is eta this is exactly identical to the homework it's a function of y tilde but in this case it's also multiple of y tilde it is a function of y tilde I don't know how you want me to know function times y tilde that's fine does that make sense? yeah it's exactly the same as the homework you see an error and you ask what is the sensitivity about that error you pull it from the sum of null passing bases and you multiply by the error the homework as well I did not put this extra set of parentheses with an error in there because then that means the error squared I don't know how to develop this as a function of error and this is being multiplied by error you can actually drop the function and just write it as a multiply and the key difference is that I say eta on top of n as opposed to just eta I don't know how to write the function I think it's just the parentheses that are confusing so here is just a asterisk between that would be minus 8 us but g is a function yeah this is a function of