 Hi. Welcome back guys. What happened? Did we pass the drop date or something? The class shrunk. Well, it's good to see you, the ones that are here. Today I want to talk to you guys about a concept that was fairly instrumental in my own career, which was the idea of generalization. The idea that when you look at a biological system and you see how it learns, can you say something about the region of the brain that might be involved in learning that's taking place? And the mechanism by which we made progress was toward this concept of generalization, which means that the learner is learning from error, makes a guess, sees an observation, forms an error, updates his belief, and then you look to see how does he change, he or she changes his or her belief about other things that it hasn't seen. And the way it generalizes says something about the shape of those basis functions. And those basis functions are what was interesting to us because we wanted to say something about different parts of the brain, different parts of the brain have different kinds of basis functions, and so can one make estimates of what part of the brain is being used. So to start out, there was one thing that I remember from yesterday's on last week's lecture that I wanted to go back to because I think it's an important idea that you may want to know since many of you guys will decide to take your oral exams with me when you want to pass your qualifying. So I want to distinguish between two concepts with you. When we have a random variable, x, that's made up of x1 and x2, each one a random variable, when we say what's the variance of x, well, what we do is that we find the variance of x1 plus variance of x2 plus two times the covariance of x1 and x2. And this is something that we've described, but that's very different than writing a probability distribution that's made up of two different distributions. So for example, on last week we talked about having a probability distribution, p of x, that was made up of row one times a normal distribution with some mean mu one and some variance sigma one plus row two, another normal distribution with mean mu two and sigma two. So for example, you know, we have a distribution that looks like this, right? So here what I mean by variance is very different than in this case. And just to show you that, I want to go through the math to describe what is the variance and expected value of a random variable that's described as a sum of two distributions, which is very different than a random variable that's a sum of two different random variables. Okay? So let's go through this to understand what this means. So what's the expected value of x in this case? Well, what is the definition of expected value? It's x times the probability of x and what's the probability of x? Well, it's row one times a normal with mean mu one sigma one squared plus row two. And of course, by this, if this is a proper distribution, this plus this must be equal to one, right? So row one and row two are both less than one or at most one and the other one is zero. So that's what we mean when we say expected value and because this integral, I can take the expected value inside and divide it and separate it. What I have is that x times a normal. So that's just equal to row one times expected value of x, which in this case is mu one plus row two mu two. So that's fine. The expected value of x is just a weighted means of these two distributions. What about the variance of x? Variance of x is equal to the expected value of x squared plus expected value minus expected value of x quantity squared. So what's expected value of x squared? That's just equal to row one integral of x squared times a probability of normal mu one sigma one squared dx plus row two integral of x squared normal mu two sigma two squared dx. And these integrals are from minus infinity to infinity. That's going to be equal to row one mu one squared plus sigma one squared plus row two mu two squared plus sigma two squared. So therefore variance of x is going to be equal to this minus this squared. And let me just check to see if this makes sense. So if I set row two equal to zero, variance of x should just be equal to sigma one squared, right? And that's, I'm missing a squared here squared. Well, I don't know. This, I have to go check my math. This doesn't, this didn't quite work out. This seems like those should be squared for this to come out right. Row one and row two need to be squared for it to come out right. But they are not squared here. Row two is zero. Sorry. I'm not sure what's wrong. I don't know why this didn't work out. I have an, I have an issue with my rows. That is true. So then it doesn't matter, does it? Yeah, that's true. No, I was only going to check to see if it makes sense. If, if, if I only have one distribution, then the point is rows one, so it doesn't matter if it's squared or not. Yeah. Well, not five. It has to be less than one. I mean, it can only be a, so it's not a distribution unless rows equal to one. This is integral has to be one. Yeah, yeah, yeah, yeah. So it may be right. I don't, I don't see, I mean, I don't have, I don't, I don't find an error in it. And you're right. If I only have one distribution, then it's correct. I may be right on this. I'm not sure. I checked it before on some, some, some other distributions and it works. But I was just a little confused here because of having a single distribution. I thought that, I thought algebraically things would cancel, but of course they don't cancel unless rows equal to one. Which has to be, otherwise it's not a distribution. All right, I will check this and I will come back to you guys on Wednesday. I think it's right. But not 100% certain. So anyway, the point being that when you have a random variable that is a sum of two random variables, its variance is very different than if you have a distribution that's a mixture distribution. And this is a mixture distribution. It's a mixture of two different, two different Gaussians in this case. All right. So let's go back to another thing that we talked about yesterday on Wednesday, which was that when we looked at the loss function, from the experiments that we saw from that paper where they estimated the loss function for individuals, they said that it looked like this, y tilde to the 1.75. So the error is not quite being squared. It's being raised to a power less than two. So I want to show you what this means from the point of view of how much you learn from error. And I want to show you that what this says is that your sensitivity to error is particularly large when you have small errors. You learn a lot from the percentage wise point of view when your errors are small, but when your errors become really large, you don't learn as much. That's what this equation means. This kind of a loss function means. So if we had a quadratic loss function, what it would mean is that I'm going to learn 20% of my errors regardless of whether the errors are small or large. But when I have an error function that looks like this, what this implies is that I'm going to be more sensitive to small errors and I'm going to discount large errors. That's just the intuition that I want to show you guys with this particular error function. So what this means, of course, is that we're going to look at the difference between y and y hat, and in this case raise it to some power q. So let me just do this, call this q my power that I'm raising it to, and suppose that I write my estimates as I have some basis set that is evaluating the input x, and I'm going to sum it up with some weights w to form my estimate y hat. So I'm going to have my learning rule where let's just worry about the first derivative where I have the derivative of the loss. So whatever is my derivative with respect to the w, I'm going to move in the opposite direction with it so I can rewrite that equation. Find this derivative with respect to my estimate, do the chain rule and write it like this. So I'm going to define sensitivity to error as how much I change my weights from before when I made a prediction to after I made the prediction. So I'm going to say w hat of n plus 1 minus w hat of n is the change that I made in my belief divided by the error. So how sensitive am I to that particular error? I'm going to call this variable sensitivity to error. This change divided by the size of the error that I had. So if I do that, yeah. You need to like take the numerator and add the action. I'll show you. It's going to be a function. It's going to be a function because the weight is a vector, the w's are vectors, y is a scalar. So it's going to be a vector in this case. Okay. So that's just equal to this representation which I can analytically find. So the derivative of the loss with respect to y hat is assuming that I'm going to write it in this form is going to be equal to minus eta times q times its absolute value raised to the q minus 1. Then derivative of y hat with respect to w is this function g, vector g of x. And then I have this final quantity 1 over y tilde. So if you look at this, my sensitivity to error as a function, this is a vector g, as a function of, I can plot it as a function of my particular error size. And you see that it depends on two quantities. Error raised to the power q minus 1 divided by error itself. So this, if q is 2, this is just something raised to the power of 1 divided by itself which means that the sensitivity to error is just going to be constant. So if I plot this quantity as a function of error for a quadratic cost function, it's going to be some constant equal to minus eta times q for q is equal to 2. But for this particular one where q is less than 2, what's going to happen is that this is going to be a number that's going to be q minus 1 is going to be 0.75 divided by q. You're going to get a function that looks like this. It's going to be high here and it's going to fall as y increases. So this is for q is equal to 1.75 and for functions that are larger than being raised to a power greater than 2, you're going to see things that, you know, kind of looks like this. This is q is equal to 2.5 and it's going to be a line for q is equal to 3. So how much you change your weights for a particular, at a particular x that doesn't depend on g of x depends on the size of the error and as the error size increases, this sensitivity to error decreases for the particular function of 1.75. Yeah, yeah, yeah, peas, yeah. So if they should like over there and they had a huge error, wouldn't that correspond to a big change in their behavior? Yeah. So a bigger or less force them to change a lot of their prediction. Right, right. But what we're talking about is that as the magnitude of error increases, do you learn the same percentage of it? So say that you're going to learn 20% of the error. If the error is this you're going to change your behavior by this much. 20% of it, say. Now if the error is small, do you also learn 20% of it? If you have a quadratic cost function, that's what you would do. You always learn the same percentage of it. But if your cost is being a power of less than 2, loss is less than 2, what you actually, what it means is that you're going to learn as a percentage wise a larger proportion of small errors than large errors. Effectively you're going to ignore very large errors as being outliers. That's what people seem to be doing as well. I have no thread there later but I still find that all of this information might work in this case but not in another case. Absolutely, yeah. So the reason why I wanted to show you this is because this paper was the original paper that described this particular loss function was published in 2004 or something like that. Then a few years later we did an experiment where we actually gave errors, we controlled error sizes. Given a particular distribution of perturbations we controlled the error size and we asked what percentage of each error were people learning. And what we saw was that they were more sensitive to small errors than they were to large errors. Which then made us go back to that study and say well that's actually consistent with that study that the loss function is not a quadratic but something smaller. It's discounting very large errors basically. Possibly implying that very large errors are unlikely to be sort of your fault, just outliers. So this idea simply comes from this particular loss. Alright, let me switch gears and I want to talk about generalization and how we're going to estimate generalization. So I want to give you the framework that I want to pursue. So I describe some basis set G here. And this G we don't know what they are. All we can see is that the learner makes an estimate. You query the system, you say X, give me Y and they give you Y. You give the truth to them. So there's an error between the two. And so then they estimate their weights. Basically they change their weights. And the idea is that they will take that error and change their weights everywhere. Not just at the particular X that you gave but it will affect everything. And the question is well how are you going to understand how that information got spread. So you may have only given information to the learner at one particular X but they will have generalized that to everywhere. And you can't track it. You can only ask them to say okay, give me your estimate at some other X. So the learner can't tell you everything he knows. He can only tell you what he knows at a particular location. But his model will change everywhere. So the question is how can one estimate this process of learning from those individual questions. How can I ask what these basis functions are given that I've only been asking them from this location than that location and the learner is going to change from every estimate. And every trial the learner is going to change everything he knows. You can only ask what do you know at this particular location. Give me Y hat for this particular X. As soon as you give the actual Y for that location they change their W's everywhere. So there's no way for you to know what that whole system looks like. You only can tell for the particular location. So how can one estimate G's. Now why is it important to estimate G's? Well if you look at how biological systems learn there's consistency in them. Meaning that they don't just randomly pick some basis set and then use it to make some learning. They come to the learning problem with a particular basis set and those that seems to be consistent among the samples of the population that you try. So from that point of view it's interesting to ask what can I say about their basis set and why did they choose that basis set. One idea being that maybe there was some part of the brain that they used and that part of the brain had receptive fields that had a particular shape to it and by looking at how that learning was taking place and how they generalized I can say something about the basis. So that was the idea. So back about twenty some years ago when I was a graduate student like you guys sitting on an undergrad I guess it's sitting among sitting in a class there was this idea that if we look at learning not just from the point of view of the learning curve but something about how it generalizes we might be able to say something about what part of the brain is being used. Because we know something about the receptive fields at that location. So those are the G's. What do those G's look like. So a few years later we came up with a very simple technique to estimate not G's but something that's related to G's which is called the generalization function which is the way that information that you got at a particular X gets spread out to the other parts of the model that you have. And that's what we're going to do today. And by the end of the day what you guys are going to end up with is a simple set of tools that you're going to use. So I've given you a data set for your homework where all you have are basically what a learner would have. You have X input, you have Y hat the learner's estimate and you have error. And you have a sequence of examples like this and from that you're going to estimate what the generalization function looks like. How do they learn that experiment. So okay let's try to formalize our problem. We want to say something about G's and what they look like. In a very simple case you know they may be narrow, they may be broad. So we have Y hat is equal to W transpose, this is a vector G of X, this is a vector G. And you know we have our simple learning rule which plus if we want to use the Newton-Raphson technique it would be the second derivative of the loss. Something like this. So when we write it in a single trial it looks like this times the error times G of X evaluated at X of N times G transpose of X. So this is a vector. Okay I can write the change in my weight vector as this function. Okay so if I now project the change in my weights onto the output space, the Y space so now the change in Y is going to be equal to change in my weights times G of X and this is a function basically it says you figured out this vector, this change in your W projected onto your G space at any X and I'll tell you how much your change, how much change did you get in your output Y? Sorry this is Y hat. So to give you a sense of things, so here's some function that I'm trying to estimate here's X. Let's say I ask you now to tell me something at this location, X of N. You say here's my prediction. My prediction is this number here. This is Y hat at X of N. On trial N. So here's your error. This is Y tilde. You're going to learn from that error and you're going to change your estimate now. Next time you're going to become a little bit better. Maybe move up to there. So this difference here is delta Y hat in trial N plus 1 but it's not going to just change your belief there. You're going to generalize it to other locations. So if your prior belief maybe look like this, if this was the function that you had before, afterwards it's going to become different over some space over this region. So there's one location where I gave you the error. You're going to change your belief over some particular range, over some space. That's what you generalized. And the difference between Y in N plus 1 after the weights changed and Y at time N before the weights changed is what you generalized. Is what you learned. Which is this quantity. So it's my change in weights times the basis at G that gives me my new Y hat. And that Y hat now is a function. So the difference between the red and the black is in some region. And that's what I've learned. And the width of that and shape of that somehow depends on what. It depends on these G's. So if I were to write that, so what is this delta Y hat of X, is this function a times the error G of X of N times this normalization minus 1 times G of X. So I'm going to call this term here. Let me write it like this. This is a function. I'm going to call it B of X and X of N. X is the place that I'm evaluating my new belief. X of N is where I saw the error. So before I started to learn, you asked me to tell you what is your belief about X of N. I gave you Y hat, this red point here. Afterwards, you gave me the error and I learned from your error. You changed my belief everywhere. So now I get this function. And that's this function here. It depends on something that I'm going to call a generalization function. This generalization function depends on two things. Where did I see the error? And where are you asking me now to generalize to? X is anywhere along this line. X of N is the particular location where I saw the error. So the change in Y depends on this generalization function. And that's what we're interested in understanding. How do we quantify this function? So, any questions before I continue? Okay. So you're going to have some data. And you're going to ask yourself, alright, well how do I quantify a function that looks like this? Well, we need to parameterize things. It's a two-dimensional input space. X and XN. Because you're going to have an input that you're going to learn from and you're going to generalize that to anywhere along this line. So X can be anywhere, XN can be anywhere. So we have a two-dimensional space to represent for input B. And so I'm going to write my function B of X and XN. I'm going to write it as a matrix. And this matrix is going to have rows and columns, B11, B1 size P, and BPP and BP1. It's going to be some numbers there. We're going to have to estimate what they are. And what is the rows? The rows represent the quantity X. The columns represent the quantity XN. This is, XN tells me the location where I received the error that I learned from. X is the location where you're asking me to generalize to. So to describe this, what we're going to have to do is take my input space, take input space X and divide it into P equal locations. So we can just digitize your input space into P different places. And if you do that, what you can then do is that you can write your Y, Y hat at some time point N as a vector. This is a function that is evaluated at any one of those locations. So this is Y of X1 of N, Y of XP of N. So take my function there. And write it up so that at any particular X that takes one of P values, I can tell you what Y is. And by generalization function, what I mean is this matrix. The matrix that tells me if I were to experience an error at a particular location X of N, what fraction of that error will I generalize to any other location P? And we're going to have to try to estimate that from the data. Let me show you how to do it. Suppose I define a vector K to be made up of P elements only one of which is a 1. This is going to be a row vector that's going to select the location P or location I, let's call it, where input X is queried. Somewhere along my input space X, you're going to ask me to make a guess and that location is going to be the location where this K selector is going to have a 1 in it. Everything else is going to be 0. So what that means is that when I ask you to tell me at trial N what you do is you take K and you multiply it by this vector Y hat of N. So this is a scalar. This is your guess in trial N. Let me write it like this. Y hat of X of N. You're going to take this matrix, this vector, multiply it by this and that's going to give me your output at that particular location. What I want to do is to show you how we can estimate this matrix from the sequence of data that you have. And the idea is as follows. You have a guess from the learner at a particular location. Then the next time you're going to test them in this location, that location, this location, that location, and so forth until at some other time point later you're going to come back to the same location. During these periods of asking to make guesses the learner is learning. The learner is learning from this error from this error from this error so forth and each time he's learning he's generalizing to this location. Each one of those generalizations is a number in this matrix. So all you have to do is to keep track of the places that you've asked the learner to guess the errors at each one of those locations. And what you end up with is a single equation with the unknowns being the elements of this matrix that are the distances to the location where you asked. So you get one equation with as many unknowns as the number of places that you asked. Then you do it again. So every time you come back to a same location you have one equation which is composed of all the generalization in between the two trials. In trial 50 you're in location here. In trial 59 you come back to this location. In between you were in some other places. Those places all contributed to what the learner knows in this trial. So the change that took place from trial 50 to 59 are a sum of all these generalizations in between. That's one equation with nine unknowns. In trial 51 you were over here. And at trial 73 you come back to that location again. In between you have all those generalizations that took place. You have another equation with a bunch of unknowns. So we end up with many equations and far fewer unknowns as long as we have a space that can be quantized in this way so that we come back to that location over and over. And by keeping track of that we end up with a set of equations that we can solve. And we end up with our estimates of these B's. So it looks like this. So on trial K of n you have the same thing at trial K n plus m. So in trial K I'm at some location and then m trials later I come back to the same location. I'm asking you again what do you know at this particular location. So my y hat sub i of n is K of n y hat of n. Ith element of this vector is chosen by this selector vector K. My error is y of n minus y hat of n sub i. Then what I learned y i is equal to b i comma x of n times my error. And then similarly y i of n plus 2 minus y i of n plus 1 is equal to b i x of n plus 1. And if I continue with this I end up with y i of n plus m minus. So if I sum these equations up what you will see is that these things cancel. What I end up with is this minus this y i of n plus m minus is going to be a sum b of this is one equation with n plus m minus one unknowns. The change from the estimate after m trials before you had learned it the change from n plus m to n is the sum of all these elements of the generalization matrix times the errors at those locations. So you just keep track of the errors. You keep track of the elements of the matrix that are describing how you generalize and that should be equal to the change in your estimate at those locations. So our representation of y is a vector. So at any particular trial I can ask you tell me what you know about this location. But in fact what you know is a whole vector. You don't just know one location. You know it everywhere right. So y i means at that particular location associated with x the element y i. It's one of these guys here. So basically you're treating y as a function that you now discretize. Yes exactly. Exactly. That's what that means. M is arbitrary. That has come back to some initial location. I guess that's what I'm used to. The key assumption here is this. That at n and n plus m, k is the same. Which means that whatever discrete space you have if you're here to begin with at some point you've returned to it. Because the reason is because you don't know any of these intermediate elements. You only know the beginning and the last one. So why is it that you don't know the intermediate elements? Because I've asked you and in trial n I asked you about this location. In trial n plus 1 I asked you about this location but my equations are all written about this equation. This location. So in trial n plus 1 I ask you about here. But I'm keeping track of what you know about here. I can't measure it. Until I come back to it I measure it again. Yes. Does that make sense? Because then they learn from that equation. I'm assuming that you can't do that. A real biological system, as soon as they make a guess the environment gives them the truth. We're making the assumption that the learner will learn everywhere even though you gave an input to one location. Anything about what he's learned except at the location where you've asked them to give you a guess. M can change. It's arbitrary. It doesn't matter. But as long as you come back to it once. It's just a random sequence of guesses that you're asking them to do. Why was this cool? In 1991 or something like that one of my professors, his name was Tommy Poggio. He used this idea, not this idea the way I've written it for you, but the idea of generalization to make a guess about a specific kind of learning. And here was the experiment. So what they did is I'm going to erase this. So what they did is that they had people look at what are called vernias. And what that means is that you look at the trading that this is something like I think the distance between these is something like I think two or twenty second arc. So it's tiny. And you're looking at this tiny difference between these two lines that are like two and a half meters away. And you ask the person, is the top line to the left or right of the bottom line? And they will make a guess. And these lines are so close to each other that they're in the position of the size of the photoreceptors in your eyes. So it's like impossible to be able to tell it. But the person is learning from every time you give him a feedback. So you ask him, is it to the left or right? And they say whatever. And you say, this actually to the left. So they say, okay, to the left. And then they do it again. Okay? And what happens is that trial after trial the performance actually gets better. So you know correct, percent correct. And it's always these vertical lines. And then what they did is that they said, all right, after some period of learning so this is vertical lines. We're going to stop the learning. We're going to now ask you, can you do it now if I do it to you this way? Horizontal Bernieres. And what they found was that they could not do it. They were back to being naive. So they could not generalize. Okay? So Poggio suggested that the way the learning was taking place was with these basis sets that had receptive fields that were basically quite small. So he wrote up this problem that said I have some basis functions and my basis functions have receptive fields that are about this size that I'm drawing here. Which are about the size of photoreceptors. And he said that if I had basis sets like this I could learn how to do this using you know that kind of a function up there. This is just some basis functions, W transpose times g of x. Each one of these is my g of x. And after I learned it if you now ask me to do this task I'm no better than naive at it. Because my receptive fields could not generalize to this horizontal. And you know from this in that paper they postulated that what was happening was that the receptive fields were very small close to what one might find in the primary visual cortex. This is in comparison to large receptive fields that one finds in higher visual areas like you know more closer to the frontal lobe. So in that paper they said look by looking at the way people learned this task we can have a sense that the fact that they don't generalize is telling us something about the shape of these basis functions. And they thought that the basis functions must be quite small. The receptive fields must be quite tiny and that would be consistent with what you might find in the primary visual cortex. So this was about 1990. Now what happened later in my own work was that we were looking at generalization of movements. And we sort of developed these equations to estimate the shape of the generalization function. And we moved from this visual domain to a different domain which had to do with reaching movements. And so in the case of the reaching movements what we began to study is how people learned when they made a movement and they had an error in their movement. So here was the error. Some force perturbed their arm. If you look at the next time that they moved they got a little bit better. The force that they estimated was smaller and then with training they got better at this. And what was interesting is to ask now what did they generalize to these other directions of movement? So if they acquired some model of these forces and learned to estimate those forces we could estimate how much they learned and how much they generalized to these other directions. And using a similar scenario as you see there we estimated that the basis functions were like this. They had a shape that had a two-dimensional basis set and if I were to draw them the shape of the basis function it seemed to have some peak here and then a smaller secondary peak here. So it was bimodal. It has a small peak like this and a large peak like this. So it seemed to learn a lot in one direction but also generalized a little bit to the opposite side. This was a two-dimensional basis set like the one that you see over there. And based on this we said that look it seems to us that the basis set that people are learning from in this task has a shape that looks like this. And then a few years later people looking in the cerebellum of monkeys found that the coating of the receptive field of the Purkinje cells for reaching movements had this bimodal shape to them. So perhaps kind of interesting. And we know the task depends on this particular task. Depends critically on the cerebellum. Patients that have cerebellum damage can't learn the task. And the generalization function that we drive through this process here had this bimodal shape to it that seemed to be consistent with what they found in the cerebellum. Okay. I'll stop for now. Do you have any questions? All right. So the basic idea is that we're keeping track of the errors and the inputs and the guesses that the learner makes on every trial and fitting it to basically a linear set of equations in which you have a set of unknowns that estimate how you generalize from any location to any other location you'll end up with a set of equations that should be able to estimate the generalization function. That won't tell you the basis set but will give you something pretty close. So you have a homework that's like that. Thank you very much.