 Alright, I think we're ready to get started. Welcome back. Good to see you guys. I want to thank David for giving the lecture on Monday. The highlight of today's lecture is going to be a set of experiments that I want to show you that basically tries to estimate what we call the loss function, which is what is it that we are trying to minimize in the learning process. And this experiment is going to look at the loss functions in people as they perform a task. And in order to describe what the loss function is, they measure certain actions in the individuals and then they say okay the actions that they did, what kind of a function were they trying to minimize in their behavior. So to get there I need to build up our case. So we've been talking about regression and just to review things for us, we have some vector x that describes our input that goes from x1 to xp, let's call it we have a p-element vector, and then we have a vector wp and we have n data points associated with y1 to yn. This is the observations that we've made and we make an estimate on each trial based on w transpose xi. And we compare that to our observation y, y sub i. And we call an error the difference between y sub i minus w transpose x sub i. And our loss is written as something like this and we want to minimize, we minimize this loss. And what you saw on Monday was that we can write this x over all the trials, all the iterations that we've seen, i is equal to 1 to n. And we can write it as follows, we can write this big matrix x that represents x like this, x1 of 1, x2 of 1, xp of 1. And then we can write all the observations that we've had with this vector y and then our y is equal to x times w. And this loss function becomes equal to 1 over n y minus xw transpose xw and we minimize w as follows. w is equal to arg min of w of this loss which is equal to, and what does arg min mean? Well, you know what min is, so if I have a function like this, a function let's call it q of m is my function. m is the variable that I'm plotting here, q is the variable that I'm plotting here. And when I say the minimum of that function, this is the minimum of the function. So this is minimum of q of m. When I say arg min of q of m, what I mean is this value. So the m that minimizes that variable is the arg min. So the w that minimizes this function is w hat is going to be equal to the x transpose x minus 1, x transpose y. And this is where you guys got to last time. So we also saw in a geometric way of thinking how to change w from trial to trial. And you saw an equation that looks like this. W of t plus 1 hat is equal to my estimate in time t plus eta, some learning constant, times the error y of t minus w transpose x of t. And my estimate times x of t normalized by x of t transpose x of t. So this was the least mean squared algorithm, LMS algorithm. Now what I want to talk a little bit about today is what this means, what's this division here, where does this come from? And what this means, this error sensitivity. And to motivate our discussion, let's begin with our cost function and what we're trying to minimize. So this loss function here, I'm going to write it as some function j. This is my loss. And of course it's going to be depending on this parameter w that I'm trying to estimate. So there's some world w and there's some j on this world. Which says basically there's some location that if I find it, this w star, if I find it I will have minimized my loss, my j. Okay. And I'm out here. This is where I am. W in trial t. This is where I want to be. W star, the best place that will minimize my cost, my loss. This is the lowest loss that I can have. And I want to go from here to here and I want to know how to do it. So this function j is a function of w. And so I can write it j at w star. This is, my notation is j evaluated at w star. It's going to have some number. It's going to be something. And I can write it using Taylor series expansion as j at w here. This value is equal to this value plus this gradient, the derivative times this distance. Plus w star minus w times the first derivative of j evaluated at w. Plus w star minus w squared times the second derivative of j evaluated at w divided by 2 factorial. Which is just 2. Plus w star minus w to the cube, the third derivative of j evaluated at w divided by 3 factorial plus the rest. So I've just written the Taylor series expansion of my cost function evaluated at this optimum parameter w star. Any questions about that so far? All right. So I can do the same thing for the derivative of this function here. So the derivative of the function j, the derivative of j evaluated at w star. Of course it's going to be zero because I'm at the minimum of this function. So the first derivative of this is going to be zero at that location by definition. I'm at the best place that I can be. I'm at the bottom of this bowl. So therefore the derivative, the first derivative at that location is going to be zero. That's equal to j prime at w, the first derivative at this location. Plus j double prime at w plus the higher derivatives. Okay? So if I have a cost function that's quadratic, like I've been showing you up there, then all of these terms from the second on for the second derivative forward are going to be zero. Even if I don't have a quadratic function, I can choose to ignore those higher derivatives. So if I now just work with this equation here, what I have is that w star times the second derivative, the second derivative evaluated at w is going to be equal to w times the second derivative at w minus the first derivative at w. So now w star is going to be equal to w minus the second derivative evaluated at w inverse times the first derivative evaluated at w. What this shows you is an important step that you can take to improve your estimate. Because if you are here, w, it says to get to w star what you need to do is to change your estimate by precisely this amount. The first derivative evaluated at w divided by the second derivative evaluated at w. To go from where you are to the best place that you can be, this is the change you have to make called a delta. This distance here. This is the delta that you have to do. And what that delta needs to be is the first derivative divided by the second derivative. So up here, this term on top of the numerator is the first derivative. This term is something like the second derivative. And I want to show you that now with another example. So let's look at our cost function and find the first and second derivatives. So our cost function is my loss as a function of w is 1 over n sum of yi minus w transpose xi squared i is equal to 1 to n. So what I need to know is what is the derivative of j with respect to the vector w. That's the first derivative that I need to compute. So how do I do that? How do I find the derivative of this function with respect to this vector? What do I mean by that? So this function is just a number. It's just a scalar. When I say find the derivative of it with respect to w, what I'm saying is find dj dw1 dj dw2. dj dwp. If I have p weights that I want in my model, when I say find the derivative of this function with respect to the vector w, I want you to tell me give me a vector of size p. So I find the derivative of a scalar function with respect to a vector. I'm going to end up with a vector of size p as that being the size of w. So that's going to be equal to 1 over n is a 2 here because of this. There's a negative here times the derivative of the inside with respect to w, which is going to be just x. That's the first derivative of this function with respect to w. The second derivative, and just to be clear what this means here, the proper notation for this is by its second derivative of a scalar function with respect to a vector, it's dw, it's written this way. This is the second derivative of it, the way my nomenclature for finding it. What does that mean in terms of this first derivative that I wrote here? So now I'm going to get a matrix. I'm going to find the derivative of this dj, dw1, dw1, so the second derivative, the second derivative of j, dw1, dw2, so forth. I'm sorry, this is what I mean, and then so forth. So end up with a matrix of size p by p. And the second derivative of this, the second derivative of this function, which is the first derivative, the second derivative of this with respect to w is just going to be the this term here, xi times xi. And to show you that, what I have is that I have a vector, and this is going to be minus 2 over n. There's a minus here, so that's going to become plus, and I'm going to get my sum xi, xi transpose. So I have to get a matrix, right? And you see that the only way the matrix is going to come from this is that if it's multiplying this times its transpose, which is shown here. In your notes, in the slides that I have, I worked this out on a specific size x to show you how you get the second derivative to look like this. But to give you an intuition about it, as you can see, the first derivative has to be a p by 1 vector. And you see that it is a p by 1 vector because this is a scalar, and this is p by 1. The derivative of this vector with respect to a p by 1 vector is going to become a p by p matrix, and you see that I get a p by p matrix by this formulation. So the reason why I did this is because I want to compute the optimum change in my weight. And according to my formulation, what I need to know is the first derivative of my cost function and the second derivative of cost function, evaluated at the particular w that I'm at. Here's the first derivative, here's the second derivative. So now what I need to do is take the inverse of this and multiply it by this, and that's going to be the change that I'm going to see. And if you look at this, this has a 2 over n in it. I'm going to invert this to multiply it by this so this 2 over n goes away. And now what I get is w star is equal to w minus the second derivative, which is equal to w minus this is shown here, sum of xi, xi transpose minus 1 times sum of yi minus w transpose xi times xi. So this term is normalizing this and now we can see that this equation, just look at this equation up there that I have, is a one iteration estimate of this which is the sum over all the data that you have. And so this is called the Newton-Raphson algorithm that in a single step moves you from where you are to the best possible place that you can be in order to minimize your error, assuming that your function is a quadratic loss and that you've used the Taylor series expansion to minimize it. So let me stop for a second and see if there are any questions. Yeah, so in that equation over there with the second derivative, how come you flipped on the last step, how come you flipped left and right multiplication of the inverse? Because on the right side you had to right multiply by the inverse in order to get the identity and same with the other w term, but then you left multiply the right most term by the inverse. It has to work out that way for the dimensions to be correct, but I don't understand why. So in here the way I did it, I assumed that J is a scalar of course, but its derivative is not going to be a scalar, in this case it's going to be a matrix here. And the way I wrote this was assuming that w was a scalar, but of course it's not, it's a vector. Yeah, you're right. So the proper way of doing it would be to write the Taylor series expansion using vector notation for w. And you'll see w is to be on the right side. That's correct. Good question. Any other questions, guys? So in LMS what's happening is that you are, what you're doing in LMS is that you're saying w at time t, your estimate is going to be your estimate at time t minus 1, plus some fraction, call it eta, of the derivative of J with respect to w times the inverse of the second derivative. And because your estimate of these don't exist for the entire data set, you just have it for that particular data point, you approximate this by saying that this is just your difference between your observation and your prediction, w transpose x at time t minus 1, x at time t minus 1, this is this first term times, the second term is x at t minus 1, x at t minus 1 transpose here minus 1. And the reason why I put the transpose on the inside, whereas here it's on the outside, this is a matrix, this is a number, this is just a scalars, because of course if I multiply x times x transpose, I'm going to get a singular matrix. I won't be able to find its inverse, but when I sum it up, then I no longer have a singular matrix. Here I don't have a sum anymore, and so I'm saying that I'm going to normalize it by x transpose x, rather than x x transpose, in which case you're going to get a matrix and you can't invert it. Yeah, Max. Can you do batches of the neutrographs including more terms as a term? Yes, exactly. If your loss function is not quadratic, is it to get one part of the quadratic loss function? Yes. You mean it doesn't have any terms as well? Exactly. That's why. Good. All right. I'm going to now give you a simple variant on this, and it's called weighted least squares, weighted least squares costs. And the idea here is that when you look at your errors, maybe some errors cost you a lot more than other errors. So what we're going to write is that now the loss function is going to be equal to the sum of something that I'm going to call it p, p sub i, y of i minus w transpose x of i. So this says that not all errors are made equal for me. I'm going to put some weight associated with these errors, and sometimes these errors mean a lot. I should learn a lot from them. Sometimes these errors don't mean very much. So another way to write this is to say the loss is 1 over n y minus xw matrix p rather than scalar p where this matrix p is equal to p of 1 p of so it's just a way of weighting the errors. Now how do I find the optimum w in this condition to minimize this loss function? Any questions before I do it? Alright, so this is equal to 1 over n y transpose p y minus w transpose x transpose p y minus x transpose p xw y transpose p c if I did that right. Okay, I think that looks okay. So I want to find the w that minimizes this. So find w star that's equal to argument of the loss as a function of error. And what is error? Well, this is error. So what is the minimum of this function? And what's the w at that minimum? So the minimum of this function is found at the location where the derivative of the loss at w is equal to zero. The derivative with respect to w at that location is equal to zero. So what is this derivative? It's 1 over n. The derivative of the first term with respect to w is zero. What's the derivative of the second term with respect to w? Well, let's take a look at this function here. w is of size. This is a vector. Let's call it a vector of size m by 1. So therefore, what I have here is a and then y is a vector of size n by 1. That's the number of data points that I have. So what I have here is I have a 1 by n size this is a 1 by n size scalar. Sorry. This is a scalar quantity. When I find this derivative with respect to w, what I have to get is an m by 1 quantity. So when I find the derivative of this scalar function with respect to w, I have to get this m by 1 quantity. And so to get that, what I need is this. The derivative here is going to be a 2 times x transpose, let's see. How do I find the derivative of this guy? This is a quadratic function in terms of w. I want to find this derivative with respect to w and I have to end up with this quantity which is a m by 1 quantity. And to get an m by 1 quantity let's see, what is the shape of x? x is so this is m by 1. This is a capital N by m size matrix. It has n rows and it has m columns. And this is n by 1. So n by m m by n becomes an m by n. This is n by m. This is m by 1. So I get an m by 1 which is correct. x transpose pxw. That's the derivative of this term. This derivative of this term I have here y transpose so this is a 1 by n and so then what I need is the transpose of this I'm going to get x transpose p y x transpose p y is x transpose is m by n and this is n by 1 so I get an m by 1. So I claim that's the derivative of that function. Which is equal to 1 over n minus 2 x transpose p y plus 2 x transpose pxw. So the way I took derivatives of these scalar functions is by making sure that the size of the final vector is of the appropriate size. So to find the derivative of this with respect to w it's either going to be y transpose pw or it's going to be the transpose of that x transpose pw. So it's one of those two things and the way I decide which one it is is by looking at what's the dimensionality of the resulting multiplication. That's all I'm doing. So to find the derivative of this with respect to w it's going to be x transpose p y or the transpose of this. It's going to be one or the other. And the way I figured it out is by looking at the result what's the dimensionality of each of them and my solution here has to be m by 1. So that's how I wrote each one of these components down. And that's important for you guys in this course we're going to find derivative of these quadratic functions a lot. These matrix functions a lot. So I hope you become comfortable with doing these derivatives. So now if I set this equal to zero and solve for w what I get is w star is going to be equal to x transpose px minus 1 x transpose p y. So when I have awaited least squares problem this matrix p which that determines the weight plays a role in the final solution. Let me give you a brief application of what we just did. So a few years ago I was doing some fMRI work with my student Jorn Diedrichson and we came across a scenario where we could use this weighted least squares to estimate the function that we were trying to fit to the data. So let me show you what this looks like. So when you do fMRI what happens is that you take an image of the brain and then you take another image a minute later sorry, two seconds later and another image two seconds later and each one of those images is going to be a set of you know positions, voxels and those voxels are going to be measured over and over each location is going to be measured over some time. The problem is that sometimes the person you know, moves so maybe at time point 300 they move their head a little bit. So in that case you have a position in space that you've taken over this time that the data is not going to be good. So then what do you do? Do you throw away that data or do you somehow weigh it down in your model that you're trying to get the data? So what we came up with is basically a mechanism by which you use weighted least squares to weigh the observation that you have based on the uncertainty you have associated with that observation so that p this thing here that says how much do I believe the data that I've taken is going to be weighted based on you know did the person move how much did they move so the experiment went like this so what is a typical fMRI experiment? You have n voxels and what you have in each of those voxels so n voxels what that means is that you know so you can each position in the brain is one location we're going to call that a voxel and you've taken your measurements at time points one through t this is you've measured the activity of the brain over capital T time points and so each one of these is an image that we would call and so one particular of these voxels y yn is one of those voxels we have t measurements for it so this is a t by one vector so we've measured the activity in this one location capital T times during the experiment and what one does one fits that to a what's called a design matrix x this is called a design matrix known so you manipulate things on the screen whatever and then you have some model of how that activity should relate to the measurements you're making via a weight vector that you're going to measure beta for this particular voxel the unknown the variable that you're going to fit and then the residuals are written as noise this is the error in fitting your model and so one way to come up with these weights in this kind of a model is to use the variance on this on this error variance for this residual epsilon and you know you can have a matrix let's call it P P1 to P of T it's a number of samples that you have times some noise sigma squared for this particular voxel n and we can write this as P sigma squared n the components that go into this P are the basically the variances that you have on this noise so if they all of a sudden they make a movement what happens is that the variances are going to become large and one of these components is going to become large but the rest of it are going to be fine and so you don't throw away the data for that particular measurement because they move but you just scale it based on something that you believe is associated with how well it fits your model then you're going to down weight it and if it fits well then it's going to be up weighted and I have in your notes I have a reference to how you would estimate these guys to you know to then fit these things and find beta so basically beta here is going to be exactly as before when we found W it's going to be you know X transpose in this case it's going to be 1 X minus 1 X transpose P minus 1 times Y so it's again just weighted least squares except that we have some reasonable way to estimate how much we we believe each data point and a related topic to this is learning with basis functions where what we do is that instead of having X we have X and our model is now W transpose times vector G of X where vector G of X is equal to G1 of X G M of X where W is an M by 1 so what this means is that if instead of linearly encoding the input space via X as I've been doing in these scenarios I've already encoded so for example I have some X space say it's a scalar I have some G here G1 of X I have some G here G2 of X and the problem really doesn't change because all I've done is replace X with G of X and you know my loss function now it's going to be some of N Y of I minus W transpose times vector G of X so where X is evaluated at time point I and you have to then decide what would be your G what shape should it take how wide should it be and so forth and so basis functions are just a natural extension of describing the linear function X that we've been doing so in the approach that I've been doing here these are linear bases I could just as well have non-linear bases that would encode the space any questions about that alright so the last topic that I want to talk to you guys today is this loss function so we've been using a quadratic loss function why well because we can find these derivatives you see that I need the derivative of the loss in order to tell you what's the optimum W right and for a quadratic function I can easily find this derivative but that's just a mathematical convenience there was a nice paper a few years ago that tried to estimate what's the loss function in people when they are learning things and so the last part of the conversation with you I want to show you how that experiment took place and how do they decide what the loss function was and in a way to motivate it imagine the following scenario so you're performing actions you observe errors in your behavior you change your behavior as you change your behavior can we estimate what is it that you're minimizing that's the idea are you minimizing the squared error or are you minimizing something else entirely so suppose you go to the amusement park and you've seen these games where there's a jar and you have to throw the coin so that it falls in the jar so if it falls in the jar you get the bear if it falls anywhere else you don't get the bear so it's not that if the coin falls close to the jar you're going to get a small bear you just get no bear if it falls in the jar you win if it falls anywhere else it doesn't matter so that's one kind of a game another kind of a game would be where you perform an action like throw the horseshoe I don't know if you guys have seen the horseshoe game so there's a bar here and what I'm going to do is that I'm going to stand here I'm going to throw my horseshoe and you know if the horseshoe lands here that's good because if you throw your horseshoe and it lands here I win so this is closer than this so this is better than this it's not only one location where all the goodies are the distance now the farther I get the worse I get similar when you throw darts when you throw darts on your target if you get close to the center you get something if you get farther away you get something else and the distance there indicates how much you're going to get did I get that right are darts like that there's a caveat or there's one part in the middle that gives you more so it doesn't show okay what's the theory behind that why do they why is there one part I mean there must be something about that being hard it's a strip it's a thin strip I thought there was some non-linearity to it that I might have screwed up so the game that they played with their participants in this experiment is as follows so there's a bar here and imagine that my finger is down here and there are these bullets let's say that are coming out of my finger and the way they describe it is that basically if your bullet hits this that's great try to maximize your performance by having your bullets hit this bar and what they did is that they controlled the distribution that related the location of my finger the location of my finger to where these bullets ended up and let's call it P's P shooting let's call it so we don't have bullets and the idea was to ask where do people put their fingers if you ask them to maximize the number of P's that hit the target and what they did is that they manipulated the relationship between location of the hand and the distribution of these P's so let me show you the experiment so the loss function so here's your error and of course it's great to be at the center when you're at the amusement park and you throw the coin it doesn't really matter how far away you are from the cup it only matters if you're in the cup so the loss function looks like this meaning the loss of y tilde is equal to zero if y tilde is equal to zero it's one if otherwise that's the amusement park game now you could have a cost function that looks like the way we've been doing it which is the quadratic cost function in that case it looks like this so this is y tilde squared that's a different kind of a cost function it says it's good to be at zero and it's bad to be a little farther away and it's really bad to be very far away so the thing is getting really bad as you get farther away you could have a linear cost this just grows linearly with error and you could have a cost that grows less than quadratically this would be like absolute value of this raised to some power say one and a half so this is just the absolute value of y this is y tilde squared okay so they wanted to know what was the loss function in the behavior that individuals had and so the objective of the game of course they imagined so the objective was to minimize the expected value of loss and what does that mean what's the expected value the expected value of loss of y tilde is the integral of loss times the probability of loss given the parameter w integrated over all the errors so we want to find w that minimizes the expected value of loss and what this means is that this is the loss function which we don't know which we're trying to find out this is the probability of error given my actions so if I put my finger so here's the goal line the goal target here's my finger some distance from here I'm going to call that w and if I put my finger at some location there'll be some p's coming out of my finger and what I want to know is that how do I place my finger in such a way that I minimize the expected value of loss and loss is measured in terms of the distance of the p's from zero which is the goal target and that loss for me might be a nonlinear function like this delta function which is like this or it could be some other function this linear function here or some quadratic function and I don't know what that is so I don't know this which I'm trying to estimate and now what I'm going to do is I'm going to impose this so I'm going to impose on you what this describes the relationship between errors that you're going to see and the place that you placed your hand so if you place your hand to the left you're going to get some probability of error if you move the hand to the right you're going to get a different probability of error you can control that error by moving your hand what I want to know is that for you where do you place your hand because that's going to be w you're doing as good as possible and so for you the place you place your hand must be the place that minimizes this expected value I know what p is because I said it you tell me what w is because you place your finger I will compute your loss by finding that loss function that's the idea so how do they do it so they represented this p a mixture of Gaussians so here's w and I'm going to plot now p of the error given that w was equal to say w1 so you placed your finger at some location you're going to now observe a density that looks like this it's going to be a mixture of two Gaussians this is the probability of error given that you placed your finger at some location so if you placed your finger maybe to the left by some amount w1 this is the probability of errors that you're going to see if you moved your finger to the right this is p of error given that w is equal to w2 this is what you're going to see and p of error given w is going to be described as this probability is going to be a mixture of two Gaussians it's going to be w minus rho some number times a normal distribution would mean w minus some constant 0.2 with variance sigma1 squared plus normal w minus 0.2 plus 0.2 over rho with variance sigma2 squared I'm going to add them weighted by this quantity rho and that's going to be where the p's are going to go when you place your finger at some place w so if you place the w1 the p's are going to go here if you place the w2 the p's are going to go over there the sum of two Gaussians so let's see what's the best thing you can do if your loss function is this function here let's worry about this loss function the one that is like a cup it means that it only matters to me if my p falls exactly in target and if it falls far from it or close to it it doesn't matter because in all those conditions it didn't fall in the target so if that's your loss function how would you behave where would you put your w so here's what they're going to do they're going to change rho this function here this constant here which means they're going to change this relationship between your finger position and error and as they change it as they change rho they're going to measure where you put your w and for each one of these loss functions they're going to write for you how basically w should depend on rho as you change the relationship between the p's and your finger if you had a loss function that was, you know, a cup like this then these are the locations this is how you should change your w this is where you should put your finger so what we want to know is if your loss function is this cup this non-linear delta function then if I were to manipulate rho to Gaussians how would you change your w where would you put your finger so let's see what that would do alright so if all that matters to you is putting the p inside the cup and it doesn't matter where things fall anywhere else then if you look at this function here if you want to set w that basically minimizes loss what you want to do is maximize the probability of your errors being 0 right so if all that matters is your p is hitting the target it doesn't matter how far away things are from it all that matters is that what you want to do is place your finger in such a way that you maximize the probability of the p hitting the target because it makes no difference of every other p how far they are if you want to minimize if you want to minimize this function you should maximize the probability of the error being 0 find the position for your finger where this probability is maximum which happens if you would move your finger in such a way where the function is sitting in the middle so this guy here move your finger in such a way this probability function has its maximum at y error is equal to 0 if all you care about is getting the bear then of course what you should do is place your finger so that as many of those coins that you've thrown fall in the cup that's what all is saying maximize that probability it doesn't cost you anything that these guys are not falling in it as far as they want as long as your objective is to maximize the probability of getting the bear you get as many bears as possible so place your finger in such a way that you maximize the probability of y tilde being equal to 0 so what that looks like is that if you were to plot in this axis row so I'm going to move row close to 0 to some value close to 1 and I'm going to compute the W that maximizes the probability of y tilde being equal to 0 what I will find is that I'll get some function W that looks like this this is for the W for the cup that I want to maximize what I would do is basically solve this equation by finding what is the W that makes the probability of y tilde is equal to 0 is maximized and it would depend on row and I could plot that function so what's interesting to us is ask now how would this function look different if I wanted to not maximize the probability of getting my this particular loss function which is what I've shown here minimize this loss function the quadratic loss function the one that we've been doing this one here and we can find an analytic solution for that and that's what I want to do for you next so what if my loss function before I go are there any questions? yeah, Max it means that as they observe the p's going that they can estimate the probability of it yes but they're assuming that that's the case yeah exactly good point so we're assuming that the probability distribution that we're producing with the subjects they are unbiased estimator of those probability distributions yes sir why do they use two Gaussians? well this is oh you'll see because you'll see that when I find you'll have a dependence on W but not its variance yeah it'll have a particularly simple dependence of the expected value of this function will have a particularly simple dependence on W but its variance will be independent of W and so when we want to maximize it its variance doesn't matter and so it'll be particularly easy to handle when we want to minimize the squared error which we'll do right now that's a good point alright I should start bringing water here I can tell that I'm running out of I thought it's getting dry alright so suppose now we have a loss function is a squared error and what we want to do is find the W that minimizes the expected value of the loss which is equal to that's the expected value of it so the question is what is this what's the expected value what's the expected value of my loss given W if I can find that and then find the W that minimizes it I will find the optimal place to place my finger so remember how we define variance variance of X that's the expected value of X minus its mean squared which is equal to the expected value of X minus so when I say the variance of X what I mean is yeah that's correct so the expected value of X squared is equal to the variance of X plus the expected value of X quantity squared so the expected value of Y tilde squared given W is the variance of Y tilde given W plus the expected value of Y tilde given W quantity squared so what I need to know is are these quantities what's the expected value of Y given W and what's the expected value of that quantity squared and this variance so here's the here's the probability distribution what I need to know is I need to find its expected value so I have probability of Y tilde given W is 1 minus rho times a normal W minus 0.2 sigma 1 squared plus rho times another normal W minus 0.2 plus 0.2 divided by rho with sigma 2 squared that's my function so the expected value of this given W is going to be equal to the expected value of the first W minus rho times W minus 0.2 plus rho times the expected value of the second W minus 0.2 plus 0.2 divided by rho and if you multiply that out and do some cancellation that's going to be equal to W so Max the reason the reason why they had that funny thing there is because they wanted that to be equal to W so then the so it turns out that the variance of this function is not going to be dependent on W the variance of Y tilde W is not dependent on W and so this quantity which is the sum of this variance plus this squared is going to be dependent on W squared so the W star that I want is the argument of W squared plus W squared this term this term squared plus variance of Y tilde given W which is not dependent on W and that's going to be equal to 0 so if I go back now to this function and I plot the optimum W as a function of rho what I will find is that for a quadratic error I will not change my W I will always keep it at 0 so this is if my loss function is that cup this is my behavior if my loss function is W tilde squared sorry, Y tilde squared so they constructed these two Gaussians in such a way that if I cared about the second the power to the 2, the loss function was quadratic in terms of the error then I would just keep my finger exactly the same as they change this parameter rho on the other hand what I cared about is winning by placing as many of those P's inside the bar if that's what I cared about then I would be changing my finger position as they manipulated rho and similarly they did the same thing for this is why squared it turns out that if you do something like Y absolute value of error you would get something like this and then so they did it for a family of functions family of loss functions is it the error raised to the power of 2, is it raised to the power of 1, is it raised to the power of 1.5 and what they concluded with the actual data that they got it was a messy thing but people sort of behaved like this there was somewhere between I'm sorry so here if you have a cup if you have raised to power 1 here if you have a raised to power 2 and they got something that went in between power 1 and power 2 so they said the loss was best described as Y tilde raised to the power of 1.75 that was the best estimate of the loss function so just to summarize what we just did today in terms of this experiment what they manipulated was the relationship between error and your actions the probability of error and the action that you could produce you control your actions they computed the loss function based on what you observed as a probability of error and what you did in terms of your actions assuming that what you were doing was minimizing the expected value of the loss function okay yeah thanks if they paid you based on some error function they asked you do you find so if they manipulated this directly for example the amusement park experiment the loss function is set you're not going to get anything for missing so in that case your behavior should be quite different then yeah is that what you had in mind I'm not sure the easiest control for that would be to just move the target because you're measuring your W relative to the target so I guess like saying you can just keep your hand in one place is a little bit like facetious because you actually can't just keep your hand in one place they probably varied the target location but then measured like the W position relative to the target but Max what was your thought so is it the raw form depends on the role of the reverse the cup would be a constant function as it wants to be changing I see what you mean I see what you mean you're saying that not moving in this case can we set up a scenario where not moving is the best thing to do for the cup and moving would be the best thing to do for a squared error do you still see something that is consistent in 7.5 that's right that's right is that good point okay guys thank you so much see you next Monday