 David, you ready? All right. So today, I'm going to describe the idea of maximum likelihood, which is the beginning of thinking about what we call generative models, which is the idea that you're going to describe a model which your data was generated, not just fitting your data to some model, but describing the model with which this data was generated, which means that you're going to have to make some assumptions about the probabilistic nature of the data. So this is going to lead us to the notion that maximum likelihood, which is the way to say, what's the most likely scenario in which the data that I collected was generated? So there are some parameters in your model, and these parameters include like the weights, maybe the variances associated with the noise. And what maximum likelihood does is to say, well, all right, given the data that I saw, how do I maximize the likelihood of seeing this data given the parameters that are such and such? So it's more than just describing a model that fits the data, it's describing a model that produces the data. And so that's called the generative model. At the end of the lecture, what I want to show you is the concept of how we can use this to describe mixing of multiple sensors. So for example, I may have one sensor that measures something and another sensor that measures the same thing, and how do I combine those two things using, again, these kinds of concept of maximum likelihood. And then we'll talk about an experiment where they looked at how people combine what they see with how they feel. So in this experiment, what was happening is that people were feeling an object haptically, so they had a sense of their size, they could also see it, and they combined these two pieces of information to form an estimate of the size of the object. And we're going to use maximum likelihood to understand how they combine these two kinds of sensors. So first we build a mathematical framework to describe maximum likelihood. We'll link it to the work that we've been doing so far, minimizing the loss function and so forth, and then we'll link it to biology and how in this one condition, when you have multiple sensors, you combine the information from those two sensors. So we've been talking about a loss function. And for most of the class, we've been describing the loss function as a quadratic function, where we have some sum, something like this. And it's fair to ask, what's good about this loss function and where does it come from? So today we're going to talk about our generative models, which describe how we believe our data is generated. And we'll see that under certain assumptions, this is the maximum likelihood thing that you would estimate, meaning that under conditions where you believe your noise is Gaussian with zero mean, then this loss function that you are minimizing is equivalent to maximizing the likelihood of measurement y. So we'll see there's a relationship between a quadratic loss function and a generative model in which you believe there's Gaussian noise. So let's see how that would work. So in principle, what I mean by generative models is that, for example, in this case, we have two things that we observe. We observe x and we observe y. And so the relationship is as follows. So there's a y that we measure. There's an x that is related to this y. And there's this thing called w that we don't know what it is. And so what I'm going to use in terms of our nomenclature from here on is a description where things that I measure are going to have double circles. Arrows are going to describe conditional probabilities. And one circle means that I can't measure it. It's something that I'm going to try to estimate. And what I'm trying to do in such models is describe the relationship between things that I see and things that are producing it. So for example, in this case, I have y is equal to w transpose x plus some noise, epsilon. And let's assume that this epsilon has the following properties, say that it has mean 0 and variance sigma squared. So that's a generative model. It says that if you were to give me an x, I would give you a y. But you notice that this y that I'm going to measure is not going to be the same for every x. So for the same x, I might get a different y, depending on what the epsilon is, what the noise is on that term. So what maximum likelihood is, is the idea that find a way to maximize the likelihood of seeing all the y's that you saw given that you had this particular model. So you're going to write the probability of y given w, x, and sigma squared. And basically, that probability in this case is going to be a normal distribution with mean w transpose x and variance sigma squared. So if your y that you observed, say you observed y of 1, this is the first data point that you got. And if this particular position is far from this mean, w, t, x, i, then that means that that's a very unlikely data point that you have. On the other hand, the certainty that you have regarding this data point that you observe is related not just to the mean, but it's also related to this variance here. So if your model says that I have a lot of noise in my measurement, this is sigma squared that I can see versus this, then, of course, a small difference between what you predicted, w transpose x1, and what you observed here, y1, can make it so that this is very unlikely on the this scenario, but it's pretty likely on the this scenario, depending on the noise that you have in your model. So the same weights, depending on the noise, is going to give you a different certainty associated with your particular observation. What maximum likelihood does is says that let's find the parameters, w and sigma, in such a way that we maximize the likelihood of actually observing the data that we did. So I'm giving you some data. You have a model that says how this data was generated. Find the parameters, w and sigma, that produce the data that you saw. And if you do that, then you've maximized the likelihood of seeing that data. That's what maximum likelihood is about. So in principle, what we're going to do is maximize the probability of the observations that we had, given that there was some w that generated it, and there were these x's that we saw, and there was some variance sigma squared that was associated with the distribution that generated that data. That's the likelihood. And what maximum likelihood does is find the parameters. This is one of the parameters. This is the other parameter. We're going to call that theta. That produce the data that you saw. In principle, you could have whatever generative model that you have, however you believe that your data was generated, by maximizing the likelihood of the data that you saw, you estimate the parameters. And what we will see is that in the case of this particular cost function, when we were trying to find the parameter w, what we're really doing is maximizing the likelihood under the data assumption. This particular generative model with this particular noise. Any questions about that? So let me give you some examples. So let's start with a particular distribution. Say I have an exponential distribution. I'm going to give you some data points. Say I give you n data points. Let's make a capital N drawn from an exponential distribution. So what that means is that probability of y being equal to with some yi, the particular data point that you saw, given that I have some parameter a associated with my exponential distribution, if 1 over a exponent of minus 1 over a times y. So this says how likely is it that the data that you saw came with this particular probability distribution? And what I want to know is what's the parameter a that would maximize the particular data that I saw? So how many data points that I saw? I saw data points from 1 to n, so I have the probability of y1, yn, given a, that's my likelihood. And because each of one of these observations is an independent observation, what that becomes is the multiplication of this quantity n times. And if I multiply this quantity n times, it becomes 1 over a to the n exponential minus 1 over a sum of. So OK, what I want to do is maximize this likelihood. Well, how do I maximize that function? Typically what we do is that instead of maximizing an exponential, what we do is that we take a logarithm of it so that we have a simpler function to maximize. A logarithm is just fine, because it's a function that's always going to be increasing as this component increases. So basically, if I want to maximize this, I can just as well maximize the logarithm of this function. So we call that a log likelihood, which is basically the log of this probability, which when I find the log of that, I want to find the log of this quantity. Let's see what that would be. It would be the log of the first quantity plus the log of the second quantity, because log of a times b is log of a plus log of b. So log of the first term plus log of the second term, I'm going to get a negative of a sum, which is going to be equal to log of a. This is log of a to the minus n, which is going to become minus n times log of a minus 1 over a sum of y of i. So now what I want to do is find the a that maximizes this likelihood. So what I do is that I find the derivative of the log of this probability with respect to a. So derivative of log of x dx is 1 over x. So this becomes minus n over a minus, this is a to the minus 1, so it becomes plus a squared. Sum of i is equal to 1 to n, yi. Set this equal to 0. I get, so then I have a is equal to 1 over n. Sum of i is equal to 1 to n, yi. So this is my maximum likelihood estimate of a. Does this look all right? OK, let's now do this for another problem. Say that what we have is a data set that's generated. So we have n data points from a Gaussian distribution. And so in this case, the probability of y being equal to yi is dependent on some mean mu and some variance sigma squared. And that's a Gaussian, so that's 1 over this quantity. OK, so we have been given n data points. So what we have is the probability of y1, yn given mu and sigma squared is going to be equal to the multiplication of this quantity, which is equal to 1 over 2 pi n over 2 sigma to the power of n, the exponential of. So next what I do is I find the log of this. So this is my likelihood of seeing those data points. And it's just a multiplication of all those normals. So now what I find is the log of this, the log likelihood, that's going to be equal to the log of the first term, which is, well, let me write it out. So this is log of 2 pi to minus n over 2 plus log of sigma raised to the power minus n plus or minus, which is equal to minus n over 2 log of 2 pi minus n log of sigma. OK, now I find the derivative of this log likelihood divided by the derivative with respect to mu, the first parameter that I want to estimate. So I have two parameters, mean and variance. So I want to find what's the mean of the distribution. I find the derivative with respect to this mean. I get minus 1 over 2 sigma squared sum of minus mu times minus 1, which gives me this as a plus here. And the 2 cancels out. There's a 2 there as well. So when I solve for mu, set this equal to 0, solve for mu, I get this is irrelevant. I just get 1 over n sum of yi. So that's your typical, this is how you find the mean of a normal distribution by finding the sum of the values dividing by the number of elements. And this is the maximum likelihood estimate of the mean of that distribution. Let's find the variance of the estimate of the variance. So we want to find derivative of the log likelihood with respect to sigma. So I want to find the derivative of this with respect to sigma. I get minus n over sigma. Minus sigma to the minus 2. Minus 2 comes in the front. I get a minus 3. So I have plus 1 over sigma cubed. So I set this equal to 0. What I have is minus n. I want to solve for sigma squared. This doesn't matter because there's a 0 over there. So what I end up with is sigma squared is equal to 1 over n sum of yi minus mu squared. So this is my maximum likelihood estimate of the variance of my data. Any questions about this? Yeah? In that last equation, is the ML estimate of mu? Yep. OK. So let's now do the problem for our estimating w in our linear model with a generative model that says how our data was produced. So suppose what we have is a likelihood that looks like this is equal to the sum, sorry, the multiplication of each one of those likelihoods which are given by my normal distribution. Something like this. So if I multiply this out, I get sigma raised to the power of n. You can see that. So this is the likelihood for the model where every data point that you see is generated by your weight that you're trying to estimate and some noise that you have in your measurement assuming that noise is Gaussian with zero mean and variance sigma squared. So every data point I gave you, you're going to assume that it was generated by a model in which there was noise. And that noise was sigma squared with property of zero mean. OK. So if that's the model that generates your data, then this is the likelihood at which every data point that you see can be given a probability. So what we want to do is find these parameters that produce our data. This is what we need to find, w and sigma. This is our unknowns. So we want to find the log likelihood and then maximize it for the parameters w and sigma squared. So the log likelihood, what is that? So I have a multiplication here. So what I have is that I have the first term minus n over 2 log of 2 pi minus n log of sigma minus 1 over 2 sigma squared sum of i is equal to 1 to n. OK. So if you look at this equation, what I want to do is find the w that maximizes this function. Well, look here. What is this? This was our loss function. And of course, what we did is we found the w that minimized this loss function. So what we're doing is exactly the same thing now. So if we had assumed that the data that was generated was through a process by which there was a true w and there was a true sigma squared and the relationship between the data that I saw and the true measurements were through this Gaussian process, then by minimizing a loss function, by finding a w that minimizes this, we also are maximizing this, which is a negative of it. So we're maximizing the likelihood of seeing the data by finding the w, by minimizing the determined side of that. So let's do this. So let's find the derivative of the log likelihood with respect to w. And so I can write this term here as my matrix way of writing it, xw transpose y minus xw. And so all right, that this becomes exactly as what we had before. The derivative of this with respect to w becomes minus 1 over 2 sigma squared, multiply it out. I don't remember the. So the derivative that with respect to w, it's going to be minus x transpose y minus x transpose y plus 2 x transpose xw. Said that equal to 0. And what we get is w is going to be equal to x transpose x minus 1 x transpose y. So this is the maximum likelihood estimate of w. What's the maximum likelihood estimate of sigma? Let's go back to our equation here, find the derivative now with respect to sigma and maximize the log likelihood. So the derivative of the log likelihood with respect to sigma is going to be equal to minus n over sigma. That's the derivative of this term with respect to sigma. And the derivative of the second term is going to be plus 1 over 2, plus 1 over sigma cubed times this sum, which is equal to 0. Sigma squared is going to be equal to 1 over n times the sum of yi minus w transpose xi squared. So this is the maximum likelihood estimate of the variance in the noise associated with the process that generated my data. This is the maximum likelihood estimate of the weights that generated that data. Any questions so far? You see what I did? First describe the likelihood of seeing your data, then find the parameters that maximizes that joint probability distribution. All right, let's try a different example. Let's try an example where we have the following scenario. Oh, I forgot to tell you. So we are now starting down the book. So all the stuff that I'm going to be talking about from the rest of the class will come from your book. And we are now in chapter 4.1. And we're going to cover today 4.1 to 4.5. So that's the material that I'm going to cover today. So here's the problem that I want you to consider. Suppose that you have in your hand a GPS that tells you where you are. But you also are worried about this thing being not accurate. So you bought another GPS. So you have two GPS's. Now one of them was purchased in the United States. So it uses US satellites. The other one was purchased in Europe. So it uses European satellites. So they're independent estimates of where you are. So now I want to set up the problem so that you determine where you are based on the readings that you have, based from those two GPS's. And the idea is to show you how to use maximum likelihood to basically give the best estimate that you can. So here's our problem. It's hiking in the woods problem. So you are at some location. Let's call that x. This is a 2 by 1 vector, some position in the world. You have two sensors. Call it A. For one of them, you have B for the other one. And we want to set up this problem so that with these two sensors, after you make your measurements, you make an estimate about where you are. So just give you some intuition about what this problem is going to be like. In some sense, you're going to be averaging these two sensors. Because one of them is going to tell you I'm over here. The other one tells me I'm over here. I'm probably somewhere in the middle. Now, the way you're going to average it depends on the uncertainty you have associated with each sensor, with meaning that the noise that there is in each sensor. So presumably, what you're going to do is that you're going to weigh the information that you get based on this noise. Whichever sensor is the more noisy one is going to get downgraded more. So if you're going to rely more on the sensor, that's going to be less noisy. So in describing our generative model, we're going to have to assume that we know the noises in this system. I'm going to specify for you what the noises are. And then you're going to make a decision about where am I so that I maximize the likelihood that these sensors are telling me what they are telling us. In a few lectures, I'm going to give you the example of what happens when one sensor tells you, I'm over here. The other one tells me that I'm over here. And if I do the two together, I'm going to end up in the middle of a river. So I know I'm not in the middle of a river. So then something is wrong. So in that case, it doesn't make sense for me to combine these two pieces of information. So I'm not going to believe the two. So those are scenarios that we're going to consider. But today, let's assume that we have the simplest case. We have two sensors that are going to tell us something about the world. Now, this is a biologically relevant thing. Because in your brain, you have the capacity to also rely on multiple sensors. You have things that you can touch, and you have things that you can see. And the two of them are going to tell you things about the world. And presumably, the way you combine them is by taking into account something about the quality of each sensor. And that's one of the papers that I want to show you at the end of the lecture, how they use this idea to demonstrate that when the brain made an estimate based on incorporating multiple sensory modalities, it used something like maximum likelihood. So that's where we're going. All right, so our problem is we're hiking in the woods. We're standing at some location. And we take out our two GPSs, our two GPSsis, I guess. And we're going to try to estimate where we are. So let's set up our problem. So I'm going to write my probability distributions as follows. I'm going to say y, the place where I am, is made up of y a, sorry, y is my observation. It's a place that I'm reading. It depends on these two variables, y a and y b. Each one is a 2 by 1 vector. So this is a 4 by 1 vector. This is one GPS. The other one is the second GPS. I'm going to write my relationship between what I measure and the true location where I'm at as follows, c times x plus epsilon. And what I mean by c is an identity 2 by 2, an identity 2 by 2. So I'm going to believe that what I'm measuring depends on where I'm actually standing. And that relationship is described by this matrix c, which is basically saying that two things are measuring the same. My two sensors are measuring the same position that I'm at. And they're unbiased plus some noise epsilon, which is going to be normally distributed. And it's going to have variance, which I'm going to call r. And my mean is 0. My variance is going to be r. And what this r is, is a matrix made up of two kinds of noises, r, a, and r, b, where these are the covariances associated with each sensor. So for example, one of these sensors is going to have noise a. One of them is going to have noise b. And each one of them is going to be a 2 by 2 matrix. So this is a 2 by 2. Let me stop. Make sure we understand what the problem is. Yeah. Is that mean r, b is a 2 by 2 matrix or r? Yeah, it's a 4 by 4. OK. This is a two-dimensional problem. That's all. All right. So let's write a likelihood. So what's the probability of y by a and yb given x? That's going to be a normal distribution with mean c times x and variance r. So what that looks like is 1 over square root of 2 pi raised to the power of 4 times the determinant of r times an exponential of minus 1 over 2 y minus cx r minus 1. OK. So that's my conditional probability that tells me the relationship that I'm looking for. This is the likelihood. So let's find x that maximizes this likelihood. So what is my position estimate? Where do you think you are? Well, that position is going to be my maximum likelihood estimate, and that's going to be the position that maximizes the two-sensor of measurements that I made. Find the x that maximizes the observation. OK. So before we were using x for something that we know was input, but don't get confused by that. So x is now what I want to find. I'm standing at some location. I have two measurements, y, a, y, b. Tell me the maximum likelihood estimate of your position x. So the way I'm going to find it is by finding x that maximizes this likelihood. Does that make sense? All right. To do that, I will find the log of this function. What is the log likelihood? It's going to be minus 2 log of 2 pi plus the determinant of r minus, OK. So let's find the x that maximizes likelihood. So to help us through, let me multiply these out. So I get y transpose r minus 1y, x transpose, c transpose r minus 1y, y transpose r minus 1cx, plus x transpose c transpose r minus 1cx. So the derivative of that with respect to x, the derivative of the log likelihood with respect to x, is going to be minus c transpose r minus 1y minus c transpose r minus 1y plus 2 times c transpose r minus 1cx. So x is going to be equal to c transpose r minus 1c minus 1c transpose r minus 1y. So this is my maximum likelihood estimator where I am, which looks very similar to what we've seen with regard to a weighted least squares problem. But I want to give you some intuition about what these things are. So let's write out some of these matrices. Let's see what is c transpose r minus 1. So what is c transpose? So c we had was this identity matrix. So c transpose is identity, identity. r minus 1 is going to be rA minus 1, rB minus 1, 0, 0. So this is a 4 by 4 matrix. It's a symmetric matrix. When I invert it, I get the inversion of individual elements. So c transpose r minus 1 is going to be equal to rA minus 1, rB minus 1. And what's the term c transpose r minus 1c? If I multiply this by this, I get rA minus 1 plus rB minus 1. So the term on the right, c transpose r minus 1y, is going to be equal to rA minus 1 times yA plus rB minus 1yB. And so that whole term, x, is going to be equal to rA minus 1 plus rB minus 1 minus 1 times. So let me give you intuition about what we just did. So this is saying that your best estimate of where you are is to take your measurement, rA, weigh it by the inverse of the variance of that sensor, take the measurement, rB, weigh it by the inverse of the variance of that sensor, and then normalize it by the sum of the variances. Just like what we thought. So what we're going to do is that we're going to have two sensors. Each one is going to have a noise, rA and rB. When I want to combine the two pieces of information, the way I do it is by weighing each information by the uncertainty that I have in that information. That uncertainty is the noise in that measurement. So if the noise in one of the elements is rA, I weigh it by that. The other one is rB. I weigh it by that. And then you normalize it by the sum of the two things. So basically, we now have an estimate of maximum likelihood. So this is where I believe I'm at. Now, my belief about where I am isn't just the mean of it. So this is my mean. This is the expected value of where I am. What is the variance of my belief? So I have some belief about where I am. And now, because I've done this problem probabilistically, I not only have a mean, but I also have some uncertainty about my mean. Where is this location? And that would be the variance of my equation. So that variance is here. So this is my estimate of it. So I can find the variance of that equation. And the variance of that equation depends on the variance of y, which is going to look like this. C transpose r minus 1, C minus 1, C transpose r minus 1 times variance of y times that term transpose. C transpose r minus 1, C minus 1. So if I am just standing still, then variance of y, given that I'm not moving, if I look at the top equation, y is equal to Cx plus epsilon, if I'm not moving, so I know I'm not moving, then the only variance that I have is the variance in my reading. So this term is going to be r. If I multiply this, r's cancel out. I get C transpose r minus 1, C minus 1, C transpose times the transpose of this, which is going to be r minus 1, C times C transpose r minus 1, C minus 1, quantity transpose. This cancels out with this. I get C transpose r minus 1, C minus 1. And the transpose is irrelevant because this is a symmetric matrix, so that's the variance of my estimate. Now, why is this important? Let's see what this is. What's C transpose r minus 1, C? C transpose r minus 1, C is here, so that I can just write this out. That's rA minus 1 plus rB minus 1, quantity minus 1. So if you look at this, what this is telling you is that if you have one measurement that has variance rA, if you have another measurement that you have variance rB, if you combine the two measurements, your posterior, your end what you end up with is something better than both of those things. So you're going to end up with a variance in your belief that's smaller than either of those two measurements. That number there, rA minus 1 plus rB minus 1, inverted is smaller than either rA or rB. So two sensors are better than one. In what sense? They give you an uncertainty that's smaller than either of those two variances. And this is why having more sensors is good, as long as each one of them is unbiased, and you know the variance is associated with them. All right, so for example, if I had, if you go through this and draw up where you think the way things would look like, so say that you have an xy axis that describes where you think you are. And you know one of the sensors tells you, with this uncertainty, here's where I am, the other sensor tells you, with this uncertainty, here's where I am, where your maximum likelihood estimate is going to be some place in the middle, but not along that line. It's going to be down here. It's going to be here, like this. So not necessarily between the two means, because it depends on the shape of those distribution functions. The most likelihood, most likely location is going to be somewhere in between, not directly in the same line, with a distribution that's going to be tighter than either of those two. And that's going to be that distribution. So that distribution is shown here. Here's its mean, and here's its variance. That's your maximum likelihood estimate of where you are. Any questions? OK, let me give you an example now from biology. So in 2002 it was, there was a paper that appeared in Nature that began to use these concepts to describe how the brain incorporated data for multiple sensors. So I want you guys to understand that, so the last week, sorry, Monday, when we talked about estimating generalization, the paper that I described to you was a science paper that appeared in 1990, and then there was a Nature paper that appeared in the year 2000. They used the concept of generalization. So these things are, you can begin to see how this mathematics of basic statistics and learning are being now used to just try to understand how the brain does it. And so it results in very flashy papers that appeared not that long ago, 10 years ago, 11 years ago or so. So the reason why I think it's useful for you guys to know that is because you can begin to see, I hope, that this kind of thinking is beginning to make major inroads into understanding how the brain works. And that's why we have this class, so you guys can see things. This is not stuff from two centuries ago. This is from really just a decade ago as it's being used. So maximum likelihood has been around a long time. But it really wasn't until about 12 years ago when people began applying it to asking, does the nervous system incorporate decision processes that coincide with maximum likelihood? And the experiment was very simple. So experiment was that you're going to have a device that you're going to use to estimate, basically, to feel the size of something. So they give you a little robot that can determine something of a particular size. And then you go and you feel it haptically. And as you feel it haptically, they give you visual feedback on a monitor that displays what the size is. So you're going to see something, and you're going to feel something. And now your brain forms an estimate of the size of that object by using these two pieces of information. I saw it was this wide, this long, and I felt, and it felt this long. So the aperture of my hand is one sensor. Visually, I have another piece of information. Now, if you did this in a maximum likelihood way, what your brain would have done is to say, I have certain uncertainty about my haptic sense. There's some noise there. I have certain uncertainty about my visual sense, what I see. And you would think that your noise associated with your eyes is much less than the noise associated with what you feel. Generally, we're much better in visually estimating things than feeling our way. So I have two kinds of noises. I have noise associated with my visual sensors. I have noise associated with my haptic sensors. The experiment is just like this. I have a true size that I'm trying to estimate. Now, again, the caveat is if what they're showing you is tiny, but what you're feeling is really large, it's sort of like the problem of one sensor is telling me I'm on this side of the river, the other one is here, I'm not in the middle. So I'm just going to ignore one of them. But that's not the case here. These two things are very close to each other. So we're going to believe that the two sensors are really measuring the same thing. That's the important initial assumption. Our generative model is that these two sensors are telling me the same thing. They're measuring the same thing. But one sensor happens to be much more accurate than the other. Now, what they try to do in this experiment is ask, when you are making a decision about what size this thing is, are you combining these two pieces of information in a maximum likelihood way? So how can they do that? Well, first, they have to estimate the noise in each one of your sensors. So you have to know for Reza, when he's doing this experiment, what is the noise in his haptic sense? And then what is the noise in his visual sense? So this has to be done independently first. Then when I give it to you together, do you make an estimate about size based on this? The second problem is, these are distributions. But when I answer a question for you, I don't give you a distribution. I just tell you it's bigger than the other one. Or it's less than this one. So you're going to have to use my result from the way I describe it to you to infer what is the probability distribution that I'm talking about. And the technique for that are dependent on things called psychometric functions and what I want to show you. So mathematically, we have a framework to describe the problem. But we still have the experimental problem of how do we actually go measure somebody's uncertainty about what the size is? How do we measure the noise in a particular sensor? So that's where we're going. All right. So let me show you their experiment. And this experiment is, as I said, the material that I'm talking about from here on is going to be the material in your book. And this material is going to be covered in chapter four, which I suggest you read. So our generative problem is as follows. There is some height that we're trying to estimate through vision and through haptics. And so as before, y is equal to Cx plus epsilon. And in this case, epsilon is normal, 0r. And r now is going to be a 2 by 2 matrix, sigma squared h, 0, 0 sigma squared v. It's a one-dimensional problem. And I have a variance associated with my sensors, sigma squared h, and a variance associated with my vision, sigma squared v. C is the 1, 1 vector, OK? So y, my measurement, is a 2 by 1, my haptics and my vision. So if I do the maximum likelihood estimate, what I will end up with is the following. Sigma squared h minus 1 times yh plus sigma squared v minus 1 times yv divided by sigma squared h minus 1 plus sigma squared v minus 1. So that's my normalized relationship. This is my best estimate of where I am, where I take my measurement from my haptics, I weigh it by the inverse of the noise, I take my measurement in vision, I weigh it by the inverse of the noise, and I normalize it by the sum of those inverses. And my variance, my uncertainty about x is going to be this, 1 over the denominator of that function. So to give you a sense of what this means, if you had haptically estimate that said you're here, so this is the mean of your haptic sensor measurements, and then you had your vision tell you you're here, that's the mean of your visual sense, and say that in this case, we have a person who has really bad vision, so in this case, sigma squared v is equal to sigma squared h. You see these two distributions have the same width. Then your maximum likelihood estimate is going to be right down the middle with a variance that's going to be smaller. So this is x maximum likelihood, and your variance is going to be this, this quantity. On the other hand, most of us have a scenario like this. I have some belief about what I feel. I have a much sharper estimate of what I see, so sigma squared v is much smaller than sigma squared h, in which case then my maximum likelihood estimate is going to look like this. It's going to be much closer to the visual estimate. Does that make sense? You believe one sensor much more, and you notice that my Gaussians are narrower than either of those two estimates, and the probabilities are, of course, higher, because the integral of those Gaussians is always one. All right, so how do they measure the noise in each sensor? Let me introduce to you the concept of a psychometric function. So the way they're going to do it is that they're going to give you a standard size object, say 5 centimeter, and say they want to figure out the noise in your haptic measurement. You're going to feel that 5 centimeter object, then they're going to give you a second object, bigger, and they're going to ask you, is this bigger than the first one or not, and you're going to tell them something, and then they're going to give you one a little bit smaller. Is this bigger than the first one or not, and you're going to tell them yes or no. This is the kind of process they're going to go through. So they're going to give you two objects, one, I'm going to call it Y1. Say that we're only talking about one sensor now, just they're going to try to measure the noise in only your haptic sense. So say that Y1, the first time that you say it, is a normally distributed with size mu1 and variance sigma squared. Let's call it mu, a variance sigma squared. This is going to be your haptic sense. And you're going to give you another object, Y2, is also normally distributed with mu plus delta. It's going to be a little bit of a different size with the same variance. And so when you compare these two, what you're going to estimate is this thing delta, which is going to be just Y2 minus Y1. And that's going to be a random variable, delta hat is going to be a random variable with mean delta and variance two sigma squared. So, and this sigma squared is your haptic noise. And we want to know, well, you know, something about it. How would we be able to estimate it? Well, you're going to tell us whether the second object was bigger than the first object. So what's the probability of Y2 being bigger than Y1? Let's see if we can figure this out. So suppose that delta is two centimeters. So this is one centimeter, this is two centimeters. So the second object is two centimeters bigger. And you have this Gaussian here now that's going to be centered on two centimeters. So you're going to be able to estimate that, you know, the second object was bigger than the first one. But what's going to be the probability that you're going to detect the second object as being bigger than the first one? Well, that's the integral of this probability distribution above zero, because if the integral, that probability distribution is above zero, then you're going to say the second object is bigger. If the integral of it is, if the smaller that integral becomes from here on, bigger than zero, the smaller is going to be the probability of you saying that the second object is bigger. So if my sigma is, so this is, that's called a sigma, sigma in a condition where, you know, I have a pretty good sensor. If I have a bad sensor, here's what it's going to look like. So this is a good sensor, meaning that it has low noise. Here's where it has large noise. So you notice now that this integral is going to be smaller for, I'm going to have more of my function over here to the left when I have large noise than if I have small noise. So the probability of detecting that difference is going to become smaller. So now if I plot the probability of detecting the second object being bigger than the first object, so this is probability of y2 being bigger than y1, as a function of delta, so here's one, here's two, here's minus one, here's minus two. So the second object is bigger than the first object by two centimeters, by one centimeter, it's actually smaller, it's really small. The probability is going to be the integral of that function is going to look like this. So if it turns out that the first and the second object are exactly the same size, then of course with probability one-half, I can tell you that the second object is bigger than one. If the second object is two centimeters, I'm pretty darn sure that the second object is bigger, but the noise in my sensor tells you the steepness of this function. So if my noise is really not very, is a terrible noise, I have a lot of noise, so I have large noise, then my ability to detect the second object is not going to be nearly as good as if I have small noise. And this is called a psychometric function. So you can calculate the noise that there is in each sensor by looking at the probability of detecting something being bigger or smaller than a standard size object. And so again, this is large noise, this is small noise, sorry, I'm using the same color for both of them. Does it make sense? Yeah. Probability. This one here? Also probability. This is the distribution that I'm plotting, yeah. So this number is the integral of zero to this side. So if that function is centered at zero, then half of it is to the right, half of it is to the left, right? If the function is moving to the right, then a larger part is gonna be to the right. But the amount that it's gonna grow depends on if I have small noise or large noise. So if I have large noise, this integral is gonna have a whole bunch of it to the left, right? So that means that it's not gonna be nearly as big as if it's a small noise. Okay, so that's how they did it. They asked people to detect the difference between a small object, a standard object, and another object, the second one. And they did it repeatedly on the same person, so then they basically came up with the probability of the, you know, the probability they could detect a one centimeter difference, probability of detecting a two centimeter difference. And they fitted these sigmoids. And from those sigmoids, they computed the variance associated with the normal distribution that generated that data. So the left side is the integral of the right side. And by doing so, they computed the noises. Then what they did is that, okay, they did the same thing for vision, detecting a small object and a slightly bigger one. How well could they do it? They had the noise in the two sensors. Next what they did is they said, all right, now you can see this object, you can feel this object, this is the same object. How, what's your belief about the size of that object? And what they showed was that their belief was a combination of these two sensors where each sensor was weighted by the variance of that sensor, the confidence that they had. That was the maximum likelihood estimate. Any questions? Okay, see you Monday.