 I'm just kidding. So you plug in the max, the alpha of the max, and then say that age is less than this max. So you get to plug that into your. Yeah, I'm fine with that. Yeah. Something fancy? Yeah. I mean, there's a super hero question. There's a four-side view. There's something fancy. Is that going to be something to do every day? Yeah. OK, guys, let's get started. A couple of things before we get going. You have a midterm exam two weeks from today. And there were a couple of questions about it. It's mostly short answer. The result usually puts on a few multiple choice. They're usually similar to the homework. So if you've been doing the homework, you'll probably be just fine. They're usually not as involved as the homework. That said, if people are interested in doing a review session, I'm happy to do one. I would just need critical mass before we plan to do it. Because I have to prepare all the questions and stuff. So if only one person shows up, me spending four hours writing learning theory questions, there's not a productive use of my time. But if five or so people want a review session, I'm more than happy to offer one. So let me know. And we can do it, I don't know, maybe the Monday after class, right before the exam, if that makes sense, if people want to do that. That's easy. Or you can organize another time. I'm happy to do it whenever you guys want. I have graded your homeworks from last week. I'll have them up here after class. Due to the ambiguity in the homework, I graded them quite leniently. Basically, if you had the four or five equations in the proper order, if your implementation was totally wrong, you got most of the points. So even if your graph looked ridiculous, if you had the proper equations in the proper order, I gave you the vast majority of the credits. So I gave you eight or 10 out of it. So no one got less than an eight. That's pretty much it. Does anyone have any questions about anything? Yeah. No, no, no, no. No, it's all written. Yeah? Yeah. Are there problems with the textbook that we can use? No. So for the next two weeks or so, you don't have homework and so that you get to study. But as far as like, crack? Not as far as I'm aware. And I've read the book, so there's not hidden ones in there. I don't think there are any posted online. So I can certainly come up with questions for you, as I said for the study section. But they're mostly homework similar to homework questions in the derivations we've done in class. So there's almost always something about the Kalman filter. There's almost something about Bayesian integration. There's almost always a derivation of LMS, whether it's weighted or not, that sort of thing. So it's sort of the main topics. He's probably not going to have you solve. He never does matrix multiplication. That's more than a two by two. And never does inverses or singular value decomposition or anything like that. So you're probably not going to have to do generalization. So you can narrow it down pretty substantially, if that makes sense. Any other questions? Excellent. So this is actually a somewhat fun lecture today, unlike the previous one that I got to, had to give. So this one is about causal inference. And so there's actually not a whole lot of math in it, which is nice. The only real math that we're going to, new math that we're going to introduce, is the circular normal distribution, which is a normal distribution defined across from zero to two pi. So that's really the only sort of new concept that we're going to introduce today. Beyond that, it's going to be sort of examples from the literature of how people actually go about integrating, how real humans go about integrating multiple sensors, whether we do it in a Bayesian sense, et cetera. So we talked originally, or Reza talked, has mentioned several times that in the two GPS problem, you've got these two GPS's and they each have noise, which presumably you've measured, and you're going to use a common filter to estimate your hidden state X, which is the location where you are. And if one GPS says that you're on one side of the bank of a river and another GPS says you're on the other side of the bank of the river, and your common filter estimate says that, or your maximum likelihood estimate says that you're inside the river, and you're in the middle of two sensors, you're probably not going to believe that you're in the middle of the river because you know that you're not in the middle of the river. So the question is, and the question that we're going to tackle first today is something called causal inference. Basically, how do you determine whether or not you should combine the information from two sensors? How do you know that the information that you gain from those two sensors comes from a single source? So we'll start off with a very interesting experiment. This is from, published in Experimental Brain Research back in 2004. So this is Wallace et al. 2004. There. What they had subjects do, they had subjects sit, so subjects is where the X is, and they have a half circle. And on the half circle they have little speakers at normal distances. So you've got speakers at normal distances. And above each speaker is a little LED with a little light. And the experiment is actually quite simple. They had the subject fixate straight ahead at the LED that was straight ahead. And then what they would do is they would have a sound come out of one of the speakers, and 200 milliseconds later they would turn on one of the LEDs. So they would turn on one of the peripheral LEDs or the LED in the center. And so what they asked subjects to do, they asked subjects to do two things. One, they wanted subjects to say whether or not they thought that the LED and the sound, and wherever the sound came, came from the same place. Basically, this pair of speakers, that pair of speakers, did they come from the same pair? And two, if they did come from the same pair, where was the pair? Because they could choose any one of these speakers in LEDs. So where was this LED? And so you can imagine that there's sort of two different hypotheses in this particular in this particular setup that the subject might have. In one case, you have some hidden state X, which is what's generating the sound and the light, and from that you observe YS, the sound, Y sound, and Y vision. So you make two observations, which you can see. And from those, you say that they came from the same source X. One X generated both the sound and the light. So if you had a lightning strike outside your house, if the sound and the light seemed to come from the same direction and occurred within a reasonable period of time, you would say, that lightning made the thunder. Not made the thunder, but that lightning caused the thunder, as opposed to if it shows up quite a bit later in the direction, you would say, well, it probably wasn't that lightning strike, it was a lightning strike behind me. So alternatively, your alternative hypothesis is that you have some X1 that you can't see that generates the sound, and a different X2 which you can't see that generates the light. Does that make sense? So you might have two alternative generative models. And how do you combine the information that you observe, when do you say that it is this generative model and given that I think it's the same source, how do I combine those two pieces of information together? Does that make sense? This is the problem of causal inference. So we can write this out a little bit more formally. So let's imagine that we have some binary variable, random variable Z. When Z equals 1, we're going to say that, how did I define it? I think I said that it's a common source. And when Z is equal to 0, and I'll say the probability of Z being equal to 1 is equal to 0, just to be clear. So this is a binary random variable. The probability of Z being equal to 1 equals 1 means that it's a common source. The probability when Z equals 1 of it being 0 is equal to 0. So when Z equals 1 or when the probability of Z equals 1 equals 1, we're going to assume this model right here. So we can formalize that by saying our vector Y is some vector of observations Y, S and Y, V. That's going to equal some matrix which I'll call C1, just to delineate the two different cases here. C1 times X, S and X, V plus some noise and C1 in this case is going to be 1, 0, 1, 0, meaning that the source is the same. YS is equal to XS plus epsilon and YV is equal to XS plus epsilon. Does that make sense? So that's our first model over here. So this is a common source. We can write an equivalent model or a similar model for the second case which was that they were distinct sources and write that's Y equals YS, YV some matrix that I'll call C2 again times XS plus XV plus epsilon call it epsilon. Let's call it epsilon Y. And so this is now going to be 1, 1, 0, 0. XS is YS is XS plus epsilon YV is XV plus epsilon. Does that make sense? So I have two different models. Y equals C1 X plus epsilon Y equals C2X plus epsilon. Does that make sense? So let's assume that we have some prior belief about the location of the two stimuli. We know basically what the probability of seeing the sound, yeah should that be identity? C2. Oh, yeah, you're right. Thank you. Two different sources. Absolutely correct. Okay, so let's assume that we have some X hat which denotes where are some prior belief about the location of the two stimuli and we'll assume that it's normally distributed with some mean, what do I say? Some mean mu that's a vector and some variance p. Okay, so we have some prior belief about where the LEDs and the sounds come out. And so from Bayes' rule we can actually write the probability of Y given a common source times the probability of seeing a common source so the conditional probability divided by the marginal probabilities which in this case is the probability of Y given Z equals 0 times the probability of Z equals 0 plus the conditional probability Y Z equals 1 times the probability of the probability of a common source. Okay. So what this is equal to is the probability that there is a common source Z given some observation Y. So our posterior probability in this particular case, assuming that we know everything we have some previous prior about where the distribution of sources comes from. The Bayes' rule tells us what's the probability of it being a common source is given some observation Y. Okay. So how should we combine the two sources really is what this equation is telling us. So what we really want to know though, we sort of you sort of gain some intuition for what this is and the probability of Z equals 1 is just 1 minus the probability of Z equals 0 because they are either a common source or they are not a common source. So these two we sort of know. The thing that we don't really know this one we know as well the thing that we don't really know is right here. We don't know what the likelihood of of seeing a Y given that they are a common source is. If we knew that we could easily compute the posterior probability. So what is that in this particular case? Well, the probability of observing some output Y given that we share a common source is a normal distribution whose mean is given by C1 mu and whose variance is given by C1 P, C1P C1P C1 transpose plus R where R is how this noise is distributed. Y is distributed at normal 0 R just like it is normally. So all this is telling us is that the probability of observing of seeing an observation given that it's a common source is just our prior estimate times C1 the likelihood that it is or the matrix that tells us it's a common source plus its variance. Similarly, the probability the likelihood of observing Y given that Z equals 0 is equal to some normal distribution with C2 times mu with C2P C2 transpose plus R. Does that make sense? Just running them through that equation. Okay. So, what does the posterior look like? Let's assume that this probability is just for viewing purposes. Let's assume that this probability is 0.5. It's equally likely that you'll see a common source as opposed to a separate source. Let's just assume that for the time being so that it's easier to graph. In that particular case, we can plot the prior probability and we can plot it as a function of I apologize for my lack of three-dimensional drawing skills. Okay. So, on the Z axis here, I'm going to plot the probability that it came from a common source given some series of observations. On the X axis, I'm going to plot the location YV the visual location of the LED and on this axis, I'm going to plot the observed location of the sound YS. So, we'll say that this goes from minus minus 5 to 5. Similarly, this one goes from minus 5 to 5 degrees, let's say. Right. And so, if we plot the posterior probability here, what you're going to see is that when the sources are the same, basically when YS and YV are the same, we'll both say that they're at minus 5. The probability, the posterior probability of them being a common source is quite high. Does that make sense? It sort of makes sense. If you see the light and you hear the light and they're at the same place, the probability of it being the same thing is quite high. However, if you see the light at minus 5 and you hear the sound at plus 5, you're probably going to say they're two separate sources. So, the probability falls off. So, we can draw it let's see. It has a Gaussian that sort of falls off that has its peak of 1 right where YS and YV are equal, and it's sort of on the diagonal here, where they're equal to each other. Does that make sense? So, if we take a slice through it, if we hold what did I say? Hold YS constant. Yeah. So, if we hold YS constant at zero and take just a slice through it so that I can actually draw it instead of just sort of doodling on the board. So, if we assume that YS is zero, we always see the sound directly ahead of us. We always hear the sound directly ahead of us. I can plot the posterior probability of them being a common source given some observation versus the disparity which is going to be YS minus YV right, how much different they are. So, when they're not different at all, when they're sitting at zero the probability is 1 and then it falls off as you get further away. Does that make sense? Okay. Fairly straightforward. Now the question is, now that we know how much we believe, right, so out at these tails, out at you know, minus 5 when they're minus 5 and 5 apart, we basically say we believe they're two different things. They're not the same. So, the question is how much does the fact that we see something at say minus say at 3 how much does the fact that we see something at 3 and here's something at zero influence our perception of where we actually heard the sound. So remember, the subjects were asked not only to say whether it was a common source or not but if it was a common source to say where it actually appeared, right, and these things are actually quite close together, so it's not like ones that halfway across the room and the other ones there, they're actually quite close together. So where did they hear the sound? So, we can formalize this using the Kalman filter. So if the board is just not happy so if the posterior probability given some observation y is equal to 1 we use the top, the single source model, right? And so, our single source model was written up here, which basically means that ys and yv are just equal to xs right, plus some noise epsilon. And so, our best estimate is going to be basically the Kalman filter estimate, so the expected value of where x is, our hidden state given our observations and the fact that we share a common source is going to be equal to mu, our previous, our prior estimate of where we were, plus the Kalman gain and the Kalman gain in this place is p times c1 transpose times c1, no transpose p, c1 transpose transpose plus r inverse times y minus y hat, our estimate of where we actually are, which is c1 times not x mu in this case. Does that make sense? This is just the Kalman filter. So we're using the Kalman filter to basically say, you know, this is my posterior estimate of x given the sources that I had. Now, on the other hand as you can see, if the probability of z equals one given y is equal to zero, right now we have two separate sources and we do the exact same thing, right? The expected value here of x given y and z being equal to zero is equal to mu plus pc2 transpose c2p c2 transpose plus r inverse times y minus c2 mu. So if it's a common source, we use that equation. If it's a separate source, we use that equation. Right? And so in reality it's not going to be equal to one or zero. It's going to be a mixture of the two, right? We believe that they're a common source to some percentage. Only when they overlap exactly do we believe that it's a true single, true common source. And so really if we let the prior, the likelihood excuse me, the likelihood of z being equal to one, no, the likelihood of y given in z equal to one be equal to a, some value, right, it's now going to be a weighted update equation. So all we have to do is say that our actual expected value of x given y is going to be a, right, times the probability of it being a single source, which I just destroyed here. A times the expected value of x given y and z being equal to one plus one minus a, because they're either a single source or two sources, times the expected value of x given y and z being equal to zero. Okay, so just a weight that we've now assigned to these two different state update equations. Okay. So what does that look like? Well, if you plot, what I'm going to plot is x, s, the location of the sound. Now remember that for the, for these examples I've fixed the actual value y, s to be at zero, right. And all I'm doing is varying where the LED turns on, right. And so when, so what I'm plotting here is again, this is y, v, or y, s minus y, v, the disparity if you want to call it that, right. And here's one, and here's minus one, call this five and minus five, right. And so what's, what you would actually see if you plotted this equation, and again I'm assuming, we're assuming the same priors that we had before. What you'd see is that when y, v is presented at zero and y, s is presented at zero, which in this plot is just going to be right at zero, zero, right. You believe that the source is in fact at zero, and it's a common source. However, as you get farther from the periphery what you do, it doesn't go negative, down here. What you do is your your belief about where the sound actually is, is biased by the fact that you that you saw a light, right. The fact that, yeah. I'm sorry, this is x hat of s. So this is your belief about where the sound came from. So because you don't fully believe the you don't fully integrate at each time step, right. The source of the sound is always at zero. The actual source of the sound is always at zero. Your belief about the source of the sound does vary as the, so you can imagine that you're in this room, right, and I'm staring straight ahead and the sound is coming straight ahead, right. I hear the sound coming straight ahead, but the fact that I saw a light five degrees away causes me to believe that the sound didn't actually originate from right here. It caused me to believe that the sound originated a little closer to the light than it actually did. They're not hard bounce. No, no, no, they're not hard bounce. And in fact, it's going to be influenced quite a bit by the value of your prior right. So if the value of your prior changes, those bounce also change as does the width of the it's going to be the same sort of sigmoidal ish not sigmoidal. I don't even know what shape to describe that as two difference of two Gaussians shifted. I don't know. In any case, when you get out here, what's happening is that the the probability of observing a common source goes down. And so now you believe that they are two, in fact two distinct sources. The light originated separately from the sound. So the fact that the light originated separately from the sound means that you're not biased anymore as you get out to the periphery. The fact that you saw a light 15 degrees away from the sound says nothing about the location of the sound. You believe that the sound came exactly from where it was. Does that make sense? Okay. So Wallace at all 2004 show basically exactly this. They show that people's perception of the sound is biased and biased in a sort of Bayesian fashion. They always, I believe I haven't read the paper in a long time, but I believe they always presented the sound they always presented one first and then the light whichever one came second came at 200, 400 or 600 milliseconds later. So what they actually varied was the delay between and the delay between actually it also affects how much you believe that it's a common source. This probability goes down as the delay increases. They did not do the, I do not believe they did the reverse. No, no, no. Correct. They were looking at temporal versus spatial disparity and so they always did it in the same order. Okay. Does that make sense? Okay. The math isn't too bad there. Okay. So given that this is a fun or lecture today, more fun lecture, I get to have audience participation and so someone gets to participate in my thought experiment. Yes. I see what you're trying to do. More or less, yeah. What does it mean if it's one? No. So these are sort of arbitrary axes that I'm defining based on the prior probabilities that I've chosen. But if you want to put it in the wallest context, you can make this axis degrees and degrees. Right? So if the light is presented five degrees away from where the sound was presented, my belief about where the sound was presented is biased by one degree. I believe that is in fact one degree off of center instead of at center. Does that make sense? Yeah. You can imagine this as you can put whatever unit time here you want but in this particular case you can imagine it as degrees. The fact that I see a light at five degrees biases my belief about where the sound came for one degree. And if it showed up at minus five degrees, I'm biased in that same direction and also by a degree, so minus one degree. This is subject. This is across subject data. They did, you know, 10 people, 20 people and they did the same. This is what they get. It's not, I believe it's not one degree, these axes are defined given my, the prior probabilities that we've, the likelihoods that we've assumed. So this is the mathematical equivalent of their experiment which also shows a bias. I think there's our fractions of a degree, but people are indeed biased. Does that answer your question? No, but we'll see that in just a second. We're getting there. In this experiment, no. I, or at least I haven't shown it to you. It's a belief. It is in fact your belief. Yes. Exactly. And that's based on several assumptions that we've made. We made an assumption about what the prior probability is. We made an assumption about the shape of the likelihood and we made an assumption about the distribution of my priors. I have some prior mu and p. So given any arbitrary data, I can hold mu and p so that it makes sense, but that makes sense. This is just a mathematical construct. Okay. So, volunteer. Can I choose someone? Excellent. Okay. So I have three cylinders. You'd have to imagine that they're all the same shininess, but not accounting for shininess. Okay. If you come over here, now I'm going to mix them up. Okay. And I just want you to tell me which one weighs the most. Okay. So stand here. Okay. And I'll close your eyes. Okay. I'm going to mix them up. Okay. You get to feel them. You can't see them. Okay. Now which one weighs the most? Okay. Which one weighs the most? This one. Okay. Now open your eyes. Do the exact same thing and tell me which one weighs the most. Yeah. Okay. Yeah, that's true. This was a poor demonstration. So, he claimed with his eyes open that this one weighs the most, which I wanted you to do it first when you had your eyes closed to see if you got a different, you gave a different estimate. They in fact all weigh exactly the same. Right. They're all exactly the same weight. So it was sort of a trick question. But we have, we're talking about priors and the integration of priors here. And so, the fact that this cylinder is bigger seems to influence people. You didn't make a... I mean, you gave me the wrong answer, but it's the same answer that everyone always gives, which is that one weighs more. And in fact, they all weigh exactly the same. So the fact that we see something bigger seems to influence us about how much it weighs. And so, just to give you an example of that. So, an interesting experiment was done that just... I don't give you the option, do they weigh equally? That's true. So if I didn't give you that option and you were totally unbiased, you would anticipate that if I brought you all up here, the likelihood of you choosing the large one would be one third. A third of you would say the large one weighs more. A third of you would say the small one weighs more. And a third of you would say that the medium size one weighs more. If you are truly an unbiased observer, everyone's an unbiased observer and they all weege measure the same thing. And I present the question in the exact same way. We would expect probabilities of a third across the board. Never what happens. Everyone always says the bigger one weighs more. You can feel them after class. So this is from a fairly old study, but a good one. And just to give you... this removes your bias question. So this was the bias due to asking. This is Gordon et al. 1991. So what they had people do was they came into the lab and they did exactly the same experiment that we did. They gave people three different size boxes. Small box, a medium box, a large box. And they asked people to pick up the box. Now, the box... when you pick up a box, you apply at least two different types of forces. You apply a sort of pinching force to make sure that the box does not fall out of your hand. And you also apply a force upwards to lift the box up. So I apply a force here and I apply a force here. Now they measured both of those. The grip force is the one that's the pinching one and the load force is the one that goes up. I'll plot the load force for you. So this is time on this axis. The grip force is exactly the same. So I'll plot load force. This is in Newton's. And so a second is about here. And what people do is when they pick up the large box they end up doing something that looks like this. When they pick up the small box they end up doing something like this and when they pick up the medium box they end up doing something like this. So this is large, medium and small. So before you even get proprioceptive feedback about what you're doing you're applying forces and those forces are sort of a readout of your belief about how heavy the box is. And so the fact that you have a the fact that you pull up more on a large box would indicate that you believe that the box is going to be heavier than the small box even though they're exactly the same. And what's interesting is that after you get feedback from your sensors you all converge to the same value exactly the same amount of forces necessary to pick up the small box the medium size box and the large box. So you end up correcting the error, but in the beginning we're reading out your priors. So you have a prior belief about because this box is this size it's going to weigh this much. Just like you reach into your fridge for a picture of whatever. It's opaque, a picture of what Marguerite is or whatever. You believe that the picture is full you have some belief that the picture is full you pick it up and you spill it all over yourself either because you're inebriated or because you have some prior belief about it being a lot heavier than it is. Met. Don't we all? It happens to me all the time. I totally underestimated how heavy it was and just about ran into it and I have since Right, so you have a prior belief about what's going on there. The example that Reza likes to use is that soda or soft drinks that are sold here except now in New York I guess are a heck of a lot larger than what's sold in Europe. So you get like the normal large here it feeds like an entire family in Europe. And so we have some belief about how much the size of a cup is. So if you imagine that you have some belief about the density of a liquid and how much a volume of that liquid weighs let's call it the probability of X. So some liquids weigh more than others. If you've got a vat of iron it's way a little bit more than a thing of coffee. So you have some belief about how much this thing is going to weigh and and so let's call this. So what we want is the posterior probability how much something weighs given the size that it is. And so we're going to have the likelihood the size that it is given its weight times the marginal the distribution of weights times the probability of some size. We have the probability of some size the probability of it weighing something and the probability of a size given its weight. Now because in the United States we have these massive big gulps and things like that what we tend to believe is that this is going to be slightly biased and so when we present an American and someone from Europe or some Ethiopia the same size soft drink so you observe P of S your posterior probability of the Americans going to be biased because they believe that things are actually heavier than they are just like you pulling a door and the door is actually hollow instead of made of oak So to make this a little bit more concrete instead of just talking about hitting oneself with doors and poke cans let's talk a little bit about an experiment that was performed by McIntyre at all So this was an experiment that was performed up in space so imagine we live in a 1g environment in which the acceleration due to gravity is 9.8 meters per second squared and so we can formalize our belief about the position and velocity of an object that's thrown at you and dropped towards you in a system of equations So we observe X at time T plus delta and X dot the velocity at time T plus delta and that's equal to some matrix which I'll call A times the previous position and velocity plus the acceleration due to gravity times my time step and zero plus epsilon X my noise and so what goes in the matrix A well we observe XS we observe X dot and then this is going to be zero sorry this is going to be delta you're right so we observe X dot and the position is multiplied by the velocity times the time step and the velocity is exactly what the velocity was plus the acceleration due to gravity times the time step ok so if we have some so this is just X equals AX X of n plus 1 equals AX plus some matrix B so I can write that as X of T plus delta is equal to A times X of T plus B plus epsilon X where epsilon X is distributed, is normally distributed with variance zero and R I use Q in this case and we observe some Y right equals the X at T plus some sensory noise XY epsilon Y normally distributed at zero and R just as they are before and so the common gain if you had some prior estimate about your X's the common gain in this particular case is actually quite easy because you have no C here so it's just equal to P whatever your belief about the uncertainty is at time T times the uncertainty plus R inverse right so just the common gain right so if you made some observation and you wanted to combine it with your prior belief you would just use the common filter and so the the probability of the posterior probability T given T is equal to as it was before I minus K times P of T given T minus one and P of T plus delta given T the prior and the next time step is just equal to APA transpose plus Q same as it was before so what we do is we observe an observation Y we apply the common gain we incorporate the observation into our previous belief about where X was on time step T right and then we update our posterior uncertainty and apply the the update rule to get our prior uncertainty in the next time step right this is exactly the common gain except I didn't do any of the derivation but this is assuming this generative model right so we assumed this generative model and from that we derive the common gain the update equations so if we expose if we put astronauts in orbit where it's now a zero G environment right this term goes away this term goes to zero there's no acceleration of gravity or the acceleration of gravity is quite small and so when you drop a ball or throw a ball downwards right doesn't accelerate it just has constant velocity so the question that McIntyre in this particular experiment wanted to ask is do subjects have some prior belief about when the ball is going to get to their hand so if I throw a ball at you in space do you have some prior belief about when it's going to get there and so they attached EMGs to the wrist and measured the muscle activity compared to when the ball actually struck the hand and what they actually find is that astronauts open their hand much sooner than would be expected given constant velocity right what does that mean exactly well it means that they have some prior belief about how the ball is going to move in a one G environment right they believe that given the state of the current observations and their system of equations which had a G and a delta in here they believe that it should be accelerating at some speed and therefore it should arrive at their hand sooner than it actually does does that make sense and so actually this phenomena persisted for I guess they did it every day for 15 days so clearly they're opening their hand earlier than the balls getting there and for 15 days they keep doing it right so apparently this is not an error that the system needs to correct but you clearly have some generative model some prior belief about how these how this ball is going to move and so therefore your actions are done appropriately okay so similarly in baseball you know you have curve balls and you have fast balls right we have some prior belief about how balls fly through space if you've ever you always done those physics equations where you see how far the ball flies right given a one G environment but if you apply a top spin or a down spin on it the ball actually falls either faster or slower than that and so you're relying on the batter's internal model their generative model to be to tell you that that's how fast the ball should be dropping so they miss it when they apparently this was an error that they didn't seem to care about they just had their hand open before the ball got there apparently they didn't say I need to correct that and so apparently that prior persisted it was pretty strong so okay so another question is does integration of my prior belief function in a Bayesian sense basically do people do Bayesian integration we've now seen that our priors do in fact influence you right you open your hand too early you're biased in the size of soft drinks and you hit yourself with doors so priors are clearly important the question is do we incorporate the the prior and the likelihood in a Bayesian way do we use Bayes rule and so this is a study by I think it's I wrote down slipper but I think it's slipper slipper at all in which what they did was they installed spyware on people's computer and measured the with their consent they installed spyware with their consent measured the their move their mouse movements so what direction their mouse movements were and so you can imagine a very simple so the the two metrics that they were interested in were one the length it turns out that people actually make very small a lot of very small movements about three millimeters is the the mean of the people's movements with a mouse but they were interested in measuring two particular points so one is that so I'm creating arbitrary axes zero degrees and 90 degrees so subjects sometimes make curved movements a lot of the times their movements are straight but sometimes they make curved movements in which the initial angle of their movements which I'll define as theta i is different than the actual angle of their end point theta e so they make these curved movements from from the initial position to the end position a lot of the times they make straight movements but sometimes they make curved movements and so if you look at the distribution of end points so I'm plotting now the probability of theta e where this is again zero degrees this is a hundred and eighty degrees it looks sort of like a very weird star I apologize for my lack of drawing skills but people are far more likely to move in one of the cardinal axes zero degrees 90 degrees 180 degrees then they are to move at some angle in between and if you look at the distribution of initial angles it's not exactly the same as the distribution of end point angles why because people don't make straight movements all the time right so if you look at the distribution of end point angles so this is the probability of theta from instead of drawing it in a circular fashion I'll draw it just out on the axis so this is we'll say 360 what it looks like is so you have again fairly high probability of making initial movements at zero 180 no this is 180 90 180 slightly smaller probabilities of moving at some angle in between but then roughly constant probability for all other angles which is not what we saw here it in fact goes down goes back up so people seem to be attracted to making straight movements in some cases but not others if that makes sense so in some cases they actually make curved movements where theta i does not equal theta e so let's imagine that you have some desired end point theta e and that given that desired end point you have some probability of making some initial movement so you have some probability of theta i given some desired end point so theta i given some desired end point that's going to be equal to the probability of seeing some theta e given theta i the conditional probability times the probability of some theta i over the probability of theta e right so just base rule again so the this is again saying the probability of making some initial movement angle given some end point movement angle so what we can do is and this is again with the only math that were new math that we're describing here so the circular normal distribution which is a normal distribution find over 0 to 2 pi is equal to 1 over 2 pi i not of kappa which I don't know how to delineate from a k times the exponent kappa cosine of theta minus mu your mean so it has a peak at mu across all thetas and this is called the 0th order modified bezel function which basically just makes it so that the probability sums to 1 and kappa here is a shape parameter similar to the standard deviation of the variance term so as k goes up the half width goes get smaller if that makes sense ok so let's let's approximate this function the probability of theta i the probability of theta i as a sum of these circular normal distributions with peaks at each one of these different locations so you're going to 1 over m which is just a normalization constant times b our offset here plus the sum of i equals 1 to 4 of aj over 2 pi i not of kappa times sorry we got a little messy here so I'm just summing 4 different gaussians my gaussians have a peak at 0, 90 180, 270 with different initial with different peaks different values and that's because of my poor drawing so what it should look like is these are virtually flat if that makes sense very narrow and here they're quite wide so they're quite narrow just my inability to draw so they're different and the question is why are they different so because the end point this sort of makes sense you have to go someplace you're going to have to at some point in time move in that direction you're going to have a theta end point unless you make a when you want to go over here you make a movement like this at the same time you're going to have to get there so that's why theta is pretty smooth across the circle although it doesn't peak in the cardinal directions but this, the initial angle that you choose is very peaked you choose for some reason initial angles that are close to the cardinal directions and the question is why is that so so if we assume now that the probability of theta e an end point given some theta i is approximately Gaussian with its mean equal to theta e equals theta i basically the end point moves in the same as the direction of the initial one what you would see is that if we plot the results if we plot for instance I'll do it over here on this board if we plot a distribution which is going to be in this case the probability of theta i given say theta e equals 6 degrees I plan to move to 6 degrees what is the probability of seeing that initial angle what it's actually going to look like so this is this axis here here's 6 degrees and so if we were to assume that everything was unbiased and we weren't doing Bayes rule and we were always moving in completely straight lines then it should be a Gaussian with some noise about 6 but what you actually see is a Gaussian whose mean is shifted closer to the cardinal axis so about 3 degrees so the fact that you tend to move in these cardinal directions moves the initial value of your theta i closer to one of the cardinal directions so now you make a curved movement initially starting at 3 and ending up at 6 degrees if that makes sense you look confused as you get further away from the cardinal direction the pull is less and less so as you get further away from a cardinal direction you start theta i given theta e quite a bit less at 45 the peak isn't very large there is a peak at 45 but the peak isn't very large so you get pulled a little bit towards 45 but if you choose 1 right in the center 22.5 you get pulled very little so you get really pulled when you're close to one of the cardinal directions but not exactly at the cardinal directions and you make relatively straight movements when you're in between ok the last thing that we're going to talk about and we'll get out a little early is the influence of priors on basically mass guessing mass guessing cognitive guessing as it calls it and so this is a study by Griffiths and Tenenbaum 2006 so what they do in this study is they presented a large group of undergraduates with a questionnaire and they basically said if a person is because undergraduates really are the best in terms of making definitive judgments about things so what they did was they said if a person is this many years old right now how long do they have to live basically when are they going to die right so morbid but what they said was let T be the age that someone is now and let X be the age how long they're going to live so their lifetime so how old now so what the undergraduates said is that if a person is 18 years old right now they can expect to live until they're about 75 similarly if you're 39 years old you can also expect to live until you're about 75 now if you're 61 years old you can live just a little bit longer not much but a little bit longer you can live until you're 76 and if you're 83 you're going to probably live until you're 90 and if you're 96 right now you've got a whole whopping four years before they put you in the ground so if you're 96 you're going to live to be about 100 so what do these numbers say well if you're relatively young you're going to live until you're about 75 oddly enough that's about the mean age for a male in the United States about 75 years if you're older you don't have very much longer to live if you're more than 75 if you're 83 you've only got seven more years to live and if you're even older than that you have even less time to live you're right on death's doorstep and so the 96 year old has less time to live than the 83 year old so the question is what are people doing in this case are they incorporating some prior knowledge are they just guessing randomly or otherwise the means would be all over the place or the same so they're clearly not guessing randomly are they incorporating some previous knowledge some prior information into their guessing so let's assume that the frequency of life spans is approximately a Gaussian so remember X is our life span so let's say that the probability of observing of living to a particular age is approximately a Gaussian 1 over the square root of 2 pi sigma times the exponent of minus 1 over 2 sigma squared times X minus mu where mu is 75 years old right the mean age of a male let's say and the standard deviation is about 15 years now the age can't really be approximated like this because Gaussians are defined over minus infinity to infinity clearly you can't be less than 0 years old similarly you can't be infinite old we die eventually and additionally there's actually a lot of a fairly large amount of newborn or infant mortality so there's actually bias towards younger years but all those exclusions aside let's say that we can approximate someone's lifetime the probability of X as a Gaussian with a mean at 75 and then let's describe what the conditional probability of someone being some years old given their life span so this is the probability of T given X so a random person comes in off the street you know that they're going to live 80 years and without seeing them what's the probability that they're any one of those ages well the probability of T given X in this particular case is going to be 1 over X provided that T is less than or equal to X did I get that wrong X yep okay 0 otherwise does that mean if someone's going to live until they're 80 without seeing them the probability of them being a particular age is 1 over X right 1 80th of all the ages that they could possibly be right I guess you would have to say 0 otherwise if they're they certainly can't you already know that they're going to live to be 80 they can't be 81 as 0 probability right so so what does that look like well here's my probability of X which was our Gaussian at 75 and here is our probability of T given X which is a flat function which just sort of turns off at some point in time right at whatever the value of X is and this is 1 over X for some X you can say X equals 80 in this case so this is 1 over 80 and this is 80 years old okay so if we want to see if people are incorporating these prior probabilities using Bayes rule we now have our likelihood function what we need is the denominator the marginal likelihood which in this case is going to be the probability of T the probability that some person is some age right so the probability of T is actually pretty straight forward to compute right it's just the marginal likelihood which is the integral from 0 to infinity of the conditional probability of T given X DX right I'm just summing over all possible X's to get my T's exactly like what we did in the first the first one okay which is also equal to T to infinity P of T given X DX because of the bounds that I placed on the likelihood function okay so that probability T is the probability basically of all people who are alive today what are their ages right and so if I plotted P of T versus age what you would see is it's relatively straight and then comes down and this is about 100 this is about 50 it's about 75 so it starts to roll off okay so what we wanted to see is if subjects are incorporating their prior knowledge about how old people live to and so what we want is the posterior probability the conditional of a life span given that you are some years old today so given that you're 15 years old what is your life span going to look like what is the distribution of your life span going to look like well it's very simple we've calculated everything we need to calculate it's going to be 1 over X times 1 over the square root of 2 pi sigma 1 over 2 sigma squared times X minus mu divided by the integral from T to infinity of my same probability distribution P of T given X DX right zero otherwise X greater than or equal to T clearly there's zero probability of you having a life span less than how old you are right now if you're alive your life span probably longer than this P of X yes I guess I do sorry so this part here is really just a normalization constant to make sure that the probability sums to 1 and so if we plot these posterior probabilities now they're not Gaussians because I have these constraints on the system right they're in fact truncated Gaussians and so my poor drawing skills will come into effect again but let's plot the posterior probability X given T right so why do you need to oh okay never mind now no so you are at 100 here you are at 50 let's do 25 and 75 so if you're currently in which is about here right your mean is going to be you're going to look like a Gaussian you're going to be right around 75 your mean is going to be right around 75 so if you're 30 you've got a little bit of a tail here that's truncated right and so if you're 50 it's going to be the same Gaussian except now I've truncated a whole lot of the distribution so my normalization factor has changed and so it's going to look something like this right so this is at here's 30 here's 50 if you imagine someone who is currently 90 it's going to be here it's going to be a whopping big curve which is just the tail of the Gaussian right but you have a high probability in all these cases right so if given that your sum age this is 90 say given that you're 90 clearly the probability of you living until you're 75 the mean of the distribution is 0 right by definition you can't be less than how old you are so it's clearly more than 90 and so if you wanted to know when someone is the expected life or someone's expected life someone's lifespan given how old they are you're not going to want to take the mean of the distribution right what you really want is where does the area under at some arbitrary point where does the area under the curve equal the area on the other half of the curve what you want is the median of the distribution so you're going to ask the question where does the integral from you know say if you're 90 90 to something else say m of this prior this posterior probability x given t dx yes t equals 90 dx equal the integral from m to infinity of and that has to be the median which is one half right so find m such that these equalities are equal and I'm not going to find it but you can see that one you want the median and two the value is greater than 90 and so in fact that's what they found in the beginning so if you look at Griffith's and Tannenbaum they actually do look like they're incorporating prior knowledge about how old people live and doing all this crazy truncation of Gaussian distributions because it seems to agree with the incorporation of the prior estimates does that make sense yeah 200 and 200 in a day you might be alive now but you're going to be dead by tomorrow right 100 years yeah exactly so yeah it does break down but for these nice simple cases where you're in a fairly finite range and the Gaussian has a tail that's roughly near zero for anything larger than 100 they actually people tend to incorporate their prior knowledge about how old people are and the mean and the standard deviation about how old people are pretty well okay has anyone seen those ridiculous commercials where they had people put dots of the oldest person that no one has ever seen these not important the last thing I want to talk about in the one minute that I have you is people people yes incorporate prior knowledge but people also don't incorporate prior knowledge and so there was a show long ago called let's make a deal the host of the show was Monty Hall and it's the Monty Hall problem so the problem is as you can do this in parties you know party tricks so you have three doors behind the way that it's always described is behind two of the doors are goats one goat behind door number one one goat behind door number two and you know a brand new Mercedes behind door number three right and so you're asked to choose one of the doors right and then the host chooses not chooses but the host then says of the two doors that remain I'll take one away that I'm sure does not have the car right I'm sure it has a goat behind it right and he shows you the goat right and then the question is do you stick with the door that you have or do you switch doors to a different door do you think that the money is going to be behind a different door right and so how many people would stay with the door that they first thought no one okay one how many people would switch doors okay how many people say doesn't make a bit of difference 50-50 probability okay so the vast you guys are actually pretty smart because the vast majority of people would say that you should it doesn't matter 50-50 probability of everything being equal so it doesn't matter what door I choose second I could choose the one that I currently had or a different one and it's going to be about the same and there was actually this giant debate for like 30 years when someone said the probabilities of this problem but imagine a very simple scenario in which there are a thousand doors right you choose one of them right and the host closes all of the other door or all other doors except for one right that have goats behind it do you choose do you stick with the current door or do you switch doors does that make sense obviously everyone switches right you have a one in a thousand chance of choosing the car the first time whereas the second time if you switch you have a much higher probability you're almost assured that that door contains the car behind it and so most people do not believe that switching is valuable they stay with the same door they say that is the 50-50 probability I'm just going to do what I did before when in fact there's a higher probability a two thirds probability that you would get the car if you switch doors and only a one third probability that you would get the car if you stayed on the door so all we talked a lot about priors and how we incorporate that into our into our posterior estimates but it seems and in a lot of cases that's true but it seems also that we have problems doing very simple arithmetic and sometimes we can't incorporate the priors so that's the Monte Hall problem and I apologize for keeping you late enjoy your weekend team on it I have your homework here too