 Hi, I'm Zor. Welcome to Unisor Education. We continue talking about correlation among random variables. This lecture is part of the Advanced Mathematics course for teenagers and high school students. It's presented on Unisor.com I suggest you to watch this lecture from this website because it has lots of very detailed notes for each lecture and register students can actually engage in some educational process which involves enrollment and ends up with exams. Obviously everything is voluntarily, the site is free, so it's up to you basically how to use it. So anyway, today we'll talk about correlation and it will be three practical problems related to the lecture I did on the theory of correlation before. Well, actually I said there are three problems here but it's actually the same problem with different numbers and I think it's important to go through some practical calculations. It helps to basically inculcate in your mind what exactly the correlation is and how it feels. So we will continue talking about two random variables and each one takes only two values. So here is the table which I will be using. So we have variables c which takes values of x1 and x2 with probabilities p and 1 minus p and we have variable eta which takes values y1 and y2 with probabilities q and 1 minus q. Now I actually put it in a nice table which I'm going to use instead of this. It's more convenient because there is a mutual distribution which I also have to register somehow so the table will look like this. So these are values which my random variable c will take and these are random variables for eta. Now why I put it in a table because I want to know the mutual distribution. So what's the probability of let's say c to take x1 and simultaneously eta to take y1. We need some new information besides whatever I said before. I have assumed that this probability is known and it's equal to r. That's actually sufficient to determine all other probabilities. Since the probability to take x1 is p now x1 and eta to take y1 is r and probability of x1 and eta to take y2 should be obviously p minus r to have the total distribution to be equal to p. Because regardless of what eta take y1 or y2 the probability of c to take x1 should be p. So if this is r this is p minus r. Similarly the probability of eta to take y1 is q which means that the probability of eta to take y1 and x1 to take r would be c to take x1 would be r. So c to take x2 is supposed to be q minus r. And finally for the same reason this is supposed to be 1 minus p. This is supposed to be 1 minus q. So this is supposed to be let's just calculate 1 minus p minus q plus r. So 1 minus p minus q plus r. 1 minus q minus p plus r is exactly the same thing. So this is the table which I'm going to be using. This table contains not only the probabilities of c and eta to take corresponding values x1, x2, y1, y2 but also the probability of their mutual taking certain values. Now based on this I basically refer you to the previous lecture. I have calculated the correlation between these two variables. The correlation between c and eta is, and I'm just basically writing this formula because it was derived in the previous lecture, square root of p 1 minus p q 1 minus q. Okay, so that was the formula. So what I would like to do right now is to have three different problems with three different numerical values for p and r and q and see what happens, how the correlation actually behaves. Right, okay. The problem number one. p is equal to 1 half and q is equal to 1 half. So 1 minus p is also 1 half and 1 minus q is also. Now r would be a variable and I would like to investigate how the correlation between these two variables c and eta behaves when r is changing. Well, to know how actually it goes, we have to really analyze how exactly r is supposed to be changing, right? Now obviously, since r is a probability it should be greater than 0. But that's actually not enough because all these must be also greater than 0. Which means that r is supposed to be less than or equal to p otherwise this would be negative, right? It should be also less than or equal to q otherwise this would be negative. And also this is supposed to be positive, right? So 1 minus p minus q plus r is supposed to be greater than 0. So r is supposed to be greater than equal to p plus q minus 1. So these are necessary conditions which actually should be taken into account whenever I'm analyzing r r, correlation at a function of r which is the mutual probability of c and eta to take x1 and y1. Alright, so in this particular case, what does it mean? So r is supposed to be less than or equal to 1 half. This is exactly the same. And r is supposed to be greater or equal than p plus q minus 1 which is 0, right? 1 half, 1 half minus 1. So r is greater than 0 and less than 1 half. So that's my condition. And that's where I have to really analyze how this formula behaves under these conditions. Well obviously it's a linear form of r which means it's monotonically changing from its positive coefficient with r because the square root is always positive. So it means that my straight line actually is increasing. So to find out what's the range of this function we have to basically take its value with the smallest and the largest values of argument r. Now with the smallest one so r of c eta I'll do it slightly different. I'll do it c eta of r as a function of r, right? So in case r is equal to 0, r c eta of 0 is equal to, let's see what it is. Now this is 1 half, 1 half, 1 half and 1 half, right? So it would be 16 square root, it's 1 fourth. So it would be 0 minus pq 1 fourth divided by 1 fourth which is minus 1, right? It's minus 1 fourth divided by 1 fourth. Now r of 1 half, that's the maximum value for r and that's the maximum value for correlation if we substitute r here. It would be equal to 1 half minus 1 fourth divided by 1 fourth which is 1 fourth over 1 fourth which is 1. So as we see our correlation coefficient between these two variables depending on their mutual distribution or parameter of mutual distribution r can be from minus 1 to 1. Now where is it equal to 0? Well it's equal to 0 when lower case r is equal to 1 fourth because this is 1 fourth, 1 fourth minus 1 fourth. Now what does it mean that r is equal to 1 fourth? It means r is equal to p times q which means mutual distribution, mutual probability of c to take x1 and at the same time eta to take y1 whatever is r is actually equal to the product corresponding probabilities, unconditional which means when does it happen? It happens when these two variables are independent. So if variables are independent then lower case r is equal to p times q so the probability of c to take x1 and eta to take y1 which is r is equal to product of their probabilities which is p times q. So for independent variables we do have confirmed that our correlation is actually 0. I mean we did actually prove it in the theoretical lecture before that and it's obviously visible from the formula but we just wanted to check it out. It's a reasonable check and yes indeed by the way what's interesting is that this is a midpoint between this and this. 1 quarter is right in the middle between 0 and 1 half. Alright fine. So our range is from minus 1 to 1 to the coefficient of correlation at 0 for independent variables and that's basically all I wanted to know about these particular random variables with these probabilities. That's my first problem. My second problem is slightly different but only numbers so my second problem is p is equal to 1 half and q is equal to 2 third okay now what is my domain of this function? So what are the values for r? It's supposed to be less than or equal to 1 half and less than or equal to 2 thirds so that actually covers it from the talks as 1 half. Now this one 1 half plus 2 thirds 1 half plus 2 thirds minus 1 is equal to that's 6 so it's 3 plus 4 minus 6 over 6 right? 3 6 is 1 half, 4 6 is 2 thirds and 6 6 is 1 which is 1 6. So now we see that r is changing in this range. So this is the domain of this function. Mind you it's not down to 0. r cannot be equal to 0. So if my probabilities are like these there is no way they are independent. I mean they might be independent but there is no way that correlation is equal to 0. Let's put it this way. Okay now we have determined the domain of this function. Now let's determine the range. Now it's still the straight line and what exactly is changing from and to. Well let's just substitute and values and see what will be the result. So r c a 10 of 1 6 equals. Alright well let's talk about denominator. It's 1 half and 1 half so it's 1 half and 1 half and 2 thirds and 1 third. Which is equal to 1 18. So it's square root of 1 18. Well 18 is actually square root of 1 18 is 9 times 2 so it's 1 third square root of 2. And since it's denominator I have to multiply over so it would be 3 square root of 2 times 1 6 this is r. This is my lowest value minus p times q is 1 third and what is it equal to? Well 1 6 minus 1 third. Well 1 third minus 1 6 is 1 6 so it's minus 1 6. So it will be minus 6 goes to denominator. This is 3 so it's square root of 2 divided by 2. Okay r c a 10 of 1 half is equal to 3 square root of 2 times 1 half minus 1 third. Now 1 half minus 1 third is 3 minus 2 6 which is 1 6. 6 goes to denominator so it's square root of 2 over 2. And r is equal to p times q which is 1 half times 2 thirds which is 1 third is where r equals to 0. So as I said r cannot be equal to 0 but the correlation can be equal to 0. That's one of the examples by the way when not necessarily independent random variables can have correlation equal to 0. So if r is equal to 1 third and all others are calculated correspondingly. Well let me just write it down that would be interesting. So r is equal to 1 third p minus r p is 1 half minus 1 third this is 1 6. q minus r it's 2 third minus 1 third is 1 third and 1 minus p minus q plus r that's 0 right minus p minus q that's minus 3 and minus 4 minus 7 plus 1 no okay 1 minus 2 minus 2 third and minus 1 third too much. No plus 1 third plus 1 third. So it's 1 minus 1 half minus 1 third this is 3 plus 2 5 6 so it's 1 6 okay here it is equal to 2 thirds and this is 1 third which is right some of these is equal to 1 half and 1 half which is also right. Yes so this is the matrix of mutual probabilities and what's interesting is that with this definitely dependent random variables we still have the correlation between them at 0. Now if r is not 1 third but if it's moving towards smaller value to 1 6 the correlation would go to minus square root of 2 over 2 and if it goes up to 1 half then the correlation would be positive so that's very interesting property of correlation alright so that's problem number 2 and finally I would like to present a general problem without specifying concrete numbers so this is my general problem. Now to make it a little bit more palatable I don't want to get involved with different things let me just make some assumption in particular I would like to make assumption that p is greater than 1 half and let's put it this way 1 half p q now if I have this and I can definitely make this happen because obviously either p or 1 minus p should be greater than 1 half so I will assign x1 and x2 in such a way that the x1 would correspond to the probability which is greater than 1 half and x2 would be correspondingly smaller I mean if it's not I would just reverse the numbers and again same thing with q q can be either greater or smaller than p right there are no other ways and if it's smaller and I'll just reverse y2 and y2 y1 and y2 and it would be the other way around so I can always put some numbers indices in such a way that this particular inequality holds and I need it why because that actually makes my life easier here because now I can instead of two equations as the top for instance I can use only one of them which is p the smaller one so if it's supposed to be smaller than p and smaller than q q is larger than p then it's completely sufficient to say that it's smaller than p now here if it's greater than p plus q minus 1 then I can actually just let me just try to leave it alone as it is we'll see if it works so p plus q minus 1 greater than less than less than p okay by the way this on the left is definitely smaller than this one because this piece is obviously negative right or equal to 0 so this is the correct inequality just checking you know every once in a while when you see some formula and you're not really sure if there is some simple way to check if it's true I mean that's one of those little checks alright so now let's see what happens if I will substitute the smallest and the largest value into this expression for correlation okay let's start from the smallest so I have p plus q minus 1 minus pq that's equal to p times 1 minus q it's this one and this one right I factor out p and what's remaining q minus 1 or with a minus sign 1 minus q which means equal to 1 minus q now can be taken out and what will be remain p minus 1 I will put minus again here because I would like to deal with positive values so now that's what I have so this is the negative value of the numerator now let's divide it by denominator of p plus q minus 1 equals square root of p 1 minus pq 1 minus q well so what happens well first of all it's minus now you have minus 1 minus p here and square root of 1 minus p so what will be 1 minus p same thing 1 minus q divided by square root of pq oh let me just have 1 square root 1 minus p 1 minus q pq now we're talking about correlation coefficient supposed to be less than 1 by absolute value right so what about this well let's see 1 minus p divided by p so p is greater than 1 half so 1 minus p is smaller than 1 half so that's why the ratio is less than 1 and similarly the same thing q is also greater than 1 half which means 1 minus q is supposed to be smaller than 1 half so we're divided smaller by bigger so that's why we have them both less than 1 and the square root will also be less than 1 so we have a negative number by absolute value not exceeding 1 so that's my range on the left side right so I have square root of 1 minus p 1 minus q pq now what do we have on the right side so to get the largest value for the r I have to put the largest value for the argument which is p so what will be p minus pq so I will factor out p and I will have 1 minus q divided by square root of p times 1 minus p q 1 minus q so what will be here well p and square root of p it would be square root of p and square root of 1 minus q divided by 1 minus p and q square root of everything right that's what I will have square root of p on the top and square root of 1 minus q on the top 1 minus p on the bottom and q on the bottom now let me just write it slightly differently so you will see that it's also smaller than 1 that's my original assumption and that's why this is less than 1 now 1 minus q since q is greater than p 1 minus q is smaller than 1 minus p right so again it will be smaller divided by bigger and that will be also less than 1 so the result is as you see again some number which is less than 1 by absolute value and I will put it here p1 minus q divided by q 1 minus p so that's my range for correlation coefficient in this particular relative to the general case when my two variables take only two values under this assumption and again that's our assumption without any problems we can make it by changing the indices so with this assumption we have the correlation coefficient I forgot to put minus in front of it is in this range and where is it equal to 0 well obviously when r is equal to pq because that makes this numerator to be equal to 0 it's somewhere in between these two these two numbers pq should be somewhere well obviously well let's just verify I mean obviously pq is less than p right because q is the probability which is supposed to be from 0 to 1 right now is pq how is it related to p plus q minus 1 let's check this particular inequality well let's put everything on the right so I will have pq greater than equal to sorry pq minus p minus q plus 1 it should be greater or equal to 0 is this true p times q minus 1 right or let me just do it the other way around it's 1 minus q it's minus p times 1 minus q plus 1 minus q which is 1 minus p times 1 minus q and this is obviously greater or equal than 0 and all these transformations are reversible so from this I will derive this one so pq is indeed somewhere between the extreme values of r basically that's it that's all I wanted to talk about I presented three different calculations including one relative to the general and this well exemplifies how the correlation coefficient is actually calculated it has nothing to do with statistics statistical distributions are used also to check what the correlation coefficient is and as you see at least in this particular example and that's very very important observation correlation can be equal to 0 remember the second example when I have 1 half and 2 thirds we had a special value for their mutual distribution r so let me just remind you r was from 1 6 to 1 half and whenever r was equal to 1 3rd I had the correlation between these variables equal to 0 and obviously these two variables don't seem to be like completely independent or anything like that so the correlation to be equal to 0 is not really a significant of independence it just can happen that if mutual distribution is such and such then basically yes then the correlation coefficient can be equal to 0 and obviously and that's again different story even if the correlation coefficient is positive and even closer to 1 it doesn't really mean causation let me just repeat the same thing as I did in the previous lecture correlation does not mean causation because you don't know whether xc somehow reacts to eta or eta reacts to xc or some other random variable some conditions basically are affecting both of them xc and eta in a similar fashion and that's why they're changing in the same way so this is completely different this is not mathematics at all mathematics is only this and all we can say is that this is how the correlation is basically calculated and how it looks in different cases like this another very important probably note is that correlation usually is relatively well it's a relatively good indication whenever the random variables are linearly related to each other because then the correlation is actually this linear coefficient if you remember whenever whenever we had case like this the correlation coefficient was actually was equal to 1 or minus 1 depending on the sign of A so for linear case correlation is important for non-linear case what if you have something like this to count on correlation coefficient to tell you something about this type of dependency would not really work well so again in some cases it's a good indication of some kind of dependency in some other it might not be alright that's it for today thank you very much and good luck