 Hi, I'm Zorf. Welcome to Unisor Education. This lecture is part of the advanced course of mathematics for teenagers and high school students. It's presented on Unisor.com. I suggest you to watch this lecture from this website because it has very detailed notes and for registered students there are the options to enroll in certain classes and and take exams and everything is free by the way. So today we are going to talk about one particular problem, which is I would say quite practical and the problem is related to correlation, statistical correlation. So I was trying to present this problem in as practical numbers as possible. In some way, it might be actually similar to some real problem. It's related to vaccination and what kind of usefulness of particular vaccine might or might not be observed. Now not only I wanted to present just basically techniques how to deal with this particular case, but also I will probably try to do some analysis of so you would understand better the meaning of correlation and proper interpretation of the correlation between different things. So we will talk about the following thing today. Some pharmaceutical company has been able to manufacture some vaccine and it's trying to find out whether this vaccine is really working or not working. Should it be recommended to to be used in real medical practice? Well the way how it's done they usually have some kind of a group of people and they have arranged some experiments with this vaccine and then they're analyzing based on the results of these experiments whether the vaccine is working or not. So I'm going to present you this particular experiment and I will start with some more or less practical results of this experiment and here is what I mean. Let's consider this pharmaceutical company observed 2,000 people. Well, this is actually a very large group of people. Usually it's much smaller, but the more the merrier in statistics as you know and they have divided into groups. 1,000 people were vaccinated and 1,000 people were not vaccinated. They were just given some placebo or whatever it is. So this is vaccinated. This is non-vaccinated. Then they observed whether now the vaccine is supposed to be like a remedy against some virus or whatever else. So then during a certain period of time they observed what's the result? Now the results which I am talking about are the following. There were 356 people out of 2,000. 50 of them were from the group of vaccinated and 300 from non-vaccinated people. Well, that's it. That's all that is given. Now they have to make some kind of a judgment. Is vaccine working or not? Well, there is a very big difference by the way between vaccinated people getting sick and non-vaccinated people getting sick. But is this difference significant enough? Statistically significant from the real kind of scientific approach to make a judgment about effectiveness of this particular vaccine. Well, here is what we can do. Now we are trying to build a statistical model of this and then we will experiment with this. We will make some kind of a judgment based on statistical calculations on this model. So here is my model. Let's consider a random variable C which takes value 0 or 1 based on vaccinated or not vaccinated. So basically considering we have 1000 vaccinated and 1000 non-vaccinated people, we can talk about this variable as having probability of one half to be equal to one and one half to be equal to zero. But this will be addressed a little later. Then we will have another variable. Whether this particular person got sick or not? Well, in this case, the probability of eta to take the value of one is 350 over 2000. And non-sick obviously whatever is left of it. So that's why two random variables. Now, if vaccine is absolutely not effective, then we can also say that probably these two variables are independent. Because only independent variables in this particular case can really represent something which is like working vaccine or not working. If vaccine is not working, then basically these two variables are independent. So I have to really make a judgment about their dependency. Now, the only way to make a judgment about their dependency is to calculate the correlation between these two variables. We know that independent variables have correlation of zero. Now, dependent variables There is a way that dependent variables can have zero correlation, but it's a rare occasion. Usually they do have certain correlation. And the greater this correlation is by absolute value, the closer to either one or minus one. The more linearity is in the dependency between these two random variables. So my purpose right now to calculate is to calculate the correlation. Now, I decided to put this all this information into a compact table, which I will present right now and it will be easier to make some calculations based on this table. So I just have a matrix of the values of xc and eta. So they can have value let's say eta is equal to one, which is sick and eta is equal to zero, which is healthy. Xc is equal to one, which is vaccinated and xc is equal to zero, which means non-vaccinated. Now according to the numbers, which I was presenting before, we had 50 people vaccinated and sick and 300 people non-vaccinated and sick. Correspondingly, I have all together I have 1,000 people vaccinated and 1,000 people non-vaccinated, which means that here I have 950 vaccinated and not getting sick and 700 not vaccinated and not getting sick. And my total is 356 people and 61650 healthy people as a result out of 2,000 together. Now here, I have actually all the information I need to calculate my correlation. You remember that the way how correlation is calculated, first we do covariance, which is expectation of their product minus product of their expectations. Then I need variance of each of them and then my correlation coefficient is covariance divided by square root of two variations. So I'm going to just do all these calculations, which is really very simple. First of all, mathematical expectation of their product. So their product is one times one, one times zero, zero times one and zero times zero. With these are values, which this variable Xc times eta takes and the probabilities are corresponding proportionality to the total number. So for instance, this Xc times eta takes value one with the probability of 50 divided by 2,000. So this is expectation of Xc times eta equals. So this is probability and the value is one. So I don't multiply it. Now all other values are zero, right? So one times zero, zero times one, or zero times zero. So basically everything else is zero, so the probability is zero, zero to five. So that's my expectation of their product, okay? Okay. Now we need expectation of each of them. Well, the expectation of Xc, it takes value one or zero with probability one thousand over two thousand or one thousand over two thousand. So it's one half for both cases. So Xc is Bernoulli variable basically because it takes to value one of zero with the probability one half. So we know that expectation of Xc is equal to one times one half plus zero times one half, which is zero point five. And incidentally variance, if if you remember, I mean, I do remember it and if you don't, you just you can just refer to my probability lectures on Bernoulli variables. The variance of the Bernoulli variable, random variable, with probability of one to be equal to P is P times one minus P. In this case P is equal to zero point five. So my variance is equal to zero point twenty-five. Now expectation of Ata. Well, Ata has two values, one and zero, 356 people over two thousand and 1650 over two thousand is the probability of having the value of zero. So the probability is one times 350 over two thousand plus zero. So I don't count. So that's zero point one seventy-five. And if you multiply these two probabilities, these two expectations, Xc and Ata, zero point five times zero point one seventy-five, you have that this is equal to zero point zero eight seventy-five, right? I multiply five times one seventy-five, point five times point one seventy-five. So that's my expectation of their of their product of their expectation. So covariance is equal to this minus this. Notice this is smaller, this is larger. So the result would be negative. And the result is, what's my result? zero point zero six twenty-five. So what does it mean that it's zero? Sorry, it's negative. Well, that's obvious because vaccination promotes health, right? So those who are vaccinated, they are much healthier than those who are non-vaccinated, right? So the greater value of Xc vaccination corresponds to a lower closer to zero value of Ata. That's why they go into different directions. Bigger vaccination less healthy. Sorry, less sickness. So it's more healthy. So it's zero. One correlates to zero. One Xc correlates to zero. And zero for Xc correlates more with sickness Ata equal to one. That's why correlation is supposed to be negative. Now, so I calculated the variance of Xc now variance of Ata. Again, I know the P, right? So it's P times one minus P zero point one seventy-five times zero point eight twenty-five. I have the result. The result is zero point one forty-four three seventy-five. And if I will divide this covariance by the square root of product of zero point twenty-five and zero point one four four three seventy-five, I will get my correlation coefficient equal to minus zero point three two eight nine seven six. Okay, so correlation is negative. Well, it's closer to zero than to minus one. That's a very interesting observation because we were thinking that, well, there is a significant difference, right? Within the same quantitative people among vaccinated we had 50 and among non-vaccinated we have 300 cases of sickness. So there is a difference, significant, very noticeable difference between them. Yet the correlation coefficient is relatively far from minus one, as we would like it to be, obviously. I mean for a very, very strictly dependent, in this case, variables we should have minus one, right? But it's not strictly depending because, well, first of all, because there are some sick people among vaccinated. So it's not like a hundred percent working. And also there are some other considerations and I will talk about this later. So it's far from minus one. Now, is it insignificant? No, it's not insignificant because zero point three by absolute value is still within the range of, I would say, moderate correlation. Insignificant is probably less than zero point one by absolute value, plus or minus. Strong correlation is greater than point seven, I would say. That's kind of a general understanding of this. So it's not strong. It's kind of a medium-sized correlation. It's good, but it's not good enough. Let's put it this way. And this is my first kind of set of calculations within this particular problem. Now let's just think under what circumstances, with what numbers, you would expect the correlation to be greater. Well, greater by absolute value. I mean, in this case, it's negative. Well, obviously, if this is zero, right, if none of the vaccinated people got sick, then it means that the correlation, it means vaccine is working better and the correlation between vaccination and not getting sick, sick is stronger. Right? Let's do another set of calculations. So my next set of calculations is I will have zero here and 350 here. So all sick people are among non-vaccinated. So this would be 1,000. So all 1,000 vaccinated people are healthy and 650 here, 650 out of non-vaccinated 1,000 people still were healthy. They did not get sick. All other numbers are the same. And let's just try to calculate again whatever we need. All right. I'll wipe out as I'm calculating. Well, first of all, the calculation of expectation of C, which is 1,000 and 1,000 vaccinated and non-vaccinated, is exactly the same. And obviously, its variance is exactly the same. Calculations of eta, again, it's 350 out of 2,000 and variance also, it stays the same. So individually, these variables stay the same because these totals are still the same. What's changed is their mutual probabilities and therefore E of C times eta is different now. Now, what is E times eta right now? Well, let's think about it. We have a value of 1, 1 times 1, with probability 0, right? Then value 1 times 0, with probability 1,000 over 2,000 and 0, 1 and 0, 0. So it looks like this takes values only equal to 0, each component. So the whole thing is equal to 0, right? Now, these are still the same, which means their product is still the same, which is this. So only instead of this, I have 0 now. So my covariance now is equal to minus 0.0875, which is greater than this one. And my R, which is this divided by square root of product of variances, is different and it's equal to minus 463566. So as you see, this is by absolute value greater than this one. This is minus 0.3. This is minus 0.4. Still relatively far from minus 1, right? But it's greater. I mean, the correlation in this case is greater. Okay. I will talk about when under what circumstances it will be equal to minus 1 after the next calculations, which I would like to make. Think about under what circumstances correlation should be 0 between these two things. Well, just logically speaking, if proportion of the sick people among vaccinated is exactly the same as among non-vaccinated, proportion of the sick people among vaccinated is if it's equal to the same, if it's the same proportion as among non-vaccinated, that actually means that the vaccine is not working at all, right? So let's just make another change here. This is 175 and 175. Groups are the same size. So that's why the same number here. And correspondingly here, we will have 825, 825. Totals are again the same, which means that all these are still the same. Expectation and variance of C, expectation and variance of A. And only expectation of their product is now different. And what is the expectation of their product now? Let's just think about it. It's 1 times 175 over 2000, right? So it's 175 over 2000. Now, 1 times 0, so it's 0. All other are 0. 0 times 1 and 0 times 0. So the value of the product is always 0, except this case. So all other probabilities are not counting, just this, which is 0.0875. Now, lo and behold, this is exactly the same value, the product of their expectations, which means as soon as we will calculate the covariance, the covariance will be equal to 0. I subtract from this, I subtract this, and I will have covariance 0. And therefore, correlation is 0. And that's exactly that corresponds to our intuitive understanding that if the proportion of the sick people, the same among vaccinated and non-vaccinated, then vaccinated should really have no effect on being sick. So there is basically no correlation, and indeed, correlation is equal to 0. So that's logically correct. So look at it this way. First, we made 50 and calculated that correlation is this, not this, this one, 0.32. Then we made the correlation, then we made number of sick people among vaccinated to be equal to 0 and expected that the correlation would be stronger and it did it was, it was 0.46 something. And then we made another experiment, if you have the same proportion of the sick people among vaccinated and non-vaccinated, there should be no correlation, which means vaccine is not working at all, and it did, we got it. And the last question is, under what circumstances do we get the minus one correlation? Well, if you read the notes to this lecture on Unisor.com, I did not put these calculations over there. I said, well, try to come up with the solution yourself. But here in the lecture, I'll just give a very simple hint. Now, obviously, if the number of sick people will be exactly equal to number of non-vaccinated people and all sick people would be among non-vaccinated and all vaccinated would be healthy and among non-vaccinated people there is no healthy people, they all got sick, then if you will calculate the same thing in this case, you will have minus one. So, why I had to change this? Again, I'm just trying to put some logical explanation. It's actually easy because to be sick, it's not sufficient to be just not vaccinated. First of all, you have to really get exposure to a virus, right? So, sometimes you just healthy because you don't have any exposure, which means in these cases, whether you vaccinated or not vaccinated, you won't get this particular illness, right? Number two, maybe you will get exposed and you will, you know, the virus will get into your body, but your own immune system can handle it and it just defeated the virus. That's why you can have healthy people even among non-vaccinated. These two circumstances obviously diminish the effect of the vaccination because, again, sometimes you're vaccinated wherever it's not necessary, no exposure, sometimes you don't vaccinate, but your immune system is still strong enough and you don't get sick. So, these two circumstances mean that there are other factors which are which are always actually present in all these medical researches. So, whenever somebody wants to prove that a particular medicine works in particular case of particular illness, there are, it's a very difficult and very complex process to prove that this is true and precisely because there are many other factors which are participating in the whole picture and to get rid of these factors is really very, very difficult. Sometimes it's just impossible. That's why we never in real life get correlation equal to one or minus one. We, as a rule, get something like in between and then it's a judgment call. Is it really sufficient to have the correlation 0.3 as a signification of effectiveness of the drug or a vaccine or anything like this? This is a judgment call and it's really not easy because sometimes people are saying, hey, maybe in 30 out of 100 cases the person who got sick will get well and this is good as well. Right? Well, maybe. Judgment call, as I said. That's why statistics is not really an exact science whenever it's applicable to real life problems. Yes, by itself in abstract sense it is exact science because it's based on very exact theory of probabilities. But in practical situations it's not. All right, that was my kind of spiel about validity of all these statistical calculations in real life. I'm just urging you to be very, very careful in your analysis. So even if your calculations are correct, the results, some kind of a logical conclusions which you come up with based on these analytical calculations are not necessarily obvious for everybody and there are some cases when it's really wrong. Okay, that's it for today. I do recommend you to read again all the notes on the website Unizord.com and it would be really very helpful number one if you will calculate all these things on this matrix of results and get minus one or just make up your own results and try to calculate. This is just real practice which is really very, very careful and very helpful actually. All right, that's it for today. Thank you very much and good luck.