 Hello and welcome to module 9 where we are going to discuss linear correlation and goodness of fit test First stop is going to be linear correlation So bivariate data is data which contains two variables typically you have an x variable and then you have a y variable And then our goal in this module is to look at the relationship between these two variables One way to look at the relationship between these two variables is to look at a or create a scatter plot And it's a graphical representation where the values of the two variables are plotted along two axes and Then the scatter plot can be used to identify correlation for instance if you want to look at In my visual here on the screen you have someone's height and meters and then on the y axis You have their weight and kilograms and notice I plotted Everyone's height and weight so every point represents the height and weight of One person and look at this nice pretty trend here as height increases Weight increases and not only that but notice it's almost a nice neat line. That's what we mean by Correlation specifically linear correlation, which is our focus So a correlation exists between two variables when the value of one is somehow associated with the value of the other One thing you want to keep in mind though is just because you have correlation It does not imply a cause-and-effect relationship. It could appear that there's a relationship between two things, but Maybe that relationship is because of a third item or a third variable that we did not consider So correlation does not always imply causation a linear correlation exists between two variables when there is a correlation and The scatter plot results in a pattern that can be approximated by a straight line So here I have displayed some scatter plots of data. I have the correlation coefficient r I have r equals negative one notice that is literally a near perfect negatively sloped line. That's how my scatter plot points appear Just below that I have r equals positive one The correlation coefficient r is always between negative one and one and notice that nice perfect positively slope line That's what the points appear to be making Now if you look at a correlation coefficient r of negative zero point ninety four Notice the points go downward But they don't make a perfect line and if you look at a correlation coefficient of positive zero point eighty six notice the Points form a line that slopes upward. That's called the line of best fit and But it's not a perfect straight line The points appear to be forming a straight line But it's not a perfect exact straight line like though r equals positive one picture and then we have r equals positive zero point zero eight There's not really much going on there. It's just a cluster of points It may appear that there's a slightly positively sloped line a best fit that would go through those points And then you have the correlation coefficient r equals zero notice. There's no actual linear pattern in our scatter plot points It's kind of an arc So here are the requirements do you do the linear correlation test we're about to do So the sample pair data is a simple random sample of quantitative data. So that means numeric data Scatter plot confirms the points approximate a straight line pattern So you need the scatter plot to give you that visual and then outliers must be removed if they are known to be errors The effects of any other outliers should be considered as well by calculating r with and without them because remember outliers are Data values that are way out of line from the rest of them and they could really mess up our calculations So notation is in number of pairs of sample data r is the linear correlation coefficient for the sample and Then this p-looking thing called row Row row row your vote except it's RHO It's the linear correlation coefficient for a population So to calculate what is called the linear correlation coefficient, which is going to be important for Conducting our hypothesis test here in a minute used the following formula. So notice it has r equals You have all these x times y add up those products multiply by the sample size in minus Sum up all the x's sum up all the y's Then you'll multiply them together and I mean there's a lot of work going on there It's not impossible to do but I mean who wants to spend a long time Calculating the correlation coefficient by hand. So we're going to use technology But I want you to understand and appreciate what technology is doing for you. That's why I wanted to show you this fun formula So here's some important information about that linear correlation coefficient r. r is always between negative one to one if all values of either very variable are converted to a different scale The value of r will not change So it doesn't matter if we put population in terms of Thousands of people or tens of thousands of people or millions of people. It would not change the correlation coefficient to value The value of r is not affected by the choice of x and y you could switch the x variable with the y variable and still get the same exact answer r measures the strength of a linear relationship and r is sensitive to outliers, but once again technology will take care of all of that for us So often common at this point that if I could literally make up a hypothesis testing handout I would list all the different types of hypothesis tests that we would Encounter in the statistics class and then I would write out all the different hypotheses you could deal with Well for a test for correlation or a linear test for correlation It's always the same two hypotheses the null hypothesis is row equal zero Remember that's the population Correlation coefficient if the correlation coefficient for the population relations equal to zero There is no linear correlation That's the English translation and the alternative also always row is not equal to zero notice You have equal to not equal to equality always goes with the null hypothesis And if row is not equal to zero that means there is a linear correlation So critical values can be found from the correlation coefficient critical value table using the significance level and n minus two degrees of freedom We'll talk about that in a moment We're in is the number of pairs of data and this is a two-tail test So there is a positive and negative critical value So if you test using critical values if you use the critical value method, it's the absolute value Of the correlation coefficient R is greater than the critical value That we find using a table We reject the null hypothesis And then that means there is linear correlation if the absolute value of the Correlation coefficient is less than or equal to the critical value which we'll find from a table We will fail to reject the null hypothesis And then there is Not linear correlation That's the test statistic or correlation coefficient critical value approach And then we have our p-value alpha comparison approach The p-value is less than alpha if it's under that limbo bar We reject the null hypothesis and assume there is linear correlation If the p-value is greater than alpha Then we fail to reject the null hypothesis as usual and conclude there is not linear correlation So to conduct the test for linear correlation, we'll use in the google sheet spreadsheet the regression tab We'll clear out any data that's in column a and column b And then we either will copy and paste or type the data into columns a and b Then we'll find the needed information including our correlation coefficient test statistic and the two-tailed p-value and column e In my first example the paired shoe length and height data in centimeters from five males is given Conduct the hypothesis test of the claim that there is a linear correlation So that's our claim there is linear correlation between the two variables use a 0.05 significant level So they give me the shoe length and they give me the The height for each of these males my hypotheses are as follows The null hypothesis is always row. Remember that's the population correlation coefficient is equal to zero And the english translation of that Is that there is no linear correlation next the alternative hypothesis Remember the alternative is always basically the opposite of the null so it's not equal to zero And that is there is linear correlation So notice the equalities with the null hypothesis Not equal to is with the alternative hypothesis and our claim is that there is linear correlation Next step is to look at the scatter plot and see if there is a general linear pattern So in my scatter plot I have A point for every shoe length and height So if the shoe length is 29.7 and the corresponding height is 175.3 I go to 29.7 and I plot a point at 175.3 If the shoe length is 29.7 and then the height is 177.8 I go to 29.7 and I put a point at 177.8 So there's basically two points right on top of each other In that area And then you continue and you plot the other three points And what I have here drawn is the line of best fit and it's showing that there is kind of a slight upward positive linear trend going on there So let's go forth and find more information for this hypothesis test for correlation So now we're going to find r we're going to find the linear correlation coefficient We're going to use google sheets for that. We'll also use google sheets to find the test statistic And we also use google sheets to find the p value the critical value will be found by using the Critical value table that we'll talk about in a moment So i'm going to go to the google sheet spreadsheet document And you would start off on the one variable stats tab That's where the default tab is and you need to go to the regression tab So starting in cell a2 and b2 you'll type the data values for x and the data values for y You'll type your shoe links and then you'll type your heights So i've typed My five pairs of data into the spreadsheet and notice you have your correlation coefficient r Which is about 0.59. So that's cell e2 you have your Test statistic the two decimal places was 1.27 That's what t is and then you have your p value There's only one p value to pick from here and that's going to be 0.2937 that's our p value 0.2937 So it's about as easy as that All right, so our Correlation coefficient as we found is going to be 0.59 Our test statistic Test statistic is 1.27 Critical value will come in just a minute And then our p value is going to be 0.2937 p values are typically two four decimal places To find the critical value You need to know Whatever your degrees of freedom Remember degrees of freedom is just something that's used to help you find The critical value And it's used in certain distributions including this one here when you run the hypothesis test And the degrees of freedom is the number of pairs of data minus two I have five pairs of data minus two Which is equal to three So we usually just call the critical value table to find the critical value when there are three degrees of freedom So remember the level of significance typically used for a hypothesis test is 0.05 unless they say otherwise For these tests you'll always have a 0.05 level of significance You go down the three degrees of freedom and look at the critical value It's 0.878 remember there is both a positive and negative critical value 0.878 So the critical critical value is positive or negative 0.878 if they ask you for the positive critical value You put 0.878 if they want to know the negative critical value you put negative 0.878 So now with our conclusion I must compare the p value to alpha remember alpha is 0.05 That's our significance level the p value is clearly greater than alpha Which means we fail To reject we are not under the limbo bar. Sorry. We fail to reject the null hypothesis So we fail to reject The null hypothesis Which is there is no linear correlation So what that means is that I can't say anything about my claim There is not evidence to support my claim that there is linear correlation So there is not sufficient evidence to support the claim that there is linear correlation is the actual Structure for the conclusion statement for this test Now perhaps maybe we should have used more Data points, but once again, this is just an example to show you how the process works All right Listed below are the number enrolled students and thousands and number of burglaries for randomly selected large colleges in a recent year Is there sufficient evidence to conclude that there is linear correlation between enrollment and burglaries? So we'll start off with the hypotheses Same old hypotheses any time you do a test for correlation row is equal to zero, which means there is no Linear correlation And then you have row is not equal to zero Which means there is linear correlation Which is our claim Then you have your scatter plot. So once again you plot Enrollment along the x-axis. I've plotted burglaries along the y-axis member enrollments in terms of thousands So I went to 32 103. So I'd have had a point at 32 103 Then you had a point at 31 103 53 86 these can all be represented as ordered pairs If you'd like to think back to algebra class and plotting points Anyway Let's find all the different things we need for this test as a first refresher here are the hypotheses over on the right hand side Membered will use google sheets to find the correlation coefficient the test statistic and the p-value We'll use a table to find the critical value. All right. So let's find the correlation coefficient test statistic and p-value now Google sheets. We are on the regression tab We clear out any data that's currently in column a and column b And then we're literally going to input all of those pairs of data So input your x values your enrollment values Make sure you push in or after you type each value do not use the down key Otherwise it will not register for you and your calculations will all be off Which is not very fun if you ask me All right, so I Typed everything in and I have a correlation coefficient of point five six A test statistic of one point nine two and it looks like a p-value of zero point zero nine one four Those are the things we need All right, so we found that our correlation coefficient actually has a value of point five six We found that that our test statistic has a value of one point ninety two We found out that our p-value is Point oh nine one four And we'll compare that to alpha in just a minute to find the critical value. We need to know our degrees of freedom So our degrees of freedom Is the number of pairs of data minus two? And how many pairs of data do we have we actually have ten So our degrees of freedom is going to be eight So let's look at the table and determine what the critical value is when we have eight degrees of freedom So the critical value for eight degrees of freedom is actually zero point six three two. Remember that's plus Or minus point six three two All right, so the critical value is zero point six three two Let's compare that p-value to alpha now. Let's compare that p-value to point zero five That's what alpha is and looks like the p-value is greater than alpha. So once again, we fail To reject the null hypothesis So we fail to reject the null So guess what we don't have evidence to support our claim We can't weed out the null so we can't say anything about the alternative Which is our claim so the proper structure is there is not sufficient evidence to support the claim that there is linear correlation But would these results change with my correlation coefficient test statistic critical value p-value change If I actually gave the enrollments of 32,000 31,000 instead of 32 and 31 The answer to that is actually no And that would be because scaling data Does not change Or does not affect correlation So scaling the data would not affect correlation. It would not affect any of the calculations in this test So that's how to run a test for linear correlation. Thanks for watching