 Hello and welcome to a new lesson on linear regression. So throughout this lesson we're going to walk you through a series of tests and ways in which we can conduct linear regression. In particular, this video is going to focus on correlation and how we can actually conduct eventually a hypothesis test for testing the significance of correlation. So we're going to start with an introduction to the subject and then later we'll get into the actual hypothesis test. So we are here in Google CoLab and we have several libraries that we're going to use that we've all used before in different settings and in particular we're going to ask the question does we're going to test the significance of correlation. So we're going to ask the question does using more water for fracking yield more natural gas production and so if you're not familiar there's some context here provided that you can read through on your own time but essentially we are going to be using data from the Marcellus shale formation in Pennsylvania to test to answer this question. So there's two CSV files that we're going to use and I have them ready to go and read it in. The first one is lecture eight Marcellus wells and the second one is lecture 20 Pennsylvania wells frack and so both of these have data on fracking wells but they have a little bit of a different different columns and so we can see here for in the Marcellus data which I called mark we've got all of this information about drilling and what we're going to be interested in is the data down here total gas and max gas these are the values that tell us how much gas was produced from a well the water data on the other hand has less columns so we've got basically what we're interested in is total base water volume and there's a few other extra things so before we can actually get into working with this data we need to merge it together so we need to do a little bit of pre-processing and we also need to do some grouping so the first thing we're going to do is group the water data so I'm just going to call this water two so that I don't override my original data frame I'm going to say water dot group by and I want to group by API number so the API number is a way to identify the actual well that we are referring to and this water data set has multiple values for multiple wells and so we need to essentially group them all together and we want to get the total for that well over the entire study period so to give you a look about what this now looks like after I run that I'm using that dot head command to print the first five rows and essentially we've sort of condensed our data set down to API number and total base water volume as opposed to having multiple columns up here because I'm not grouping them I'm not storing them so they just sort of drop off which is fine because we really only need the API number and the total base water volume so now we've grouped the data the next step is to merge our water data and our wells data however before we can do that you look up here we've got this API number column but it has a different name than our API number column here now these are the same API numbers they're just called different things for the different data sets so in order to merge on them we actually need to rename the column and so we don't need to rename every column we just need to change API NO to API number so that it matches what the water data set has and then in that same block of code I'm going to go ahead and merge our data frame using pd dot merge and say mark water two we're going to do an inner join so that we only keep data points that are in both columns and we're going to join on our newly renamed API number and again I will print the first five rows and so here we can see we've got this big data set now if we scroll to the end now we've got our total base water volume at the very end to go with our total gas column and so this is how the data set that we're going to use throughout the correlation lectures but before we get started with correlation a good habit to be in is to visualize our data so we know what is going on so we're going to use ggplot with merge df and I'm going to do a little bit of a different modeling or visualization technique and I'm going to put our aes command within the larger ggplot command so it's still going to look the same I'm still going to have an x value and a y value which is going to be total gas but the benefit to doing it this way is that these aes statements carry throughout the whole plotting thing so when I go to do geom point I don't need to actually type anything I don't need to provide an aes statement because it carries it down from the main ggplot command and that makes this go a little easier because then we can add our stats smooth again we don't need to give it the aes we do still need to specify a linear model and we can change the color to red and so here we have what the total base water volume and the total gas look like and now we could make some qualitative statements about the correlation here we can say that it is positively correlated and that maybe if I had to estimate it might be about point four maybe just based off of the example thoughts that we have to go off of but again we haven't added any numerical quantity to that yet and in order to do that we need to find the sample correlation coefficient which is r so I'm actually going to do this inside a print statement so I'm going to say correlation coefficient and then I'm going to do a round statement so that it's not printing a bunch of different a bunch of extra numbers we really only need a few and then I'm going to give this the command for finding the correlation of sample data and so we can say merge df total base water volume and then the command after the square bracket is dot core open parentheses and then we just give the second variable total gas and then there's one more thing that we need to do and that's to finish off our round command and so it's going to be right here so you can see how the parentheses are highlighted so we're still inside that round command but we're outside of the correlation command and I'm going to say comma three which rounds to three decimal places so we can print it out and say the correlation coefficient is 49 or 0.494 which is actually it's okay it's not the best correlation coefficient that we have but it's not terrible and so based off of this plot and this correlation coefficient we can write a conclusion so generally when we're thinking about correlation our conclusion should focus on the direction of the association the relationship between the two variables and the evidence for this relationship and so I'm just going to paste in some text that I had preloaded and say that to make this conclusion we can say there is a positive association between total water volume and total gas which means as total water volume increases the total gas increases and the evidence for this is the correlation coefficient which is positive and the slope of the line which we can see is also positive and so this is sort of a basic way to make do some inference from correlation but it's still a bit qualitative it's not although we do have a quantitative number it's not necessarily the most statistically sound and so what we'll do in the next video is conduct a hypothesis test in order to get at more of that statistical inference instead of just looking solely at one sample