 Welcome back to the videos on correlation. Today we're going to continue to talk about correlation, but we're going to get into that statistical inference side. And in particular, I'm going to introduce to you how to conduct a hypothesis test for testing the significance of correlation. So let's jump in. So here we are in Google Colab. I've already loaded in the data. We're still working with that Marcellus well shell or Marcellus well and water data that we used up here to make this conclusion. But we're going to do a bit more of a quantitative analysis and focus on hypothesis testing today. And so the first step in any hypothesis test is to state the hypotheses. And so our null hypothesis is row, the Greek letter equals zero. And it's always going to be this correlation because the significance test for correlation is whether or not the correlation is significantly different from zero. And so we have row equals zero. And then our alternative here is that row is greater than zero. Now this could be less than, this could be not equal to, but for this case, I'm going to go greater than zero. And so once those are stated, then we need to conduct the randomization procedure. And I've outlined these steps here so that you can have them later when thinking about the randomization procedure. But we start with the data. And now in the past, we've shuffled data around, we've done reallocation, randomization. In this case, we're going to do a reallocation of sorts, but we're only gonna do it with one variable. So say we start with this data here, X and Y. What we're then going to have to do is choose one variable to shuffle. Here I'm doing Y and we sample without replacement. So reallocating the data for that one variable. And then we're going to recalculate the correlation coefficient and repeat that n number of times. And really the goal with this reallocation is to break the pairs. So we don't wanna keep them paired together. We wanna break them and then recalculate the correlation because if the correlation is equal to zero, then it shouldn't matter how much we shuffle the data. It should always be equal to zero. And so by shuffling, we can sort of get at that significance level. And so we can go ahead and get started with the code. So the first thing we need to do is define our sample correlation coefficient. And I'll call it SAMP Core. And this is just how we calculated the correlation before. Total basewatervolume.core with total yes. So this is our sample correlation. And then before we get into our randomization loop, we need to initialize the variables. So I'm going to make a copy because we don't want Python to accidentally override our original data frame. So I'm gonna call it SIM. We need capital N for 1,000 iterations. We need lowercase n, which is our sample size, which I'm just going to set to be the number of rows in our simulated or copied data set. And then we need to create an empty array called SIMCore, which we'll use to create our simulated correlation. So I'll go ahead and run that. We come down here and check out SAMP Core. We can see that it's the same value that we got up here, just it hasn't been rounded. So then we need to start the randomization loop. And so we started like we have all of our randomization loops for i in range n. And then we say SIM, total guess. So again, we're selecting one variable. So this could be water, but I'm gonna do total guess. It doesn't matter which one you select, as long as it's the same throughout all of your code. So I'm going to take SIM, total guess, and I'm going to shuffle it. So I'm gonna say np.randomChoice. And then I'm going to randomly draw from, again, total guess. So making sure it's the same variable. Our size is going to be n, and then replace equals false. So there are two things that are a little bit different here from what we've done in the past. We're sampling without replacement. So you need to make sure that you set replace equals to false. We've done that before, but the main change is that we're just gonna simply, we're just going to randomly sample from one variable, shuffle it, and now we're going to recalculate our correlation. So this is total base water volume dot core and total guess. Making sure that you're using the simulated code here so that you are taking the correlation with our new reshuffled dataset. So we can run this, and if we come down here, we can print simcore and see that we've got a lot of different correlation values, occasionally even some negative values. So one of our reallocations led to a negative relationship between total gas and total base water volume. So that is the randomization. And like we have done before, a lot of times it's easier to then visualize as the next step. So before we can do that, we need to create a data frame for simcore because right now it is a NumPy array and ggplot doesn't like to work with those. So we can say simcore, the columns, I'm just going to write as simcore. So we'll keep the same columns. And then we can say ggplot simcore, df, and we can do a histogram. And so our x value is the only one we need, just simcore. And then I'm going to add a vertical line where the x intercept equals samcore, get our sample dataset. The color is blue, and we will do line type equals dashed. And I'm going to enter that down so it's a little easier and wrap this in parentheses. So Python knows it's all one line. And then we can go ahead and run that. And so here we can see this is our simulated correlation values. So they're centered around zero. And this is our sample one. And so depending on how we framed our alternative, this value is based either going to be zero, if it's however much is greater than that, or one, however much is less than that. But we have done a greater than test. And so we can say print the p value, and it is just the length of simcore where simcore is greater than or equal to samcore divided by n. So we can run this and see we get zero, which we expected there's no data above our sample line, which leads us to reject the null hypothesis. And in particular, we can say that the p value is less than the significance level of 0.05 and give exactly what the p value is. So we reject the null hypothesis in favor of the alternative that the correlation between these variables is significantly greater than zero. And so that is how you can do a randomization procedure to test the significance of correlation.