 Hi, in this video we're going to do a hypothesis test for two samples, comparing means, but this time we're going to look at the mean of differences, that is a paired comparison. This is in contrast to the difference of means that we did in the last video. So let's get to it. So the reason why we're doing a paired comparison here, or we're able to do a paired comparison is because at each individual time here, we have both a measurement of wind percent deficit and natural gas percent deficit. And so we can compare what's going on at each individual time as we go along. We don't have to just compare entirety of one curve versus the entirety of another curve. There may be valuable information in constraining that comparison at the individual time steps that we have here. So how do we do this? First thing is we have to reorganize our data a little bit. So let's just create a new data object called DF2 from our original GenPIV and we need to pivot this data frame. So let's do this and it'll start to make sense after we've done it. So our index is going to be datetime, our columns will be created from the fuel column that we have in GenPIV and our values are going to come from percent deficit. So let's see what this looks like and it will be clear what's being done here. So what was the datetime column in GenPIV has now become an index. We had a fuel column that contained either natural gas or wind or two groups and those are now two separate columns here and the values underneath are the percent deficit. Now the reason why we've organized it as such is because now we have these pairs at each time step spread over two columns here. So this allows us to very easily then calculate the difference of each of these pairs. So we'll make a new variable difference that's simply our natural gas minus our wind. And why am I doing natural gas minus wind and not wind minus natural gas? Well, this is because of our hypothesis. So our hypothesis is the mean of the difference between gas and wind and this is following the same order that we had in our first test of two means, comparison of means, looking at the difference in means. So our order there was gas minus wind. So we're preserving that same order. Again, this is based upon the supposition that we see from the data visualization that gas is underperforming relative to wind. But the goal of this is to test that supposition, test that hypothesis. So there we have, there's our difference. We can see that it's been calculated. So now we have a difference column that's simply the difference between those two things. And then finally, to get our sample statistic, if we can do down here, our mean diff, we'll call it is from this difference column and calculating the mean of that. We can see what that value is. Lo and behold, minus 12 and a half percent. Note that this is identical to the greatest degree of precision that we have here to what we had above with the difference of means minus 12.5628, blah, blah, blah, blah, blah. They're mathematically the same thing. So we have the exact same sample statistic here just calculated slightly differently. However, the randomization test is going to be done considerably differently. So again, just as a reminder, we do need numpy here include that to be complete. We're going to also copy DF2 to something called SIM here so that we don't mess up our original data frame. It's just good practice. Capital N again will be a thousand and little N again will be the number of rows that we have in our data frame of interest. We do need an empty vector here to store our statistics in after we do the randomization. And again, we'll fill this up as we march through our for loop here. So what goes into our for loop? Well, first we need to do our random is randomizing. Here, how we'll do it is we'll calculate this multiplier. Multiplier here is just going to be, and this should be in square brackets. Our multiplier is just going to be either one or minus one. That is, we either flip the sign of the difference or we don't. Because flipping the sign of the difference is equivalent to swapping the values across these two groups or across these two columns. So this is quite simple in terms of code. We just use random choice again. And we can just simply randomly select from the vector one or minus one. So we're just going to randomly choose one or minus one. We're going to do it N times. And we will do that with replacement, of course. So we're going to end up with, every time we go through this for loop is a vector here or a column. It's the same length as percent deficit. I'm sorry, same length as difference. This can be filled with some random assortments of ones and minus ones. Okay. And then our new simulated randomized difference column is going to be simply this sim multiplier times our sim difference. So either we flip the sign or we don't. And then finally, we store our sample statistic in this X bar diff vector, this empty vector here. And this will just be the difference and we'll find the mean. Okay. So there we go. And then we can, of course, visualize the randomization distribution. X bar diff, we're still using that variable name or cycling that variable name. But we're not looking at the difference of means. We're looking at the mean of differences. So we'll change that name up. And our sample statistic is actually mean diff. We'll tell that something else here. All right, let's see what this looks like. There we go. So immediately you might say, okay, well, the mean diff, which is equal to this minus 12.5%, still outside the realm of our randomization distribution, right? Again, this randomization distribution is the realm of, this is the suite of possible events that we could see if the null hypothesis were true. And so, again, you can immediately tell from this that our p-value is still going to be zero. But there is one difference I want to tease out of this. So in comparison to what we did with the difference of means. So take a look at this randomization distribution. Most of it's concentrated between about minus seven and a half to positive seven and a half, whereas we have a bit of a wider sweep of possible events when we did the difference of means. So in other words, this randomization distribution from the mean of differences is a little bit more concentrated around the null hypothesis value. In this case, zero percent deficit. Now, the reason for that is because we're constraining the randomization to take place within each row. Before, with the difference of means, essentially these values could swap rows. They could end up anywhere, but here they can only stay within the rows. So because we're maintaining that pairing, because we're saying that there's added information somehow in keeping these values at the same times, but there's something valuable about that time information, we're getting a narrower range of possibilities in this randomization distribution. And thus, a more definitive answer in our hypothesis test, even though the p-value is still zero. I encourage you to go take a look at a subsequent video, Effective Sample Size, in which we'll compare these if we have a more limited sample size and you can see some interesting differences there. Just to finish the discussion here, let's calculate our p-value and arrive at a conclusion. We'll just borrow our p-value calculation code up here. It's not x-bar-diff anymore, but it is still x-bar-diff, but here we have mean-diff. And again, it's still less than, and again, we get a p-value of zero. And of course, our concluding statement would be the p-value is low. It's zero. It's less than 0.05. We will reject the null hypothesis. Same conclusion that we had under the difference of means. Okay? Thank you.