 Welcome back to the video series on sampling distributions. We're just going to continue to work with our sampling distribution. And today we're going to get into how we can compare that to the true values. So if you recall from the last video, we had these, the sample of dice rolls that we were able to visualize and then create a sampling distribution. Of the means of each sample. So one of the reasons that we develop the sampling distributions is that we use them to test things so we can say, does the sampling distribution center around the true mean. And so in this case, one of the benefits of working with dice rolls as an example, is that we know what the true mean is, it's seven, you will always, in theory, if you roll enough dice, you will always get seven is the most common and sort of central number. So in particular, the dice outcomes sort of follow this trend where there's you can roll one and one to get a two, but you can roll one and two to get a three and get one and two that way. And so we have the most values for seven. So we can jump right into the coding here. So I've pre coded this aspect of the theoretical outcomes. And this is a very, a little bit of a faster way to develop a large data frame, because we've pre wrote all of these. So each of these, if we run this and come down here, we can print out twos, for example, and it's just one to but then if we look at sevens. We have a list of six sevens. And so this is a quick way to develop a data frame or series of data frames, and then we can connect them. So we can create a new data frame called true. Use our PD dot data frame command. And again, doing this curly brackets that we can specify the outcome. And the outcome is just twos, plus threes, plus fours, plus fives, and so on. So this is why we create these variables above, because we can quickly just add them together in order to create a data frame that doesn't need us to sit there and write out every single number that is included in our true data frame. So we can come down here and print true DF. And so then we can see that we've got the outcome here, and we've got all of our data in a single column. And so then we can also, you know, plot that plot those values. We're going to use a histogram. So we'll say GG plot true the F. And then we say geom histogram and give it our AS statement. So our X is just equal to outcome. We don't have any why we're not doing any film. We say bins equals 12. Bin with equals one. And then outside of the histogram. I'm going to add a second or third line that we say scale X continuous. And we say breaks range from two to 13. And this will essentially tell the histogram where we want our specific breaks for our bins. And so the range tells it the first value. And you'll notice that it only goes until 12. This is a something that you have to do in Python, whenever you're having a range of data, the last one is not inclusive. So you can say from two to 13 exclusive to 13, meaning that it's really two to n minus one. In this case, and minus one is 12. We've got our data frame of true values, which is exactly what we expect we have this nice bell curve seven up here is the maximum. And we can go on either side. So this is our population. This is our true population. And what we want to do is compare our data or samples to that theory or true population. So we've got several steps here. The first thing we're going to do is create a new column called label, which is going to specify that this is the true outcome as opposed to a sample. So to create a new data for our new variable a new column, we just add it here and tell it what goes into it so it's just going to be true. If we just print out the first five rows, we can now see that we've got this label and everything is set to true. So then the next thing that we need to do is reset the index. And we do this so that the sample number can, while we're resetting the index to our data frame up here, so that we can make sure that we have the actual values in a column instead of as an index. So this is just data DF reset index. And so now we can see that we've got what used to be a index is now a row or a column. And then we need to rename. So just so that it matches the true DF column we want them both to be called label. So we say data DF. We need to save it as something so data DF equals data DF dot rename. And then we change sample to label, making sure that we match exactly what we wrote up here down to the proper cases. And just to show how that is changing, we can print out the first five rows. So now we can see that this is no longer called sample, it's called label. And then finally, we can concatenate this into two data frames while maintaining our long format. So we'll just call it data frame cat. We use the PD dot can cat function. And in this case we give it our first data frame, comma, our second data frame in square brackets. We print that and see that it starts off with all of our true data and then it ends with all of our sample data. And so then the last thing that we can do to compare the true population with our sample is that we can create a histogram. So what we'll do is again using GG plot. We'll say DF and cat is our data. Again, geom histogram. We give it the AES statement. So we say X equals outcome. In this case, we're going to add a Y. So earlier we're going to use this to essentially make it do a statistic rather than account on the histogram. So we're going to say after stat density. So this will tell it to use the density calculation as the Y value instead of account. And then we want to fill based off of the label. Again, we want to specify 12 bins been with one. And we're going to add an extra thing called position, which will essentially prevent it from stacking on top of each other will just separate them out so they're next to each other. And then we'll do our scale X continuous. So that our range is as expected. So we can run this. And so now we can see the distribution of all of these values. So our true value in purple here, we can see it's got this nice bell curve but we can see that sample for for example had a lot of sixes. Less down here we can see that sample three to one two and four have more sevens than the true value and so forth. And so we can use this to see how these distributions. Compare. And then if we wanted to find the sample distribution. Again, we need to get the distribution of the means in order to make it a sample distribution so we'll call this DF means to. And then we can say DF underscore and cat group by. And in this case, we're going to group by label. We're going to say mean. And then we're also going to rename so that they're a bit more descriptive outcome will become sample means. So we can see each of our sample means, and the true population mean down here at the bottom. And we can use that to compare. And what we'll get into in the next series of videos is how we can use this data to develop confidence intervals and start to make some inferences based off of our sample.