 Hello and welcome to this new lesson on confidence intervals. Before we get into confidence intervals, we're going to talk about sampling distributions because we ultimately perform the confidence interval on the sampling distribution. So we need to talk about how we actually get those sampling distributions. And so in this video series, we're going to be using some simulated dice rolls. And so you can imagine a situation in which you roll to dice, you take the total, and then you record that total. And ultimately we did this four times, but 10 rolls in each time. So we've got four different samples of 10 data points. So with that, let's go ahead and jump in to the video. So we are here in Google Colab. We're going to be using two main libraries in this series of videos. The first is going to be pandas and the second is plot nine. And then as I mentioned, we're going to go over sampling distributions. So I went ahead and preloaded this sampling data. And as I mentioned, we've got four samples here. And we've got 10 dice rolls in each sample and we recorded the total. What we're actually doing here with these curly brackets is essentially creating a dictionary, and then converting it into a data frame right away. So we recall in lesson one, we talked about dictionaries and data frames and how we can convert between the two here we're just doing it all in one step. So we've got our data, our data underscore DF for data frame. However, we're going to convert this into a long form data frames, which will allow us to help with plotting down the line. We're going to be using that melt command that we talked about in lesson three. So I'm just going to rename override the data DF name and say data DF dot melt. And then we need to give it our value bears. So which values do we want to maintain. And in this case, we want to maintain sample. To each of these samples is a column so we want to keep them as they are. So those are going to be our value bears, and I forgot an equal sign right there. And then we also need to give it a bear name so what do we want these new values to be called. We will call them sample. And we give it a value name. What do we want what's in them to be called, and we'll call that outcome. And so then I'll just print the data frame. We'll just show the first five rows using this dot head command. And so we can see here we've got our sample, and we've got our outcomes. And so we can use this to create plots. And so, as we talked about in lesson three, it's often a lot easier to plot using this long form, especially when we're working with gg plot. So our first visualization is to just look at how these different samples relate. And so we're going to do a histogram with filled bars so we're going to have our basic histogram and then we're going to add a variable using the fill command. So I'll start with the gg plot command. So gg plot data underscore df. And then we give our geom. So geom underscore histogram. And then we give our a yes. So this is the requirements so we need an x value, which we will call, which will plot our outcome so our actual dice rolls. And we're going to add a histogram so we don't need a y value because it's just going to give us the count, but we will add a fill based off of sample within the as statement, because we want to attach it to a variable. And then I'm going to specify that I want 12 bins we want one bin per possible dice roll. Each of those bins to be one long. So if we run this, we can see our plot here, we've got each of our samples has been automatically put into a legend. And then we can see sort of the distribution and because we've done fill it stacked each of these samples on top of each other so we can see as the whole group of samples all 40 data points that we ran. The most common was seven across all the samples. And it does tend to follow, you know, vaguely normal distribution. So we can see that, for example, sample four was heavily skewed towards the lower values. It looks like sample two and three might have more of the higher values. And so we do see that there are differences there between our samples. So this is just a good way to get an idea about what your samples exist, but we still need to determine the sampling distribution. So we have two steps to determine the sampling distribution. The first is to find the mean of each sample, and then we're going to plot it. So the sampling distribution is the distribution of statistics. And in this case we're going to say x bar is our statistic. So we need to first find the mean of each sample. So we say data means. And we are going to use a group by function. So we want to group by sample. So we want to group. First we're going to group by sample. Then we're going to take the mean of each of those rows in the now grouped sample. And then we're going to tag on a rename function. So here we just will rename our outcome column to sample mean, sample means, which just is a little bit more descriptive because it will no longer be the outcome. It'll be the means of the outcome. So we can print that. And in theory you can do each of these steps individually. But one of the benefits of working with pandas is that you can just sort of tag on these additional commands using this period or dot right here so we can do dot group by dot mean dot rename and it'll integrate them all in order. So if we print this, we can now see our sample means here. So we've got sample one with a mean of six sample to 6.3, 7.5 and 5.7. So this is our sampling distribution. It's our sample means are our distribution of the sample means. To visualize that will often use what's called a dot plot. And so it will function very similarly to a histogram. But we'll start with just plotting data means. And now the geome is geome dot plot. It requires an AES. Here we want our sample means. There's no other data that we're doing we're not adding a fill no need to specify any y axis, but normally we will specify the dot size, just so that the dots don't overwhelm the plot. So here we can see that it's plotted a single dot for each of our sample means, but it's stacked them into bins the way that histogram one. And so you can use this as a way to see the distribution of data, the way you would histogram but you can also see how frequently certain points are where they are each located. So we wanted to, we can add additional variables to show the fill to show lines that can make this a bit more descriptive. But this is how we can create a sampling distribution from a small subset of samples.