 Statistics and Excel. Hyped statistical inference data practice problem. Got data? Let's get stuck into it with statistics and Excel. Or one note in this case, but we'll still talk about Excel too. First, a word from our sponsor. Yeah, actually, we're sponsoring ourselves on this one because apparently the merchandisers, they don't want to be seen with us. But that's OK, whatever. Because our merchandise is better than their stupid stuff anyways. Like our crunching numbers is my cardio product line. Now, I'm not saying that subscribing to this channel, crunching numbers with us, will make you thin, fit, and healthy or anything. However, it does seem like it works for her. Just saying. So subscribe, hit the bell thing, and buy some merchandise so you can make the world a better place by sharing your accounting instruction exercise routine. If you would like a commercial free experience, consider subscribing to our website at accountinginstruction.com or accountinginstruction.thinkific.com. You're not required to, but if you have access to one note, we're in the icon left-hand side, one note presentation, 1013 height statistical inference data Excel problem tab. We're also uploading transcripts to one note. So you could use the immersive reader tool. You can change the languages if you so choose and then be able to either read or listen to the presentation in multiple different languages using the timestamps to tie them into the presentations. One note desktop version here, our data on the left-hand side measuring the heights of individuals of human beings in the units of inches. If you want practice data sets to practice with, we recommend going to Kaggle.com, K-A-G-G-L-E.com. If we reveal some of our data, we're imagining these are individuals in the population, us majoring their heights in inches. So the first one being 75.15 inches, 75.12 inches, and so on and so forth. Now remember that the two major buckets that we think about with statistics is, one, where we know all the data of the population and we are basically trying to show some characteristics about that data within the population using our statistical tools. And two, when we don't know the entire population and we are trying to take a sample of the population to see if we can then apply our statistical tools to the sample to infer what is true about the entire population. Now when we're thinking about that second group, when we're trying to take a sample to infer what's happening in the entire population, it is useful to do some tests where we already know the entire population and then we're gonna act like we don't know the entire population and do our sampling testing to see whether or not our sampling techniques provide us with results that are gonna be representative of the entire population. And if they do, then we might be more confident in applying those same kind of statistical tools when we don't know the entire population. So we're gonna imagine that this data actually represents the entire population of heights that we're looking at. It might be a population for a particular location or something where we have all the data of that place. Now in actuality, this data set when we ran the practice problem is quite extensive, but we didn't copy the entire data set over here because it was just too long. So this is actually a longer data set than is represented here. This is just a bit of the data set so we get an idea of what we're doing. But the idea is that is the entire population. We can of course sort this data set if it were in Excel, so we can easily see the lowest number and the highest number, sorting lowest to highest. But that's probably not gonna give us as much variance of what we wanna know about the data set. So our typical kind of calculations would then be the average or the mean. So to calculate the average or the mean, we basically could mathematically add up all of the numbers and then divide by the number of numbers, one, two, three, four and so on. In Excel, we can use the average function just taking the average of that entire column of numbers, which is quite nice and useful in Excel. Remember that the median is the one in the middle, just like Rocky, the boxers coach said, you see three of them out there, you don't know which one is the representative number, try hitting the one in the middle, hit the one in the middle. That's what the median is. And so we could then do that with our formula in Excel and Excel will pick the one in the middle for us. We don't even have to do that and that's great. And then the max function, that says can we get the largest number in the population? So if I took this entire population and used the max function, then it would simply give me the largest number and the min would give me the smallest number and there's a nice Excel formula, which is equals min to get the smallest formula. So these are just the standard statistical tools. It's nice to know the formulas for these statistical tools. These are basic kind of formulas. Then if we took a histogram of this data, it would look like this. Now remember, this data actually represents a larger group of numbers than is represented on the left-hand side, a fairly extensive group of data to give us a pretty nicely populated histogram. And it's formatted as we would basically expect when we're measuring things oftentimes in nature. If you're measuring things in nature, they often tend towards having most of them in this middle point and then they basically taper off. So you would expect to see a shape like this. So you might have some outliers out here, but most of the population is kind of in the middle. That's gonna be true with a lot of things that we measure in nature, such as heights of human beings. And that's gonna be our dataset. So now that we have this dataset, we wanna practice thinking about, okay, what if I didn't know the entire population and I wanted to take a sample and then see if the sample that I took is gonna be representative. If I'm able to take the inference from the sample and apply it to the entire population. So there's two goals we wanna think about here. One is statistically, how does that work kind of theoretically? And two, if we were to put this into our tool of like Excel, how can we use the Excel tools to kind of help us to practice these practice problems? What kind of tools can we use in Excel? And we'll just think about them theoretically now. And then you can do the practice problems in Excel and actually punch in these formulas in there. So one thing we can do is we can say, there's a formula in Excel that's a random formula. So I could randomly generate numbers with this formula in Excel. So this is just equals R-A-N-D and then brackets with nothing in the middle. It will give us a decimal, but that decimal is quite long. So if I reveal as many decimal places, it's a very long number. And then now I have this randomly generated number. So note that I could then say, I'm just gonna put this random number, copy this formula down in a table next to my data, the same data that was over here. It's in a different order now, but the same data set, I now have a random generator tool to the left of it. And then if I sort the random generator, then it'll then sort, it'll shuffle up our height data. So we can then, that's a simulation of taking a random group of numbers. Now this random cell over here will actually also recalculate every time we basically click on it. So every time I do something, it will reshuffle, allowing us to basically use this tool to randomly shuffle the order of our height data. Because remember that if I was to just simply take my height data samples of an entire population, I would have my data set in order of either when I took the sample, or possibly by name of the individual. And what I want the sample in order, when I'm trying to order the sample is from highest to lowest or lowest to highest, or something like that. But if I'm trying to take a random sample of the data, which is gonna be a key component, every time we're trying to get a sample of data in order to create an inference from that sample, then I need to randomly shuffle all of the data. So this is one way that we can take all of the data that we have and randomly shuffle it. And obviously in real life, we have other kind of problems in terms of, well, how would you take a random sample in order to measure the heights of populations when many people won't come in to the doctor's office to measure their height, that they won't tell you their actual height, or blah, blah, and all that. We'll talk about that kind of stuff later. Right now, we just wanna think about that concept from a statistical standpoint and a kind of theoretical method, and we'll get into more details later on mathematically of it, of just what if we have the full population, now we wanna take a random sample of that population and see if we can get numbers that are representative using this technique, right? And then we'll deal with all the issues of applying this technique in real world. All right, so then because this one over here is, this is always gonna keep on shuffling randomly, I can use this as my random generator tool because these are gonna keep on shuffling every time I touch them. So I can then copy these two cells and paste them over here, but paste them one, two, three, paste just the values. And then, because then I won't have the formulas. So the key is that these have the formulas are equals rand in this. This does not have the formula. It's just a hard coded numbers, what we call it, meaning it's just a number without a formula. So then we can sort it. So now once we've sorted it, then these, we can then take the sample of just the top so many because they've been randomly shuffled using the random number generator. So then if we took a random sample then, for example, we could then just take the first, I think this is 10, right? We just took the first 10, which is just a random sample because we shuffled them and we can then try to analyze that sample. We could take the average of the sample using simply our average formula. Well, that wasn't what I wanted to reveal there. Using our average formula here. So this is the average of the sample. Maybe I do want to see. And you'll recall that the average was for the population 70 or 67.99. So this comes out to 68.01 about. So the average of our sample of simply just 10 of them, which is fairly small sample, in this case came out pretty close. If I compare each of these samples to the average that we saw, so this is the actual average of the population, this is the sample. Now, just remember that like for example, if you were taking a sample of heights in a population, remember we know something about heights in a population as we do with most natural kind of things that are similar to height. And that is that we kind of expect it to have this type of relationship with most of the heights kind of being within the center point and then fewer people having taller heights and lower heights. We kind of expect that to be the case. And if that is the case in the type of data that we are looking at, if I was to choose, for example, just one individual, then it's likely that even that one individual is gonna be somewhere in here. It's not very likely that that one individual is gonna be way out here, although it could happen. We might just randomly have picked the center of the Lakers or something, right? Right, and all of a sudden we've got this huge skew of the heights. But even if we just picked one person, it might be that we picked one person that was exactly in the average, which is, you know, we wouldn't have known that, of course, because we don't know the, because the whole point here is that we wouldn't know the whole population. But it's more likely that we're gonna be picking someone even with one person kind of in the range just because of the nature of the data set if we randomly pick someone. Now notice that even if we did pick someone way out here on the tail end, a very tall person or a very short person, if we just take two people, then it's likely that we're going to, that we're gonna go towards the middle because it's not likely that we're gonna pick of two people, two people that are way off on either tail, right? It's not likely we're gonna pick two basketball players and a random sample that happened to be centers, you know, of, you know, that happened twice. So just taking the average of two, you would think normally we'll tend towards the middle. That's the whole idea. So then, and then if we take more than two, it's likely that taking the average of more than two is going to tend us towards the middle. So the larger that our sample is that's random, even if we grabbed some of the outliers, when we have a population that's usually populated like this in particular, then it's likely that we're gonna, that we're going to be going towards the middle. That's gonna be the idea. Now, then of course we have questions that are gonna come up then of, well, you know, how many people do we need to pick in order? And then how many people do I need to be, to pick in order to be fairly confident that I'm at least within a certain range? And those are getting into more technical questions that we'll talk about later, but right now just conceptually that's of course the idea. If we randomly pick more people then we're going to tend towards the actual middle number of the population typically. That's the concept that we're trying to apply. So if I see this here, for example, this was the one we grabbed in the sample. This is the average of the population. So the sample was higher than the average 0.39. This is the second person we picked from the sample, 6675, this is the actual population. So notice these two kind of cancel each other out. This one was much, you know, more smaller than the population, then this one was taller, but the two cancel each other out. This one if I, this one was the sample, this is the population, and sample versus the population. So you can see that you would think that they would kind of cancel each other out. Now that doesn't have to be the case if you just pick 10 people, you know, you might just randomly have picked 10 people that all are over the average, it's possible. But it's likely then that it's gonna, that if they're actually randomly selected that they're gonna 10 towards the middle and the more people, if we have more people in the sample, you would expect that just from statistically it would be more likely that we're going to 10 towards the middle. That doesn't mean by the way that every time, like if I did a sample of 20 versus a sample of 10, that every time the sample of 20 is gonna be closer to, like if I took the average of the sample of 20, that it's gonna be closer than the average of the sample of 10, for example. That's not, it could be that the sample of 10 was perfectly picked. I picked the sample of 10 and it happened to be completely representative of the middle point and the spread of the entire population, right? And the sample of 20 just randomly selected was off further. But again, the idea is that the more people, the more likely that you're gonna have the data that's gonna be representative of the entire population. So that's gonna be the idea that we're gonna be based on. Now notice that in Excel, we could apply this concept and like what if I wanted to have larger samples and I wanted to do multiple samples, right? Multiple samples of 100. Well, in Excel, you can mirror this. You can mimic this by doing this multiple times, right? I might say this, I'll do the same thing. I'll put my random sample generator just like we did over here. We'll do our random sample generator. I'll just copy that. I just copied that over here. And then we can shuffle it again. So now this one has the formula in it of the random generator and then I can just keep sorting by that random generator. It gives me a constant random generation of the numbers of the entire data set, not just the sample. And then I could just do that multiple times, right? So now I've got two, three, four, five, six, you know, seven, eight. Like I think I did like 10 of these. I've got a bunch of these random generators stacked up on top of each other. And so then I could then take that information and say this is a count that I wanna take a sample from one to, what did I go up to, a hundred. Now notice that these random generators, I stopped, these are actually going all the way down for the entire data set, but I only copied a little bit of it just for one note. So the data set would be much longer in the practice problem, which you'll see if you wanna work this in Excel. But then we're gonna take this data set, I'll just copy then after I have shuffled the data, I'm just gonna copy all of the height data for each of these. I'll just copy all of the height data and then paste it over here. And I can do that in Excel, I can copy all of these non-adjacent cells at one time. So I can copy all of just the data after it has been shuffled. I can paste it over here and then I've got 11 in this case, randomly generated data sets from our data and the actual data would be a lot longer than 100. I can just simply delete, trim off everything after 100 and then I will have picked basically 11 data sets, randomly generated, each having 100 data points within it. And so that could be, again, a nice tool to use within Excel when we're trying to kind of understand what is going on so we can work with larger numbers than typically people are able to when you learn this stuff like in a classroom, which could really be helpful when you're trying to figure something out. When I took this stuff, Excel wasn't as applicable at the time or I wasn't as efficient added, it wasn't as ubiquitous, it wasn't as popular at the time. So we kind of had to kind of envision this stuff whereas now you can actually just populate the data sets and get a much better understanding if you run through this and clearly it has practical examples as well. But so now we've got these generated numbers and now I've taken sample sizes of 100 instead of 10. So if I was to take the average, for example, of the 100, the sample of 100, I came out this time to 68.01, it's still not gonna be exact, right? 67.84, 67.79, 68.15, this is the average of the sample that we pulled in and we might compare that to the average, the actual average of the population which we calculated first. Now notice in Excel, then this number, this information is now horizontally given. I might wanna populate that vertically now because that might be useful for me to then compare, I might like to see it in a vertical fashion. So I could copy that cell and then paste it up top but then transpose the X and the Y. So that's something that we'll practice doing in Excel. So if you wanna work this in Excel, you can do that too. So now we've got it in a vertical column and I can compare it if I want to to the average. You'll recall this is the actual average. So now what we have here, I'm not comparing every number in each sample to the average like we did last time. That's what we did over here where I took each number from the sample data when we only had a sample of 10. And I compared that to the actual average number or mean number. Over here, we're taking samples of 100 and then I took the average of the samples of 100. So now we've got one, two, three, four, five, six, seven, eight, nine, 10, 11 samples of 100 which we took the average of, right? So now we've got a difference of only 0.01 between the average of the sample of 100 for the first sample versus the actual average of the population. So you would expect these to be much closer because we're taking the average of a sample of 100 in this case instead of each individual population number. So now you've got these but it's still acting in a similar fashion in that we would expect these 11 samples of 100 if we compare and contrast them to kind of cancel each other out as well, right? So we can apply the same kind of concept and each of them are a lot closer than our individual data points over here. And then so the total is only off by like 67. Like if I took the average of all the averages of the 11 samples, right? Then you would expect us to be getting, so this is 67, nine, eight, seven, six versus the actual if I take the decimals out a little further, it's not exact but it's getting pretty close to the population and that's kind of what you would expect given that's our general concept when we're trying to infer from something smaller than the population to the entire population and that is if we can take a random sample, if we get a larger random sample, generally you would think that that's gonna go into lead to more accuracy. You might hit a point of diminishing returns in that. We'll talk about that more technically later but that's the general idea, right? We're gonna take random samples and the more random samples we have, the more likely that it's gonna be representative of the population. Now we could then make histograms of this. In this case I took a histogram just of the sample of 11. So this has 100 samples that we took from the population. So we're measuring heights of individuals. We randomly selected 100 individuals and then we made a histogram meaning how many of those individuals fall within 63.82 to 65.32 inches. How many of those individuals are from 65.32 to 66.82 and so on and so forth. Remember our midpoint here is the average of the entire population is 67.99, right? So somewhere in here. So we can do, now this is a histogram from the sample data 10. So same thing, we randomly took 100 people but we randomly shuffled them so we would expect that we're gonna get some similar kind of formatting but it's not gonna be exactly the same even though we used the same technique because we randomly selected a new group. And notice that when you compare these histograms you've gotta be careful about the number of buckets and the numbering of the buckets here, right? So this one starts at 62.53 because it was provided by Excel. Excel does this quite easily if you want but if you wanted to adjust this column for better comparison or the x-axis then you can adjust those. And then this one is for sample nine. So here we have the same thing but we just took the sample nine, 100 participants in this one. So you get an idea of the histograms because we took a pretty decent sample size, 100 are starting to be somewhat representative. You know, it's looking more towards the shape of the entire population, which this is the entire population that we're looking at here. Now notice that the easiest thing that we can kind of compare when we look at this is usually the average. We're like, okay, so if I look at this am I getting that midpoint? We also, but when we're doing this in practice we also wanna know kind of what's the spread of the data can the sample help us to tell that and how confident are we of the results? And those get into the kind of more technical questions which we'll dive into in more depth and future presentations.