 Statistics and Excel. Probability, distribution, models, and families. Got data? Let's get stuck into it with statistics and Excel. Introduction. In prior sections, we've been thinking about how we can describe different data sets using both mathematical calculations, such as the mean or average, the median, the quartiles, and with pictorial representation like the box and whiskers and the histograms. The histogram being the pictorial representation most used when we're thinking about the spread of the data, how the data is dispersed. We can then use different kind of language to describe the histogram and what the histogram looks like, such as it's skewed to the left or it's skewed to the right. Now we want to spend more time using mathematical models to describe different data sets. So in other words, when we're looking at a data set, if we can approximate that data set with some type of mathematical model, which will give us a line or a curve approximating the data set, that will often give us more predictive power over whatever the data set is representing in the future. So three pillars to describe distribution. Remember when we're thinking about the distribution of data, we're thinking about the shape of the distribution. You're envisioning here a histogram of a data set, which will give us an idea of what the shape looks like is the data centered in the middle, for example, or dispersed to the sides. The center, where is the center point often represented by the mean or some other centering kind of tool like the median, for example, and then the spread of the data. How is the data spread around generally that center point? How is it spread around, for example, the mean? Those are the characteristics we typically have in mind when we're thinking about a data set. Again, usually envisioning, say, a histogram. Shape of data represents the distribution of data. Any curve can model a data set, but some shapes are more useful than others. In other words, if we had a set of data, we could plot those data points into a curve or a histogram. And when most people envision or imagine a curve or histogram, the first one that comes to mind is a bell shaped type of curve. But it's important to remember that the bell shaped type of curve is only one family of curve, one possible shape of distributions. If we take any given data set, it's possible that that data set could represent any kind of curve. In other words, if you just looked out of your window at the horizon and you saw this mountain, for example, you can imagine some kind of data set that would be represented by the curve of this mountain. It's just a jagged type of curve. It doesn't necessarily need to be resulting in a bell shaped type of curve. And if that is the case, if we don't see any pattern in the type of data that we are looking at, it's going to be more difficult for us to approximate that data set with some kind of smooth curve or line, which is what we would like to do. Now, of course, when you look at things in nature and you look at just about anything, there are oftentimes going to be patterns. And if there is a pattern, then it might be the case that that data set can then be represented by a smooth line. And if it can be represented by a smooth line that can be then shown with some type of formula, that can give us predictive power into the future. So oftentimes the way you might want to start thinking about this is looking at the actual data, the thing that you're trying to test, and then plotting those data points. And then once the data points are plotted, you're trying to say, is this information something that could be represented by a smooth curve? Because the smooth curve possibly could be represented by some kind of equation or formula. And oftentimes many things can be, and if they are, then we can use that perfect representation of the jagged line in order to make future calculations. And you can kind of, I think about this kind of like, if you think of like, what was it, Socrates that had the idea that you're in the cave and everything that you look at is basically a shadow that represents the actual realness of something. So the horse that you're looking at is kind of like a shadow of the horse that represents, I guess you can think of, you know, the God's vision of actually hoarseness, what a horse is, right? When you're looking at a data set that seems to be following a pattern, you're looking at a small sample of basically the entire pattern. And if you were to be able to extrapolate out to the entire pattern in a similar way, then you would have that basically smooth curve that's representing, you know, the pattern that you have this small snippet of, maybe one way to kind of think of it. So salaries at a corporation, for example, skewed distribution. So when we're looking at the shapes, we're trying to think about the shape of the actual data. If we were to take a look at the actual data of the salaries of the of a corporation, we can describe the shape as we saw in prior presentation, so it might not be a smooth curve. We're looking at actual data on the histogram, and we could say that the data might be skewed to the right, or to the left, for example. So most employees earn an average or below average wage with a few outliers at the top. That's the other thing that we wanted to keep in mind from prior presentations. So you might have the CEO, for example, that makes a lot of money, which means that there's going to be an outlier to the right end. So you would expect the curve to basically be kind of skewed.