 The next step in statistics and exploring data is exploratory statistics or numerical exploration of data. I like to think of this as go in order. First you do visualization, then you do the numerical part. And a couple of things to remember here. Number one is you're still exploring the data. You're not modeling yet, but you are doing a quantitative exploration. This might be an opportunity to get empirical estimates that is of population parameters as opposed to theoretically based ones. It's a good time to manipulate the data and explore the effects of manipulating the data, looking at subgroups, looking at transforming variables. Also, it's an opportunity to check the sensitivity of your results. Do you get the same general results if you test under different circumstances? So we're going to talk about things like robust statistics and resampling data and transforming data. So we'll start with robust statistics. This by the way is Hercules, a robust mythical character. And the idea with a robust statistics is that they are stable. Is that even when the data varies in sort of unpredictable ways, you still get the same general impression. This is a class of statistics. It's an entire category that's less affected by outliers than by skewness and kurtosis and other abnormalities in the data. So let's take a quick look. This is a very skewed distribution I created. The meeting, which is the dark line there in the box is right around one. And I'm going to look at two different kinds of robust statistics. The trimmed mean and the windsrides mean. With the trimmed mean, you take a certain percentage of the data from the top and the bottom. You just throw it away and you compute the mean for the rest. With the windsrides, you take those and then you move those scores into the highest non outlined score. Now the 0% is exactly the same as the regular mean and here it's 1.24. But as we trim off 5% or move in 5%, you can see that the mean shifts a little bit than 10%. It comes in a little bit more to 25%. Now we're throwing away 50% of the data 25% on the top 25% on the bottom. And we get a mean here of 1.03. That's the trimmed mean and a windsrides of 1.07. When we throw away 50%, when we trim 50%, that actually means that we're leaving just the median. Only the middle score is left. Then we get 1.01. What's interesting is how close we get to that even when we have 50% of the data left. And so that's an interesting example of how you can use robust statistics to explore data even when you have things like strong skewness. Next is the principle of resampling. And that's like pulling marbles repeatedly out of a jar, counting the colors, putting them back in and trying again. That's an empirical estimate of sampling variability. So sometimes you get 20% red marble, sometimes you get 30%, sometimes you get 22% and so on. There are several versions of this. They go by the names, the jackknife and the bootstrap and the permutation. And the basic principle of resampling is also key to the process of cross-validation. I'll have more to say about validation later. And then finally there's transforming variables. Here's our caterpillars in the process of transforming into butterflies. But the idea here is you take a sort of difficult data set and then you do what's called a smooth function. There's no jumps in it. And something that preserves the order and allows you to work on the full data set. So you can fix skewed data and in a scatterplot you might have a curved line. You can fix that. And probably the best way to look at this is with something called Tukey's ladder of powers. I mentioned before John Tukey, the father of exploratory data analysis. He talked a lot about transformations. This is his ladder starting at the bottom with the minus one over x squared up to the top with his x cubed. And here's how it works. This distribution over here is a symmetrical, normally distributed variable. And as you start to move in one direction and you apply the transformation and take the square root, you see how it moves the distribution over to one end. Then the logarithm and you get to the end, you get this minus one over the square of the score. And that pushes it way, way, way over. If you go the other direction, for instance, you square the scores, it pushes it down in the one direction, you cube it, and then you see how it can move it around in ways that allow you to actually undo the skewness to get back to a more centrally distributed distribution. And so these are some of the approaches that you can use in the numerical exploration of data. In some, let's say this, statistical or numerical exploration allows you to get multiple perspectives on your data. It also allows you to check the stability, see how it works with outliers and skewness and mixed distributions and so on. And perhaps most importantly, it sets the stage for the statistical modeling of your data.