 As we move in our discussion of statistics and exploring data, the single most important thing we can do is exploratory graphics. In the words of the late great Yankees catcher, Yogi Berra, you can see a lot by just looking. That applies to data as much as it applies to baseball. Now there's a few reasons you want to start with graphics. Number one is to actually get a feel for the data. I mean, what's it distributed like? What's the shape? Are there strange things going on? Also, it allows you to check the assumptions and see how well your data match the requirements of the analytical procedures you hope to use. You can check for anomalies like outliers and unusual distributions and errors. And also you can get suggestions if something unusual is happening in the data that might be a clue that you need to pursue a different angle or do a deeper analysis. Now we want to do graphics first for a couple of reasons. Number one is their very information dense and fundamentally humans are visual. It's our single highest bandwidth way of getting information. It's also the best way to check for shape and gaps and outliers. There are a few ways you can do this if you want to. The first is with programs that rely on code. So you can use the statistical programming language are the general purpose programming language Python. You can actually do a huge amount in JavaScript, especially in D3.js. Or you can use apps that are specifically designed for exploratory analysis. That includes Tableau, both the desktop and the public versions. Click and even Excel is a good way to do this. And then finally, if you really want to know, you can do this by hand. John Tukey, who's the father of exploratory data analysis, wrote his seminal book, a wonderful book, where it's all hand graphs. And actually it's a wonderful way to do it. Now let's start the process for doing these graphics. We start with one variable, that is univariate distributions. And so you're going to get something like this. The fundamental chart is the bar chart. This is when you're dealing with categories and you're simply counting how many cases there are in each category. The nice thing about bar charts is there's really easy to read. Put them in descending order and maybe have them vertical, maybe have them horizontal. Horizontal can be nice to make the labels a little easier to read. This is about psychological profiles of the United States. This is real data. And that we have the most states in the friendly and conventional, a smaller number and temperamental and uninhibited. And the least common of the United States is relaxed and creative. Next, you can do a box plot or sometimes called a box and whiskers plot. This is when you have a quantitative variable, something that's measured and you can say how far apart scores are. A box plot shows quartile values. It also shows outliers. So for instance, this is Google searches for modern dance and that's Utah at five standard deviations above the national average. That's where I'm from and I'm glad to see that there. Also, it's a nice way to show many variables side by side if they're on approximately similar scales. Next, if you have quantitative variables, you're going to want to do a histogram. Again, quantitative, so interval or ratio level or measured variables. And these let you see the shape of a distribution and potentially compare many. So here are three histograms for Google searches on data science and entrepreneur and modern dance. And you can see mostly for the part normally distributed with a couple of outliers. Once you've done one variable or the univariate analysis, you're going to want to do two variables at a time, that is bivariate distributions or joint distributions. Now, one easy way to do this is with grouped plots. So you can do grouped bar charts and box plots. What I have right here is grouped box plots. I have my three regions, psychological regions of the United States. And I'm showing how they rank on openness. That's a psychological characteristic. And what you can see is that the relaxing, creative or highest and the friendly and conventional tend to go to the lowest. And that's kind of how that works. It's also a good way of seeing the association between a categorical variable like region of the United States psychologically and a quantitative outcome, which is what we have here with openness. Next, you can also do a scatter plot. That's where you have two quantitative variables. And what you're looking for here is, is it a straight line? That is, is it linear? Do we have outliers? And also the strength of association, how closely do the dots all come to the regression line that we have here in the middle? And this is an interesting one for me because we have openness across the bottom. So more open as you go to the right and agreeableness. And what we see is there's a strong downhill association. The states in the United States that are the most open apparently are also the least agreeable. So we're going to have to do something about that. And then finally, you want to go to many variables. That is multivariate distributions. Now, one big question here is 3D or not 3D? Let me actually make an argument for not 3D. So what I have here is a 3D scatter plot of three variables about Google searches. Up the left, I have FIFA, which is for professional soccer. Down there on the bottom left, I have searches for NFL. And on the right, I have searches for NBA. Now, I did this in R and what's neat about this, you can click and drag and move it around. And you know, that's kind of fun. It kind of spin around. And it gets kind of nauseating as you look at it. And this particular version, I'm using plotly in R, it allows you to actually click on a point and see, let me see if I can get the floor in the right place. You gotta click on a point and see where it ranks on each of these characteristics. You can see, however, this thing's hard to control. And once it stops moving, it's not much fun. And truthfully, most 3D plots I've worked with are just kind of nightmares. They seem like they're a good idea, but not really. So here's the deal. 3D graphics, like the one I just showed you, because they're actually being shown in 2D, they have to be in motion for you to tell what's going on at all. And fundamentally, they're hard to read and confusing. Now, it's true they might be useful for finding clusters in three dimensions. We didn't see that in the data we had. But generally, I just avoid them like the plague. What you wanna do, however, is see the connection between several variables you might wanna use a matrix of plots. This is where you have, for instance, many quantitative variables. You can use markers for group membership if you want. And I find it to be much clearer than 3D. So here I have the relationship between four search terms, NBA, NFL, MLB for Major League Baseball, and FIFA. You can see the individual distributions. You can see the scattered plots. You can get the correlation. Truthfully, this for me is a much easier kind of chart to read and get the richness that we need from a multidimensional display. So the questions you're trying to answer overall are, number one, do you have what you need? Do you have the variables you need? Do you have the variability that you need? Are there clumps or gaps in the distributions? Are there exceptional cases, anomalies that are really far out from everybody else or spikes in the scores? And of course, are there errors in data where there are mistakes in coding? Did people forget to answer questions? Are there impossible combinations? And these kinds of things are easiest to see with a visualization that really just kind of puts it right there in front of you. And so in sum, I can say this about graphical exploration of data. It's a critical first step. This is basically where you always want to start. And you want to use the quick and easy methods. Again, bar charts, scattered plots are really easy to make, and they're very easy to understand. And once you're done with the graphical exploration, then you can go to the second step, which is exploring the data through numbers.