 The next step in our discussion of statistics and the choices you have to make concerns common problems in modeling. And I like to think of this is the situation where you're up against the rock and the hard place and this is where the going gets very hard. Common problems include things like non-normality, non-linearity, multi-colinearity, and missing data, and I'll talk about each of these. Let's begin with non-normality. Most statistical procedures like to deal with nice symmetrical unimodal bell curves, they make life really easy. But sometimes you get really skew distributions or you get outliers skewness and outliers. Well, they happen pretty often. They're a problem because they distort measures like the mean gets thrown off tremendously when they're outliers. And they throw off models because they assume the symmetry and the unimodal nature of a normal distribution. Now, one way of dealing with this that I've mentioned before is to try transforming the data, take the logarithm, try something else. But another problem may be that you have mixed distributions. If you have a bimodal distribution, maybe you what you really have here is two distributions that got mixed together, and you may need to disentangle them through exploring your data a little bit more. Next is non-linearity. The gray line here is the regression line, we like to put straight lines through things because it makes the description a lot easier. But sometimes the data is curved. And this is you have a perfect curved relationship here. But a straight line doesn't work with that. Linearity is a very common assumption of many procedures, especially regression. To deal with this, you can try transforming one or both of the variables in the equation. And sometimes that manages to straighten out the relationship between the two of them. Also, using polynomials, things that specifically include curvature, like squares and cubed values, that can help as well. Then there's the issue of multicollinearity, which I've mentioned previously. This is when you have correlated predictors, or rather, the predictors themselves are associated with each other. The problem is this can distort the coefficients you get in your overall model. Some procedures, it turns out are less affected by this than others. But one overall way of using this might also simply be to try use fewer variables, if they're really correlated, maybe don't need all of them. And there are empirical ways to deal with this. But truthfully, it's perfectly legitimate to use your own domain expertise and your own insight to the problem, to use your theory to choose among the variables that would be the most informative. Part of the problem we have here is something called the combinatorial explosion. This is where combinations of variables or categories grow too fast for analysis. Now I've mentioned something about this before. If you have four variables, and each variable has two categories, then you got 16 combinations, fine, you can try things 16 different ways. That's perfectly doable. If you have 20 variables with five categories, again, that's not too unlikely. You have 95 trillion combinations, that's a whole other ballgame, even with your fast computer. A couple of ways of dealing with this. Number one is with theory. Use your theory and your own understanding of the domain to choose the variables or categories with the greatest potential to inform. You know what you're dealing with, rely on that information. Second is there are data driven approaches, you can use something called a Markov chain Monte Carlo model to explore the range of possibilities without having to go through every single one of your 95 trillion combinations. Closely related to the combinatorial explosion is the curse of dimensionality. This is when you have phenomena, you got things that may only occur in higher dimensions or variable sets, things that don't show up until you have these unusual combinations. That may be true of a lot of how reality works, but the project of analysis is simplification. And so you got to try to do maybe one or two different things, you can try to reduce mostly that means reducing the dimensionality of your data, reduce the number of dimensions or variables before you analyze, you're actually trying to project the data onto a lower dimensional space the same way you tried to get a shadow of a 3d object. There's a lot of different ways to do that. There are also data driven methods. And the same method here, a Markov chain Monte Carlo model can be used to explore a wide range of possibilities. Finally, there's the problem of missing data. And this is a big problem. Missing data tends to distort analyses and it creates biases if it's a particular group that's missing. And so when you're dealing with this, what you have to do is actually check for patterns and missingness, you create a new variable that indicates whether a value is missing. And then you see if that's associated with any of your other variables. If there's not strong patterns, then you can impute missing values, you can put in the mean or the median, you can do regression imputation, something called multiple imputation, a lot of different choices. And those are all more technical topics, it will have to talk about any later more technically oriented series. But for right now, in terms of the problems that can come up during modeling, I can summarize it this way. Number one, check your assumptions at every step, make sure that the data have the distribution that you need check for the effects of outliers, check for ambiguity and bias, see if you can interpret what you have, and use your analysis use data different methods, but also your knowledge of the theory and the meaning of things in your domain to inform your analysis, and find ways of dealing with these problems.