 As we continue our discussion of statistics and the choices that are made, one important consideration is model validation. And the idea here is that if you're doing your analysis, are you on target? More specifically, your model that you create through regression or whatever you do, your model fits the sample data beautifully, you've optimized it there, but will it work well with other data? Fundamentally, this is the question of generalizability, also sometimes called scalability, because you're trying to apply it in other situations, and you don't want to get too specific or it won't work in other situations. Now there are a few general ways of dealing with this and trying to get some sort of generalizability. Number one is Bayes, a Bayesian approach, then there's replication, then there's something called holdout validation, and then there's cross-validation. Let's discuss each of these very briefly in conceptual terms. The first one is Bayes, and the idea here is you want to get what are called posterior probabilities. Most analyses give you a probability value for the data given the hypothesis, so you have to start with an assumption about the hypothesis, but instead it's possible to flip that around by combining it with special kinds of data to get the probability of the hypothesis given the data. And that is the purpose of Bayes' theorem, which I've talked about elsewhere. Another way of finding out how well things are going to work is through replication. That is, do the study again. Consider the gold standard in many different fields. The question is whether you need an exact replication or a conceptual one that is similar in certain respects. You can argue for both ways, but one thing you want to do is when you do a replication, then you actually want to combine the results. And what's interesting is the first study can serve as the Bayesian prior probability for the second study. So you can actually use meta-analysis or Bayesian methods for combining the data from the two of them. Then there's holdout validation. This is where you build your statistical model on one part of the data and you test it on another. I like to think of it as the eggs in separate baskets. The trick is that you need a large sample in order to have enough to do these two steps separately. On the other hand, it's also used very often in data science competitions as a way of having a sort of gold standard for assessing the validity of a model. Finally, I'll just mention one more. That's cross-validation. This is when you use the same data for both training and for testing or validating. There's several different versions of it. And the idea is you're not using all the data at once, but you're kind of cycling through and weaving the results together. There's leave one out where you leave out one case at a time, also called LOO, L-O-O. There's leave P out where you leave out a certain number at each point. There's K-fold where you split the data into, say, for instance, 10 groups and you leave out one and you develop it on the other nine and then you cycle through. And there's repeated random sub-sampling where you use a random process at each point. Any of those can be used to develop the model on one part of the data and test it on another and then cycle through to see how well it holds up under different circumstances. And so, in sum, I can say this about validation. You want to make your analysis count by testing how well your model holds up from the data you developed it on to other situations because that's really what you're trying to accomplish. This allows you to check the validity of your analysis and your reasoning and it allows you to build confidence in the utility of your results. Thank you.