 Finally, we've reached the question that probably brought you here. What models should I try to fit my data with? In part one, we step through the process of choosing a model. Train all the candidates on the training data, evaluate them on the testing data, and the one with the lowest testing error wins. The goal of this exploratory analysis is to find a concise representation of the patterns in the data. Not regard to what form they take. This process helps us avoid overfitting. That is, finding a model that captures too much noise. However, it doesn't protect against underfitting. It doesn't guarantee that the model we found is the best one possible. There might be other models out there that would capture the signal even better and result in a lower testing error. We have no way to know unless we try them all. However, we have to balance this with the fact that there is literally an infinity of models to choose from. There's no way to try them all. It's not even practical to try the most common ones. Running them all in all their published manifestations is likely to take so long that we don't care about the answer anymore. And that's not even considering the time it would take to get them all running in the same code base. We're going to have to make some tough decisions. Despite these practical difficulties, the try them all approach is conceptually sound. The AutoML community is committed to making it feasible to try a lot of models at once. This is a great way to make sure that you're getting a broad selection of different model candidates. But it's important to keep in mind that for every model candidate included in an AutoML library, there are a million that aren't. With that one caveat, AutoML is a great way to balance time with the breadth of your candidate pool. One way to describe a model's complexity is to count the number of separate parameter values that get adjusted during model fitting. These are all quantities that need to be inferred from the data. It doesn't make sense to try to learn 50 parameters from 10 data points. For a well-behaved and responsibly fit model, it takes somewhere between 3 to 30 independent measurements per parameter. If we consider 10 to be a nice middle-of-the-road guideline, that means a model with 100 parameters would require at least 1,000 data points to fit well. This also assumes that the data points are scattered relatively evenly across the range that you're trying to model. If, as is more usually the case, there are dense clusters of data points and vast wildernesses around them containing almost no data, then the amount of data you may need might go up much further still. Using this guideline, a straight line, which has two parameters, would require 10 data points to fit confidently. That's 20 measurements in all, 10 of X and 10 of Y. See, variate linear regressions can have dozens of parameters, random forests can have thousands, and deep neural networks can have millions. The number of data points you have can be a dominant limitation on the types of models available to you. If you only have 100 data points, it just won't be worth your time to try training a deep neural network. Sometimes we have a good reason to try only one type of model. One such case is when your research community has reached a consensus about which model is the most successful in a certain application. For example, in the problem of image classification, very particular architectures of convolutional neural networks have been shown to be quite successful. It's completely reasonable to repurpose one of these without doing a broad comparison across different models. The number of possible models in the universe is so large that it is vanishingly unlikely that a particular published convolutional neural network is the best solution to your problem. However, a thousand diligent PhDs trying tens of thousands of variants have failed to come up with anything much better so far, so it's not a bad place to start. Sometimes it's safe to stick with a single model because we are highly confident that it reflects what's going on in the world. For instance, imagine that we're watching a baseball game and we'd like to predict where a pop fly is going to land. A naive approach would be to measure the position and velocity of the ball many times per second for every pop fly, for many hundreds of games, and then fit all that to a model predicting the final location of the ball. A shortcut to this process is to rely on all of the people who have already made careful measurements and predictions about objects flying through the air and have found a beautifully concise model to describe them. In its simplest form, a ballistic trajectory model in three dimensions has only six parameters. A single observation of the ball gives us values of six different measurements. Position and velocity in the X, Y, and Z directions. Depending on our measurement accuracy, two or three observations might be enough to accurately determine the final location of the ball. This is probably why veteran outfielders can catch a brief glimpse of the ball shooting into the air, then turn their back to it and sprint toward where they know it will land. Incorporating domain knowledge wherever we can is a great way to limit the set of models we have to consider. And often, it helps us make confident predictions with many fewer data points. Reason to try only one model, testing a hypothesis. If we're looking at our data in order to answer a specific question, as is the case when we're testing a scientific hypothesis, then the model only needs to be focused on teasing out that particular information. In this sense, a model's like a spotlight. Each model is capable of revealing different aspects of the data. By choosing a particular model, we're effectively asking a particular question, or set of questions, of the data. When we're testing a hypothesis, that question can be very pointed. For instance, the classic, are the means of these two samples significantly different question, suggests a very specific model. Other more complex hypothesis might result in more complex models, but the connection between the form of the model and the hypothesis it is testing remains. Hypothesis driven testing is well suited to clarifying the details of how our system works. We're using data to help us make a decision. Hypothesis testing helps to shed light on a subtle point we've glossed over so far. One person's signal is another's noise. Consider our temperature data set from part one. A climate change scientist may approach the data with one question. Are the temperatures of the last 20 years significantly higher than the temperatures that came during the 100 years previous? In this case, a straightforward model that chops the temperature data into two samples and compares the means of both would be a reasonable approach. This is the signal they need to extract. All other deviation from this is noise to them. However, imagine another researcher studying the effects of solar radiation on climate. They would like to be more interested in identifying cyclical patterns in the temperature data. For them, any multi-decade trend would be noise. They would choose a model that isolates repeatable patterns on the eight to 11 year scale. As we saw when we incorporated domain knowledge into our pop fly model, making more assumptions allowed us to reach confident conclusions with a lot less data. This is part of a recurring theme in machine learning. Many decisions of models, of methods, even of the questions we ask and the data we use, boil down to a trade-off between the number of assumptions we are willing to make and how quickly we want to arrive at a confident answer. The benefits of making additional assumptions are obvious. Reaching conclusions with just a few data points can save a great deal of effort and money. However, the downside is considerable. Each assumption is a potential point of failure. If that assumption happens to be wrong, then we could end up very confidently reaching a conclusion that's entirely false. This might be embarrassing in the case of an academic publication, or it could be catastrophic in the case of predicting severe weather events. More assumptions get us started fast, but they also put strong bounds on how far we can go and how much we can learn. When we make fewer assumptions, we give the data more latitude to tell us its own story and to reveal patterns that we hadn't anticipated and to answer questions that we hadn't thought to ask. Unfortunately, assumptions are a Faustian deal we can't escape. The essence of inference requires some basis for using the observations we've already made to set expectations about the ones to come. And every method for doing this comes with its own collection of assumptions. Some are explicit and some are quite subtle, but no method is entirely agnostic. And it gets worse. In practice, we're further constrained by how scarce our data is. We are often forced to rely on assumption-laden domain models simply because it's prohibitively expensive or entirely impossible to gather additional data. And so, we forge ahead with as much grace as we can muster, staying aware of the limitations we've knowingly self-imposed, and keeping in mind that our analysis is only as strong as our poorest assumption. Staying vigilant for hints that any one of them is irredeemably wrong. On that note, let's look at some other assumptions that can prove useful. A lot of statistical models get their power from additionally assuming that the noise has some particular properties. For instance, that it's normally distributed, identical, and independent for every measurement. With these assumptions in place, statisticians have been able to derive powerful results. For several families of models, you don't need to have testing data at all. Student t-tests, ANOVA, and other statistical tools are a way to get around having to do separate training and testing. They make assumptions about the underlying model and noise distributions, and then use these to estimate how the model would perform on many hypothetical test datasets. They trade assumptions for inferential power. They are particularly helpful when you have few data points and can't afford to divide them into testing and training sets. Based on the training data and fitting errors alone, you can tell to whatever level of confidence you want what range the testing data will fall into. Even though this statistics-heavy approach looks very different on the surface, it's just another way to do what we've been doing all along. Finding the model that best captures the signal and determining how well it generalizes. The difference here is that the generalization performance is estimated rather than measured directly. Step carefully. Many of the common statistical assumptions, independence, identical distribution, normal distribution, homoscedasticity, stationarity, they're routinely violated by the data that we work with. Some violations are consequential and others aren't, but if you turn a blind eye to them all, you're likely to get poor results. So here's a tip. If you want to be a thoughtful colleague, you can help your fellows carefully think through the standard assumptions and figure out whether any of them need to be double-checked in their analysis. An anti-tip, if you want to be a statistics asshole, point out to everyone that their entire analysis is invalid because their assumptions are not all met. Sometimes when incorporating domain knowledge into our models, we can comfortably assume something about the parameter values themselves in addition to the model structure. For example, consider a model of regional differences in income, broken out by profession. There are already nationwide income surveys by profession, and it's reasonable to expect that regional differences in the wages of, say, a carpenter won't be radically different than the national average, maybe higher or lower by 50%, but not by a factor of 10. This gives us a running start. With this bit of prior knowledge, we can craft a distribution showing our best guess for what these wages will be before we report a single measurement. Then, using Bayes' theorem, we can cleverly use each new data point to update this distribution. It's useful to think of this as a representation of our beliefs. We start with the vague belief that in any given region, a carpenter makes somewhere in the neighborhood of the national average. Then, as we collect data for each region, this belief gets updated and shifted and takes on the characteristics of the local data more completely. The magic of the Bayesian approach is that it reaches a confident estimate with fewer data points than it would have taken otherwise. Again, we have the option to trade some assumptions for increased confidence. Again, the pitfall is that if we start out with the wrong assumptions, we can actually make our life much harder and require many more data points to reach the same level of confidence. Those who favor the Bayesian modeling approach and those who don't trade usually good-natured criticisms and jokes. But it's important to remember they just represent different sets of assumptions. Each one's appropriate for different goals and different data sets. Another type of structure we could add to our models is causality. This is extraordinarily helpful for answering questions about why something happened and what would have happened if another decision had been made. Causal relationships are in a class of their own. Most statistical and machine learning models don't say anything about causality directly. They can, at best, hint at it through careful experiment design. But there's a small and growing body of work around models that explicitly represent causal relationships. They let you answer questions like, what if I had not made this decision? What if this event had not happened? These are called counterfactuals. And as you can imagine, this type of analysis lends itself very well to making decisions and evaluating them. If you want to go deeper into this topic, you can find The Book of Why by Judea Pearl. He's a pioneer in the field and gives an excellent in-depth tour. Causal relationships are similar to the other assumptions that we've discussed. If you're justified in assuming them, they add a lot of power to your analysis and help you reach confident conclusions more quickly. However, if the assumption is unjustified, making it will hurt you more than it helps you. No matter how much you like the results. So, returning to the question at hand, which model candidates should you try? The answer, as always, is entirely dependent on what you're trying to do. But the one invariant, your single guiding principle in this process, is to be mindful of your assumptions. You can't get rid of them. If you ignore them, they'll bite you. But if you pay attention to them and thoughtfully prune them, they'll repay your attention many times over. Nice work. You made it all the way through. Now you're armed with the concepts you need to choose good models. May they serve you well in your next project.