 Let's fit a model to some data These are the annual temperatures for the last hundred and twenty years in a fictional Midwestern town There's one point per year the annual median of the daily high temperatures and when we look at it Our eye is really good at pulling out a pattern There's a clear lift toward the right-hand side We'd like to capture that in a model There are a lot of models that can represent this but a really nice starting point because it's so simple is a straight line Here's what the best fit straight line looks like it does a pretty good job We can see that it definitely captures the upward tilt of the data But it doesn't capture the bend in it It's clear when we examine it that a straight line doesn't do quite as well as we would like Luckily we have a lot of other options A reasonable next candidate is a quadratic a polynomial with a squared term instead of just a linear term These have some curvature to them We can see that the best fit quadratic clearly captures the lift at the right-hand side of the plot and the bend in the middle But it also imposes a little bit of lift on the left-hand side of the plot Which is not obviously reflected in the data So we can try other options. We can try polynomials with cubic terms powers of three Or we can look at polynomials with quartic terms powers of four We can also fit polynomial models of order five polynomials of order six Seventh order polynomials and eighth order polynomials also called octic polynomials Useful tidbit for filling lolls and conversations at parties Now the fit appears to be getting better But the line is taking on extra personality. It's adopting some wiggles If we take this to an extreme we can imagine a model that passes through every single data point perfectly This model would have zero error zero deviation from our measured data So does that make it the best fit model? Models are useful because they allow us to generalize from one situation to another When we use a model We're working under the assumption that there is some underlying pattern. We want to measure But it has some error on top of it The goal of a good model is to look through the error and find the pattern The most common way to do this is to split our data up into two groups We can use one group to train our model and then we can test it to see how closely it fits on the second group The first group is the training data set. The second group is the testing data set There are lots of ways to do this and we'll revisit them later But for now we'll randomly sort out our years into two bins We'll put 70% of them into the training data set and 30% of them into the testing data set Then we can go back to our collection of model candidates and try them one by one Here are a few of the models trained on the training data and Plotted against the testing data as the models get to be higher order We can see that the wiggles they developed may have been helpful for fitting the training data But don't necessarily help them fit the testing data better We can see an extreme example of this in the full interpolation model Where we just connect all the training data points with straight lines It really struggles to match the testing data points It's helpful to look at the error on the training and testing data sets for each model lined up side by side Looking at the errors on the training data set a few things jump right out First is the wide gap between the training errors the hollow circles and the testing errors the solid circles Right away. We can see that there's a substantial difference between the two data sets Second there's a precipitous drop in error Going from a linear to a quadratic model. That is a first to a second order polynomial This makes sense When we were eyeballing it we can see that the linear fit failed to capture the curvature of the data One of its most prominent features the quadratic curve captured that just fine So which model fits best? When we look carefully at the errors on the training data It appears that the error on the fifth order polynomial is the lowest The differences are subtle so you might have to squint But all the other higher order models have low error too. They're just just a little higher than the order 5 polynomial But as we mentioned, that's not the ultimate test. The error on the testing data is what we really care about Careful inspection of testing error shows that the fourth order model does the best job At higher orders of polynomials the error on the test data set goes up The more wiggly the line gets in fifth and higher order polynomial models the more it captures the quirks of the training data Rather than the underlying pattern of the testing data that we're interested in Based on this train and test approach We have a clear winner of all the models. We tried the fourth order polynomial is best Congratulations to us. We chose a pretty good model for our data But don't leave just yet. There are some pretty important ideas still to mention Join me for part two where we'll talk more in depth about what we want in a model