 At this point, we need to take a little bit of a detour to talk about auto correlation. It's a tool that we're going to need to dig in to the temperature data and to find the patterns that we need to make predictions. It's not as complicated as it sounds, but it does bear some explanation. So to start with, let's talk about linear regression. We're going to use some Python code here as an example. So let's assume that we start with an array of things that look like temperatures. It is a potentially long array, but it has a certain number of days in it. And we also create an array that represents those days. And we can plot this now using days on the x-axis and temperature on the y-axis. And the first day, day 0, has the 68.2 associated with it. We can plot that and so forth for all the rest of our temperatures across all the rest of our days. Linear regression is just the process of taking a line and fitting it to all of those points. To be a little bit more specific, this line is described by where it crosses the y-axis, the temperature axis. So on day 0, what's the temperature estimated? This is called the intercept. And then for each day that we change, how much does the temperature change? So that's the slope, the rise over the run, the temperature change per day. These two things fully describe the line. Once you know the slope and you know where it crosses the y-axis, you know exactly what that line is. Once you have that line, for now we're just going to assume that we have it to start with. We'll come back to how we get it in a minute. Once you have that line, there is a deviation. Each measured data point won't lie exactly on that line, but will be off by just a little bit. So we'll just call this the temperature error. To write, if we write a little function to calculate the temperature error, it would look something like this, where first we calculate the slope, which is this temp change over days per day. And then our estimate of the temperature on any given day is just this equation of the line. y equals b plus mx, or a plus bx, or however you write it. But the estimate equals the intercept plus the slope times your x value. So in our case, our x value is days. So our estimated temperature will be whatever the result of that equation is. Those will be points lying exactly on that line. And then the measure temperature on any given day, temps, will be off from that by temp error, have a difference of temp error. So this is how we can calculate the deviation of our measured points from our line. Now, it is a characteristic of the best fit line that it takes all of those deviations, all of those temp errors, and minimizes the sum of the square of them. If you take them all, square them all, and add them up, that quantity is minimized by this best fit line. If we were to change the intercept at all, if we were to change the slope at all, that sum of squared errors would go up. So yeah, so we minimize that. And this is the Python for doing that. We use polyfit, which is fit a polynomial to our data, described by x and y, in our case, days and temps. The last number is the order of the polynomial. So first order is a straight line. Second order would be quadratic, third order would be cubic. Since we are fitting a line, we put a 1 there. And it returns the slope and the intercept of that line. And now by passing that slope and intercept into the find temp error function, we can actually find what our errors are. Correlation is closely related to linear regression. Once we fit that line and minimize the square of those errors, the square of those deviations, then correlation helps us to put a number on how closely the data points hug that line. If they lie exactly on the line, then correlation, often given by r, is 1. But if they're close to the line, it can still be quite high, like 0.9 or above. If they're in the neighborhood of the line, that would be more moderate correlation, somewhere in the 0.5 range, plus or minus. And then if the points don't make any attempt to hug the line at all, that's very low. The lowest it can possibly get is 0. But anything lower than 0.2 or 0.3 is often considered quite low. Now when I say the lowest it can get is 0, that's actually not true. If the line is slanted downward when it goes to the right, then the slope is negative and the correlation is negative. So measured points laying exactly on a downward pointing line would have a correlation of minus 1. So 0 equals no correlation at all. r equals 1. Correlation of 1 is perfect correlation in a positive direction. r equals minus 1, perfect correlation in a negative direction. So when we go to calculate this in Python, the way we do it is we take our data, temps and days, and we use the core coefficient or core call. And it spits out something that looks like this. What it actually does is takes our two quantities, days and temps, and finds the correlation between days and days, days and temps, temps and days, temps and temps. It's a symmetric operation. So the upper right hand triangle on this square will be the same as the lower right hand triangle. Days versus temps is the same as temps versus days. In addition, the correlation of anything with itself is 1. So the diagonal will be 1s all the way down. So what we really want, the number we really want is to say, get row 0 column 1. And we'll pull off the correlation between days and temps. This is the number that represents the correlation of the data that we see on the right. Now so far, we've been correlating temperature versus days. We can actually correlate temperature with itself. As we saw just now, if we just straight up correlate temperature with itself, the correlation is 1. It's perfect. That's good because temperature is a time series. We can do this trick where we shift it by one day and we correlate whatever the temperature is on day i with whatever it was on i minus 1. So we're always comparing the temperature on one day to the temperature on the day before. We would expect this to be correlated in the case of temperatures because the temperature from one day to the next does change, but usually not as much as it changes from one month to the next or from one season to the next. So we would expect this to have a reasonable amount of correlation. So this correlating a time series with itself, with some amount of shift, is what auto correlation is. Now to calculate this, then, we would just use the core call and put in the original data set and then the shifted data set and then find the coefficient of correlation between them. Now we don't have to just do a single day shift. We could, for instance, shift it by four days and calculate that. The full auto correlation function takes a sequence of these shifts starting with, we know that at zero, shift of zero, the correlation is one. But we can cycle through these and for each shift, calculate the auto correlation and make a sequence of these. So with the shift of zero, correlation is one. It's perfect. When we shift by one day, it goes down quite a bit. There's still quite a bit of correlation, but it's not one. And then with each successive day, that correlation gets a little less. Until about day five, six, seven, it gets down and it's bouncing around in the noise. That's just random amounts of variation due to happenstance. This is intuitively what happens in a time series like weather, where the weather one day is reasonably similar to the weather the next day, but the further out you get, the larger the swing you would expect outside of that. And so this is what it looks like in an auto correlation plot. The further away you get, the lower the correlation. There's also a very useful term called partial auto correlation. And what that is, is we start again with our temperatures and we think of them now as being errors, residuals that we haven't been able to fit yet. So in our very first correlation between with a shift of one, we take and we calculate the temperature on day I, versus the temperature on day I minus one, and then we find the residuals. What's left over after we fit? What are those errors? What are those deviations from the best fit line? Now, instead of with auto correlation now, we would find the correlation between the original temperatures and the temperature on day I minus two. What we are going to do is take the residuals after fitting day I minus one, the leftover errors, and we're gonna plot those against the temperature on day I minus two, the day before. And then we'll fit a line to that, find the error, find the residuals, and then we'll repeat this. So each day we try to fit the leftover error, we take it out and then we pass the residuals on to the next day to be fit. And what this does is it lets us see as we plot that, we go each day and after we calculate the coefficient of correlation between that day and the day before, then we fit the line to it, we calculate the estimate along all the points on that line, and then we subtract that estimate from what we're still trying to fit and we get updated residuals. So what that means is on day one, we get our original day one shift one auto correlation because the residuals that we're fitting there is actually the original temperatures. But after that, it falls off very quickly. And that means that what we're doing is we're using the temperatures two days ago to try to correct for any errors that we missed by using today as an estimate. So in the case of temperature, once we use the previous day to estimate today's temperature, there's actually not a lot of information left in the days that come before. So it falls down to the noise very quickly.