 Regration splines are sometimes used for modeling non-linear relationships. In this video I will take a look at what regression splines are and how a spline regression is estimated. The first time a person normally sees a spline it's in exploratory data analysis in a curve like this. So here we have a data set from Stata, we have a car's weight and the miles per gallon that the car receives and we are feeding some kind of curve here that is supposed to describe the data. This here is a spline curve. To understand what that curve actually tells us and how it is calculated we need to take a look at how these splines are defined. So the idea of a spline is that this is not actually a regression line or any other function that we can draw based on a regression model rather we are feeding different regression models to different parts of the data. Let's take a look at how Haastian co-authors explain splines. They have this example of a spline regression with two knots. So the knot is a part or a value on the x-axis after which we switch to a different regression model. So here in the first panel here we have two knots and we're simply estimating a line that has a different mean or different intercept for each of these part of the data. And in the second part we are estimating a different regression line for each part of the data. So this is basically running three separate regressions on the same data, splitting the data into three sub-samples and this is simply calculating means for three parts of the data. And these knots are estimated from the data but how exactly that is done we'll get to that in a moment. Typically when we have a spline we want the spline to be continuous. So we restrict one of these lines to start where the other one ends. So for example here we have a negative slope, negative slope but not as negative and then a positive slope and that feeds a curve that is non-linear. So this is the most commonly used scenario. We are basically here estimating four different parameters. So we have the intercept which shows where this first spline crosses the y-axis, then we have the first slope, then we have the difference between the first slope and the second slope and the difference between the second slope and the third slope which gives us the line. So we need four parameters to show how that line goes. Linear spline is simple to understand and simple to interpret but quite commonly the splines that we see which are used for data exploration purposes they are so called cubic splines. So instead of fitting a linear regression model we fit a regression model that has x, x-square and x to the third power or x-cube and then we estimate our betas for each of those three terms and we allow each term to have a different value for different parts of the data. And we can adjust the spline to have a different decrease of continuity. So here in the first case this is discontinuous so we're just estimating a separate different regression model with our third degree polynomial of x for each part of the data. Then in the continuous one we are estimating two parameters less which means that we constrain these intercepts for the second part to be whatever is the point here where the first part ends and then we can add even more constraints to the model estimate less and less and that gives us smoother and smoother plots. So how exactly we calculate these plots? Let's take a look at how splines work using the prestige dataset. So this is a dataset that I use in several examples. We have occupations here, the observations. We have the average year of education of the occupation and then income of the occupation in Canadian dollars in 1970s. And we can see that when we fit a spline the best fitting spline shows that the increase in education in income is rather gradual until about 14.45 years and then it starts to increase rapidly. So education pays off and colleges education even more so. So how is this kind of line with one not estimated? Let's take a look at the data. So this is there are the first 12 observations and if we know the not that it's 14.45 we can calculate a new variable called education star. This name of the variable does not have any significance. It is just something that I decided to use. So the star is not the convention here. So we have education and then we have education star which tells us how many years of education this particular observation has more than the not. So if there are occupations as less education than the not value then education star is zero. For example the first observation is 13.11 years of education. It's not more than 14.45 so the value is zero and then this observation here has education which exceeds 14.45. So the value of education star is positive and then we simply run a regression model that where we have education star and then plus plus u term here and that gives us the regression spline. So in fact what we are doing here is that we are estimating how much the slope will be different or what is there are these beta two estimates the difference of the slope and the interpretation is that how much excess or how much more this additional years of education beyond the not point pay and in practice these models are estimated by first specifying the regression model and then specifying the position of the not as addition a parameter and then that is a iter of the optimized to find the best feeding line. Spline regression can be understood also as an interaction model. So we can understand this as an interaction where we have education plus education times c and c here again this is my own convention is a binary variable that indicates whether there are x value for that observation is above or below the not value if it's below then it receives zero if it's above then it receives one and then we set the beta three there are coefficients for the not we constrain to be that kind of equation which basically defines that these lines will meet at the not point. In practice these models are estimated using something called non-linear least squares the idea of non-linear least squares is that we find the spline including the regression coefficients and the not value that minimizes the sum of squares residuals it is called non-linear least squares because there is no closed form solution so we cannot simply apply linear algebra and arriving to the estimates rather the computer will actually calculate the sum of squares residuals trying different values for the regression coefficients and the not or or not if we are more and it will iteratively find that there are the values that maximize minimize the sum of squares residuals.