 Welcome to this video where we are looking at the prediction of house prices. So here what you will see is a GitHub page. The link is github.com, slash web artifacts, slash aims minus housing. And this is all the code and all the files, the data files that you need to follow what I'm about to present to you. And again, this is a presentation of using some simple machine learning algorithms to predict house prices. So first, if you go to this link, what you see here is a couple of files and one folder. Up here you see a data folder. If you click on that, we see that there are a couple of CSV files in it and also some other files that contain the data. And this is it. And then we have four files that start with the numbers 1, 2, 3, and 4. These are so-called Jupyter notebooks. And these are the data formats in which we present the analysis of this case study. And then you have some other files. So regarding this case study, this is actually based on a scientific paper. And there is a paper.pdf file also included. And what you can do is you can click on it and you can open an article from the Journal of Statistics Education where they basically talk about everything that we are also looking at. And we were at some points contrast what we find to this paper. So again, you don't have to look at this to understand the video. But if you want to dig a little bit deeper, then maybe it's worthwhile to read it. And also there is a map.pdf file which shows you a map of the city of Ames in Iowa, the United States. This is where the data set is taken from. And this is basically showing you all the different neighborhoods. We see there are many different neighborhoods here in different colors. And all the houses are from these neighborhoods. And as we shall see, the neighborhood is also an important feature of where the price is going to be, if it's a high price or not so high price. So what do you need to replicate what you're about to see here? Well, first of all, you need a working installation of Python. I'm using Python 3.7 here. So the easiest way for a beginner to get this is to go to anaconda.com. And Anaconda is a private company that offers commercial products around the Python software but also has some open source distribution here, which is called the Individual Edition. So you click on that and then scroll down. You can download it. And the nice thing about here, we have the installers all for Windows, Mac OS, and Linux. So it should work on whatever system you have. And the good thing about Anaconda is it not only ships you the current version of Python but also many so-called scientific libraries, for example, Jupiter, which we will need in this case study, but also NumPy and SKlearn. SKlearn is the default machine learning library in Python. Pandas, which stands for panel data, is the default library for dealing with data, which is basically the Excel replacement, if you will. And there are some other libraries we don't really need. But it basically ships with everything you would need in a typical data science project. So if you don't use Python from Anaconda.com, your alternative would be to go to python.org and download a pure version of Python, but then you would have to install Jupiter and all the other third-party libraries on your own. So I think for beginner, it's easier to just download Anaconda. And then also, as I said, this is all built in Python. If you are missing some basics in Python, I also am the author of some Python introduction materials, which you will find at a similar URL. It's github.com slash web artifacts slash intro to Python. And you have files containing an entire semester course on Python and exercises as well. And most importantly, you have for every chapter and a link to a YouTube video. So you could basically refute the basics of Python if you still miss some of the background here. And for further background of how to make this project work, there is also another page, a help page called chipiterlab.readthedocs.io here. And chipiterlab is the environment in which we will program Python. So if you have any troubles installing this or any troubles to understand what are the keyboard shortcuts that you may want to use and what else you can do with it, this would be the best resource to look this up. And then there is one more resource, which is the original cackle competition. So cackle is a company where many other companies and organizations can upload data sets and make them available for free for individuals around the world that can then participate in so-called competitions and try to solve some data-driven problem. And the Ames price and house prices data set is actually also distributed on cackle, even though it's also available without cackle. So cackle is just the competition here. But if after this video you are still interested in learning more about this project, what you will see here is many tutorials around this data set but also the solutions of other groups around the globe that work on this case study. So if you are not sure if you have the best solution or what else could be done in terms of math and statistics, I can take a look here at the cackle competition page. And if I close this now, here now in my web browser, we see local host 8888. And this is the Jupiter lab environment running in my local Chromium browser here. If you cannot install that for whatever reason, if you go down on the GitHub repository, you have links to the four notebooks as well to a service called MyBinder. And MyBinder is an interactive service which allows you to open a notebook in a web browser without installing anything. So if everything fails, you can always go back and try to follow the analysis here. However, you have to know that this is a temporary environment in the cloud, so it is probably better to install Jupyter and Python and everything, as I already said, on your local machine. This is in particular better if you want to use your computer's calculating power because MyBinder in the free version does not have so much computing power available. Okay, so let's go to Jupyter lab. So I assume that somehow you will find a way to open the project in Jupyter lab. So the easiest one way to download the materials, by the way, would be to click on the green button here where it says clone and say download a zip. This is how you get all the files that I have here as a zip file. But if you're familiar with the so-called git tool, you could also git clone it. This is why this button is called clone, but I assume that somehow you have knowledge on this already, so I'm not talking about this here. Again, the easiest solution is just download a zip and unpack the zip. This should make it work. And then when you open local host here on your machine, this opens an instance of the Jupyter lab environment. And what you see here is on the left-hand side, all the files that you just downloaded from GitHub. And then you see this launcher here. So what the launcher is, if for example, we click on the notebook on Python three, we get a new Jupyter lab notebook where we can just enter in these code cells here any Python code. So for example, one plus one. And if I execute this and get back to, so I can basically create new code files here in this environment and execute them. However, for this presentation, I prepared the four notebooks already. And so we will use them to go over the case study. So we start with notebook number one called data cleaning. And whenever you open a notebook from someone else and you start to do an analysis or to replicate an analysis on your local machine, one good practice is to click on kernel and say restart kernel and clear all outputs. This will basically get rid of all the output that may have been there before because the person that prepared the notebook saved the file with the output. But now as we run the code ourselves, we don't want there to be any output. And because of that, I just cleared it. And this makes sure that there is nothing left from previous runs. So again, these notebooks that I prepared for you, they have lots of text in them. So they are optimized also for reading, for reading through the materials. I will go over the notebooks rather quickly. So don't be afraid that you cannot read everything here as I go over it. That is not the intention. The intention is that if you want to dig deeper into a specific area of a notebook, then there is lots of text and documentation that helps you to dig deeper. But we will in this video only do a high level overview on how to do data science in the context of house prices. So what I do in this notebook here in the beginning is what I call housekeeping. So we import some libraries, for example, NumBuy, Pandas, and so on. And so make them available within this Jupyter notebook here. And then in the next code cell, I say from utils import and what utils is, this is not the library that you install, but this is rather a file, a .py file. So here in the folder, you see there is a file called utils.py. And if you click on that, then what we see is a text file opens and this has lots of Python code in it. And what I did is I put all the code that is not so relevant to be looked at in a notebook and also a code that is to be reused across several notebooks. I put it in this file. And this is basically how raw Python would look like if it's not done in a Jupyter notebook environment. And from this notebook, we will import some helper functions and some helper variables to make the notebooks easier to follow. So whenever you don't find any code or the code for something you don't understand, most likely it's going to be in this utils.py file here. Okay, so let's delete this temporary file as well and let's continue here in the notebook. So again, I import all these helper stuff here and then further in the housekeeping, I say pandas should show me 100 columns. So by default, pandas will not show you so many columns and we set this to 100 because the data set contains a lot of data, a lot of columns, I have to say. So at first what we do here is we go ahead and here's some code that loads the data file and this is built such that when the data file, the CSV file is not in the project already, then it goes to the original web page and which is here at amestad.org which is the official page where the data is from and downloads an Excel file and prepares it but then also temporarily stores it in your folder so that you don't have to go to this URL all the time and load the data again and again. So we call this caching, so the data is temporarily saved in the data folder as well and it was there to begin with because I put it in the repo so that you don't have to, so if you don't have a fast internet connection or whatever then you already have the data here and then we do some, we run some code that basically puts all the data into what we call a pandas data frame. So a data frame in Python, in pandas, in the library pandas is a special data type which you can compare to Excel. So this is basically how you would model Excel-like data in Python and then what we do here is we look with the .head method, we look at the first 10 rows. So .head takes a number, so if I replace the 10 by a five for example, I only see the first five rows and if I want to see the first 10 rows of the Excel sheet, so to say, I just say 10 and I have to run the cell of course again and again but then I see the first 10 rows and as I already told you, there are many columns in this data set and we see the last column is the sale price. This is the variable in US dollars that we want to predict and in order to predict this, we have all the columns available that precede the sales price column and every sale has an order number. It has, I think it's called a placement ID or something and then again, many, many features over which we will go now. And yeah, so this is usually how a data science project starts. You are given some raw data from some source and usually the data is missing, some data is missing, some data is messy, it may not be clean and we will, throughout the first two notebooks, go over all the features and clean them a bit and clean the data sets and then we will create features out of them and then only in the last chapter, chapter four, we will do some forecasting, some predictions. So let's go over the next couple of cells rather quickly. So as we see some of the columns, they have spelling mistakes and what we do is we will replace them by unified text strings. So this is what these code cells here do and then you will see throughout this notebook, there are many so-called assert statements in the code where I basically assert that some condition is true for the entire data set and this is how I run quick checks to make sure that for example, one column is never empty or one column only contains integers or data of a certain type and so on. So this is what you will see to me quite often so that again, I make sure that an entire column is in the format that I expect it to be. And then Pandas has some other attributes that it provides, for example, every data frame, so the variable DF is now the data frame, DF is the variable that symbolizes the axle data, so to say, and by saying dot shape, we get back the dimensions of the axle sheet. So the axle sheet has 2,930 rows and 80 columns and we will remove some of the rows because some of the data is not clean and we will create many, many more columns as well because some of the columns, they are not really useful for making predictions so we will create new columns out of existing columns that are more useful for making predictions. The 80 columns that we have here, they can be grouped into four different groups by the generic type, so to say, so we have one of these types is what I call continuous variables, so these are numeric values that come on a continuous scale, then of course there are discrete variables as well where you have one, two, three, four, or five rooms in a house, so this would be a discrete column, but here first we look at the continuous variables and what we do here is we assert with some quick test that the data are really continuous, so these are all the columns that hold a continuous data, continuous numerical data, for example, the number of square foot of the first floor in a house, the number of square foot of the second floor in a house, and so on, here we have the description as well and then we have many, many more like the garage area, the cross-living area and so on, the lot area, so these are all different measurements but they all come as continuous numbers and if we look at the first five of only the continuous variables, so continuous variables here is one of the helper variables that are defined, so whenever I just write continuous variables, this is what I imported from the utils module, this is basically a shortcut for all the names that we have here so I don't have to specify them, this is why we use the utils, and here we have all the continuous data in the data set, for example, we see that most of them is really continuous so we see that the square feet, they can come basically in any number, it's always an integer value here but it's basically a good approximation of a continuous number because we don't have, we have many different realizations of this value, this is what makes it continuous. Then we can look at some basic statistics maybe, so what we see here is most of the continuous columns are not null, that means they are not missing, but then for one column which is called the lot frontage, we only have 2,400 available data points, so there is a lot of data missing and we will see how to deal with these missing data in a bit, so we keep here track of variables that we want to take a look at later on. We do the same type of first check on the discrete variables, so let's quickly go over them as well, so these are the discrete columns, so for example, the number of bedrooms, that would be discrete variable, the number of basements, or basically does the house have a basement or not, this is a yes or no question basically, how many garages do I have, how many cars can I put there and so on. Let's look at all the discrete variables here, so we can already tell, these are typical discrete variables, we see also the year here, so for the year, this is basically the one variable that is close to continuous, we could basically argue it's continuous, but the year in which a house was built or sold is to me more of a discrete variable, so the reason why I do these checks is because those different groups of variables, they allow us to do different things with them and that's why we look at these groups independently. Another group that we have is nominal data, so for example, if we look at some of the columns that contain nominal columns, for example, these are fields that are used as tags, so basically a tag could be the house is of this style, so this would be a word describing the house and so on, the neighborhood is what I already told you about in the map that we saw, this is basically just the name of the local neighborhood within Ames, Iowa and the street name, for example, would be a nominal feature and the others as well here and we will look at all the features in detail throughout this video, so here is basically a brief view of the nominal variables here and we can indeed verify these are nominal features and if we look at the statistics here, here we have a more full picture, so there's only one column that has some missing data, obviously, all the other columns, they are basically always full and then the fourth category is a category of variables that is related to nominal, this is what we would call ordinal variables, so the difference between nominal variables and ordinal variables is for ordinal variables, you also have words describing a feature, but these words can be brought into a natural order, so for neighborhood or street name, there is no order, but if we look here at certain features that are ordinal, let's look at an example maybe, yeah, maybe these are usually features that describe the quality of something, so how good of a shape is something in, for example, what is the fireplace quality and so on, so how big is it or is it new, is it old and stuff like that and these are all abbreviations that the authors of the data set used and if you want to look up what these abbreviations mean, the source where we got the original CSV file from, they also contain a text file where every column is described and so this is where you would also read about all the ordinal characteristics here, but we will change the ordinal variables soon, so let's first look at visualization, so oftentimes when we do data science, looking at data in an Excel-based format like this is already quite insightful, but visualizations are usually a lot better to quickly get an overview of the data, so in Python, there is a third-party library which is called missing and all and this is a library that helps us to visualize where in a data set data is missing, so what we do here is I plot a so-called missing matrix of the four categories separately, so what this does is it gives us back a matrix form, a matrix visualization where we have all the individual columns and we have white areas wherever in a row data is missing, so we see there is one column called the lot frontage which we already identified above which has a lot of missing data and then we have this other column here called mass V and R area, whatever this is and this only has two missing data points and all the other rows basically always have something filled in, so this could still be messy data or dirty data but at least the other rows, the other sales basically have all the data available, so this is important to know to decide what do we do with this column, so my recommendation here would be to keep things easy to just get rid of this column because then we don't have to deal with missing values. Here for this column we could try to extrapolate the missing values but again because it's not that many, the probably easier way to go about this is to just drop the rows and only keep rows that have data available for all the columns and now let's look quickly at the visualization of the other three groups, so we see for the discrete variables, we see a similar picture, we see that for the column garage year built, so we don't have data here available for all the houses, so that means in most of the cases, I would guess that the garage is built together with the house in the same year obviously but sometimes the garage could have been added later on to a house and maybe sometimes this data is missing, so what do we do with this data? Well, I think the year in which a garage was built is not that important, the more important thing for a house price would be does the property have a garage at all? This is probably a more important property than when was the garage built, so we could basically also drop this column and then the other two categories of variables indicate that we have basically almost no data points missing, we see that sometimes when data points are missing, it occurs at different rows, so probably we have what we saw here around about seven or eight rows that we could basically just remove because they always have one missing value for some column and this is basically what we will do, so this is what we do here, in the cleansing part here, we get rid of the two or three columns that had, or two columns it was, that had lots of missing data, so these columns are eliminated entirely and for the rows where we, so here missing a lot, these two columns here, so these are the remaining columns here, we keep them and then we basically go over the, we build a follow up here that goes over the entire over the entire data set and cast them as a data type in or float, just to make sure that in the column we, when it says we have a discrete number that it cannot be a floating point number in there, so this is typical cleaning work here and then at the end what we do is we quickly print out the shape again so we dropped two columns and we dropped a couple of rows here, but again this will save us lots of work to extrapolate some data and then what we do is we store them as data.data underscore clean.csv and this would now overwrite what we already had in the data folder, so the data folder contains all the data for all the notebooks already hard coded into it basically, but if the data weren't available, this script number one would up here go out to the original source, get a data set and at the end store a clean.csv file, that's basically the entire idea of this first Jupyter notebook here. So now that we have a clean data set, what do we do with it? So what we do in the second notebook here is after some housekeeping, which is basically in our self-explanatory, we load now the clean data set with some helper functions so these are all the helper functions again that come from the utils module in Python here that we provided and then we start with the now already cleaner data and let's look at some features. This is a common mistake people make. So let's look at maybe the numeric variables here. We have sometimes square foot of something. So for example, the square feet of the entire house, where is it? I think, yeah, here we have the lot area. This is the entire in square feet measured the entire area of the house, but then we have a garage area and then we have some basement area and some first floor area and so on. And the important thing is the individual square feet, they add up together to the total basically. And so what that means is from a mathematical point of view is that there are linear combinations of columns that add up perfectly to some other column. And whenever we do, for example, linear regression, we don't really want this. We want the columns to be linearly independent. So what I do here is, what I check here is with some quick assert statements here is I checked that some of the columns, they are basically perfectly the sum of a combination of some other columns here. For example, the square feet we did, but also for the basement and so on. And then what we could do is we could basically get rid of some of the columns here because if two columns add up to some other number as well, in any linear regression, for example, later on, the linear, this will only confuse the linear regression because the linear regression may for some rows take some of the one column and then for some rows, some other column. In other words, what I'm saying is the constant, the constant beta that gets estimated in the case of a linear regression, for example, that may be a very unstable estimator. So it's always helpful to get rid of redundant columns and this is what we do here. And yeah, so this is what happens here. Then another typical transformation that we will do and this is what some of you may know from finance data. When we want to predict, for example, prices or in any kind of financial model, what often is done is we don't take the difference of two prices or the prices as a whole, but we take the log of some column or some value. And this is in a more general setting called a so-called box-cox transformation. So as I said, the easiest one would be to just take the log, a natural logarithm of some number. But what we do here is we use some estimation technique that is a standard way of how to estimate the best of such transformations for individual columns. And if you want to understand this in detail, you can read through this in detail a bit more, but what we will do here is I will just run this and then basically what this code here does is it goes over all the columns and it only takes for the continuous columns, those that have non-negative numbers because if you do a logarithmic transformation, it only works for positive numbers, obviously. And then it tries to estimate what would be the best so-called lambda here. And when a lambda of zero basically implies, we just take the natural log of something. So in other words, what this model suggests is for the cross-living area and the first floor area, we should just take a log transformation. And for the total sales price here, it also suggests that probably a normal log transformation would be the best. So what we do here is for all of these columns, we keep of course the raw columns as they are, but we add second columns that are the transformation. So for example here at the end, we have the sales price, the original sales price. And then next to it, we have the boxcock zero transformation, which is basically the natural logarithm of the value here. And then what we will do later on when we do the house price predictions, we will train prediction models, both for the raw value of the sales price, but also for the transformation. And then we will check which of the transformations works better or which of the prediction is better. And because sometimes the prediction may work better for the actual data, but sometimes it is better to fit a model on some transform and transformed data. And we will basically look at both cases and compare what is better. And then what I did here next is, I created a section called correlations. And I defined that a number that is correlated between a correlation coefficient of 0.66 and one is what I consider strongly correlated. If two variables are correlated between a coefficient of 0.33 and 0.66, then I call it weakly correlated. And if the correlation coefficient is below 0.1, I call it uncorrelated. And what we do here is I define a helper function that will plot at the correlation coefficient. And because plotting as we learned is often an easier way to look at data. And then we will calculate two different kinds of correlations. So the first one is the classical Pearson correlation. So what we do is we create a correlation matrix in a visual form. And the indication is the more solid the color is, the heavier the correlation is, the stronger the correlation is. But we don't really care if the correlation is positive or negative. All we care about in data sciences is if we find strong correlation in absolute value, because at the end of the day, the sign of a feature, if it's negative one or plus one, we don't really care so much, but we care more about how one feature varies when another feature varies. In particular, we are interested in pairwise correlations between the sales price and some other features. And this already gives us a first graphical implication of which of the features may be worthwhile to dig deeper into in which one not. So a color close to white basically suggests here that a feature has no correlation to the sales price here. And therefore, yeah, may not be really helpful to keep it in the data set. And then we do the same thing. So what we do here is I sort the features according to the rules that I defined above into weakly, strongly correlated and uncorrelated. And what I will do with that is later on, and we have a name error, the weak is not defined, so I should, of course, run all the cells. So this is a common mistake people make in Jupyter Labs. They just skip a code cell. So what we do here is I calculate all over again. Usually this works. And then what we do here is we create three different sets in Python that only keep the features that are either weakly, either strongly or uncorrelated with the sales data. And why I do that here is because in chapter four, when we talk about prediction, I will not only contrast the effect of taking the logarithm, for example, of the price and not taking it, but I will also contrast models that are allowed to work with all the features and also only to look at features that have some correlation that I identified before. And what we try to analyze here is we try to find out if it's worthwhile for me as the statistician to look at this manually here, basically, and define such thresholds, strong weak and uncorrelated in a manual way. This is just an assumption here, so to say. And basically pick my features manually, which I think are good predictors for the sales price. And then what we will see is that it's actually not a good way to do. So I can already give you the result here, but we will see that in more detail later in chapter four. And then again, here we have some code that basically shows us in a list which are the features that are totally uncorrelated to the price. So for example, the pool area is uncorrelated. That means if a pool is at the house or not influences the price, but how big the pool is does not seem to have any influence on the total price. In contrast, a strongly correlated variable would be the cross-living area. And this is of course, this makes sense because usually when we buy and sell houses and property, usually we have some factor and then we multiply this by the size of the house or by the number of square or living area. And then we get to the actual price. So this is how calculations, how price calculations are done by real estate agents. And this is why it's not surprising that the cross-living area is strongly correlated to the house price. And then we have a lot of what I call weekly correlated fields. So how big the first floor is, of course, is also quite correlated to how big the overall houses and the bigger the house, the more it will cost eventually. And yeah, so, and then we do the same with the so-called Sparman correlation. Sparman is just a variant where we don't look at the, or where we basically look at the order of features. So the Sparman correlation index is the correlation based on the fact of how ordinal values correlate to each other. And usually you should just go ahead and work with both the Sparman and the Pearson correlation index and see what works better, even though the Sparman is, of course, optimized for ordinal kind of data. And then we do the same analysis as we did for Pearson. We do for Sparman as well. And we will see a similar result. However, now we see that, for example, also in the strongly correlated section, for example, the number of garage is now also strongly correlated. So this is again the kitchen quality. So this basically now enables us to also look at two ordinal variables and in terms of their correlation to the sale price at the end. Then we save the data here and we haven't really removed anything, but we created some new columns here. So the log transformations and of course we store these log transformations in the CSV file as well here. So far we haven't done anything fancy. We have basically done what we would consider the dirty work. So the data cleaning is the dirty work because this is usually what you would spend most of your time with. And then the pairwise correlations is something that I would always recommend you to look at in the beginning so that you don't only understand the individual data, the clean data set, but also what are some rough correlations that you can already identify so that you have an idea of what are the features where you have to spend more time on and what are the features where it doesn't really make sense to spend too much time on. And now we go into the next chapter, chapter three. I call it descriptive visualizations because this is a chapter where this is basically all about a plotting. So we will plot lots of graphs and look at individual features. And then we will basically briefly discuss how good is a feature to be used in the actual forecasting model later on. So again, we do some housekeeping and then we load our cleaned and transformed data. I always also keep a dot hat in the beginning of my notebook so that I always see that when I go over the notebook later on, then I see that this is the data I am working with in this notebook. And so this makes it nicer to read. And we keep a list called new variables here that will keep track of all the new features we will create. So in this notebook, we will not only look at visualizations of features, but we will also create new features out of existing features. And this is called feature generation and this is also in a very important process because sometimes you see some pattern in the data that is not easy to predict or not easy to learn about by a machine learning algorithm. And so you have to prepare the data set a little bit and create new features out of existing ones to make it easier for the machine learning model later on to learn something out of it. So at first what we do is we create some derived variables. So for example, we have a variable that is called the second floor square feet. Now I thought that how big would the second floor be? Well, usually the second floor is very much the same size as the first floor because the second floor usually is built on top of the first floor. So I figure that the size, for example, of the second floor itself is not really a strong feature for prediction. However, if I create a feature which is called has second floor, a yes or no feature, which basically just indicates if the second floor is available or not, then this may be a stronger feature because someone that looks for a house may pay a premium, for example, if there is a second floor or maybe not. So we don't know yet what the structure of the house price is, but building a yes or no or for example, the second feature here has basement. Well, the total size of a basement is usually not so important as the fact if a basement is there or not. So the same is a fireplace. So we don't really care if a house has one or two fireplaces. We only care if it has at least one. So this is what we do here. We create new variables here, which are binary, and we will add them to our new variables list here. And I will also always include a preview on the data set on the new feature. So we see how the new features look like. And again, this would be a zero or one feature. So either a place has a fireplace or not, or either it has a garage or not, but there's no other value possible. Now if we look at a second floor data, so what we do here is now I create some pairwise plots. So we have the sales price on the y-axis and on the x-axis we have the cross living area and we see that the bigger a house is, the more living space there is, the more expensive it is. This is why the cloud of data points goes from the lower left-hand corner to the upper right-hand corner. And if we are using the color here, indicate if the house has a second floor, not what we see is given a fixed area if the house has a second floor, it basically comes at a discount. So in other words, people in Ames, Iowa seem to value or seem to be willing to pay a premium if for a given size of a house, if the house is only one floor in contrast to a second floor. So in other words, people in Iowa or people in Ames here, they don't really like a second floor. It seems like at least they are not paying a premium here. So this is an interesting realization here. Let's look at basements. So if we look at basements, what we see here is if the house has no basement, it will get a discount, but we also see that there are not really many houses that have no basements. So in other words, even though in the United States, in general, it is very rare for houses to have a basement in Ames, Iowa, the vast majority of houses has a basement and therefore having a basement is really not a good indication of if the price is going to be higher or not because basically every house has a basement unless the very few houses that don't have a basement, they come at a discount. So it seems what we deduce from this picture here is we could say that people in Iowa, they want the house to have a basement. They're not willing to pay for it, not willing to pay a premium for it, but they want the house to have a basement. Let's look at fireplaces. So what we see here is given, so first what we see is there seems to be a relationship between the fact that a house has to be rather big in order to have a fireplace. This makes sense. If you have like a bigger, let's say a luxury house maybe, then there's an increased chance that this house also has a fireplace and small houses which are down here, which are on the left here, they tend to not have a fireplace and then if a house has a fireplace, then the price seems to increase. So given the same area, the same living area, having a fireplace yields a premium. And this makes sense because we could see that the fireplace, we could treat it as an indicator variable for it's a luxury house, so to say, and then for a luxury house, you're willing to, or you have to pay a premium to get it. So these are some ways of how to show some stuff here. Garages, garages here. We see if a house has no garage, it gets a discount. Other than that, we don't really see any pattern here. So beyond a certain price point, every house has a garage here. So it seems that, so what we can tell from these variables here, there is different, so variables will take a different role in the prediction model later on. So some of the variables, they only seem to make sense when taken together with some other variable. So only for a, for example, for a cheap house, it makes sense to look at the garage or not a variable at all. So this is how we see that there are, there may be some complex underlying relationships between different sets of variables here. But it's still good to get an idea, at least visualize it so that we see that what is going on here. If we look at pools, what we see here is basically we, the variable here is quite uninteresting. Why? Because most of the houses don't have a pool. So having a pool, you know, the red dots, the rare orange dots here, they are all over the place. So if we have a pool, we cannot really say anything about the house yet. So that's interesting because I would have guessed without looking at the visualization that a house that has a pool must be a luxury house and therefore have a higher price. But it seems like this variable is really worthless because we don't have lots of data on it and not too many houses have a pool. So fireplaces seems to be really a really, a better indicator for luxury in Ames, Iowa. And this may be different in other places in the United States. Let's look at porches. Porches seem to have a different effect than a garage. So if a house has no porch, then it comes at a discount other than that. It's also, you know, another really good variable because most houses do have a porch. So it's a variable that it may be very hard to learn from it. Then let's look at neighborhoods. And here I quote the paper that I originally showed you. And they say, the instructors basically say that they found that the neighborhood plays a very large role. And this is, of course, not surprising because also due to the history of redlining in the United States, you have poor neighborhoods and you have rich neighborhoods. So the neighborhood where the house is in is probably the most important indicator of a house price at all. So let's visualize it. And the good way to visualize this would be to use box plots. So we have the different neighborhoods on the X axis here. And then we have the different neighborhoods in different colors. And we see that for every neighborhood, we have an average or a median, I guess it is. And we have the entire span. Of course, we have outliers as well. So box plots, they are usually, the boxes here, they usually disregard the outliers. But then they show you here that usually this is the 95% of the houses where they lie. And so we see that there are huge differences not only between the average, but also in terms of the spread and so on when it comes to a house price versus neighborhoods. So what we do here is the variable house price so far is a nominal variable. So it can have up to I think 28 different realizations. But we cannot really learn anything from a text column. So what we do here in the next code cell, we use the pandas get dummies method to translate the neighborhood feature, the nominal neighborhood feature into a so-called factor variable. So what this means is at the end we get 28 columns that basically are a yes or no answer to the question is the house in this neighborhood? Yes or no. So it's a 28 binary variables. And we can check that we did the right thing by checking if there's only one one in every row because a house can only be in one neighborhood, of course. And so we see here that our feature matrix later on will grow tremendously to the right or in its width because many of the nominal features that we have have to be translated into these dummy indicator variables here as well. And this is how we do it in Python using pandas.getdummies. Let's look at the nominal features without the neighborhood now. Let's look how LA is play a role. So we have an LA in the United States is a small road or a small street that is behind the house. And usually when they do trash collection they usually go on the backside of the house. This is quite different to how this is done in Europe. And we see that not every house has this. So in other words what we see here is the absence and also what we see here is we see that all the blue dots here they are actually called NA for not available. So we don't have lots of information on that. And this is also something that is interesting. So when I go back to the data cleaning part what we did is we deleted all the rows that did not have all the entries filled in. However, now what we see here is even that doesn't mean that there are no missing data because as we see here in the category called LA there is the most common value in there is just called not available. So we have to be careful here even though in so physically the data point is not empty it is really empty, it's practically empty. And because of that what we do here is we delete the column because it's really not helpful if the feature is missing for the majority of the data set. Then let's look at the feature called the building type. So there are different types of buildings in Ames. So there's the most common one is the one family home called one fam. And then we have two different kinds of townhouses in orange and green. And then we have a duplex which is down here and we have a two family condo which is also rare but it's also down here. So we see that the type of house does play a role. So what we see here is to make the feature maybe a little bit better to work with is we go ahead and we merge the two townhouses into one category. So green and orange here which is about the same. So it seems to be the orange ones they are like slightly bigger townhouses and the green ones are slightly smaller townhouses but the townhouse itself they are roundabout here. So we have a different slope here in the data cloud if you only look at townhouses and the slope is here rather constant here. So what we do is we lump those two groups together into one to get a stronger signal later on. And we also do that for the two family condo and duplex which are both down here because what we see here if we look at the violet and the red dots here they are mostly down here. So it makes sense to lump them together because the two types of houses they really seem to be one. So in terms of pricing. And so this is also something that you should, something that requires manual work here. A computer wouldn't see that these two categories basically are the same. But what we have here is basically as people when we categorize it statistically we make up some categories but in this case we make up two categories where we should have statistically speaking only made up one category name for it. So this is basically a way of cleaning up what the statistician came up with. And then of course we have to create dummy variables again. So here we get one zero variables again and because that's the only thing we can work with here. Let's look at the air conditioning. So what we see here is if a house has no air condition it comes at a discount. So this seems to be a variable that is a very important variable. And now here again this is a nominal variable which only has two labels or two realizations that it comes in which is Y for yes and N for no. So really is this a nominal data type? No, not really. This is really a binary data type. So just because the data in the raw format is provided to us as a text string which only contains yeses and noes, Y and Ns we have to really get some more meaning out of it by casting it as a correct data type which in this case obviously is again a one or zero here. And so we end up with one column which is called air conditioning and it has a one if we have air conditioning. And then we have a category that is called proximity to various conditions. And this is really a very messy column or very two messy columns actually. So we have two columns called condition one and condition two. And they are basically columns that can contain tags. That's basically the best way to describe it. And so when we look at it let's maybe look at what are the most commonly used tags. So the abbreviation feeder is basically one of the realization that comes up quite often. And what a feeder is, this basically means the house is close to a feeder street and a feeder street is a street that basically brings cars to the next highway so to say. So basically what I'm saying is if a house is next to such a street this may have a bad influence on the price because no one wants to live close to a big street at the end of the day. And then there are some categories called are something and these are always, these are different categories regarding railroads. And you can look them up what they mean in particular but what we do here is we will go ahead and we will basically lump all the different railroad categories together into one a new category which is called near railroad yes or no. So this is kind of the transforming that needs to be done so that we can actually use these labels here. And yeah, let's look at what this means here we have a visualization. So as I said, if you live close to a feeder street you get a discount, right? And so what we use, what we do here is in this next code here I create a new variable called street I create a new variable called railroad and then there's also one called park. So if you're close to some park there are several ways of saying that in the data set and we only make it one unified way of saying that and we will lump together all the different park categories also into one big category just called park. And now if I run this this will also create some plots and we see that the different categories that we have they result in different slopes in the data set in the data cloud. So they will have a different effect. So this is a very easy way to plot called a library called a map lot lip or seaborn. This is what it does is it's so called LM plot. And then LM plot is basically a linear regression model without doing the regression but it's basically a simple linear model and we see different slopes here. And this is what we are looking for as data scientists we want to see different correlations how strong is something correlated with one another and also of course the slopes which is kind of a different way of saying something is correlated or not. So here we create the categories I talked about and then we finally delete the condition columns because the text conditions they don't really help us in any prediction. And so these are the new features we created here. Let's look quickly at some other feature called the exterior. This is also a tech like feature. So this is basically a feature that lumps together in two columns, the two most commonly used materials out of which the house is built. And we can see that there is some pattern in it, right? So some houses are built of materials that are more pricey and some are built of cheaper materials. So what we will do is we will look at this but then what I found here is the category is too diverse. So we couldn't really use this to make any good prediction. So this is, I'm already taking ahead the result that we will see, but of course I played with this data a lot more. And I found that the material out of which the house is done in theory it may have an impact but in practicality for our models we didn't really see much of a difference so we deleted the column here. And then there are some other categories, some other features, some called the foundation and here we see some strong pattern. So depending on how the house is founded this basically seems to indicate a price discount or a price premium depending on how you look at it. So what we do here is we get dummy variables as well for the different foundation types and what they are in detail. Let's not get too much detail here. You could read it up. But again, we only take these variables here because we can really measure a linear relationship and we can see different slopes. So it really makes sense to keep them here in a clean way in a one or zero way. Then we look at some other features that are not so relevant. We look here at the feature called garage type and so this talks about is the garage built next to the house or a little bit further away from the house is it a carport or real garage and so on. And what I found here is that this really didn't play too much of a role here. So what we'll do here is we'll just delete this as well. It's more important if a house has a garage or not. It's not important what kind of a garage it has. If we look at heating, heating we see here different data dots. Most of the houses have a gas heating. So if there is another heating, a gas water and so on, gas W, whatever this is, we see there's not too much variation in here. So this is also not a good feature to learn from. So we get rid of the feature as well. Let's look at house style. So here we think that we see a pattern, but again the slopes are not too much different here. And I worked with it again and I found also that this feature, the house type is not so important. So here it says is it a one story house or two story house? Well, we also have another feature called has second floor up there already, which we're using. So the house style, if it's a one story or two story, so house style is a text field first and foremost. So this is not so important because we already have other variables that basically indicate the same kind of information and the slopes here are not too different. So this is why we get rid of this feature here as well. Land contour, we see there is some sort of a pattern here, but this isn't really not too many data points here. And so we got rid of this as well. And there are some more. So lot configuration, if we look at this, we see there's, yeah, it's a little bit messy here, not too much variation here, so we also don't look at this. And you see that in general, I get rid of many of these fields because at the end of the day, what these are, these are fields that either are already included by another field already that we have or they are just too messy and we cannot really deal with it. And you cannot just take any text field and leave it in a data set and give it to the prediction algorithm later on because the feature has to be made a one-zero dummy kind of column first. And sometimes this is too hard to do. And at the end of the day, if you have too many one or zero columns, then what we will have, we have another problem which is called the Curse of Dimensionality. So we also don't want to run into this. So we also have to find a balance of not creating too many variables later on. This is why I'm removing many of the variables here. And also miscellaneous, this is basically a column that you will find in any data set of any sort, something called other or mis-misc or miscellaneous and so on. And usually as we look at this data cloud here, doesn't, it's not really helpful here. Let's look at the roofs. As we see, there are different kinds of roofs, but most of the houses in Ames have the same kind of roof here. So again, there's also a field that is not really helpful. And so we get rid of it as well. And now we come to some more interesting fields. So there's one text field called Sale Info and this basically covers abnormal sales. So in other words, foreclosures. So someone, some person goes bankrupt and then the house is foreclosered. So the bank takes away the house and what this basically does is, this basically does not enable a fast or a real sale process, but it goes into a very rapid sale process. So we can expect that the house price will be at the discount here. So let's look at this. So at first what categories are there? There is a normal sale, a sale, then we have a partial sale and we have the abnormal sale. And again, abnormal means a bank foreclosure. So let's look at a data plot here and we see the foreclosures, they all come at a discount. So if a house was basically auctioned away by a bank or not, this is a very important detail here when we want to predict house prices. And for partial sales, this seems to be where someone has a house and is basically only selling some part of the house, some unit within the house and this seems to come at the premium here. So we keep features out of that. We have here a line protocol again that we see that especially partial and where is it abnormal? They have, they result in different slopes. So this basically indicates that we should keep them and we should keep the variables in a clean way. So what we will do here is we will create a clean one or zero variables here again and three new features, partial sale, abnormal sale and new home. New home is a feature if a house was sold for the first time or a second time. And whenever a house is sold for the first time, it also comes at a light premium here. Street names as a feature are not valuable at all. So we saw that neighborhood is important but we could argue that a street name or the street in which houses is kind of related to the neighborhood in which houses but it's too granular. So we have probably too many different streets and then we would again run into the curves of dimensionality where we end up with too many one or zero variables for every street basically and this will not be helpful in the prediction later on. Some more interesting features we can develop is the age of a house. So in the data set, if you look closely, you will see that we have columns called year remodeled and year built. And then we have basically year sold or year sale. We also have a variable. And the idea is that while it is important when the house was sold because of inflation, so if a house was sold in 1980 and another one was sold in 1990 and the next one was sold in the year 2000, then we would expect that the price of the houses is going to increase due to natural inflation. However, this is not what we're looking at here. We're looking here at how old the house is. So we're looking at the difference of when the sale was done and when the house was built and this variable does not exist. So there is no variable called age. So what we do is we just create it. And we also do this similar variable for remodeled. So whenever a house gets very old then usually what happens it's remodeled by some construction company and then it is sold again and then usually it gets a premium because it was modernized and this is what we capture with the variable remodeled yes or no here. And then we have variables called year since built, year since remodeled, so we capture the age and the time and we see that there is some variation again. So most houses are sold when they are new and then after a house gets beyond a certain age it is not really, yeah, there are not so many left. But this is also due to the fact that the city of Ames is not so old, I guess. So we create some one of zero variables here again and plot them again and we see that if a house has been rebuilt, recently built this is also, I forgot to mention this, some feature variable. So instead of just looking at the years built and the years since remodeled which is a continuous variable we can already also create feature variables that are called recently built or recently modeled where I create a one or zero variable by asking if the house was built within the last 10 years then it's a yes and if it's built beyond 10 years ago then we say no. So what this does is it basically translate the continuous variable, the age of a house or the age of the time of when the house was remodeled into a binary variable and we keep this as a secondary variable because we see that maybe the variable itself in terms of it's being continuous does not help so much. So we take the binary variable instead as a proxy and we will of course in the next chapter look at both of that and then we will see which one works better. So again, we get rid of some of the variables here and then lastly, before we end this chapter here what we do here is we have to take care of some outliers. Now you may wonder why did I not take care of some outliers before? Well, now that the data is clean what we can do is we can run some automated algorithm to detect outliers and I chose one that is called in so-called isolation forest you can do your own read up on what that is is it's basically a machine learning algorithm or a statistic method that basically looks at the entire data set and given some of the parameters it basically gets rid of some rows that are too different from the cross of the rows and this is a way of automating the outlier removing outlier process we could do that manually as well but the interesting thing is what I did is I plotted the outliers and we see all the outliers here visualized these are houses that are extremely big or extremely pricey or at the opposite so extremely small and extremely cheap and they are removed and the ones that were removed by the authors of the paper they are all also enclosed in this set here so in other words our automated way of removing outliers detected all of the outliers that the statistic people from the paper also discovered and removed and I think we removed one or two more outliers due to statistical properties and then at the end all we do is we store the now not only cleaned data set but also the data set that now contains all the features or the new features that we generated so now we have gone a long way from coming from the raw data set and do some first approximate cleaning then we looked at some correlations and now in this data set we also put in many, many or much, much more manual work many, many more steps and now we have a new data set which is now 109 columns wide so we started with the initial 78 columns and we are now at 109 columns and we have discarded some more rows again but at the end of the day we have basically transformed the data set a little bit more and now we have a feature set that we can actually use for prediction so again as I said throughout this chapter here some of the features they are not suitable for statistical models and now in the final step we open chapter four which is called predictive models and now what we will do is we will run some of the some machine learning algorithms on the cleaned data set and also on the not so clean data set just to show the contrast so one caveat here this is not about getting the best possible result on the AIMS house price data set if you want to look what is the best data set here, the best solution the maybe the best model to predict it you should go to for example the Kaggle competition page and look up some of the competitions there what we do here is here the idea is to contrast using the clean data set versus the not so clean data set and also showcasing a different subset of variables that we manually detected in chapter two by only looking at correlations and then we will see some we will take some learnings from that okay we import some stuff here and now do some housekeeping and now we will load the original data set which we call data clean so this is the data set as it was after the first chapter was done here so this is the data set that is already a little bit clean so there is no missing data in there for example and some other stuff but we have no transformation here no cox transformation no log transformation no new features nothing and we call this data set df1 here and then what we do is we encode the ordinals there is a helper function in sklearn sklearn is the standard machine learning library in Python so there is an helper method that can encode the ordinals automatically this is basically automating all the work that we put in in chapter three however I will show you that sklearn will do a much worse job than we did because we as people we as humans we see that there is an underlying story so to say or intuition behind the nominal values this is why we dropped some of the columns and why we lumped several values into one value in one uniform value and we did all the cleanings in chapter three and the automated way of scikit-learn cannot do that so it can only basically create dummy variables so zero or one variables out of the ordinal and nominal characteristics and that means we can basically pass this data now onto the sklearn models however the data set is not really good so this is really the minimal transformation that is needed to make it mathematically work but as we will see it's not a really good result so this is the data set contains now only numbers all the nominal values are gone and now we start into a matrix X, X1 and into a vector Y1, the sale price because the names usually used in machine learning is a big feature matrix X is fitted to a target vector Y so we are not using the pandas data from anymore but we are now using real matrix matrices and vectors from NumPy now we look at the improved data so this loads data from the chapter three from the end of chapter three so we have the transformation in there and the factor variables that we created and we will do the same type of transformation here as well and we will store that in X2 and here because we have log transformation we now have Y2 and Y2L for the price as the log value and then we do for comparison reason this is what I mentioned in chapter two we do one more import we import basically now the variables that are strongly correlated and also the ones that are weakly correlated and what we do is as I told you in chapter two is where we had those matrices here those correlation matrices what we could do as humans is we could drop all the features that are somewhat weakly correlated to the sales price and say okay it's not correlated strongly or not even correlated at all so let's just drop it and not look at it in the prediction and this is basically what we do here so we only work with the strongly and the weakly correlated features so only the features that have a correlation coefficient of at least 0.33 with the sales price and this is something that we could have done as humans but we will see it turns out that this is actually something we shouldn't do so this is just something to show you that you should always the learning from this will be that you should always give the machine learning algorithms all the features you have as long as they are clean and not try to sub-select them and be smarter than the machine learning algorithm because once you have a very clean data set the machine learning algorithms they are already very good at selecting an own feature set so we do that again here and we store this in the variables X3 and Y3 and Y3 log so now we have three data sets as matrices and vectors and now what I do here in the next step I create a helper function called cross-validation which does a cross-validation for any kind of machine learning model I pass it and by default we do 10 K-fold or we do 10-fold cross-validation and so what cross-validation is in a nutshell is we take a data set and we split it into 10 equal parts and we take nine of them and so we take 90% of the data and we fit a model and then we predicted on 10% that we did not use for training and then we calculated an error measure on this 10% and then we do it for another set of nine yeah, folds and we do that 10 times until every 10% of the data set has been in the use for prediction and use for evaluation once and this way what we do is we train the model in this case 10 times and we make prediction 10 times on data the model has not yet seen before and then we average the error and this way we get an unbiased estimate of how the model would perform in the real world on unseen data because the idea of the house prediction here is we want to predict the price of a house that we haven't seen before and then in here we calculate different kinds of error measures most notably for you probably would be the root mean squared error this is what basically most of you should know and maybe the R2 error and then we have some other called the bias and the mean absolute error and the maximum deviation and so on but we will focus on the R2 and the RMSC here or just MSE and yeah, so we define the helper function you will often see this happening in some analysis where you define one or two helper functions once and then you run it all over and over again and here we will have a dictionary in Python called results where we'll store all the results so compare it at the end and now we will run our first couple of models here so we start with a simple linear regression as we all learn in stats 101 so how this works in SK Learn is the following we take the algorithm called linear regression we imported this before and we have to initialize it this is what we do with the call operator and we store the model in the variable called lm and then inside this cross-validation function what happens is we pass in the lm as the model variable here and then somewhere in the cross-validation the code says model.fit so we call the .fit method on the model and then we call the .predict method and this is what scikit-learn basically does and this is what makes the model fit and then predict on a new dataset and this is all automated in this function called cross-validation so let's create a new linear regression model and run it on our original data that we just barely cleaned and we didn't do any feature generation yet and now we run the 10-fold cross-validation and we get back results and one thing that we should already tell you that something bad happened here is that the R2 here is negative and basically negative infinity so the R2 is usually between zero and one the adjusted R2 and in rare circumstances the R squared can actually be negative and it's basically negative when something goes terribly wrong in the model and this is already indicating that the linear model on the raw dataset so to say on the least cleaned dataset is not really a good model and also the mean squared error the root mean squared error is very, very high here so we shouldn't really trust this model yet however, this is the easiest benchmark we may have however, it's really a bad benchmark and now let's use our improved data and this now works with the data with all the new features that we generated in chapter three here so we do that for two cases once for the normal price and then for the log scale price so we run the linear regression and immediately we see we get an R2 of 0.92 and we get a way lower root mean squared error than above so that means with our cleaned dataset with all the time that we put in into cleaning the dataset and generating new features it's really worth it so we improve the prediction by a lot here and we have a bias that is very low so a bias of negative $89 means that on average our model predicts a price that is on average $90 too low and if we go back to the original bias this was basically plus infinity right so we see that now we understand the values the measures even more and now let's run it on the log scale as well and on the log scale what we see here the R2 goes up and the root mean squared error goes a little bit lower so this means and interesting what we also see is that the bias is in absolute values a little bit higher than before so in other words what happens is a model that is trained on a log scale so only the prices, the sale prices that we fit the model on are put on a log scale leads to a situation where the bias is a little bit higher however the overall R2 and root mean squared error they basically improve so by giving up a little bit of bias we get on average a way better model so in other words using a log scale for prices in this house a dataset seems to improve the situation this does not have to be the case log transformations are often used for rate of returns most notably in finance but it also helps to improve the situation here and now let's do a third scenario for the linear regression let's now use the improved dataset however we only use variables that have a strong or weak correlation with the sales price so we basically drop all the columns that have an almost whiteish here color in terms of the highlighting here basically all the features that seem to have no correlation with the sales price and now what happens if you run this once for the normal scale and once for the log scale is our R2 goes down and the root mean squared error goes up so in other words by only giving the linear regression model the strongly and the weakly correlated features and dropping all the seemingly unrelated features or uncorrelated features to use the correct word we actually get a worse performance in the prediction and that means in this case we as humans cannot outsmart so to say the computer the computer already the linear regression model is rather good it's better than us humans in selecting features and now how does the linear regression model fit or select features? Well basically a linear regression model uses better terms and so and a better value of close to zero basically means that the linear regression gives no weight on a certain feature and this is how the linear model can get better and get rid of some feature and also what is important is the linear regression model here this is just a simple linear regression model this is not a linear regression model with interaction terms and so on so the linear regression what I'm saying here is could be improved by making the model a little bit more complex however we will not do that this is again if you want to know how this could be done check the cattle competition for how to win it and now what we do instead is we will use another linear model which is called the lasso and the lasso model is a linear regression model similar to the next one also I can already give you the name for that which is called the rich regression and both the lasso model and the rich regression they have a different way of constraining the better terms than the normal linear regression does so what the lasso basically does is if a better is close to zero it basically sets it to hard zero so it basically, you know if in all for a better to be non-zero it has to be significantly different from the non-zero from zero this is what lasso and rich do at the end of the day in a rough speak so to say now let's do that and the lasso has to be calibrated so the lasso regression takes a parameter called alpha and this has to be optimized and what we do is we use a so-called crit search to also not only do the k fold cross validation but also to optimize the alpha value so at first what we do is we go ahead and we do the crit search to find the best possible alpha so at the end of the day this is similar to how cross validation works and then we do that for different kinds of alphas and then at the end we choose the best alpha and the alphas that we use here was between what is it I think we used maybe let's just copy paste this here in its own cell so copy paste so these are all the alphas that we are looking at and the crit search determined that the alpha of 20 results in the best result and now we use the best alpha in the lasso model and do the cross validation to get an unbiased estimate and we see that the unbiased estimate on the O data on the original data is now kind of at least stable so in comparison to the original linear regression, the simple linear regression run on the barely clean data set on the high quality original data set this resulted in basically a negative R2 now resides in at least a kind of OK R2 which is 0.81 but it's still bad it's still way worse than the linear regression for the improved data set up here so what we learned from this is that the lasso regression but also the rich regression they can handle they make the linear regression model more stable in a sense but still if we don't give the model a good data set with a good generated features then again we will get bad results so let's now do the lasso for the improved data set and now what we see here is and also on the log scale and now we see that the lasso regression has an R2 of 0.925 and if we go back up to the normal linear regression we have an R2 which is a little bit higher so in other words the lasso regression here is a little bit worse for the improved data set compared to the normal linear model and this is something that we can only find out by trial and error if we again go ahead with our manually chosen features now what happens is the R2 and the R become even worse so again here this is a general rule whenever we manually pre-select the features we get worse results so we should always give the model all the data that we have all the columns that we have and then have the model make the selection and in an automated way and not make this on our own let's look at the lasso linear model the so-called RIT regression also here we have a grid search so also here we have to optimize a parameter called alpha and also the RIT regression the RIT regression is able to work with the totally unclean data set however the result R2 of 0.85 is also not so good so let's use our improved data set again and now improved the data set and the log scale yields a very good result here and if I manually choose or pre-select some features I get the worst result again so this is a general rule so as I said before the learning is don't try to be smarter than the machine learning algorithm here let's look at a different family of algorithms so now we have looked at three different models that are linear and now let's look at a tree-based model with most notably the random forest the random forest I can already tell you I like it very much because the random forest is a very flexible model in terms of what kind of patterns it can learn and it's also a model that doesn't require you to clean the data set to an absolute maximum so for example instead of having one or zero dummy variables we could have a yes or no text variable and the random forest would still work so of course I give it the data set with the dummy variables here but the random forest if you need a quick and dirty approach to get like a first indication of how good a prediction could be you can use a random forest on a you should always include a random forest and you don't have to really generate all the features in a dummy way and the random forest can still work with it so let's look at it the random forest has one downside and we can already see it here when I run it, it takes some time so here we run the 10 fold cross validation and the random forest regression creates 500 so-called randomized trees within the same forest and then each tree makes a prediction and then the collective is put together and to make one prediction on the overall level and training all these individual trees in the forest takes time so now this is over but we see that the random forest on the dirty data set or on the unclean data set from chapter one already goes to R2 of almost 0.9 so this is something the linear models couldn't do the linear models could only go up to let's say 0.85 in the best case scenario further original data and the random forest already is able to detect some of the features so the learning from this is either you spend lots of time to generate features on your own manually or you rely on a little bit more sophisticated algorithm like the random forest and then you couldn't use the linear model here unfortunately but the random forest basically spares you some of the manual work for feature generation now let's have the random forest run on the normal scale and the log scale for our improved data set and again now here this takes some time so one way to optimize this would be to use less trees and I think if we used 50 to 100 trees in the forest we would also already get a very similar result usually what you do is when you work on a big data set is you start with a tree that with a forest that has not so many trees let's say you start with 100 and then you increase the number of trees and then you run the same model over and over again and until you see that the additional benefit is not given anymore and then you stop growing the forest and then you use this as the number of trees in the forest for all the other models that you build so you have to manually optimize this parameter as well and now let's look at what the result is so the result is that we get a better result with our improved data set however here the normal scale is a little bit better than the log scale but we can neglect this actually so it's not a big of a deal but the linear model here was even better so this basically tells us the story that even though the random forest spares us to do all the feature generation to the maximum so to say doing the feature generation to the max manually helps us together with the linear model because the house prices seem to be explainable with linear models in a better way so to get the best result in order to make this work we have to put in the manual work there's no way to not get the best result to do the manual work and now let's run the last two cells here just to be complete so this is now running the random forest with the manually pre-selected features from chapter two we're using the strong and weakly correlated features and we will see what the output is here as well so now here we see the result is not so good as compared to giving it all the features and of course the random forest is also very good at selecting so to say features it does so implicitly and so again here the learning is if you have data in a clean format just give it to the machine learning model and have the machine learning model select the features now let's look at the overall results from another angle let's look at the two most common error methods here let's first look at the root mean squared arrow and let's compare all the results so if we look at the original data set the unclean data set and only the root mean squared arrow we see that the random forest is the best model and the linear model itself doesn't really it's not stable right the raw linear model doesn't work basically and if we use a rich or a lesser regression then we can make it work but if we don't want to process our data if we don't want to put in the manual cleaning work and the manual feature generation work a random forest may be the best model in terms of root mean squared arrow if we use our improved data set which is the data set that we spend so much time cleaning then we see that we can get a way better result using a linear model now and not the random forest however we had to spend all the time cleaning right so this is basically the improvement we get by spending manual work let's look at the logarithmic transformation this gives an even better result so in this situation the best result would probably be to put in the manual work use a logarithmic transformation of the price and then use some sort of a linear model and again let's look at our pre-selected model and our pre-selected model is somewhere in between so again let's not once we have a data set let's not try to be smarter than a machine learning algorithm and now finally let's also compare this to the R2 error measure so in R2 of course we get the same order for the original data set for the unclean data set the random forest would be the best and explains roughly 90% of the variation in the features and if we look at the improved data set also with the logarithmic transformation the rich regression is the best model and if we look at the pre-selected data set then what we see is we are somewhere in the middle so that's the big learning so let me summarize what we looked at in this case study we look at this case study at first of how we open a data set and then look at it at a very high level so remember that in my mind I grouped all the individual features into these four big groups continuous, discrete, nominal and ordinal this is something that you should always do because these you know variables of these four types always require different treatment later on we did some very raw level cleaning by getting rid of rows that were obviously missing data and columns that were not really filled in like here with the visualization that was a nice help however then we realized that this is not enough so if we go into the chapter 3 where we did all the manual work we saw that some of the fields for example were not empty but had not available here so they were basically empty but the exercise sheet didn't tell us it was empty so we have to really spend manual time to realize this and automate this the correlations I don't want to disregard here even though they were not helpful in choosing a better selection of features to make a prediction they are always still worthwhile to do in the beginning to get a rough idea of what are the features that should be used that that could work and because if we are time constrained in a real work scenario and we cannot go over all the features and put in some manual work then maybe it is a worthwhile idea to start with all the features that have a strong correlation with the sales price and put in the manual work that we did to these features first and see how far we can go and if we are happy with our predictions then we just keep it and then only if we are not happy with our prediction results we would then more and more include features here that are seemingly unrelated or uncorrelated to the sales price and by doing the we can also in this light here we can also put already also explain some of the work we did in the chapter three in the feature generation part so what we did basically is for features that are seemingly uncorrelated to the sales price we discovered some pattern that the computer wouldn't discover and by lumping different categories together into one or by creating one zero variables out of nominal variables for example on deriving some other variables out of some existing variables already we did some manual work that the computer these days cannot yet do and then finally we ended up with the basically easiest part in machine learning which is just to run the models so this is basically the part that requires the least amount of manual work and then we learned that it does not pay any benefits to try to be smarter and do some manual pre-selection automated feature selection is the best is the way to go and then we saw that in this case it is obviously worthwhile to use a linear model and the linear model requires clean data so in order to get the best possible result you have to put in the manual work here there is no way around this now what we didn't do in this in this case that is we did not use any deep learning methods this makes sense why because in order to use some neural networks and deep learning methods you need a way bigger dataset the number of samples should be in at least in tens of thousands maybe in the 100s of thousands and we only had roughly 2,300 rows here so a samples of data so doing deep learning here doesn't make any sense and so this is one reason why we couldn't do that what I would now do if I had more time and I wanted to get a better result is I would probably try to run some other machine learning models in particular I would try to stick to linear models and try to get some interaction terms into it to do some let's say quadratic linear regression something like this this is where I would now spend some time on and see if I can get better results but we're not doing this here okay so this is it this is the case study so I hoped you liked it and again here is the link github.com slash web artifacts slash aims housing and this video will also be made available and if you have any comments I would also appreciate to receive for example either an issue where you could raise a question of hey why why did you do it this way and couldn't this be done in a better way or maybe if you come up with a better solution of how to solve this problem maybe just copy or clone my project here and use it and make some improvements and then make a pull request to basically merge in your improvement to this project this would be a nice contribution but other than that yeah I hope you liked it and I hope you learned a whole lot so I see you soon on the channel