 I'm doing my PhD in US in statistics in my last year, so yeah, so I'm teaching one. So he's a problem with R, and he deals with R on day in and day out, and he's gonna be contacting the second session on the plotting? Basically, data cleaning and a little bit of modeling, yeah. Modeling and the digital visualization as described by tomorrow morning. So I would allow Kasia to take the stage, and please feel free to ask questions or anything. Also, the session is very important for your final projects. Basically, this afternoon's session is not about work through a project, which the process is the barest of mean. Maybe, in fact, understand what we will do in your final project. So basically, it's like walking through how you should think of how you can process your data in a use case scenario. Hopefully, you will learn something good. Enjoy! Okay, so we're going to start. Before we start, what I'm going to talk about is the process and sequence. So we'll do data analysis. There are a lot of different kind of data. If you do finance related jobs, you will encounter time series data. If you do business customer related jobs, you will have a relation based data, like those in the SQL, and then you have joint tables. So when you are faced with this kind of data, how to do that? Several steps. The first step is, of course, you have to load your data. You have the data, load the data into the computer before you can analyze it. And after you load the data into the computer, I mean from MySQL, from a database, or from a CSV file, or from a TXT file, or from so-called no SQL database, whatever you find this in. Eventually, it will be loaded into a process in the program. And then after you load the data into the program, the second step is you have to look at your data. You look at your data. There are several different ways. You can describe your data, make a summary, and then you can look how many variables related, and then you see what variable, which variable, what's the variance, what's the mean, and then what's the basic characteristic of that data. I mean, it can sometimes be an image. For me, I'm recently analyzing the brain image, MRI, so it can be a data volume. It can be of all sorts of data. After that, you have a first understanding of your data. You have to do data cleaning before you analyze it. You do data cleaning because in your data set, there might be some problems. Due to a recording problem, or a run logic, or a missing value, it's kind of the same. You have to clean your data set before you analyze it. After you have done your data cleaning, you have a basic understanding of your data set, then you propose a model. For time series data, you can propose some smoothing models, some regression models, or tree models, whatever. Like most fancy models recently, as a neural network, deep learning, this kind of thing. You propose models. After you propose your model, you feed your data to the model. In machine learning, it's called train your model. In statistics, it's called fitting a model. After all, they are telling the same thing. Essentially, the logic behind it is that you propose a mathematical structure, and then you feed your data into that structure so that you can reduce a certain kind of error. For example, you do a linear regression. Just now, you learned that you got linear regression, you got a line and a lot of dots around that line, and then somehow you minimize the distance of all those dots to the straight line that you are going to fit. After you feed your model, you have to do model diagnosis. Sometimes you propose a model, a model has model assumptions. Sometimes if you do a model diagnosis, eventually you look at all those model diagnoses, a certain assumption is not satisfied. That model should not be used. You may consider changing the model. The difference between a newbie, between a newcomer and a university professor, a professor knows more models. They have more tools in their toolbox. If you first come in and do this thing, and then perhaps you have one or two linear regression in all this, and then some other very basic models, you have less tools. That is essentially the difference, but the process is almost the same. After you analyze, after you see that your data feed the model, and then model diagnosis, everything is okay. Then you can proceed to do a prediction. Normally, because we are learning R, R is very powerful. Inside it got a lot of different packages. Usually if you choose R package, you can automatically use that package. The function in that package is to do the prediction already. No need to write the prediction yourself. Then you achieve your purpose. Why you do machine learning? Why you do statistical learning? What kind of computer-based data analysis? For most of you, the eventual objective is to predict something. From the data set, you can do some thought-in-telling. For example, in finance, there is something called technical analysis. That is also machine learning. For example, the image processing is also predict something. You learn some human face and then you can give you another face. You can recognize if it's smiling or it's crying or this kind of thing. Detection. Now we are going to use a very small example, a very small data set to demonstrate this process. We are going to run through the whole process, so you have an idea of what's happening. The first step is load your data. Just now, I think you perhaps already learned this command. The first step is to set the working directory. WD stands for working directory. You have to set your work directory. How to set your work directory? Basically, you type in global-option-appearance-fun-size-better. I'll get bigger, even bigger. Let's say 24. Set working directory. How to do that? If you are running under Windows, a very simple method to achieve this. Usually, for you, you save your files into a certain folder. For me, I save the file. I'm going to load this data into the R. How to do that? You go there and then you are on this. You click here, you have all the working directory. You can copy. The next universal method is to do this. You go to property and then you have everything working directory here. So you copy, so ctrl c, you close it, and then you put here. You have to code into a quotation mark. But inside this, you have to change the slash into double slash. Otherwise, you will return error. You don't have a file? No problem, I'll upload it. Sorry, that's my mistake. I'll upload it. One minute. Yes, there. It's in your session 3 R part 2 folder. Did you find it? Okay, so you do that, you set work directory, and then you load your file into, you set work directory to the folder where you save your file. And then you do this. I'll put a code into another file called end nodes. I'll copy and paste that code there. So you don't need to type and you can just copy paste it. That will be easier for you. So I explain it. So the file is analysis.txt. It can be a CSV. It's just read.csv and then you put it inside. But here we have our g-read.table analysis. This is the file name. In the Google Drive. I don't think there's a file in Google Drive. You can... It was there and disappeared. It was there and disappeared. Where are we? Where is that? You find the file? I already copied this code onto the Google Doc. Which one, this one? Yeah, yeah, yeah. WD, you have to use your own working directory. You have the file or not? Upload file. I'll upload it once again. Is that it? You copy the file instead of you cut the file. If someone cut the file out of the folder, I mean the file will disappear. You paste it on your computer. You don't cut it. You have it. Normally you should see something like this. Inside your folder, you should have one file called this analysis.txt and another file called code and notes. And then this one you can open it. I'll paste the code onto this file so that you don't need to type every time. It'll be simpler for you. Okay. Have it. You go to your Google Drive. You'll find a folder. In the folder you have two files. One is called analysis.txt. Another is called code and notes. You download that analysis.txt into your computer. You put into a certain folder. And then you find that particular file. You do what I did on the RStudio. Okay. You said you're working directory to... Or you can... You can first set the working directory to the folder where you want to do everything inside. Okay. Yes. This is the old one. You just have to open these questions down with it. You have to have a part two, right? Just right click and download it. Oh, you have a lobby. No, no, you have a lobby. Yeah, yeah. Yeah, I know. I have a Google Drive. You have Google Drive, right? Yeah. Otherwise, you have to follow up. You have to ask someone to write an email to you and then write it down. You have to ask your... Yes. No, okay. You know your directory, right? You know the directory. And then you set WD and then you put your food directory in there. You set WD. You set... Okay, you go to your file, you go to the analysis file. You're just going to show it in the DR. Okay. You'll find the property. The property of the file. Okay, you have the DR. Second round. Second round. Second round. Second round. Second round. Second round. If you're in a Google Drive, you should search for this new folder called session3r part 2. The three... In here. Yes, the three things that's required, the three documents that's required are all under that folder. Can you search for that folder right now in the Google Drive? You cannot go into the Google Drive. Let me check. It's all set to collaborate on the Google Drive. Okay, otherwise, I offer you another solution. Can you upload it into session 1? You write me an email or I reply an email with a part. No, just download all this into session 1 as well. Because you all have access to session 1, right? Okay, I'll get it to session 1. We'll drop the three text required into session 1. Please search for it there. Okay, I'll get it to session 1. Okay, it's fine. You can look at this. Analysis. Okay, analysis. I'll search for it. I'll create another one. Was it just here? Okay, I'll create another one. I can't wait. I'll create another one. That's all. Okay, so everyone, please go to session 3r part 1. Download only one text which is called analysis.txt. Can you all do that for me? Yeah, just that one. It's good already. Is it all okay? It's fine. Either one, either one. Someone must have already put some... Okay, you only copy that file. Don't delete it and you don't put something new inside. Yeah, just download that folder into your computer. Yeah, copy and paste it. It's a 42kb one, right? 42kb. I think it's right. So the content of the analysis you should see... 42kb. It's a 42kb file. Just download that one first, okay? Download analysis.txt which is a 42kb text. Has everyone okay? If you have any problems, please reach out your hand. If there's no problem, we'll continue with the lesson. Are we on the same page? Everybody is able to load into their own directories right now? Can I have a reply please? Yes. Perfect. Then we'll move on. Actually, no worry. Every time... After all, it's the first step. So I'll write everything on this part 2 or part 1. Part 2. If you are on this folder, it's part 2, but I'll write it in those after all on part 1. So just for everyone's can understand's sake. Okay. So first step is you set your work directory. So how to set your work directory? You go to the folder. For example, you save your file you highlight it. You go to property and then you file link. Okay. So all the food directory. That won't... you won't miss it. You go to full food directory and then you copy it and then you put the directory into a double code. And then you set all the slash into double slash. Okay. And then you set work directory and then you hold your file with this command. Okay. This command basically does several things. First of all, you read table you put your file name, okay, with extension txt. It can be other file type. It can be CSV, just now someone gave me an Excel file. Okay. Actually, Excel file can be read as well. But it's more complicated. So header, you got a header or not. Under our situation, we have no header. Means the first line of that analysis. Okay. Let's open our analysis file. So this is the content of this particular file. Okay. So we have a lot of data inside. We have a lot of xxx and then we have a lot of numbers. Some categorical variable. Okay. So the file like this. What do we mean by header? Header means the first line. Okay. If you have, you do have a header. The first line will contain a lot of names. Okay. Like this is what the name of this particular feature. For example, this first one should be season. Okay. So you have you have no header, but you have to append header to your file. I mean to your, this is called a data frame. Okay. The previous lecture normally your culture, I already talked about that. It's called data frame. You're basically by doing this you create a data frame. Data frame is a basic data object in our environment. Okay. So you do whatever manipulation calculation on data frame. So you can consider it as a big matrix. Under this case it's a two-dimension matrix. Each row is one recording. Okay. Each column represents one feature of that particular recording. It's like a spreadsheet. In spreadsheet each row is a record. And then each column contains the one characteristic of that particular recording. Okay. So here we do first is the file name. Then is a header. We don't have a header. So we choose that equal to F. So forth. Decimal places sometimes decimal places is represented by a comma. But here we specify it as a dot. So we know what the decimal place is. So column names. Column names is basically a feature of the data. So we have season, you have size, you have speed, you have max pf value, you have other chemical values. I mean chemical names. And then this is algae. And then Na string equals to this. Okay. This basically tells computer, I mean tells computer what is Na string. String is a series of characters in the computer. So in this just now we saw that we saw that in the file. In the file we have sometimes a lot of this thing, non-recordings. This is a problem with the data set itself. So in computer there is a standard name for such situation. It's called Na, not applicable. n slash a. It can be other forms. And in this case it's a lot of x. And in other cases it can be a lot of n so a lot of other things. So you are telling the computer when you encounter a lot of x, it's Na, it's not applicable. So a computer will assign another value, replace all the x with Na. So it's easier for later manipulation of the data set. So you do this, you load your data into your computer. You have 200 observations and 18 variables. Okay? So after you have done this I will let's look at the data. So you look at, let's say we read 10 rows. Okay. So head, whenever you have a data set loaded into the computer, I mean in data frame into the computer you can, you want to have a visual understanding that direct understanding of a data set. You can type head the name of your your data frame is called LG. Okay? I'll write that code into that. Okay? The code is on the code and nodes. So after you type this okay this number represents how many rows you want. It can be 1, it can be 10, it can be 200. Okay? So by default it's 10. If you don't write anything you put that there. Okay, 6. So you can look at the data. Okay? So the data okay this data set I'll do a little bit explanation about this data set. This data set is actually the measurement of 200 water samples from different rivers. Okay? So inside inside this you find a lot of chemicals value, recalling. Because in the water in river you have those kind of algae. Okay? And then you want to you want to predict because you have a different kind of harmful toxic algae A1, A2, A3 until A7, you have seven kind of toxic algae. You want to monitor the quantity of algae in the river. But if you do that if you do that each time you sample it, you test everything there will be very troublesome. You want to automate that process. So how to do that? Normally it will be very difficult to do chemical analysis. But it will be simpler to just put a monitor there and test those chemicals inside the water. So by reading those chemicals you can see the probability of having a certain kind of algae will be high or low. Okay? So that you can take measures. So this is the scenario of this data set. So here you look at the data set this is the quantity of seven different kind of harmful algae. Okay? Then this one is when the season, when the sample is taken. Okay? What season? For example, winter. It's a river size. It's a small river. It's a medium size river or the large size river. Speed? Okay. The water speed is high, low or other speed. High or medium or low. Okay? Max pH is the pH value. Okay? It's acid or it's alkaline or MNO2 is other chemicals all those chemicals. These chemicals you can just put a direct detector there and they can detect the value automatically. So you don't have to go there personally take the water sample, do the analysis with one week or one month to get the result. Okay? This one can actually produce a procedure produce a model so that you can predict the value of harmful algae automatically. Okay? So this is the background of this data set. Okay? We have this data set. We load it into our computer and we are ready to proceed to the next step. For the next step, if you recall that it's we're going to look at our data I mean have a little understanding I mean of our data set. Right? So how to do that? A very simple step is you okay? So it just types a summary and then the R will automatically give you a lot of results about each variable. Okay? For example you have a season. Summary is so you have autumn, you have spring, you have spring, you have spring summer, summer 45, winter 62, so on and so forth. And then you have you have minimum value, maximum value for I mean mean, median, first quarter third quarter, third quarter and then this for different variables. Okay? So minimum maximum you all know median is is a middle is a middle number, okay? By index. Mean is a mean I mean the average. First quarter is first 25% where that value is so third quarter. Okay? You have a direct understanding I mean from this you can already see if the data is skilled or not. Okay? But it's not that clear so you can make some plots. Okay? Just now I recall that you learn that how to do a histogram. So L, G, A, E and dollar sign. Okay? You can make histogram. Okay? So you can have a histogram you can plot a histogram. So you apply what you learned just now in the previous session. So this is this is the name of your data frame. Dollar sign in the selector. Okay? And then after the dollar sign you can choose whichever value you want to plot. Okay? And then our situation is Mxph you can plot other things as well. Okay? So histogram you can if you look at the left side of your histogram it's called a frequency. Okay? If you don't want a frequency you can set it to probability. Okay? By adding probability equals to true. So on the left hand side will display you the density. Okay? What's the probability? That all depends on you with which value you want. You can also add some lines. Okay? Say at the density line. You just made a plot. I mean you draw the density curve around your histogram. Okay? You just add the density there. Okay? This NA dot remove is remove all the NA values because NA values cannot be calculated. A lot of acts or NAs cannot be calculated. So when you do the density density basically you do a spline fitting. Okay? Spline fitting inside the polynomial you do a spline fitting at the back. So when you do that you are actually performing some calculations. You have to remove those NAs. No value. That's true. And then density perform on which value? I mean the LG dollar sign max pH. Okay? You can perform other variables as well. Okay? We have a lot of them. A lot of variables. You can try PO4, CHLA and okay? So we continue. You can further plug other things. Ruck this function actually plots something very interesting. Okay? It actually plots the dots. I mean the data points at the bottom. Okay? So you can actually see each data point. Where are they located? Okay? You all know what's a histogram, right? Histogram. You can think histogram as a bin as a series of bins. Okay? For example, you have a bunch of data range from max pH. Let's see, max pH. Okay? You have a bunch of data range from minimum 5.6 somewhere here to maximum 9.7 somewhere here. Okay? This is a maximum value. This is a minimum value. Okay? And then you have medium. Okay? You have a bunch of values. And then histogram, what is histogram? Histogram basically you chop your axis into different segments. Okay? Evenly chop them into sections. Each section consider as a bin. I mean as a dust bin. I mean like a bin. You put the number into that bin. I mean you call into the value where it falls. So you put all the numbers into their respective bins. You have something like this, a histogram. This is a histogram from I mean layman term. So this is a histogram like this. If your bin is smaller enough, very small, I mean it's infinite small, infinite small you will have a curve like this. Okay? You have infinite number of values. And then you put your value into infinite small bins. Okay? You will have a curve like this. Okay? So the curve is like this. In mathematics it's a limit. It's called a limit. It goes to infinity. Okay? Let's continue. Which one? Probability equals t means okay, t is true. f is false. Okay? It's false. It means if it is false, here is the frequency. Okay, let's go back. Okay, if we don't put it t, we'll have a plot like this. You look at here, you 70 60 means in this bin, there are 70 numbers in this bin. Okay? There are around 60 numbers in this bin. Okay? So this is, okay, there are around 30 in this bin. Okay? If you put probability equals to true. Okay, now it will be from 0 to 0.7. So I'll say the probability of your data point into this bin is 0.7. Okay? So this is called probability equals to true. Okay? Are you okay on this? Histogram. Just now actually you plot a histogram. Histogram is nothing but a lot of things. You put your data into the bin and then you get a high load. Okay? So we continue we plot other plots. So we can plot box plots. Okay? LG here we OP04 Okay, this is a box plot of another variable. OP04 is also phosphate. Okay? So the chemical formula is a PO4. PE stands for phosphate or oxidin oxide. So also phosphate. So the box plot. This is a box plot. Just now you have you have a summary. You have OP04. Okay? Summary here. This part. This part is, they are just numbers. Okay? Not easy to understand. So you want to make a plot. I mean have a visual understanding of the data so you make plot. So this is the box plot of that particular variable. Okay? Same way you can also display all the data here. I mean as a side. Okay? Side equals to one means that you plot this thing as a bottom on the x-axis. Side equals to two means you plot it on the y-axis. Jitter means if two points fall on the same place they will give us more error to make them separate the part. Okay? So it just means the little thing. You have it. It's good to have. So otherwise, okay, you can also continue. You can plot this. Okay? So you can add a line on this plot. Okay? This is the mean. Mean of this particular variable. And it also equals to, and it also will remove those n-a's. Okay? n-a dot remove equals to two. No need to type. I'll put the code onto that code. So you just copy paste. No need to type. It'll be complicated. Yeah. So you just listen. I mean just listen. I'll explain to you. You no need to type. You can copy paste. So you have this. What's a good point of box plot? I mean good point of box plot. You can easily see that if the data is killed or not. Okay? And you can easily detect the outliers. Okay? So this is a median. This is a mean. Okay? This is your first quantile. This is the first quantile. This is the third quantile. Okay? So all those points all those points I mean extreme points you can see them outliers. Later on you want to remove them or you want to have further examination on that. Okay? So now we we just talk about extreme values. So how to find extreme values? So we want to identify extreme values. Okay? All outliers. So called outliers. I'll do that. So you you can do this. Okay? Here I plot another plot. So LG is this ammonia. Okay? NH4. Okay? So the ammonia into in that water body. Here obviously you find this this is outlier. Okay? You don't know which point it is. You want to find which point it is so that you can remove it, right? Okay? So how to do that? There are two ways you can do that. One is a visual way. So let's put that. I put a plot there so you can plot it. Before we do that before we do that we can also plot some other lines there. So I'll give you some other plots so you can play on that. For example, you can add the mean. Okay? This is the mean of all those values. You can add the standard deviation. Okay? Mean plus one standard deviation is already high but that one is way much higher. So you want to identify that particular point for example there. Okay? The extremes. So how to do that? So you there is a function called identify. Okay? You can do that. What is this one? Okay? So let's run that. Okay? So now your cursor becomes crossed. So put that here. You click click click Okay? So already identify. You identify three points. Let's make another one for that one. Okay? You see that locator active. ESC to finish. So you click ESC to finish. Okay? So all the index of the point you just clicked start into that click variable. Okay? So these are the points you just clicked. So this is a 20, 35, 88 and that one means it's a 155th string. Okay? This is very useful. This is very useful when you have a data set you have extreme values. You can just remove them this way. Visually remove them very easy to use. So then you can remove all those points. How to do that? Okay? I don't want to... Okay? All these points I will click the value. Okay? This variable called clicked hold the index of extreme values. Okay? So basically the vector 20, 35, 44 and 153. A vector of four elements. Okay? So algae you can consider it as a two by two matrix. Okay? You put the click here and it will tell you to take out all these rows. Rows of that index out. So you display all those extreme values here. Okay? If you put a minus sign there you directly remove all those rows from your data set. Okay? Which we don't want to do it now for the moment. Okay? This is visual way for you to identify the extreme values. There is another way to identify the extreme values. I mean more computational more more technical. You can do it like this. Okay? This is also selected. So what you do is algae, you put the square bracket inside there is a comma. Okay? The first value stands for the index on the rows. Second value stands for index on the column. Okay? So basically this one is selected. You select all those rows that satisfy this condition. Okay? I put it here on the, you can find it on the note. So means that value is not NA. It's not the NA first. Okay? This is is.NA or if it is NA, if it's NA is true NA is false. Okay? It is NA. Okay? If it is NA and then you negate it and then means it's a number and then that value bigger than 19,000. Okay? 19,000 around here means obviously you can detect this point. Okay? So by doing this you also select 153. Recording 153. Ammonia 2000 24,000 64. Okay? You also detect that particular outlier. Okay? Are we on the same page? Any questions? Questions? No? Okay, so far it should be very single. So, since inside we have a lot of NA's. Okay? We want to deal with it. Okay? You have a lot of NA's. Obviously you want to you want to how to say do something about those problematic data points. So how to do that? There are several ways. Okay? First way you simply remove them. Okay? That's when you have a very big data set like you have a thousand data or 2,000 data you have a quarter or a bunch of problematic data points. You can simply remove them. Do you know how? Okay? But that won't usually happen. I mean, sometimes you just have a very small data set. So you have to use other ways. Okay? You can either fill your data set with the most frequent value. I mean do some intelligent guess. Okay? For example somewhere you have those NA's or non-applicable you can fill that particular hole with the most frequent value of that column. Okay? For example, you have you have in that particular pneumonia. Okay? A lot of values are wrong like a lot of for example, 6400. And then you can fill that hole with 6400. Or you can fill with mean of that column or median of that column. Okay? Or more advanced you can also explore the co-relationship between different column. And under our situation, okay later we are going to see that we are going to explore the relationship between all PO4 also and phosphate. And this PO4 these two values they are all something related to phosphate. Okay? You think that there might be co-relationship between these two columns. Okay? So you can explore their linear relationship for example and then you construct a small model and then you have one they are correlated. You have one value and then you can fill the other value if it is missing. Okay? This is what we do. Or more advanced you can use I mean you can explore the similarities between cases. This is the most advanced one. Usually in more proper data analysis I mean you use this one. You use cluster I mean you have for example 200 observations. Okay? You have observation that have all the characteristics list there and then you will find within these 200 observations which observation is most similar to that one. Okay? And then you use the value of that similar recording to fill the holes in this particular recording where you have a hole. Understand? I repeat once more. So the most usually used method is just now we have actually four ways to deal with unknowns. First one you remove it delete it. Simple. Second one is you use some simple guess. Simple guess means you can fill that particular hole with mean or median or more of that particular feature and then you put it there. Third way is explore the collusion between different characteristics. Okay? For example something all related to force field or something all related to nitrate. Okay? You use this similarity. I mean the collusion constructs more model. For example construct the linear model between them two. You have x and then you predict the y for we have one value for x and the other value you predict the y for that. So you can fill the holes. The third one is the most sophisticated one is you explore the similarities between the data points. Okay? You have it's like you have a Mona Lisa. Okay? But Mona Lisa without one arm. And then somehow you find another picture looks like a Mona Lisa. I got an arm there. And then you you draw that Mona Lisa that draw out the arm with a reference of that full picture. So it's like that. Okay? Explore the similarities between data points. Okay? So we start to do that. The first method simple you just remove it. So I'll do that. Several ways again. Okay? You first you first find incomplete cases. Okay? Which is a very powerful function. I don't know if a previous session that you will talk about that. Which is a very powerful function. Which is something will tells you the index of the data in the data set which satisfies certain conditions. Okay? It tells you that the index of the data of the data points which not complete. Okay? This phrase actually says that it's not complete. Complete.Cases means that particular recording got no NAs inside. Okay? Remember what I said you have 200 observations each observation is represented by a row. Okay? In row. You have a list means that on that row there is no NAs. All values properly listed. And here we want to find out the index of rows where you have NAs inside. Okay? It's not complete. So with this you find a lot of rows. Already 12 13 15 of them. So almost 18 of them. And then you can remove them. You know how to remove them, right? You keep an index. So you have an ID or recording an ID and how to remove it. Very simple. L, G, A, E your negative ID.com You don't click Enter because if you click Enter your LG will become less and become other things. So you have to load it again. So you have LG and then if you want to remove all those NAs with all those problematic recordings just do this. And then you remove it. Okay? So Yeah, you can display you can also display Yeah, these are all your NAs. This is display all those recordings NAs, NAs, NAs Okay? So you can do what I did you just negative ID You can do this Oh, there is a function. Very nice. It's called a L, G, A, E and then you have N, A, Dot, Ormit. Okay? Same thing. You're actually doing the same thing. You omit all the NAs. All the recordings are NAs. You don't click Enter. If you click Enter, your LG will become a smaller set. Okay? If you click Enter, you have to reload your data for later things. Oh, you can remove some particular values. So for example, you can also do this. For example, 62, 199 Okay? This is actually you can select a particular recording, remove them. In the previous session, I think your coach already talked about that. You remove certain observations. You create a vector, complain the index and put a minus sign in front and then you can remove all those two rows from the dataset. And give you a more advanced scene. So in this line, actually tells you okay. In this line, what it does is this. It finds out the row where NAs inside that row is above 20%. Okay? So means for example, you have one row, you have 18 variables. Okay, 18 variables. 20% of it is four variables. Around four variables is NAs. So it is actually find out the observations with like five NAs or six NAs. Okay? More than four. Or more than three. Okay? A lot of NAs. It is because there are two situations. Some values, you simply have like you only have one NAs. You have like 20 characteristics, you have one characteristic is not recorded. It's okay, I mean you just fix it. But for example, you have some recordings like you have 20 of the characteristics, you have 10. Half of the observations is not recorded. So this, you cannot fix. Okay? So very bad recording. So we remove them. Okay? You can treat that with two different measures. You can remove them. So you do that by this. This is actually very powerful. This is a very powerful function. Okay? It applies it actually applies is applied to all the rows. One represented rows. LG is the name of the data frame. So it applies a function on all the rows. All the rows of this data frame. So what the function? Function is defined as a Boolean function. A Boolean function is written as function X space. So some of NAs divided by the total number of the column. Okay? Bigger or equal to 20%. Okay? If it's bigger or equal to 20%, it returns true. If it's not bigger then it returns false. Okay? So you put everything inside which and then you find out index. So we do this. So let's see. Yeah, you have two values. Have a lot of NAs. Okay? For example, let's look at so I take all these two observations. I mean I have a lot of NAs there for 62. And you also got a lot of NAs there for 199. Okay? All these things are considered as very bad. Okay? A lot of not unknown. So most probably going to remove them entirely. We don't use that because almost nothing recorded. Right? Almost nothing recorded so remove them. For those only got one NAs, I mean most probably you fill in NAs and I can still use that data point. Okay. So this is one way you remove it. Another way is to fill it with most frequent value. That would be more complicated. So what you do is this. Okay. You find a particular observation and then you fill it one by one. Okay? This is actually find out the character mxph, row number 48. And then you fill this particular hole with the mean of that particular column with all the NAs removed. Okay? I didn't load the data again. You can do that. Or you can replace by a median. Yeah. For example, you can do other things like this. It's the median. I put a code on the ... You want to undo that one here, right? Normally you can. Because every time, okay, if you do that, you destroy that particular variable. You replace it with something else. So now you cannot retrace back. Yeah. Something like you overwrite it. You overwrite with something else. Right? That particular memory is written by something else. Yeah, of course. You can set a dummy variable. I mean, set a algae, for example, here you can do this. What you do is normally you can algae, this this. You can keep the original record. That is a very good point. That is a very good point because some operation inside your R is destructive. So you destroy some variable. You want to keep it. So you want to keep a record as a site. But it's okay. In our situation, we have the original data text file. You can just load it once more. We have the original one. It's no problem. But what your friend said is absolutely very important. If you do a certain operation, you are not sure about it, and then you better keep a copy as a site. You do it, and then if something goes wrong, you can go back and do it again. That's very important. So just now what I said is you can fill either with a mean or with a medium. I already write the code on the Google Drive. You can fill that with the first line is fill with mean. You find a particular value and then you just put the value there. And the second one is you basically fill all the NAs of that particular column with a median. Okay? Which one you want to choose? Here, since the last session, you learn how to write functions. There is also the function you can... Frequent fill. I also write a function that you can use for that function. Yeah. This is the function that exactly does that. You put a data from the column is numeric and that data is median. It's true, and then else in fact... Okay? So it's a function. And then it's a frequent fill. It's basically those NAs with median. Okay? For numeric variable, it's filled with a median. For fact... For categorical variable, it's filled with the most frequent one. Okay? This function actually does this. Yeah, you can... You can load a function inside your R and then you can use it. Okay? You know how to load a function. Just now you learn how to write a function. But this one is a little bit more complicated. Yeah, you have logic. From... You have a data set which is a 2 by 2 matrix. Sequence is actually create a sequence from one to the number of column of the data. So here is 20. Okay? From one to 20. Then for each column, the data in this column is numeric. Is numeric. This is true. Okay? You will fill the nth with median of that particular column. Okay? If this one is not numeric, means else, if else. Okay? If not numeric, you will put this column data as factor data. Okay? You put your story in the fact. Okay? And then this is of fact into which level. Which level is it? Okay, max. This table actually, table fact actually gives you a table, the frequency. This is called this is to create your frequency table. Okay? It's a... Frequency table basic tells you for each category, there are how many counts. Okay? Then you find the maximum of it. And then you choose that factor and then you put it on the NA part. Okay? And then you return your data. So this is basically this function, what the function does automatically. Okay? I mean, you are going to have this function and you can play at all. Any questions? It should be fairly easy to understand. Okay? Another one will be more complicated. So we feel that we feel unknown by exploring the correlationship. Okay? Explore the correlationship between variables. How to do that? So we first have to find which one is correlated with which. Right? So what we do is we find a covariance matrix. Okay? We look at the covariance matrix. But this one is not so obvious. There is another function where you can do this. This is a covariance matrix itself and then we create a sparse one. Okay? This is a covariance matrix very sparse. You find a star there. Okay? So star means there is some significance there. So you find this one and this OP04 is correlated here. Right? So when you look at this you will start you can create covariance matrix. You know what is this, right? It's a little bit mathematical. You have how to describe that. Okay, wait. This is Wikipedia. They got an equation like this but previously you learn this, right? Somewhere in your high school covariance matrix. Okay, you don't understand this. Okay. Okay, in this matrix if you do this if you do this you put a lot of columns a lot of data vectors inside this. Use complete observations all those observations with no NIS. Okay? You put a lot of vectors inside this function will return you a matrix. You find those values that is very big. Okay? Very big, close to one or negative one. Okay? If it is one it means perfectly, positively auto-correlated. Okay? If it's a negative one it's a negative auto-correlated. Even the auto-correlated means one variable, okay, got two variables they are perfectly auto-correlated means you have one variable increasing the other variable will definitely increase at the same time. They are so-called auto-correlated. Negative auto-correlated means you have one variable increasing and the other variable decreasing. Okay? So this is so-called negative auto-correlated. So if you have this matrix you will find those values that have very big auto-value. Okay? If the auto-value is very big means these two variables are very highly auto-correlated. Okay? So, for example here we have this matrix. This is calculated automatically. So you have this matrix you have PO4 and OPO4 these two these two variables and then you have this number 0.9. Okay? Very close one. So they are they are highly positively auto-correlated. Okay? If it's negative 0.9 is a negative auto-correlated but it's 0.9 it's very big. One is perfect. One because it's with itself. So this is OPO4 this is OPO4. So here is one. You are definitely auto-correlated with yourself, right? So, with other variables this is one is very big. Okay? So we want to explore the correlation. We want to explore this. We can use the value of this one to predict the value of this one. Okay? This is so-called explore the correlation between different variables. How to do that? We want to do that. We do it like this. We still have 200 observations. We first remove those problematic points. Okay? So this one I just explained I put a negative sign and I remove them. They are not useful because we got so many NAs I remove them. So you have 198 observations. Okay? And then we start to produce our regression linear model. Yes? You mean the dots? The dots is okay because this is not so well displayed. You have all these points this one should be a star. Because display here is not so good. It represents some range. You can go to the PO4. It basically represents the range. Because this matrix is very big. It is not so easy for you to find out that 0.9. But this one will give you a sparse version of that one. Okay? So you can do that yourself. Okay? It's already on. So you can plot yourself. So the COR is correlation coefficient. Okay? From negative 1 to positive 1. Okay? You want to find the values with absolute value close to 1. It means they are highly correlated. This one will create a sparse version of that one. Because that's so big and there's so many numbers very difficult to find out those values. So you get a sparse version. It's fine. And then you find these two are correlated. Actually correlated. So you create a model. You create a model here. So the model is the PO4. And then this is your Y, this is your X and this is your linear model. You fit it and got coefficient this and then this is the slope. This one you know, right? This one not a problem. Just now the previous session the coach told that. Yeah, this is just a very simple linear model. So with this we can Okay? Value 28 PO4 The PO4 is N8. PO4 is related with OPO4 and then we know that row 28 PO4 is N8. So how to do that? We find out it's row 28. So what we do is this. So we use our intercept. We use that slope. We sub in the LG of OPO4 at row 28. We get the PO4 at row 28. By doing this we feel that gap. Any questions? Any questions? All clear? Okay, very nice. So the last method is is to exploring the similarities. Okay? Exploring similarities is a little bit complicated. I'll give you a function first. Okay? You call it k nearest neighborhood. Okay? What is k nearest neighborhood? It's basically this. You look at this plot. We happen to have this plot. For example we have a plot with a lot of points. We find the k nearest points near this one. So-called similarities. By similarity we are calculating Euclidean distance between each point. What is you do know what is Euclidean distance? Right? Euclidean distance is let's do this. Okay? You have for example under two-dimensional plane. This is x. This is y. This is origin. You have a point. There is a point. There is a point. Okay? Euclidean distance is basically this. The the straight line you connect these two. Euclidean distance. I'll calculate that. You have x y. This distance is the square root of x squared plus y squared. Very simple trigonometry. But under high-dimensional cases you have more variables. In this case you have a z. This three-dimensional you just plus a z there. z squared. In our case we have 18 observations. We are 18-dimensional. Okay? So you are expecting inside this you have 18 of them. Okay? So this is called Euclidean distance in high-dimensional space. Okay? It got a mathematical name. So you find a k nearest one near your center point. I mean I'm not going to explain the equation but later I'm going to call it pasted today. So if you're it's very long. So you can use it after all. So it's called k-n-inputation. So I am going to copy and paste it. Okay? What you do is this. Your copy pastes this equation and then you run it. And then it's loaded into your function. You find here it's got a function called k-n-inputation. And that's good enough. k equal to 10 means you find a ten nearest point among all the points. Okay? So under as what I explained just now you want to find a you have a point and inside the point you got some problematic holes. But what you do is you find a k nearest point near this point. And then you average this ten points. And then you put the average value for that particular feature into this hole. You got what I mean? I explained a little bit more. So it's like you have a point. So it's x1 blah blah blah until x18. And then you have a lot of these points. Ten of them. This is point one. This is blah blah blah until point ten. Okay? So you have a certain point here. You got a hole here. For one feature you got a hole here. So you find a ten nearest point near this point above others. You find out that they're similar point. And then you average all the other points. You find that you find that the value on this particular dimension and then you average all that numbers and then you fill in. And basically that function does this. Okay? Yeah. If you want to explore a little bit more you'll read this function and then you understand it. Okay? So this is the false method. It means give you a logic. So far we have cleaned our, we know how to clean our data set. Okay? So we proceed to do analysis. Okay? We want to propose a model for our data. Okay? Now we enter into into a real analysis part. So the first model is what I'm going to talk about is a linear regression. Okay? Multiple linear regression. So we since we already learned I'm going to use that. I have another auxiliary function. I copy it onto your file and then you can copy it there and then run it. Okay? Our data is clean and then what I'm going to do is multiple linear regression. How to do that? Multiple linear regression is just the IX and YM. So multiple linear regression is actually by the same logic but you got more dimensions. So what we do is this. So we predict A1, 1LG. We have a seven of them. We predict one of them. Okay? Linear model, LM represent linear model. Okay? LM.A1 and then you have A1, Toyota Dot. It's a dot. Here you got a dot there. And then data is from the data set first column until the 12th column. We use all the data. So this one by doing this you actually create the most general model for your data set. For most general linear model. So let's see what we have. Okay? And then I'll explain. Okay, what is okay? Basically what happens is you have a bunch of points, a lot of points here. You have a lot of points here. Okay? You want to create a model you want to create model that minimizes the total I mean the distance. Okay? This is the distance from okay? This distance got a name. It's called a residue. It's called a residue. The error you're going to minimize is called total sum of residues. Okay? Total sum of residues. You square all the residues because residues can be positive and can be negative. You take all the residue squares and then you sum them up. And then you want to minimize that particular thing. And then this model your linear regression actually does this. It minimizes all the residues. I mean all the errors. Okay? And then you create a line. Multiple linear regression. Multiple linear regression means it's not just a line. It can be a plan. Instead of a line, in three-dimensional space you'll fit the model. y equals to alpha plus beta x1 plus gamma x2. You have two x. In that case you have a plane. So in three-dimensional space it's like this. Do a multiple linear regression for three-dimensional space you are actually fitting a plane. Instead of a line. Okay? Plane is also got a lot of points on three-dimensional space. You have a distance you have a distance between that you see this is perpendicular perpendicular to the plane. And then you are calculating this distance. Okay? You have a lot of planes. So you have you have a point projected to that to that plane. Okay? So that distance. A higher even higher-dimensional space I mean it's called a hyperplane. Okay? Like a 12-18 dimension is called a hyperplane. You cannot visualize it. It will be more difficult. What is a linear model? Linear model means they are linear models they are okay. What is a linear model? A basic linear model example you have a y equals to a number I mean alpha plus beta plus fx plus gamma gx Linear model means here is linear. Inside this part is not necessary linear. Okay? This can be x squared or sine x or cosine x or other thing. Okay? You adjust the coefficient this is the number and then this can multiply adding together. So they are linear by this not by the function x. What's okay. But if you don't understand this is okay. So here we actually create this. So someone send me a message. Cheer Tran Li Okay. Okay. Okay. Just now it's a little bit explain explanation of the linear model. So we have done this. So after we have done this we can do this another analysis of various. Okay? This is actually helping you to understand how to choose a good model because how to construct that model is also very important. Which value is more related to the result? Because you have a lot of variables not usually not every variable related to the end result. Right? You want to find out that particular variable that related to your result. Okay? This actually does this. So this one the p-value is not very good. Okay? You want to remove it. p-value actually here is actually doing how to say a test a statistical test. Okay? This statistical test actually says that this is a parameter for this value. It can be 0. Anyway, okay. After you create this one you do ANOVA analysis of variance. Okay? This is analysis of variance and then you put your model inside will give you a table like this. Like this you find you find the the parameter with the largest p-value. Okay? So you can reduce it. You can remove that value. How to do that? You can update it. So we go second step update update your model is LMA1 dot this is your dot model. Dot model means the most general model. We don't specify the model structure. Here is the most general one. So your minor season means you remove the season variable. Okay? And then you you can compare the difference. After you have done this you compare the difference. Okay? This is a joint F test. F test you got the p-value actually also very big 0.7 when p-value is very big it's not good means they are the difference the significance of how to say means they are not significantly different. Okay? Normally one of these one is the smaller 0.5. Okay? You don't want to do that step by step there is the R is very good so you can perform that automatically. So what you do is final linear model you can use this step function step function will do that for you automatically. This actually calculate AIC AIC is archived information criteria. AIC The series is a little bit complicated but automatically it can give you a result. Okay? Running several steps give you a final model this is your final model your final your final line. Okay? So what you do you can you can summary okay? So this give you a summary of that model so multiple R square is not so good but linear model is the final one this is the structure of the model contains size, MFPH nitrogen I mean trioxide ammonia also phosphate okay? So basically it is automatic basically your step is you got three steps first step is first line is I put on the Google Drive file first step is this one. Okay? Your first line put this and then you directly can go to the final step I mean step on the first one and do a monthly step and automatically will reduce you the best monthly linear model okay? You'll give you a final step. In the middle you can check with ANOVA or with summary you can check I'll copy paste the model on the file okay? After you have done the produce the model you have to do the model diagnosis there's something called model diagnosis so I'll do that actually there are several plus it's good to know okay? I'll give you a dummy model I'll give you a dummy model or create a model so this is X this is Y based on X okay? this is this is a toy model toy model is not related to our dataset but it's toy model for explanation purpose only so actually I create a 100 data point random number range from 0.10 I create a model have an intercept 1 and have a slope 2 this is and then the R norm because we have some Y noise 100 number range from 0 to 1 okay? so we feed a linear model we feed a linear model as Y D our M Y tilde X we return the formula intercept 0.679 and then X equals to 2.0 something the slope is pretty good the slope is around 2 the real slope is 2 okay? if you get if a number goes bigger let's say 1,000 point 1,000 point if you got more observations if you got more points your estimation will become better your estimation will become better so here is 1.0 0.67 here is 1.987 so you got the intercept real intercept is 1 real slope is 2 more point you can go to 1 million almost equal to 1, almost equal to 2 so model diagnosis something like this this is so called model diagnosis give you a plot give you a full plot this command basically tells you create a frame where you can put a 2 by 2 can put 4 plots if you don't put that command if you don't put this in the plot plot of 1 by 1 we will do multiple linear regression usually you will find this for model diagnosis because this is our dummy model I mean our toy model I explain you first one is called residue against fitted value so if the model is linear if the model is linear you will find the straight line in the center all the dots are evenly distributed up and below just like this so this is this is one assumption model is linear another assumption is normality residues are your residues are normal normally distributed because we generated the model ourselves so our model is normally distributed so this is so called qq plot qq plot CRT quantile is a quantile for normal distribution and that standardized residues against the CRT quantile I mean the feature you are looking for is you have a straight line you have all those points follow on a straight line if this satisfies that model is good means the normal hypothesis is satisfied so this one is a scale location this one is if the model is good you will find a straight line this one is actually testing for this for the stability the stationarity of the model if you know what is the white noise if this one is not good what you have you will have something like this I'll explain if the model sometimes the variance will become bigger and bigger means you will have this you will have those points here and then those points are very so you will have a point like this on this region you will find on that plot like this that model quite here hypothesis is not satisfied so but this one is good no problem this one is called a standardized residue against the leverage this one is more for looking for the outlier here we don't have outlier normally you will have a dash line on this plot somewhere here or somewhere here if within the dash line range you will find some dots there that dash line is called cook distance so you will find dots there there are outliers so you have to take a look at that okay this is standard procedure for model analysis I'll give you a reference I'll give you a web link so you can on this page you have a lot of explanations so if you have a QQP first one this one is better this one is non-linear model something like non-linear model this one is a normal QQP plot this one is normative criteria hypothesis is not satisfied this one is what I described bigger and bigger this one is not good this one our plot there is no dash line because we don't have outliers you expect to find cook distance within that you have a point here okay you have a point here means you got outliers I have to deal with that I'll give you this link you can read it at home it's more theoretical it's more good to know you can read at home so this is multiple linear regression we talked about this within 20 minutes first class they will teach 3 class minimum 3 class minimum each class 2 hours about this thing this is so called multiple linear regression we want to do more than that more than that multiple linear regression in multiple linear regression the problem with it it's a global model multiple linear regression is a global model we use all the data at the same time okay but it might not be the data feature might not be global okay for example sometimes the data behavior are one part of it is different from the other part of it right? it usually happens not really all the models will follow a global feature right? so what we do is I'll introduce you another model it's called decision tree decision tree decision tree is basically called two sum categories one is called classification tree okay for classification another is called decision tree for regression okay basically they are following the same procedure it's just okay what's the difference between classification and regression is that classification you have the outcome are discrete we'll do classification outcome is discrete outcome is discrete is the classification outcome is continuous it's regression okay basically they are all like feeding a function they are feeding a function so let's look at it to use the regression tree we actually need our part okay you know how to load the library right? you have to load the library called our part is it in your computer? if you cannot load our part you can install packages you can do this okay you can download our part and then install it okay you all got the package slowly normally automatic download the package after you have installed our packages you do you do library rary r part okay and then you need another package it's called our part plot you also need this one so this one is helping you visualize the tree okay so what is the tree decision tree? decision tree is actually it's actually very simple for example you are you have a bunch of students okay you are HR and then you want to choose the candidate so you have 100 students and then you can set a lot of criteria for example mess result have work experience or not no machine learning or not handsome or not high or not right by following this criteria you can part your a bunch of people into different subgroups a lot of subgroups so you can choose a subgroup that satisfies all your criteria and then that is your candidate this thing this is the same you have a lot of criteria for water you can say ammonia bigger than certain value smaller than certain value and then you got two groups and then you look at phosphate smaller than certain value your part two groups and then you go down for example you look at hydrogen is bigger than smaller than certain value you go further down Okay, and then you're certain level. And then there is a criteria to get the error of those trees, okay? Because they are feeding something. So you have, you build a decision tree for your data. And then we will have a new data point. You have a new data point, you can trace that tree. And you go to, you have a data point with different features. You go through your data tree and then you go down, you'll find a particular leaf that corresponds to your data point. And then you can use that point, particular leaf to predict your result, okay? Any questions? We all have our plot plot installed. Those who haven't installed it, all done? Okay. So we have, we have a data set. Okay, so this is how to use it. Basically your R part, so RT, so, so it's called a regression tree, okay? Because we are not doing classification, we are doing regression. Our outcome continuous variables, okay? So, continuous variables means it can be any number, I mean real number within a certain range, continuous. This grid means it can be, it's only a certain point. One, two, three, four, five, six, like six categories, okay? So our A1 is continuous called regression tree. So this is how you do that. Data is data algae from one to 10, okay? So you run it, you have RT dot A1, okay? This is your model. So let's plot it, okay? So, so this is, why it's so small, right? Plot it again. Ah, okay. Ah, yeah, yeah, I know, I know. I'm doing that. Yeah. Okay. So, so this is a so-called tree model. So this is a automatically built-in tree like this. You have yes, no, right? You have PO4, bigger, 44, yes or no? If a yes, you got to test another variable, chloride, bigger than nine or not? Okay, if a yes or no, it is yes, and this is no. So also false, false, fake, bigger than 51, and NO4, NO2, bigger than 10, so also false. And then you have a lot, this is so-called tree, okay? Later, when you have a point, when you have a new point, or when you want to make a prediction, and then you go to this tree, you just test for different variables. And then you look where your point falls, okay? From that, you can already see what's your A1. Okay? This actually tells you how much percentage of variance is explained by this tree. I mean, here we rise to 40%. It's much better, it's much better. It's several percent better than our previous linear model. Our linear model is 30, at zero point is three something, 37% or 35%, whereas 40% is bigger than that, okay? 7% is, okay. This one is, so this is iteration, I mean, so you'll have an ending, this is where you stop, okay? So for example, we can build another tree. So Cp equals zero point zero eight. So it means if this is smaller than zero point zero eight, you'll stop RT2, let's see, okay? This is a decrease in difference. I mean, this goes in the mathematical structure. So you set a stretch hold. You set a stretch hold, you can set it smaller. So it means that if you set a stretch hold smaller, the algorithm will stop slower, and then it will run more iterations, and then your tree will become more precise. But that also means that you have a risk of overfitting, okay? If you have that arrow very, very small, there is a risk that you have overfitting. Overfitting means you have a big variance, a low bias. If you load in that wall, you hold a big stretch hold. Yeah, yeah, yeah. Yeah, yeah, yeah. Can we realize this, like, the symmetry can vary from this to that, but then which one will you choose? That one you got, oh, wait a minute. That one you have to test your prediction error. You can, a very standard method is you'll do a prediction, which we'll go to later. So you have a tree, okay? You can use that tree to make prediction by using test data, okay? You'll make prediction, and then you will see that, for example, mean square error, okay? You have a different trees, and then you can compare them. Okay, those trees with the smallest prediction error, and you choose that one, okay? I mean, this is a silver bullet. This is usually the silver bullet for comparing all the models, okay? This is, okay, further than that, okay, this is just one tree. This is just one tree, this is called decision tree. Above decision tree, there is something called random forest, okay? You, from tree to forest means you have more trees, okay? What's the difference? Random forest is called ensemble method, okay? Ensemble algorithm. So under this case, what you do is you, with your data, you build a lot of trees, okay? You build a lot of trees. How to build a lot of trees? You'll, each time, each time you sample, okay, you have a data set, you train, with this data set, you train your tree. Each time you sub-sample, okay? The one answer question I'll answer you later. So each time you sub-sample a small subset of your big training set, okay? With this data set, you build a tree, okay? You do that for many times. You do that for many times. So you'll build like 20 or something trees, okay? You have multiple trees and then when you're doing your prediction, either there are two ways. When you are doing regression, I mean regression tree. Regregation tree is because your outcome is continuous. So what you do is you predict with these 20 trees, okay? You got 20 values. You take the average. This is one method. Or if you are doing classification, you predict with 20 trees and then you make a vote. You make a vote. With 20 trees, you make a vote. Those class with a maximum vote will be the predicted result. Yeah, yeah. With a majority vote. Okay, this is the random forest. The so-called random forest. So these two are closely related. Okay, to calculate that very simple, later I'll show you how to calculate that. They're very simple, but the logic is this. I explain it again. So now we have a tree model. We have a tree model, but it's just one tree. I mean, the structure is very simple. You just calculate your tree. I mean, you can also gauge the performance of the tree by setting your CP. You can set this smaller. I mean, they'll run more iterations. So, but one tree is perhaps is not enough, okay? So you to enhance this, you can build more trees, okay? That one is called random forest. Random forest is something in real life, in real data mining problems. It's usually used. This method is actually usually used by the industry. How to do that actually is random forest, most simple one is you have a training set, and then you take subset from this training set, okay? With replacement, okay? Your sample with replacement. And then use this small subset to train a tree, okay? To train a new tree. So you do that like 20 times. Do that many times, okay? You build many trees. Instead of just one tree, you build many trees. You build many trees, and then you have so-called revised ensemble model. It means ensemble, it's a collection of trees. So you use these collection trees to make prediction. Okay? If it is a regression problem, you use your test data. You calculate 20 numbers, 20 results. You take average, okay? If it is a classification problem, you take a vote. You got 20 results. 20 results in class, right? It can be class one, class one, class one, class two. Class three. Class one, class one, class two. Class three, right? The majority normally should be class one, okay? And then you use class one. You use that tree to predict the class. Okay? Because at this time, you are making a prediction. So the result is just the majority vote. So it's class one. It doesn't matter which tree you are choosing. Okay? Because you got 23, I mean 10 of them vote for class one. And then you, this particular test data will vote for class one. Okay? So is it? Any other questions? There will be more in the strength for each parameter to make the decision. So each model will be different, right? Yeah. And then in practice, we will want to know how much impact code that PO4 will be. Then from the model, can we detail anything like for a particular parameter? How much impact is it? What will be the potential stress or we can choose? You mean what? Like only in terms of one parameter, PO4, right? So let's say you have PO4 value and then you want to say something about the final result. Can you tell something about the decision tree? I understand your question. Like you have PO4 like value 50, right? So from here, you will be less right on the right side. Okay? But with the value 50, we will be able to know like, it will more likely be in a yes or no group. You mean you want to have the probability of that 50 with that, okay, is that why you can calculate yourself? Because, okay, is that why you really don't need a tree? You can estimate yourself directly with the data itself. You know why? Because for example, your PO4 equals to bigger, okay. Your PO4 is, you want to say, you got a 50. You want to say how likely this PO4 bigger than 50 happened, right? They're very simple. You get, you have a data set, right? You can directly calculate it. How to calculate that? You get, you take a subset with PO4 bigger than 50 and divide it by total number of data, you get a probability. Is that the threshold will be different? Here the curve will be 44, right? Yeah. Then what will be using the 50 something? I mean you want to use the 50 something. Now this, this is the threshold you calculate by algorithm itself. It's not set. It's not set by myself, by me. It's set, it's got an internal algorithm to do that. Okay, so which means like roughly each this entry will have a different kind of point? Yeah. Yeah, yeah, yeah. Yeah, that'll be different. That won't be always the same. Okay, so within, so it's a black box. We don't know what happened in the middle. We only know reason. Okay. Yeah, yeah. There is only one, there is only one probabilistic thing is by the choice of, okay, some trees, some trees by the difference of exact algorithm, algorithm, because there are many different algorithms to calculate this, by minimize different kind of errors. So there is one probabilistic thing is which parameter count first? Which, which count first, which count second? That can be random. That can be random for some algorithm. But the threshold is fixed. Threshold is calculated by the algorithm itself. That one is decided by algorithm. So it's not, it's not predefined by yourself. No, it's automatic. Okay. Any, any other questions? Someone asked me a how do you? So that you're able to tell us like which parameter will be more important in terms of this like classification and the others? Which parameter is more important? That one is, that one is, that one is you got a, you got a particular thing it's called PCA, principle component analysis. Yeah, you do another, it's another thing. Okay. There is another tree, there is another tree algorithm. First step you do PCA, so-called principle component analysis. So principle component analysis, then you can find which variable is more important. Okay, variable is more important means how much percentage of variance explained by this particular variable. Okay, explain the variation of the result. So you do PCA and then you select several variables and then you perform tree algorithm on that particular several variables. You can't do that. That is a known tree algorithm. Yeah. That corresponds to what you ask. But PCA is another thing. Yeah. Much more larger. Yeah. Yeah, if you want to know PCA and later I'll send you some links to do that. Yeah. But yeah, so here is a tree is automatic can be calculated. Who's a linear model, a static linear model, right? So in which condition will you think the decision tree and the linear model will somehow be similar in terms of efficiency or in terms of the recipient? Okay, okay, compare the advantage. Okay, the advantage of tree model is this. The tree model is easy to understand, okay? And then it's not a global model. Just as what I said, tree model is not a global model. It doesn't mean that you have to, it doesn't, because at the very beginning pie in two parts, right? Or in two more parts. It's not a global model. The second one is incorporate with more data types. Not just, it can incorporate a categorical data inside. Okay, it can be yes, no, categorical data. But for linear regression, if you want to do linear regression for categorical data, you have to plug in dummy variables, right? You have to plug in dummy variables to do that. For tree model you don't need. You just plug in and then you can do that. Okay. Then it is robust with respect to holes. NAs, you even put an NAs inside, it also work. Yeah, not a big problem. So it's more robust, yeah. So it will, you just delete those data that can change with NAs or? No, you will put that into a category, yeah. Any other questions? Yes? But okay, but yeah. In real life, I don't use this. I use other things. Okay, we can do prediction. So let's do some prediction. Linear model prediction. So this is your prediction, okay? Linear model prediction.A1, so just predict, okay? So this is your tree model prediction, okay? So with this prediction you can, as well I told your friend, I mean your classmate, is you can calculate the errors. The model errors can be me absolute error, okay? For example, for this one, yeah, me absolute value is, it's absolute value of the difference. I mean prediction, I'm pretty A1 minus A1, okay? And then you take absolute value and then take the mean. So this is, so 12.9 something, okay? So let's see the tree model. So this is your regression tree, very small. You see? So visibly smaller. This one and this one. So linear, this previous one is your linear model and then that one is your regression tree, your tree model. So this error is bigger. This is 12.9 something. This one error is the 8.47 something, yeah? I mean, error can be other types of error. You can say mean square error. Mean square error you can mean or instead of absolute value you can take a square. Okay, you take a part two and then you can calculate mean square error. For example, mean square error is just square two. Yeah, you remove this. This is a mean square error. 161 and then mean square error for linear model is 295, okay? So by comparing error you can choose which model is your one with model is better, okay? Sometimes this error measure is defined by the task. Sometimes the one who assigned a task to you will give you an error function. So you sub in the prediction and your prediction and then the real data they will use that particular function but normally it will be some form of this. Mean square error, mean absolute value, mean square percentage error, this kind of thing. Okay, anyways it'll give you the formula. So yeah, some other error form I'll just call it based on your file so you can do that yourself. Okay, it's on the Google Drive, okay? So by this, just now I said that talk about the random forest. Let's try the random forest as well after all. So to do that you have to install packages, random forest. So you'll install the random forest package, okay? So after you install the random forest package, you'll load it into the environment by library random forest, okay? So to do prediction is also very simple. I mean you just feed the random forest first, okay? So how to do that? So random forest, this is your random forest. So you just feed the random forest model automatically by the R package. So, and then you'll make a prediction. So this is the prediction and then let's see the error performance of that random forest. So it's M-A-E-I-R-F, where is that number? Okay, I'll override it. Okay, just now we saw that, okay, I overrule this value. After all, let's, okay. And then I do the linear model again. Okay, you see the comparison. So this is your original linear model, which is lousy. Okay, very bad. So this is a mean squared error. This is calculated mean squared error of prediction A1 and the linear modeling. So the error is 295. Okay, this one is the regression tree, the second one, the tree, so-called tree model. Tree model, the error is 161. Okay, so it's one tree. And now we change to the random forest. The random forest is the forest, I mean the ensemble modeling. So it's 53, the error. For the prediction, okay. So yeah, this is, you got a better modeling. Otherwise you can, let me see. So this structure is more or less the same because okay, you may later have other prediction models. For example, you have neural network, okay. Neural network is followed the same structure. Your train and your network, you'll predict and then you get the result, okay. The algorithm I inside, I mean mathematically, very complicated. So if you want to explore that, you can Google it, that we can PDI it, or if you don't know, you can email me, I can give you some material, really. Okay. Yeah, theoretically there is a very good book. By the community, it's called Element of Statistical Learning. Okay. So this is a book of reference used by largely academics. And then a more simple book is called Party Recognition and Machine Learning. This is another book by B-Shop. It's also very famous. It's also very famous book by the Machine Learning community. Okay. So if you want to explore some theory, I mean mathematical formulas inside, you can see where you want to look at. Okay. You want to just use it, use it. Okay. What I'm explaining is just, I mean linear model, random tree, random four, just three model, but as what I said, the university professor more advanced, but why they are more advanced, they know more models. Okay. So whenever you encounter a new model, you can go to R. R is very powerful inside a lot of models. Okay, and you can look for help. So for example, you have Forest, EST, and R. And then you have a package documentation. So you have a package documentation, where you can read the exact, I mean you got sometimes examples. Okay, you can run examples yourself and explore it a little bit more. Yeah. Okay. Otherwise, there is another thing, but it's more in Python, it's called SK Learn. Okay. This one is in Python, not in R, but it's, it got very good documentation. So, use a guide. Okay. Inside you got a whole lot of models, though supervised learning. What we are doing today is called supervised learning. Okay. What is supervised learning? Supervised learning means every time you have input, the dataset give you an output. So you are supervised, you are guided by the result. Okay. All these are supervised learning. So what we learned today is a linear model, but above that you got generalized linear model, linear quadratic determinant analysis, reach regression, support vector machine, and then so nearest neighborhood, Gaussian process, and it got many models. Okay. Neural network have supervised version and unsupervised version. And then you also have unsupervised learning. Okay. Clustering, yeah, just now the k, k-mean, k-mean is, k nearest neighborhood is closely related to k-mean method. Okay. So we have a bunch of this method. I mean, you have a lot of other things to explore. And then the other thing is all related to Python, how to manipulate that. But in terms of model, you have all these, if I'm already, you have, the explanation is very good. For example, ordinary least squares, okay. So this is ordinary least squares. This is the equation. This is arrow we are minimizing. This is what I said that by the arrow. So you have a line, you have dots. You're minimizing the arrows, I mean distance between the dots and the line. Okay. So this is the equation, and then you have, by adding more time, you have different models. Okay. Just give you an idea of what's happening in the industry. So this is actually what happened in the industry. But of course, there are more advanced models like deep learning, use more sophisticated, structured neural nets, or reinforced learning. It means you train yourself. You train among yourself, among different models. So yeah, that's basically what I'm going to talk today. So if you have any questions, I'll do have any other questions. Understandable? Is it understandable? Yeah. So in practice, how do we choose? That's more of a case-by-case scenario. As what I said, if you have a very big data set, you can just simply remove those problematic points. Or if you, or you can, I mean, that's all dependent on you. If you don't want to remove them, you will try to fix the data, right? Fix the data, you will do treatment. If you've got too many holes, you remove them, definitely. This one you have to remove. If you don't have too many holes, you try to fill in those holes. The method of filling in those holes got three of them. Either you use the simple guess, okay? Or you use complicated guess. You've got more complicated guess, right? By different functions. They're all dependent on you. What is the cutoff for you to remove the column, if let's say it's not more than 30% the same or more than 10%? That's dependent on you, I mean. That's really dependent on you, 20%, 30%, it's up to you. Yeah. Other things? What's the difference between LM, RT, and RF? LM means the linear model. RT means regression tree. RF means random forest. Okay? Yes. SK learn, CK learn. You Google SCIKIT. You Google SK learn, SK learn. So this is a Python package, okay? This is a Python package. So to do that, you have to use Python on R, but they've got very good documentation. On that, you can find a lot of models. It's like a dictionary. You can use it as a dictionary. You have a little explanation of the model if you don't understand, and then you can look at that. They've got demos, examples, yeah. But all the models here, okay? This is the subset of R, okay? R is much, much, much, much bigger than this, okay? R can do what? I can show you some other things of what I'm doing. For example, recently, okay, I'll give you what I'm... Okay, this is my research. So this is one way of representing data, okay? So R can, it can do, so tree models can do other things. For example, this one is a brain data. Actually, I can show you some. Wait up. So you see, this is human brain. This is for treating medical data. Yeah, this is also done by R, okay? R can do more complicated things instead of just a very simple thing, okay? So for this one, it's evaluating that the brain had got disease or not. Okay, Alzheimer or not. So just for, yeah, this one is, okay, representing how to represent high-dimensional data. This is like a 30-dimensional data. I mean, you can represent it with curves. Different characteristics can plot it, like curves. So the red is one class, black is another class. So it's also way of representing data, okay? So this is more simple model. So R can do quantitative finance. You can do those option calculations, and then you can also do technical analysis. You can do marketing data analysis. You can do neural network. You can do linear regression. You can do many things, okay? With a lot of packages. Okay, in R, you got 2,000 or 3,000 packages. Yeah, or more than that. It's on the, because university professors, after they publish the paper, they normally write an R package for that algorithm, for that paper. So you want to evaluate the most advanced, the most cutting edge technology. Two ways, either MATLAB or R. So you can use that, and then you can, okay? Any other things? If not, we, what is the card coordinator? So we can stop here.