 Hello, everyone. I'm Anoop. I work with the Friber Galaxy team in Germany. In this talk, we will learn about regression in machine learning. In the previous talks, we focused on introduction to machine learning and then classification in machine learning. In this talk, we will learn about what is regression briefly, and then we will see different approaches for regression. Regression, as we have discussed before, is a kind of supervised learning. We have a label. In difference to classification, the labels for regression tasks contains real numbers. For example, on the right, there are two tables. The top table, we can see that all the features are already real numbers. The target is also a real number. Using this top table, we can train our model because we know the features and their targets. And in the bottom table, we can use the train model to predict the target. We just know only the features and we can predict the target. Regression tasks actually require an error function. Error functions are mathematical functions, which gives the difference between the true and the predicted values. There can be different examples of cost functions, which we can see in the next slides. There are different approaches in regression which can be used depending on the data. There are linear models, such as linear regression, support vector machines, or more clearly, we can say support vector regressors. It has two variants, linear and nonlinear both. There are k-nearest neighbors approaches, and then tree and ensemble approaches. We will be seeing examples of all these algorithms. Some examples of a real-life regression tasks are predicting gene expression patterns. Gene expression patterns has all real numbers for gene expression corresponding to each gene. Then it can also be used for estimating DNA copy number. Some segments in any genome can be replicated, can be copied. Sometimes it's doubled, sometimes it's tripled, sometimes it's quadrupled in different people. To estimate this DNA copy number, which is a number, regression algorithms can be used. Another task could be to identify drug responses to therapeutic changes in the gene expression patterns. Cost function. Cost functions, as we discussed before, cost functions are mathematical functions. It's actually used for computing the error between the true and the predicted targets. By this, we can say that whether our algorithm is doing good or not. If the error is too high, then we assume that the algorithm, the regressor is not performing very good. If the error is very low, then we say that the regressor is performing very good. Some common examples of cost function are mean squared error, or mean absolute error, or coefficient of determination, and so on. One example of error could be if we see here is a true target, which is nine, and the predicted target is 3.4. We can apply one of these error functions, for example, mean absolute error. Then we subtract nine from 3.4 and we find what is the error. These are the different models or different algorithms used for regression tasks. First one is a linear model. In the image at the bottom, we can see that the black dots are the targets, which are plotted. Here we assume that we have only two features in this plot, which can be easily plotted. The straight line is fit through these targets. This straight line is our learning curve. Since linear models give straight lines, the equation for the straight line can be given as y, which is the predicted target, the quantity that we want to know, which can be equal to the w naught, which is the intercept of the straight line. And then for each feature, a coefficient is learned. And these coefficients are weight 1, weight 2, and up to weight n for each feature. All these coefficients can be merged together and can be called as weights, which is a weight vector. So the bias in the weight vector and the size of the weight vector becomes n plus 1. X is the input and it has n features. The quantity that we see here on the right, this particular quantity, this is the mean squared error, which is actually minimized for learning in linear model. So we solve this equation and find a set of weights which minimizes this equation. There are different examples of linear model. Linear regression, ridge regression, elastic net, they work on similar principles, but this minimization equation that we just discussed differs a bit. These linear models are simple to understand and they run fast. But the disadvantage is when we need to have a nonlinear boundary or nonlinear relations in the data, then these models do not work well when we have nonlinearities in our data. For those data sets, we need to use the nonlinear models, support vector machines, support vector machines, or we can say support vector regressors, SVRs, they can be linear and nonlinear both. On the right, we can say that we can see that there is this dark line, which is the separator between the boundaries. Support vector machines in general, they are maximum margin algorithm. The decision boundary is having the maximum distance from samples belonging to different classes. We need only the support vectors. The dark circles that we see here are the support vectors. We need only these vectors to classify a new sample. For the rest of the data we can throw away, which makes support vector machines highly memory efficient because the support vectors are too few compared to all the data points. Advantages of support vector machines are high dimensional data, it works well with high dimensional data. We mean by high dimensional data is number of samples is too less compared to the number of features. But one thing we need to remember that it can be prone to overfitting when the data is highly dimensional. Therefore, we need to use good regularization techniques to avoid that. It is highly memory efficient because it is used only with support vectors and restore data is always thrown away. Disadvantages of support vector machines are it has a large runtime, which increases with the data. Examples of support vector machines are SVR, the support vector regressor and new SVR and these are nonlinear variants and linear SVR is the linear variant of support vector machines. Now we discuss k nearest neighbors. k nearest neighbors approach is a very simple approach which is easy to understand as it tries to find k neighbors around each data point to predict a new sample. Advantages of k nearest neighbor are is it is non parametric. What we mean by non parametric is this algorithm can learn any form of decision function. It's not restricted to some parameters. It can learn any form of function and and these the size of the parameters actually increases with the data. We can also learn many different parameters, which is one of the disadvantages of using k nearest neighbors. And, and since it depends on a lot of parameters, which depends on the size of the data runtime increases, and also it needs to keep all the neighbors of each data point. It has high memory requirements. Insensitive to outliers, but one of the good features of k nearest neighbors is it can learn irregular boundaries. So it's good for those data sets having high degree of non linearity. On the right, we can see that we are using k neighbors regressor. And these are the target points shown in yellow, and the blue line is the, the prediction, the prediction in the bottom also is the same algorithm but with a slightly different parameter. And we can see that it is actually overfitting. Because it's just trying to learn and trying to fit all the data points and not generalizing too well decision trees decision trees. Learn simple decision rules based on the features in the data set. Let's suppose we have two features in the data set x one and x two. On the right, we can see that all these data points are plotted x one on the y axis and x two on the x axis. Then we try to divide the data set into two parts by this simple decision rule x two is less than point three zero. Then the entire data set all the samples would be divided into two, one on the left and one on the one set on the left and one set on the right. In this note here, we again make a decision x one less than point eight. Now again all the samples here would be divided into two parts. Similarly, we say that x one is less than point eight eight is further device the data set into two parts here and further we move downwards. These leaves actually give the categories that we have learned using these all decision rules, we, we can count these categories as two to four, five, six, seven. Here we can see that there are seven different parts in the whole data set. Whenever a new sample comes in, then a decision is evaluated, and we need to follow only one of the parts here, which makes it an efficient learning algorithm. We see that decision tree learns simple rules and for prediction, we need to follow one path. The advantages of our decision tree are, it's very easy to interpret as we saw in the simple example with two features here, and for prediction we need to follow only one path and it's logarithmic cost for predicting a new sample. However, there are a few disadvantages of using decision trees. It's very sensitive to variations in the data. If the training and test data have some variations, then it's, it becomes prone to overfitting. It, it may give a very good accuracy on the training data but it may perform poorly on the test data. Decision trees are also sensitive to imbalance data sets. If there is one class which is dominant, then the data is not balanced, data is not balanced because of the dominant classes and non-dominant classes, and then it gives a very biased model. It's very important to balance the data set before using the decision trees. To avoid the problems of using these decision trees, we have ensemble models. Ensemble models is nothing but a combination of many, many different trees. There are two approaches to ensemble models, bagging and boosting. In bagging approach, independent trees are built using the same data. And then average prediction is taken from all of these decision trees, which learns on the data in an independent way. Examples of bagging are random forest, bagging regressor and extremely randomized trees regressor. Another ensemble model is boosting. In boosting, we take a few decision trees and then improve these models sequentially. And these models try to combine weak models and form a robust ensemble. And then an average prediction is taken from these models. Examples of boosting are other boost and gradient boosting and extra gradient boosting and so on. On the right, we can see that there are different trees in this ensemble model, 30 trees in this ensemble model. And each tree makes a prediction and then we take an average prediction as the final prediction of an unseen data. Ensemble models generally give better accuracy in comparison to using single decision trees. But one of the disadvantages of ensemble models is it is computationally expensive. If the data increases or if you use a lot of decision trees as a number of estimators, then the run time increases. In this talk, we learned about regression in general, and then we talked about different models of regression, some are linear, some are nonlinear. Now we will go to the hands-on section where we will be using regression techniques we discussed here on our biological data. Hello everyone. In this session, we will be doing a hands-on on our biological data set and the task would be regression. In the previous session, we discussed what is regression briefly and different techniques to do regression. Before going into the hands-on tutorial, let's look at the training website of Galaxy. The link to the website is training.galaxyproject.org and we can see there are several tutorials for different analyses. To find the tutorial that we will be using for the hands-on session would be under statistics and machine learning. We open this category and we will find the regression in machine learning at the bottom. Let's open the tutorial. In this tutorial, we will see a background about regression and some background about the data sets that we would be using. Then we will be seeing how the different models of regression, different algorithms of regression are used and then at the end we will be seeing the visualization techniques to evaluate the performance of each of these regression algorithms. And we will be also seeing how to optimize the hyperparameters. In the introduction part, we discussed one of the hyperparameter optimization techniques, which is grid search. We will be using the same grid search approach to optimize the hyperparameters of one of the regressors. Let's start the tutorial. We have discussed already what is a regression. Regression is one of the tasks of supervised learning where the targets are real number. These real numbers, the targets can be anything. It can be a DNA copy number. It can be gene expression patterns. It can be biological age, which is the target in this hands-on session. To learn, we try to minimize a cost function. The cost function gives the error between the true and the predicted targets. There are different examples of cost function such as mean squared error, mean absolute error, coefficient of determination or r squared error, and so on. In this tutorial, we'll be using a data set based on Yananoy et al study, which does chronological age prediction using DNA methylation data sets. In this data set, the biomarkers are the genes which have the CPG sites which are DNA methylated, and these DNA methylated CPG sites have the highest correlation with the biological age. These number of genes, these biomarkers are 13 in number, which have the highest correlation with the age, and that's why they are being used in the data set for learning and predicting biological age. You can learn more about the original analysis in this paper, which is linked in the tutorial. Also, if you're interested, you can learn more about DNA methylation and CPG sites also. These are also linked. In DNA methylation, methyl group gets attached to one of the nucleotides. In this case, for CPG sites, DNA methylation, for the methylation to occur, methyl groups get added to the cytosine nucleotide. And because of that gene expression pattern changes, and by measuring the DNA methylation pattern of gene expression change, it is also measured against age as well, how these patterns are changing with age. And using this data set, we'll be using different regressors and see how each of them performs. Just to reiterate, we will be using regression techniques and analyze the DNA methylation data set. This data set is obtained from blood cells. First of all, we'll be using the, we'll be downloading the data sets and uploading them to Galaxy. We will use a linear model, which is a simple model. Then we use the train model to predict using an unseen data set. Also, we will visualize a prediction, how good we are doing. Then we'll use ensemble method for regression task. And then we will use another ensemble method for hyper parameter optimization. We have already discussed the regression couple of times before. We can see in the plot that the targets, which are blue circles, and the fitting curve is the red one. So this fitting curve is, is plotted across the targets. And this fitting curve is learned by a regressor. When we have the data or the targets of this kind, as shown in figure two, we can say that there is a linear relationship. And if we use only a linear regressor, then we can find a very good fit already. But not always, we have this kind of data set available. And in real, in real life, many data sets have nonlinearities. Therefore, it's good to try out a nonlinear algorithms as well to see if performance is improved. Cost function, we have discussed briefly what is the cost function. Once again, we should look at the cost function. The blue circles or the blue dots are the targets for data set and the black line is the curve that is learned by the algorithm, which gives the best fit through these target points. Cost function tries to determine the error between the true value and the predicted value. So predicted value lies across the, the black line, which is the straight line curve, and the true value lies somewhere here. And we try to minimize this distance by learning this curve. And this error is computed for each of the data points and then averaged. The first straight line gives the lowest error is the best straight line curve that, that defines the, or that explains the targets will be as discussed before we'll be using a DNA methylation data set for this tutorial. And we will apply a couple of scikit-learn algorithms, which are available in Galaxy to predict biological age using DNA methylation patterns of data sets. First of all, we need to download data sets into into Galaxy. Before going before doing that, let's go to Galaxy. This is the homepage of Galaxy. On the left side, and there are lots of tools, which are actually different algorithms for different analyses. Here we will find our different regressors. On the right is the Galaxy history, where all the data sets are uploaded in this history. So we did it for classification. For any new analysis like we will be doing here. It's always good to create a new history. Let's click on this plus button to create a new history. Then we give it a meaningful name. And now we try to upload all the data sets. The data sets. There are three data sets. The first one is the training data set, which is a train rows. The second data set is the test rows, which we will be using for evaluation. This test rows also contains the true labels. The third data set contains only the test rows without the labels. The second and third data set are actually the same. Only difference is the second data set contains two labels. The third data set we will be using for prediction. Then we will have predicted targets. Then we'll compare the predicted targets with the true targets from the second data set. These data sets are also available on Zenodo. We go there and we can have a look. We have all these data sets available here. You can also download and get a view of all these data sets. To copy these data sets, either we can just select and copy all these links. Otherwise, alternatively, we can just go here and copy all these links of the data sets. To upload the data set into Galaxy, we will click on upload data. Then we go to paste fetch data and we just paste all the links. We need table data sets. Therefore, it's always good to type table data set here. We just start the data upload process. It all turns green and then we close it. Now we can see here all these three data sets are in queue. Soon they will be uploaded when all of them turns green. We can wait a bit here. The steps that I followed here, these steps are also written. We just copy all the links and then go to paste fetch data and then start the process. We should also rename the data sets. As we see here, all these data sets have their links appended to their names. It may not be meaningful. Also, these are very big names. So it's good to have small and meaningful names here. Once they turn yellow, once they start getting uploaded, we can change the names of the data sets. Also, we should also check that the data type of all these three data sets is tabular. If it is not tabular, if it is, for example, comma separated or text file, then our algorithms will not work. Our data sets are uploaded. Let's now rename these data sets. We will click on these edit attributes, pencil icon, and this screen is shown to us. Then we just remove these links here and save it. Similarly, we will do for the remaining two data sets. Now we have meaningful names to our data sets and we can see browse through these data sets. We can see that we have 105 rows for the data set, test data set, and it has 13 columns. These columns are the genes and all these numbers are the DNA methylation patterns recorded. These are the same test rows, but in this data set, we also have age as one column, which is the true targets for these data sets. These are the trained rows and we have 209 lines, 209 rows. The first row is the column names and we will not be using this first row for training. Only 208 rows we will be using for training and we also have the targets defined, the age defined for each row. We go to the tutorial. We have done all these steps here. Here the numbers of rows is defined, which we have already discussed. We have 208 rows corresponding to individual samples. So it's coming from different people and for each row, there are 13 features. And the last column, the age column defines the target. Similarly, the test set contains 104 rows coming from 104 different people and age defines the target, the biological age. First of all, we'll be using a linear model for training on the DNA methylation data set and then predict the age. In the hands-on section here, we'll try to find this particular tool in Galaxy. We select this text and go to Galaxy and try to find this particular tool. Yeah, so it's right here. We just make it a bit smaller here. Now, this is a generalized linear model. So it contains both classification and regression models together. We'll be using regression model here. Let's go back to the definition. We need to specify regression trainer model. And then we use a linear regression model from this, which is right here. Now our data is tabular. We will select tabular data. Now we have two sections. First, we need to select all the features and then we need to select the target. The targets can be selected here. To do that, first we select the training sample data set. We select the train rows. Our data sets contain header. We have seen this. These are the headers here. These are the headers with our gene names. We say, yes, our data set has headers. Since our features and targets are contained in the same data set, we need to separate them. Therefore, this option comes in handy. Now we need only the features, only the gene names. Therefore, we will choose this option, all columns excluding some column header name, which will select everything excluding the header name that we type here. So we have typed age. Therefore, it will take into account all the columns except the age, which contain our features. Now to select the feature and the targets, we select the same data set because the feature and the targets are also present here and contains header. And now we need to select only one column, the target column. Therefore, we choose option select columns by header name. And we use the same header name here. We can see that we have done the same things as specified in the tutorial. Then we go on and run the model. We can see that a new data set is created but still queued in Galaxy and the job has not started running. We can go back to the tutorial and try to answer this question, what is learned by a linear regressor. As we have seen in the presentation that a linear algorithms learn a straight line curve. So the function in the straight line. These are the coefficients that we need, which is actually learned by the regressor and using those weights, which are the coefficients for each feature, we can predict the target for a new sample. Let's see, we can see that our job is running now. And meanwhile, we can move on to predicting age using the test data set. Now test data set does not contain any age information. We will try to predict it using the model that we have learned here. This is our learned model. And then we need to use the same tool as we have used before. But now we will be using it in a different mode. First we used it as a train model, but the second we use load a model and predict. For that we need to have two input data sets. First one is the model, which is automatically selected. And the second is the data set that we want to predict on the test rows. Our test rows data set contains header, we just make it as yes, and we want to predict class labels. Here the class labels are the real number targets. Let's just verify that we have done the same things. Yes, we have done it right. And then we execute it. This data set will create a predictions for all the rows. And these predictions are the age. The data set is already running now. To see, to compare the performance, we need some visualization tools. We have some visualization tools in Galaxy, which we can use and evaluate the performance or visualize the performance of all the tools that we are using. Before using the visualization, we need to do a removal of the headers from the data set, which we'll be using. Now, this data set is already finished. And it's a table of data set and we can actually visualize it. We can see that the last column is predicted based on all these features and our train model. This column is the predicted age. And now we'll be using our true data set. And try to see how good we are doing. Before that, we need to remove the headers, because this data set does not contain any header. So we will go for it. And we go to test row labels. We also should rename our predicted file so that we remember in future at which data set we are dealing with. Therefore, we have it and also we will rename this data set without the header as well. And we just rename it by clicking on the pencil icon. Now that the data sets are renamed, now we can easily use the visualization tool here. Let's copy this name of the tool and search it in Galaxy tool search. Now we have this tool definition open. It has two parameters. The input data file and second is a predicted data file. Input data file will contain the data set containing the true targets. So the two targets are present in test rows labels without header and the predicted data file is in predicted data linear. And then we will run this job. This will be generating three different plots. When until these jobs turn green, we can see the results already in the, in the tutorial. First, first plot is actually the comparison plotting off the real values to true and predicted values as, as a point plots here. Blue points here give the true values and the predicted values are in orange color. And these are on x axis we see is the number of data points. We have 108 points in the test set 108 rows in the test set and for each row, we get true and predicted values. We can see that for most of the rows in the test set we are getting true and we are getting close true and predicted values, which says that the performance is is good. Then these blue and orange points differ from each other for each point on the x axis, then, then we can say that our performance is not good, because true and predicted values are far from each other. Then we see the scatter plot. The scatter plot plots predicted and true values. For a good performance. Most of the points should stay along this orange line this is x equal to y curve, which says that the true and predicted values lie on the x equal to y curve, because they are very close to each other. If these points are scattered around the x and y x equal to y curve then our prediction is not good. In this plot we can also see that root mean squared error is 4.1, which says that the we are predicting the biological age with an error of with an average error of 4.1 years. For example, if the true target is 50, then we are either predicting around 46 years or 54 years. The R2 score is 0.93. We will discuss about R2 score soon. Just to give a brief introduction about R2 score is if it is closer to 1.0, our prediction is very good. If it is a negative number or very small number, then our prediction for regression is not good. And R2 is one of the error functions that is used in regression tasks. The third plot is a residual plot. The residual plot plots the predicted targets against the residual. The residual is predicted target minus the true target. So it can be a negative value predicted minus true values or a positive one. So how do you analyze this plot? If this plot contains points which are randomly scattered around y equal to zero line, then we say that our prediction is good. If it shows any kind of pattern, then our performance, we can say that our performance is suffering from some problem. This is distributed along the y equal to zero line and should not show any pattern. Using these three plots, we can judge the performance of our algorithm in different ways. Our jobs are actually finished. And let's just collapse the tool section here. So that we have a bigger area. Let's click on actual versus predicted curve. In this interactive plot, we can see for one row, the true value is 66 and we are predicting around 59. For another point, we can see that the true value is 68 years and we are predicting as 65.33. There are some not so good ones predictions of 69 and 60, but there are also some very good ones. For example, if for this particular row in the test set, 65 is the true target and 65.53 is predicted. Let's open the scatter plot as well. So here we can see that most of the data points lie along the x equal to y score x equal to y curve, and which says that the predicted and true values are close to each other. However, we can see some, some predictions, which are far away from the true targets, and some which are many which are close to why x equal to y line. So that is most of the points in this plot should lie along the x equal to y line. The third plot as we saw there is the residual plot. And we see here, the points are scattered around the y equal to zero line and do not show any, any distinguishable pattern. And the same plots for different regressors as well. Now let's go to the tutorial and learn a bit more about coefficient of determination. Coefficient of determination. Also called R2 or R squared cost function is popularly used for regression task as a, as a metric of performance. Let's see in this figure eight. All these points are scattered around the x equal to y curve. And we see that the R squared score is, is negative, which is a bad performance. Therefore, these points are scattered. If we have most of the data points that lie very, very close to x equal to y curve, then our performance is very high. Here we see that it's almost the best performance, because the, the maximum value R squared can get is 1.0. And the minimum value is any negative number. Which can go back to a very large number. We have one question here to answer. Since we inspect the plots, inspector the plots. So what can we say about the predictions. So from the figures five, six and seven weeks can say that the prediction is good and acceptable. We have a high R squared score, which is 0.93. And the predicted and true age for most of the predicted samples are in the test set. They're good. And they're very close and which gives an idea of a good performance. Now we have used a linear regression model for predicting biological age using DNA methylation datasets. Using linear models is always good to start with, but it is not the best approach. We should always try to use nonlinear models to see if we can do better. Therefore, in this section of the tutorial, we will be using ensemble method for regression task to predict a biological age using the DNA methylation data set. So one, one example of nonlinearity we can see in figure 10 that the targets are scattered like a curve. If you try to fit all these target points through a straight line, then we will not get the best results. Therefore, to have this explained all these targets, target points explained, we need to use some nonlinear algorithm to find the right curve to explain most of the points in the target. In the ensemble suite, there are lots of algorithms to be used as regressors. For example, random forest regressors are the boost regressor gradient boosting regressor. In this approach, we will use grade gradient boosting regressor. The paper that we are dealing with here and we are using the data from they used a random forest which is ensemble based regressor as I just mentioned before. And then we compare the performance of both of these regressors. Let's go to the hands on section. We need to use an tool named as ensemble method for classification and regression. We copy the text and go to Galaxy and try to find this tool. We found this tool and then we open the definition of the tool. And then we see what options we need to, we need to do. We want to train a model. Okay, so it has not loaded so far and then we can wait for a few seconds. Then we need to select the ensemble method as gradient boosting regressor. We need to be careful with it because in the same dropdown we will find gradient boosting classifier as well so it's better to be careful not to choose this particular otherwise it will give failed. Data set. This is now open. And we select our train model, and then we select gradient boosting regressor. Our data is already tabular and all these options remain the same. Here we need to select train rows. Our train rows has headers and we want to select only the features. Therefore, we will exclude the age column from this data set which will give us all the feature columns. Now we need to select only the target values. We will select the same data set because it contains the target as well. These data set has headers. And now we want to select only one column. We select this option select columns by header name. And we select age here and try to put the age here and then execute this job. Then the data set is already created but the job has not started running. Meanwhile, we can go back to the to the tutorial and try to answer this question. What is learned by gradient boosting regressor? Gradient boosting regressor has several attributes which are actually learned internally by the algorithm using the data. One of those attributes is feature importances. We know that we have several features in the training data. For each feature, this regressor gives an important score how important a particular feature is. If the feature value is higher in magnitude, then the feature is very important and highly correlated with the age or the target in general. If the feature importance magnitude is a small number, then it's not so much correlated. In feature engineering, for example, if let's say if there are a lot of features, then we can do feature engineering and try to remove all those features having small feature importance values, which will improve the runtime of the algorithm. There are a number of estimators, how many estimators and what kind of estimators are there and there are many more such as OOB improvement. OOB improvement stores incremental improvement since gradient boosting follows a boosting approach and there are only few trees which are weak learners and they try to improve themselves sequentially and incrementally to finally form a robust ensemble model. So the OOB improvement stores that change. Our jobs get finished and we get this trained model here. And now we will use this train model for predicting the test rows. We use an ensemble model because gradient boosting is an ensemble method and we're using actually the same tool. But now with a different mode, load a model and predict. It needs two datasets, the model, which is the gradient boosting regressor and the test rows without the true age. Our dataset contains header and we want to predict the targets. Then we execute this. We just copy this text so that we can rename our new dataset. Let's go to the pencil icon and then rename this dataset as predicted data using gradient boosting. Using this predicted dataset, we can plot our different visualization to see how good we are doing with the ensemble model. We have this tool available in this tool, we need to select two values as the input data file, which are test row labels. We just confirm from the tool definition here. Yes, it's the test row labels. And then we want to use predicted data and we just execute it. It will give the same plots here and we see how good we are doing with the ensemble model. As long as these jobs are not finished, we can look at the plots here. We can see in this scatter plot, most of the data points lie along x equal to y line, which is good. And the root mean squared error is 3.85. If you go or if you remember what was the root mean squared error with the linear model, it was 4.1. We can just get it confirmed with the linear model. We can see that with the linear model, it was 4.1. And with the ensemble model, it is 3.85. Let's just look at the dataset which just got created here. So we can see the root mean squared error is 3.85 years. Which means that we are predicting with an error of 3.85 years. Let's look at the residual plot. These files are a little bit bigger, so it takes some time to load. So these residual plot between the predicted values and the residual, the difference between the predicted and true values. These points are scattered all across the y equal to zero line and they don't show any kind of pattern, which is good, and that shows that the performance is good. Then we can also see the actual versus predicted curve. Here also we can see that the predicted values lie close to the true values. So we can look for most of the points in the test data, which says that our performance is reasonably good. We saw that we got slight improvement in performance compared to performance by using ensemble model in comparison to using linear model. We can also use hyper parameter search technique to see that if we can still do better because any machine learning algorithm has lots of hyper parameters, which needs to be optimized for better performance for this to do. We need to build a pipeline, which is a package for pre-processing steps and estimators. Currently in this particular tutorial we are not using any pre-processing steps, but in the pipeline we are just using the algorithm, the gradient boosting algorithm. But in real life we should use some pre-processing techniques with the data and then use estimators because raw data contains lots of noise and outliers which should be removed to get a very good performance on the data. Let's try to find this pipeline builder. We should just try to find here the pipeline builder tool. In this tool, as I mentioned before, we can use pre-processing steps. We can choose different types of transformations here on the data, which will be applied sequentially. We also add many different pre-processing steps one after another, but in this step we are not doing that. We will just use the final estimator, which is our regressor. Since gradient boosting is an ensemble method, we will go to sklearn ensemble, scikitlearn ensemble and then we will find the option of using gradient boosting regressor. Then we go to, we just put a random state equal to 42 here, which is nothing but to find the repeatable performance every time we run the experiment. We need output parameters for search CV, yes. We want to have this so that we can set the hyperparameters. Here we will see that in the next step. And then we execute this pipeline builder. This pipeline builder will return two data sets. First one is a list, is a tabular list of all the hyperparameters of gradient boosting regressor and the default values will be also present there. And the second one is a zip file, which is the pipeline containing the estimator algorithm. We have finished this step and waiting for the results to come. Now we will use another tool for actually searching for the hyperparameter. In this tool, two options are there. First one is grid search, another one is random. In grid search, we will specify discrete values of each hyperparameter we want to optimize. While in the random search, we will use a range from which the algorithm will sample a value and then use it for a particular hyperparameter. The default value, the hyperparameter that we want to optimize here is the number of estimators. The default value of the estimator for gradient boosting regressor is 100. But we don't know if 100 is the optimal value. Therefore, we will choose some values which are less than 100 and some values which are more than 100. To see if we are getting a better performance or to be more specific, we are getting a lesser root mean squared error or higher R squared scores. The search of hyperparameters will be done through a fivefold cross validation. We discussed it briefly. In fivefold cross validation, the entire data set, the entire training data set is divided into five equal parts. And four parts is used for training and one part is for validation. The validation set, as shown in this example image, that validation set keeps shifting so that each part or each sample will be used for training and test as validation both. And the accuracy, we will be getting for each iteration for each fold and then we will be averaging the accuracy. So similarly, we can do a threefold cross validation or tenfold cross validation as well. Our jobs got finished here. The pipeline builder tool that we ran a few months before. Let's look at this table of data. What it contains. We can see that there are different parameters. Actually, these are hyperparameters of the algorithm, which can be set by the users and depending on this value, the performance of the algorithm varies. Therefore, we need to find some default values already set, but we need to find the right values for our data to use this hyperparameter search tool. First, we will try to find this tool in Galaxy's tool suite. Here is our tool. Tool definition is open. So as I explained before, we have two approaches for hyperparameter search grid search and randomized search. In this tutorial, we will use a grid search. So it takes several files and parameters and we will be using the help from the tutorial and fill in the right values. We have already selected the grid search CV and our pipeline estimator object data file is already this one, and it's already loaded here. The estimator is not a deep learning models should be no. Now we need to select the file containing the parameter names. We should select this particular file. And when we select this file this drop down with the parameter name will be automatically selected. Now, we should see which particular hyperparameter we need to optimize. We should find it. We want to optimize an estimators. Therefore, we just selected an estimators and the default value of this is 100. Now we should give a search list, which contains a list of different values that an estimator should take instead of 100 for for first iteration, it will take an estimators equal to 25. In the next iteration, it will take 50 and then 75 and then 100 and then 200 for each of this iteration, it will report an accuracy, and the best one will be taken out. Since we have many other hyperparameters to to tune, we can add insert parameters search and choose, let's say alpha and specify different values of alpha just like here, but we are not optimizing it and therefore we will not be using more parameters. Let's go back to the tutorial and see what options we need to set for advanced options. We need to select the primary metric for scoring primary metric for scoring is R squared. Therefore, we select the R squared. Yeah. Now we should select what is the cross validation split here. We select K fold cross validation. And we will be using we will be using five splits data will be split into five equal parts, we can specify here 10, then data set will be divided into 10 equal parts. And it should be at least to which we should not specify less than one listen to, because it won't make any sense. In any fold, it will divide the data in a contiguous way before splitting we want to shuffle the data. Yes. And we can specify some seed number to get repeatable results. Then we try to see what different options we need to set. Another one is raise fit error. ID is already set. So a race fit error is no here, which means that if you specify some values in the search in the search list. And this value causes causes an error, then our program will stop. If we specify that as no, then if it encounters any error, then execution will keep running and it will just simply skip that value, which is good for us because we don't want to take a wrong number, which is not the, which will not give any results and it will just waste the time. So all the advanced options are finished. Now we will select the input data type. As before, our data type is tabular. Now we want to select all the features from the train rows. And we select this data set and we contain our data set contains header. And we want to exclude the age column from the features list. And in the second part, we just want to select only the target column and therefore we will select that column by header name. Then we go back to this definition and see if we are doing it right. Test separately no and whether to hold. Yeah. And we want to save the best estimator. So as we discussed before that, against each tested value of the estimators, there will be a different model, we want to find the best estimator, which gives the best performance for the estimator chosen. And we will be using this trained best trained estimator to, to predict for the test rows. Now we execute this tool. This tool will be generating two datasets. First one is the best estimator for the best hyper parameter set. And the second is the hyper parameter is the detailed results of the hyper parameter search, which will be a tabular file. And it will have all the information what hyper parameters. What are the ranks and what is the accuracy for each tested hyper parameter. What is the time for finishing the, the job for for each value. It's also important because if you specify a high number of estimators, it will take a larger time to finish. So we used an ensemble method as an estimator and for predicting the age, we need to use ensemble method for classification and regression. Therefore, we'll go to the tool section and try to find this tool again. In the meantime, our job has not started. Yeah, so we go down and okay. Maybe it's the top. Yeah, it's at the top. So we have found a tool. Now, we need to run this tool in the prediction mode. Let's go back to the tutorial and this ensemble method for classification and regression. We need to use the load and model predict option. The model that we want to use it's still not finished but it's data is is generated in the file is generated. And we can still use it and make it in queue. The prediction data is the test rows here and the data set contains header. We want to predict on the class labels which is the age, which are real numbers, and then we can execute this tool. We can still execute this tool, but it will not produce any data. It will produce some results only when these jobs get finished. We can still do that. So, meanwhile, we can see that the jobs have started running already. So this particular tool will return one file, which is the predicted with the predicted test rows with the predicted age. We can use the same plotting tools to plot the performance of this ensemble method but with the best hyper parameter. So the input data file would be the file with the true age information and the predicted data file would be our this particular file which is getting generated. And now we can execute this tool. Similar to previous plotting examples, we will be getting three plots. And we can see that this job got finished. This job has given us the predicted age. We can see here the predicted column is the age, the predicted age, and we'll compare it with the true age. Another data set we should look at is the hyper parameter search data set. This data set contains all the information what was done in hyper parameter search operation. We can see that the n estimators attribute or hyper parameter we try to optimize and the different values that we used got listed here. And the accuracy with the mean test score is also listed here. We can say that the n estimators with rank one has the highest accuracy in this particular hyper parameter search. The default estimator is giving an accuracy of 90.91 and the n estimator is giving an accuracy of 0.9146 which is slightly higher. And these scores are actually our squared scores. Closer it is to 1.0, better it is. There are other options or column names which can be explored. So since we specified five-fold cross validation, the accuracy is given for each fold. Fold zero, fold one, and fold four, these five folds. We can also see that when the estimators are less than the fit time, the training time is less. And when number of estimators are increasing and the fit time is also increasing. Let's look at the plots. First of all, let's look at the residual plot. Again, we see that the points are scattered across y equal to x line and no visible pattern. So for the performance we expect to be good. Then we look at the scatter plot. We can see that most of the points lie across y equal to x line and the root mean squared is 3.76 and the squared r squared is 0.94. With the ensemble method without hyper parameter optimization. We have 3.85 years as root mean squared error. But now we have improved it to 3.76. Which says that it's important to optimize the hyper parameters of an algorithm to see if we can do better. We go to see actual versus predicted values. Here are the points on the true and predicted values, and we see that most of the points are like close to each other. As the prediction error is only 3.76 years on an average. Let's go back to the tutorial. We see the same similar plots here after doing the hyper parameter search. The residual plot has no patterns. And most of the data points lie across the y equal to x line. And the true and predicted values for each sample in the test set is close to each other. So, in comparison to Yananoy study, which used a random forest as regressor. It achieved 3.93 years as root mean squared score. And with our hyper parameter search and gradient boosting algorithm. We achieved 3.7. That's the R squared score. Therefore we did slightly better than this paper. Therefore it's very important to optimize the hyper parameters of the algorithm. And this tutorial also shows that the machine learning tools in galaxy we can achieve state of the art predictions using machine learning regressors and classifiers as well. And with this, we reach to the conclusion of this tutorial. In this tutorial and the presentation before we learned what is regression, the basic concepts of regression. What are the different techniques to do regression, for example, they are linear models, which try to fit a straight line across the targets, then they are non linear models for learning the non linearities in the data. And then we saw what are the different visualization plots, which can be used to evaluate the performance of a regressor. And at the end, we saw how using hyper parameter search algorithms can improve the performance of an algorithm because there are many hyper parameters and the ideal values are not fixed for them. They vary from data to data and problem to problem. Therefore it's important to optimize the hyper parameters. I hope you learned something new using this tutorial. You can try out different regressors on the same data, or alternatively you can use a different data set and try to use different algorithms on that. For example, we didn't use a preprocessing technique, you can try out different preprocessing techniques on a raw data set and then use a hyper parameter search techniques, maybe the random search and see and see and compare the results across different algorithms. Thank you. In the next video tutorial, you will find a feedback section. Let us improve the content of the tutorial by giving your invaluable feedback. Thanks a lot.