 Hello, everyone. I'm Anup. I work with the Friberg Galaxy team in Germany. In this talk, we will discuss about classification and different approaches used for classification. In the previous talk about introduction to machine learning, we briefly discussed what is classification. We just have a recap here. In classification tasks, class is a category. In a data set, there can be multiple classes. And these classes can also be called as categories. And these categories are represented as integers, such as zero, one, two or three, or, or more. Before that, in the previous talk that machine learning has two different variants, supervised learning and unsupervised in supervised learning, we have outputs or classes for each data point in our data set. Therefore we have supervision of classes. And our classifiers learn to map the features with the classes. When a new sample comes in, it tries to predict using the model, the class of the new sample. There can be two kinds of classification. One is binary where there can be only two classes. For example, it can be cancer or no cancer. Cancer can be represented as zero and no cancer can be represented as one. Similarly, we can have spam or no spam classes for spam filtering tasks. In addition to binary classification. We can have multi class classification as well. In this data set can have more than two classes. For example, in handwriting digit recognition data set. We can be 10 classes, and each class belongs to a digit. For example, 0187 and so on. On the right, we can see two images, and we can see it's a binary classification problem. The red crosses belong to one class and blue circles belong to another class. We learn a boundary which is a linear classifier, and we see they are separately nicely separated. On the right, we can see that there are some outliers for the second class and the decision boundary gets tilted a bit. Let's look at the linear model. In linear model or linear classification, we learn a linear or a straight line a linear curve. The input data point x in the image can be represented as two features x1 and x2. Our classifier, the linear classifier learns weights for these features, which can be represented as w1 and w2. There's another component of weight, which is called bias, and is represented as w naught. W naught is the intercept of the straight line, and w1 and w2 are the feature coefficients. Using these all information, x1, x2, the x vector and the weight vector and the intercept, we can define a function y, which is a straight line. This gives the decision boundary. If y is greater than zero, then the input data, for example, x here is assigned to class one. And if it is less than zero, then it is assigned to class two. In real life, not all data sets can be separated by linear classifiers. We need more complicated classifiers. One of the nonlinear classifiers is support vector machine. Support vector machine is also linear. It has a linear variant as well, but also nonlinear variants for learning nonlinear curves. Support vector machine is a maximum boundary, maximum margin classifier. What it means is it learns a decision boundary, which is maximally separated by the nearest samples belonging to two classes. In the top image, we can see that y equal to zero is a decision boundary, and we have three data points, two data points on the line y equal to one. These data points belong to class one, and another data point with the y equal to minus one belongs to another class. In fact, the decision boundary y equal to zero is equally separated by both of both set of points and having the maximum margin between these two points. We need only support vectors, these nearest data points to the decision boundary are called support vectors. By identifying a new data point, we need only these support vectors, other data points can be thrown away. This is one of the advantages of support vector machines. For example, when a new data point, which is colored as blue comes in, then we say that this belongs to class one, because it falls on the right of y equal to one line which belongs to class one. In the bottom image we can see that the nonlinear variant of support vector machines. The dark black line gives a nonlinear curve separating the blue and the red circles crosses sorry. And the circled ones. These are the support vectors belonging to two different classes. There are other nonlinear classifiers, such as nearest neighbor classifiers in these kind of classifiers, we need to define a certain number of neighbors. How many neighbors, a data point can have for example five or seven or 15. These neighbors are computed based on a distant metric. A distant metric can be Euclidean distance, Manhattan distance, or some other distant distance measure. When a new data point comes in, then the class of this new data point is determined by the maximum number of classes, maximum number of neighbors, having the same class. In the image we can see that the green data point is our is a new data point and three of its neighbors are also defined to our red and to have red class and one is blue. Therefore the class of this new data point would be red triangle. One of the examples of nearest neighbor is K nearest neighbors, where K defines the number of the neighbors. We need to define the number of neighbors before, before learning this model. One of the advantages of nearest neighbor is that it learns any kind of boundary, any irregular boundary, which can separate classes. One of the downsides of the nearest neighbor approaches because it stores training data, storing training data when it is too large becomes intractable. Therefore, with a very big data set, it's hard to use nearest neighbor approaches. Another classifier is decision tree. On the right, we can see that small decision tree is plotted. This data set defines if a person is fit or not fit. This data set has three features, age, whether a person eats pizzas or exercises in the morning. We define simple decision rules, for example, age less than 30. Then all the samples belonging to this category falls on the left side and all people with age greater than 30 fall on the right side. Then we use another feature, eating habit of pizzas. Then we separate the training samples, again into two parts. We say that if a person is eating a lot of pizzas, if yes, then all these samples belong to this category, they are unfit. And if no, they are fit. Similarly, the another feature is doing exercises in the morning. If people who do exercises in the morning and greater than and having age greater than 30, they are fit and people who don't do exercise in the morning with age greater than 30, they are unfit. These are the simple decision rules. Decision tree learns from the data itself. And the splitting based on each feature is done at each node, which we saw in this example. And at the, at the labels, the labels are present at the leaf nodes, which we see at the bottom. Let's discuss advantages and disadvantages of decision trees. As we saw in the example on the right, that's very easy to understand. We have simple decision rules, for example, age less than 30 and then we divide our training data into two parts. It follows logical order and very simple to understand. Another advantage of using decision trees is that we can use it with categorical data for predicting classes and also with numerical data prediction using decision trees is logarithmic in the number of data points. What it means is when we have a trained model using decision trees, then prediction takes less time because we just need to follow only one path to classify the unseen data sets. With these advantages, decision trees also have some disadvantages as well. One of the disadvantages is overfitting decision trees can be over complex, which makes it prone to overfitting, which means that it learns very good on the training data, having high accuracy, but it performs poor on unseen data sets. Another disadvantage is that its performance can have high variance, which means that if the data set changes little, then the entire decision tree can be different. For example, in training and test data sets, if they have slight differences, then we can get very different results. Another advantage of using a decision tree is if our data set is imbalanced in the number of classes. For example, if one class is dominant and another class is not, then these decision trees can be very biased towards the dominant class. To avoid overfitting, we can use simple techniques such as pruning the decision tree or setting the maximum depth of the tree. To avoid creating bias trees, we should always balance our data set and then use decision trees on them. With high variance in our results, we can use not only one decision tree, but multiple decision trees, which gives rise to our next model, which is ensemble model. Ensemble method takes several models into account and then makes prediction by taking the performance of all the models. There are multiple tree estimators. As we can see in the image, we have data and data is passed into different models. And each model gives a prediction and then we take the majority voting for predicting the class of a new sample. There are two approaches for ensemble method. One is bagging, another is boosting. In bagging method, several estimators are trained at the same time, and they are independent of each other. And because of this independence, these estimators can be trained in a parallel way, which gives a huge performance boost in terms of runtime. Examples of bagging approach is bagging classifier, random forest classifier. Another approach of ensemble method is boosting. We have a few estimators, which are called weak learners, and these estimators are improved sequentially. Since these estimators don't work independently of each other, there can be no parallel execution. Examples of using boosting approach is other boost classifier, gradient tree boosting, and there is another one is extra gradient boosting trees. In this talk, we learned about classification in general, and we learned different techniques used in classification from linear to nonlinear ones. Linear ones, we saw that it learns a straight line and it has two attributes, intercept and the weights, which are also called coefficients. The nonlinear methods include SVM, which has linear and nonlinear variants. Then we have nearest neighbor approaches. And then we have decision trees, which learn simple decision rules, but it has some disadvantages, which can be overcome using ensemble method. And ensemble methods use different models, and these models work together to produce the prediction for a new sample. Thank you. Hello everyone. In the presentation about classification, we learned basics of classification and different techniques to do classification. Classification is a task of classifying a data point into separate classes. For example, cancer or no cancer spam or no spam. There are different ways of doing classification through different algorithms. They can be linear or nonlinear, depending on the data. Sometimes the data can be classified using linear classifiers, but many times it's possible that data cannot be classified using straight lines. They need, they need different nonlinear curves to differentiate between different classes. In this session, we will be doing a hands-on on classification. We will be using a data set and using that data set, we will classify samples into different classes and we will visualize the results. Before doing that, let's make ourselves familiar with Galaxy's training materials website. Let's go to training.galaxyproject.org. In this, we will find a variety of tutorials. To find the relevant tutorial for this session, we should go to statistics and machine learning section. In this section, we go to its specific page. And inside we will find different training materials related to machine learning and deep learning. In this list, we should use classification in machine learning tutorial. We will find the tutorial under the hands-on category. We will just click on it and it will open the tutorial that we will be using for the hands-on session. For doing hands-on session, we need Galaxy application. Let's open in another browser tab, usegalaxy.eu. It looks like this. You have different tools on the left side and you have history on the right side. History contains all the data sets that we will use and create while doing the hands-on session. On the left side in the tool section, we will have different machine learning tools. We will be using some of those. We have already learned what is classification. In this tutorial, we will learn what kind of data set we can use for classification. We will use one specific data set from Kami Informatics. We will apply on this data set, we will apply different machine learning algorithms such as logistic regression, which is a linear classifier. Then we will apply K nearest neighbor classifier, then support vector machines and then ensemble algorithms such as random forest bagging. After each run of algorithm, we will plot different plots to visualize how good we are doing and how we can analyze the classification results. As we already know that classification is a supervised learning approach and therefore the data set that we will be using will have at least two classes defined. In this particular tutorial, we will cover the data sets and what does it mean, why this data set is useful. Then we will fetch these data sets from Zenodo. Then we will apply several algorithms, which are classifiers on this data set. First of all, we will apply logistic regression. We will train our model and then we will predict on the test data set and then we will visualize the results using different plots and see how good we are doing. Then we will apply K nearest neighbor classifier, then support vector machines, then random forest. In the introduction to machine learning part, we saw that hyperparameter optimization is very important to find optimal learning accuracy. Therefore, we will also learn how to apply one hyperparameter search technique using the same data. With that, we will be concluding the tutorial. Classification, as we have discussed several times, is finding a curve. It can be a straight line or non-linear, which can differentiate between different classes of data. In the image we can see here, a straight line divides the classes into class one and class two. In real time, it can be that the straight line can be a non-linear curve. The classification that we will be doing will be on two steps. First, we will build the classifier and then we will apply the train model on a data set and try to predict the classes. The data set that we are using here comes from chemistry. This data set follows the principle of quantitative structure activity relationship with biodegradation. What does it mean? This quantitative structure activity relationship is also called COSR. This COSR actually contains data sets of different molecules. We try to map all these molecules to its biodegradable nature. COSR in general tries to find the mapping between the chemical structure of the molecule and it associates the structure with the biological effect it produces. The chemical structure is defined using different molecular descriptors. For example, molecular weight, number of nitrogen atoms, number of carbon, carbon double bonds, number of hydrogen atoms, number and position of different elements. These are the features of this data set and using these features, the structure is defined. These features are then mapped to its biodegradable nature, whether a particular structure gives a particular chemical compound biodegradable nature or not. This data set has 1055 molecules. To apply classification on this data set, we will be using scikit-learn machine learning algorithms which are present in Galaxy. And there are different plotting tools as well which we will be using for visualization of the results. Before going further, let's go to Galaxy. Let's open Galaxy. Use Galaxy.eu application we'll be using for our hands-on tutorial. Before doing anything, let's first create a new history in Galaxy. For doing any analysis in general, it's always good to create a new history so that all the analysis is saved in one history. And if you want to look at it at a later time, it will be easier to look at the whole analysis. Let's give it a meaningful name to it. Classification. Now we see that it's an empty history and no data sets are present here. We go back to the tutorial and see what kind of data sets we need to import. In this particular section get train and test data sets. There is information about different data sets that are needed for doing this tutorial. These files are present as links as we can see here that three links to three different files. These files are also present at Zenodo. Let's open the link from here and we can see all these three files are present here. To make it easier, the downloading of these data sets, we just copy these links here either by selecting or by just clicking on copy. Then we will go to Galaxy and we will do upload data. Using this, we can upload all the required data sets into our history. Then we will go to paste and fetch data. Since we have the links, it will pull the data from internet. We go to this text area and we just paste all the links here. Then we need to see whether we need to define all the data types of all the data sets. Since we don't have any other information, we will just import all the data sets. If you need to check the data types of each data set, you can set it here. There are different data sets loaded in this dropdown and we can set, either we can set it individually or we can set for all the data sets that we are uploading. Here auto detect is already selected. We just start the analysis, uploading the data sets. We see that based on these three links, three data sets are getting created. It takes some time to upload these data sets. Meanwhile, we go to the tutorial and see what else we need to do. The third point says that we should rename these data sets. As you see in these links, the data set has all the links included in the names as well, which is not so meaningful. Therefore, it's good to rename these data sets. Just now, data sets became green, which means that all these are actually uploaded and we can use them. Before turning green, we saw that all these data sets were yellow, which means that the job to upload these data sets were already running. And now when they are green, it means they got finished. To rename the data sets, we need to go to each data set and click on the pencil icon, which says edit attributes. We go to this particular page and then click on data types. To change the data type, we can choose a different data type, but since our data is already in a table of format, we don't need to do that. We go to attributes again and then just remove these links here and get a meaningful name for our data set. We do it for all the three data sets. Now, we have renamed all our data sets and they are meaningful and we will remember them easily. Here we can see a tip for renaming the data set, which is helpful. Now, we will use a linear classifier for learning a model. This linear classifier learns a straight line which differentiates between two classes. To apply these linear classifier, we need to go to Galaxy again and try to find this particular algorithm. To do that, we should look at this hands-on section here, which says the tool's name is generalized linear model. In this, we just copy this name from here and we go to Galaxy. In the search toolbox, we paste it here. Then Galaxy will search for the particular tool and we just found this tool and we click on it. When we click on it, we see the definition of the whole tool. To search for a linear model, we need to see which linear model we need. We need a logistic regression model. We just select it here. And now, first of all, we are doing a training a model task. We first need to train a model and then we can use the predicted model. Here we have two types, train a model and load a model. Right now we will use train a model. We go back to the tutorial and see if we have chosen the right parameters. We choose a linear method at logistic regression. Our data is tabular type. And we see training samples dataset. We go to the history and see which maps this particular option. So the train rows actually maps it and we go to train rows and select it here. That's why it's very important to keep meaningful names to our dataset so that it's easier to map to the tool. Now the other option is does the dataset contain header. We go to the dataset and see if there are headers defined. These headers are actually the column names. We can see that there are column names in these dataset and we need to notify to the algorithm that there are some header names present in the dataset. As we know that machine learning algorithms do not work on text. Therefore, the algorithm will actually exclude it. We say that yes, the headers are present. Now, first of all, we need to divide these datasets, the training dataset into features and labels. To select all the features, what we need to do, this particular dataset contains all the columns, which the last column is the label. To select all the features, we need to have only 41 columns and the 42nd column is the class, which we don't need. Now, when we are selecting all the features, therefore, we will choose an option, all columns excluding some column header names, which says that take all columns except the excluded column name. Therefore, the excluded column name would be class. We go back to the tutorial and see if we are doing the right thing. We are selecting the train rows and the dataset contains header, which is yes. And we have chosen the right options for selecting all the features. Now, we need the labels information from the training dataset. Therefore, we choose the same dataset and dataset also contains header again. Now, we need to choose only one column, which is the class column. Therefore, we will choose the option select columns by column header name. And here we will use the same column name as class, which will select just one column, which is the target. We go to the tutorial again and then see we have chosen the right options. After that, we will execute this tutorial. It takes some time to finish this model. We see that as long as this job is gray, which means it is not queued. Once it becomes yellow, it starts running. We also see that we have chosen the tool generalized linear model and then we chose the algorithm logistic regression and therefore we already have some meaningful name for this particular model. Alternatively, we can use logistic regression model as the name of this model once the job is finished. To refresh this history, we can just click on the refresh button. We still see the job has not started running. We can wait a bit. And meanwhile, we can try to answer this particular question. What is learned by the logistic regression model? We go to the solution and see the logistic regression model is a linear model and it learns the coefficients of the straight line curve which demarcates the boundary between two classes. We go to Galaxy and see that our job has started running as it has turned yellow. We go further. In this particular section, we will use the test data set to predict the classes using the model which we will get very soon. To do that, we use the same tool, generalized linear model, which we used before for training the model. But right now, we will use a different option. We'll use loader model and predict option, which means that it will load the train model, which is the data set number four here. And then we will select the test data with no class information. And then we will run this job. We see and verify if we will be doing the right things. And we see that we are doing the right thing here. We selected the right model and we selected the right data set here. Our data set still contains header. We can verify it here. The test rows, we again have header information, but right now we see there are only 41 columns and the class column is not present. And now we choose the option predict class labels, and we just execute this model. Once this task is over, once we have predicted the classes for the samples, we will visualize the performance of the logistic regression classification results. For doing that, we need to use some plotting tools. The job has not started running yet. Before using the plotting tool, we will use remove beginning tool, which removes the header from the data set. It's still a test data set, but without headers. We see that the job has started running. And we should rename this data set because we will be using different algorithms so that our data set is not lost. So therefore we go to the pencil icon as before, and we just rename is at logistic regression results. We see that the data set got renamed here. We go to the remove beginning tool to remove. We see that it has finished. We will come back to it in a minute. First we try to find this remove beginning of a file and we will specify test row labels. We specify remove first row and from the data set and we just run this tool. Meanwhile, we will see this particular results table. We see that it's still a table, table of data. And we see that we have now got 40 second column in this data set, which is the class information, the predicted class here. To see the performance of the logistic regression classifier, we will compare the true class with the predicted class. And compare how many of the true classes have been predicted right. And we compute the accuracy based on that. Now our this remove header job got finished and we just rename it as suggested in the tutorial. And now we will use a plotting tool, which is plot confusion matrix precision and recall curves to verify the performance of our classifier. We found this tool here and we we see that there are different options in this particular tool. First of all is the input data file, which contains the true class information. The second is the predicted file containing predicted classes. And the third one is a train model, which we will use for predicting all the area under the curves curves. We go back to the tutorial and we see input data file is test row labels, no header. This is this data set contains the true labels. And the predicted data file is our logistic regression result and the model is automatically selected. Now we execute this after executing this job, we will get three different plots. First plot is the ROC and AC curve. The second plot is precision recall curve. And the third one is confusion matrix curve. Since as long as this job remains unfinished, we can see the results already in the tutorial. First of all is the confusion matrix confusion matrix gives gives a square matrix between true class and the predicted class labels. Using this, we can get information about how many of the true classes are actually predicted true and false. For example, in this particular plot, we have two classes, zero and one. This is predicted class labels on the x-axis and true class labels on the y-axis. Then if you look at one particular blocks, which is red in color, it says that how many of the true class labels are predicted as as zero, how many true zero class is predicted as zero. So each block here represents the number of samples, which belongs to class zero and are also predicted as zero. So these block and the block here, which is actually true and also predicted as true. So these two blocks should be containing large number of samples compared to these two blocks here, which says that if it is like this, then we can say that the accuracy is high. Let's go to Galaxy and see all our jobs got finished and let's open the confusion matrix plot. So these plots are actually interactive plots. We can see different values. So I just increased the size of the screen so that we can see the plots in a better way. The plots are actually big. So it takes a few seconds to open. Okay, we will do it again. Now it has come. So we see that it's the same plot. And if we hover over these, we can see actual numbers. So X and Y both are zeros here in the black tooltip we can see here in the black tooltip here and Z is 508 which means that 508 samples which have class zero are actually predicted as zero. When we go here, we see that there are 228 samples which are actually belongs to class one are predicted as one. When we go here, the true class is zero and predicted class is one and number of those samples is 55 and here it's 45. Let's learn about the relation between confusion matrix and precision recall curves. First of all, look at the precision recall curves, how do they look like. There are three straight lines in this curve, each belonging to precision which is blue in color recall, which is red in color and F score which is green in color. These scores are given for each class. And all these classes we can see on the X axis here, the class labels, we have zero and one. This is the score. These scores can lie between zero and one. And zero is the worst score and one is the best score precision can be defined as a fraction of all the relevant results that we are getting for each class recall gives the fraction of all valid results. And F score is the harmonic mean between precision and recall. The value of F score can vary between zero and one. If it is zero, it means that either precision or recall is zero. We go to the confusion matrix again, and, and try to see how we can compute precision and recall to compute precision for each class we need to look at the true positives. True positive mean that the samples which are in class zero are also predicted as class zero. Then for true negatives, we say that it's true class is one, but it is predicted as zero. That's true negative. Using these two values, we can compute the precision. We can compute the fraction as follows 508 divided by 508 plus 555. That will give the precision for class zero, which is, which is 0.90 for class for class zero. Similarly, we can compute the precision for class one as well, which is 0.83 as we can see in the plot. And we can also conclude that precision is a little bit higher for class zero compared to class one. Similarly, we can compute a recall as well. We go to confusion matrix. Now, the recalls formula is true positives divided by true positive plus false negatives. To do that, we take this particular box here and compute its number 508. And then we go to this box, which gives false negatives and we compute the fraction as 508 divided by 508 plus 45. And this gives the recall for class zero. And we go to the precision recall curve and then we can see that the recall is 0.91%. Similarly, we can do it for class one as well. And for F score, we can compute the harmonic mean between precision and recall, which is given by the green line. It is important to compute this plot to see the accuracy for each class. Which is important because in real life data is is imbalanced in many cases and we always should strive for similar accuracy for each class. If we are getting good accuracy for one class and not for the other class then it will be difficult to predict on new samples. Then we get third plot, which is area under the curve plot. It gives the accuracy, another measure of accuracy. The ROC curve here in shown by the blue line here for a good classification training. This particular curve should be as left as possible should be near to one. Which is the case here and we also see that the AUC area under this curve is 94. We can say that the accuracy is 94. When this particular AUC curve is close to the red line, which is chance and gives 50% accuracy, then our classification is not that good or reliable. Therefore, we should always try to find the ROC curve, which is as close or as left as possible in this particular plot. It's actually a plot between true positive rate versus false positive rate and we should always try to find high true positive rates compared to false positive rates. We have seen all these three plots for logistic regression. We'll be doing the same plots for other classifiers as well. Since we saw these plots and we also discussed all these plots and we can say that the classification is acceptable. We are getting around 90% accuracy. Using further algorithms can improve the classification accuracy. Maybe the data is not linearly separable, we may need to use complicated algorithms such as nonlinear algorithms, SVMs, K nearest neighbor algorithms. We'll be looking at these algorithms soon. We have discussed the K nearest neighbor algorithm briefly. The K in this name defines the number of neighbors a particular data point can have. We cannot keep K very large, otherwise it will slow down the algorithm. And also if we keep K very small, then it may not give reliable results. Therefore, we need to be able to find the right K for each dataset we are trying. Let's do some hands on for this particular algorithm. To do that, we need to find the right tool. We will see the nearest neighbor classification and we will again re-enable the tool section and try to find this particular tool. And we found it and then it has the similar definition as the tool before. And we go to the parameters and what parameters we need to select. We need to select nearest neighbors as a classifier type and then we need to select the train tabular data and then train rows. We will do all these things as before. We have training samples dataset and we choose train rows. Our dataset contains header. We select is as yes. Our dataset contains feature columns as well as the class column. First of all, we will select all the features excluding the class and we just select class here. And we have excluded the class and taken only the features. Now we need to select only the class labels. We select the same dataset which contains the class labels and our dataset has headers. We select the header. And now we need to select only the class column. We will use this option select columns by header names and use the same header name as class. We have done the same things again. And we want to use K nearest neighbors as algorithm and then we use the same option and then we execute our model. Meanwhile, when our job is running, we can see what are the advantages and disadvantages of these algorithms. It's a simple algorithm. It just tries to find the neighbors and try to classify based on the neighbors. It's a non-linear algorithm. If we have non-linearity in the data, it's useful and it can be used for classification problems and regression problems as well. And it works well with low dimensional dataset. Since it keeps all the training data and if the dimensionality of the data becomes very high, then this algorithm may pose some problems because of the space. We can see that it's one of the disadvantages that it has high memory usage. Also, it suffers from a curse of dimensionality. If the dimensionality is too high, then it's very hard to find the right neighbors and the right classes for the new samples. Because the curse of dimensionality, the problem is when you have too many features compared to the number of samples. We see that our job is completed. Then our model is ready and then we can use this model to predict the dataset. Then we use again the same tool, but in a different mode. We use load and model and predict. We have nearest neighbor classifier model, which we just got it now. And we select the test rows. We need to select test rows here and we have header in the dataset. And then we execute this tool. Once this dataset is finished, we will be renaming it. Again, a question here, what is the value of K for the model? So in this particular definition of the tool, we have not set the value of K, the number of neighbors for the model. As we have discussed before, that before that we are using scikit-learn algorithms. Therefore, scikit-learn algorithm gives five number of neighbors as a default value. And we are using five as a default value of number of neighbors. Now our job is running. Therefore, we can change the name of this dataset as nearest neighbor results. This value may not be ideal for a problem. Therefore, it's important to find the right number of nearest neighbors for the problem and for the data. For simplicity, we have not set this particular parameter, but it's advisable to use different numbers depending on the data to find the optimal accuracy. Our prediction is finished. Then we use the same tool for plotting. We have test row labels here. Then we have predicted data file. And then we have the train model here. Just execute it. So we will get the three plots again as we got for the logistic regression approach. But we will get different numbers in these plots. As long as the job is not finished, we can see the results in the tutorial itself. We can see the confusion matrix here and for the precision recall curve and ROC AUC curve as well. So it's still not finished. So our jobs are running now and let's hope they finish soon. So now they are finished and we see that the confusion matrix and precision recall and ROC curves are already present. Let's just hide this particular tool section so that we have a bigger area. So we see that we have a slightly different results with K nearest neighbors. So as we remember for class zero, we had 508 as for the class level zero, but now we are getting 494. Then let's look at the precision and recall curve. So in this particular case as well. The precision is a bit higher for class zero and precision is a bit lower for class one. That's why the overall precision looks a bit smaller. Yeah, let's look at the ROC and AUC curves. Here we see that the patterns, the pattern of the curve remains similar and the AUC is 0.95 in the in the last plot we got as got AC as 0.94. So here it's a bit similar I'd say the performance is a bit similar. Now let's go to different algorithm, which is SVM. We have discussed before that SVM is a maximum margin classifier. It learns a boundary, which is equidistant from the nearest samples in different classes. SVM has has both the variants it can be linear and as it can be non-linear as well. SVM is actually a very good classifier for binary classification problems and also when the data is non-linear in nature. We will use SVM techniques and we will use a linear version of support vector machines. Then first we need to find SVM in our Galaxy tool suite. Let's find the SVM here. Let's go to the tool definition and we again choose train a model. And now we are using the linear support vector classification algorithm. Here there are two different classification SVM classifiers which are non-linear in nature. These can be used as well, but for simplicity we are using linear support vector classification. We have the same options, tableau data and training samples data set for we will use train rows again. Our training data has headers and also feature columns as well as target columns. We will exclude the target column and use only the feature columns. Therefore we need to use the right header for the target which needs to be excluded from the list of features. We use the same data set for the class labels as well and we have the data set contain header and we select the column by header name. Let's see if we have chosen the right options. Yes, now we just run this. So as we discussed that we learn this maximum margin for the SVM. Therefore we learn the coefficients of the line with the maximum margin in the training phase and using these we classify the new samples. Our job is running now. So as long as this job is running we can try to set up the classifier for prediction. In that we will use the same tool again with in the predictor mode. We use the model we already got the train model and the data is test rows. Test rows here and data set contain headers and we want to predict class labels and then we run this. Our job is running now and we can already rename these result coming from linear SVM. To see the performance of SVM, we will use the same plotting tool. Our results are already ready. We use input data file is contains the true class labels and the predicted data file is the results from SVM and we have the linear support vector classifier as the train model, then we execute it. It will generate again the three plots. And as long as they're not ready. We can see those plots here. We have only one plot policy plot. So here we see that the accuracy is is 0.93 here, which I would say is still similar. It's still similar to what we have got before from logistic regression and K nearest neighbors. So we just verify the results. Okay. So, so we are getting result as 0.94, which is similar in a tutorial it's 0.93. I guess we are not setting the seeds actually to to get the same results. So if you set the seeds in the algorithm, then we will get the same same result. But these results what we are getting for SVM and KNN and logistic regression, they're all same. Why is it so in the introductory ML course. We discussed about hyper parameter optimization. There are several parameters of each algorithm. We need to tune them to get the optimal results. For example, if you tune the SVM high SVM hyper parameters, maybe we get better results. We have a task for for the participants to tune the parameters of support vector machines and see if we are getting the better results than 0.93. We will be using hyper parameter techniques so that you will familiar how to use those tools to to optimize on the hyper parameters. And then you can do this task on your own to verify and learn. We see also the precision recall and this curve. We see that the accuracy for the precision for class zero has decreased and the precision for class one has increased. So it used to be 0.83. Now it's 0.88. And this used to be 0.9. 0.9 but this has decreased to 0.87. We can see how different classifiers can perform differently on the same data. We will use a random forest algorithm for classifying the quasi data set. Till now we have used linear and nonlinear models for classifying the data. Now we will use an ensemble method for classification. Random forest. We discussed it briefly that random forest creates different decision trees in parallel. And the prediction is taken as an average prediction of all these decision trees. That's why it's a forest. It's not a tree because it includes many decision trees. And takes prediction from all these decision trees created independently and takes the average or majority voting. And by doing this, this ensemble prediction, it reduces the randomness in the data in the prediction to apply this random forest algorithm to our data. So we will find the ensemble method tool suite in Galaxy. So we have this tool here ensemble methods. And again, we need to train data set. We use random forest classifier. In this list, we have many different classifiers. And then on ensemble methods, we, we discussed that there are two kinds of ensemble methods. One is based on boosting and another is based on bagging. So random forest is based on bagging and boosting is other boost and gradient boosting. Now we again use tabular data options. And we have the training samples data set. We use the same parameters. We exclude the class column from the feature set. And for setting the labels. And we again use the same data set. We have headers and then we selected by column name. We also have advanced option here. How many trees. We can have in the forest, for example, the default value is 100. And there are other different parameters what should be the depth of the tree. How many samples you need to have on each node. But we are not looking at these options. Now, let's run this tool and get the predicted model prediction model from the ensemble method tool. While this tool is running, we can look at this question. What are the advantages of random forest classifier compared to KNN and SVM. So, random forest classifier is an ensemble method and it takes prediction from many different classifiers, which are decision trees. Therefore, it reduces the problem of high variance defined by each decision tree. As we are taking in another way, we are taking the opinion of a committee rather than just one person therefore we tend to be better compared to using just one estimator or one classifier. The forest algorithm has one attribute as feature importance. It gives an importance value to each feature we are using. Therefore, using that knowledge, we can actually exclude those features which are not that relevant for our learning task. If we have features having really low values, for example, close to zero, then we can remove those features and use only those features which have high importance values. By that, our data set will become smaller. We have less features and more important features. Therefore, our running time would be better and the accuracy may get better. It's worth a try to use this feature importance values which is given by a random forest algorithm. Our job has finished and we have a trained model from random forest. Now we can do prediction using the same tool. We use load and model and predict. We choose random forest classifier and test data as test rows. The data set contains header and let's see if we have the same. I just copied the actual name of this output file and we just run this tool. With this, we will get predicted file with the predicted classes and we can still do the prediction analysis using the plots. And for that, we need to use this tool in Galaxy. We have this. So now our tool is actually open and we need. Sorry, I forgot to rename this data set before, which I'll do it now. And I have renamed this data set. So now I will again open this particular particular plotting tool and we have this input data file as two labels and we have predicted data file as random forest result, which is finished just now, and then random forest classifier model. So let's just run this tool. We will again get these three plots we have seen before with the other classifiers. So, we can see ROC plot in the tutorial itself and we can see that the accuracy is one, which is like 100% accuracy. So we can see that we have improved the accuracy with the ensemble method compared to linear and KNN and SVM models. Our jobs are running. We see that we have improved the performance using random forest algorithm. You can see the confusion matrix. For that we need to collapse this left section. We can see the confusion matrix as we have expected that we have higher number of true positives here, and we have zero numbers of of these false, these two negatives and this particular box, which predicted label as one and true label as zero. And in this diagonal, there are zeros previously we used to have some positive number around 40 or 50, which were wrongly classified. We can also look at the precision recall curve. So here, since we are getting the best accuracy so this precision curve is is all these three lines got merged here. And again, we can see this ROC AUC curve. So, looking at this curve we can say that this curve is is at the position which is the left most, and that gives the best accuracy of one. So this is an ideal prediction. You will see this kind of prediction very less, but you can see maybe sometimes. So this is the ideal prediction and yeah, till now, we have seen three or four different kinds of algorithms. We are looking and predicting on the same data and we are getting different accuracies for each different algorithm. And we have used all these algorithms on the default values of all the hyper parameters we have not optimized any hyper parameter, we have just used those. As such, to see whether we can improve the performance of any algorithm by tuning the hyper parameter. We need to use search techniques such as good search or random search for doing that, we need to create a pipeline builder. Builder is is a package which contains some pre processing steps together with the classifier. In this tutorial, we are not discussing any pre processing techniques, we are just using the data as such that therefore our pipeline builder contains only the classifier. We are using a slightly different classifier which is a bagging classifier. It, it is also based on bagging approach it creates a random decision trees, if not said differently, and takes majority voting from the predictions of individual decision trees to use the hyper parameter search technique. Let's first try to build this pipeline builder, as I just said before so pipeline builder can package pre processing steps also sequentially. But since in this tutorial we are not discussing that we will just set all the pre processing steps as none. We are just transforming the data we are just using the data as this, and just selecting the final estimator. The final estimator is a skill learn ensemble, because bagging classifier is one of the algorithms from ensembles suite. And then we choose bagging classifier. It's very important to be aware that we choose bagging classifier and not bagging regressor bagging regressor is used for regression tasks, since we are doing classification, we must use bagging classifier. Okay, and we are not setting any other parameter. So output the parameters for search CV. By doing this. Operator parameters for search CV to yes, it will actually take out all the parameters hyper parameters of bagging classifier in a table of format, which we can use it during hyper parameter optimization. Yes. So we have all the option set and now we are running this tool. Here we will get two data sets. First is the pipeline and second is the table data set containing all the hyper parameters of bagging classifier. Our job is running now to apply hyper parameters. So we need need to use hyper parameter search tool. And we just see hyper parameter search before going into this tool definition. Let's see what are the different hyper parameters of of bagging classifier. So these are the hyper parameters, which we can see here. What are the number of estimators, how many trees we need to use. And these are end jobs how many course we want to use for parallel processing. What should be the base estimator, whether decision trees or some other. And there are there are others. For hyper parameters, we can actually optimize and the default values of these hyper parameters actually written here. Number of estimators is 10. Now let's open the hyper parameter search tool again. So the search technique, there are two search techniques randomized random search and good search. So briefly that grid search uses discrete values of each hyper parameter and random search uses a range uses range and samples values from that range for each hyper parameter. For this tutorial, we will use grid search. And we go to the tool definition. We have used the grid search and then we need to use a data set containing pipeline estimator object. So we have this pipeline estimator object. Our model is not a deep learning model therefore should be said to know. Now search parameters build it. We need to use the data set containing all the hyper parameter names. So here we will just use this data set as such to get the names of all. So, when you have chosen the names here, this particular data set, we will get a list of all the names in this particular drop down. So whichever hyper parameter we want to optimize we can just select it from here. So we want to optimize the number of hyper parameters, number of estimators, therefore we will use an estimators. The default value of this hyper parameter is 10. Now we should specify a search list. It means that the end estimators will be will be using these values one at a time. So if, if it uses the default value it uses only 10, but since we are optimizing this particular hyper parameter, we have specified a list and in each iteration it will use one value as an estimators and find the accuracy based on that. And whichever is the best it, it reports input type is tabular data. So these options are the same as we have used before. We have training samples data set as train rows, our data set contains header. We exclude all these target column for taking out only the features. Then we use train rows here. And our data set again contains header. We use select columns by this. We copy the same name here. And then we just look at the table definition. Whether we have set all the parameters in the right way. It looks right portion as notes. And we want to save the best estimator. This particular tool will find the best values of each hyper parameter that we are testing, and also corresponding to that best hyper parameter setting it saves the best estimator as well, which we can use later for prediction, or for other Since this tool has many different values to be values to set, therefore we should verify it once so that we are using the right values by playing estimator here. Right way. This is right this estimator is right. We want to optimize some other hyper parameter as well then we should select it here and another hyper parameter and we can set this particular hyper parameter and we can set different values for for max features. Since we are optimizing here only an estimator's attribute, we will just delete. Data here train rows, yes it contains header, all columns exclude some column header name class and train rows again we have header column by header names and class. We want to include. This is none fine best estimator. Now we just execute this tool. Our jobs have started running here. It is, it will give two data sets. One is the fitted estimator which is the best model for us corresponding to the best hyper parameter. And the other one is a table of data which gives the performance for each combination of hyper parameter that we have set. So it will have a performance accuracy for each of these values of an estimator's and from that table we can see that which of them performs the best to again use the prediction. As before, we need to use ensemble method for classification and regression tool, because in the pipeline. We have used a bagging classifier which is an ensemble method. Therefore, we need to use ensemble method for for prediction with this model. Since this job is already yellow we can again queue up one more job and use the still unfinished data sets. Also, the job already got finished. Let's look at this table of data set return by. We can see that there are many different columns in this particular data set. Let's look at the table of data set returned by the hyper parameter optimization tool. In the data set there are several columns. For example, what is the mean fit time and scoring time the prediction time. What is the, the test score. And the score is the accuracy against each hyper parameter value. We have used an estimators as the hyper parameter whose value needs to be optimized. For each value we get one test score or accuracy score. And these are also ranked based on the accuracy here. So an estimator is equal to five gets the best accuracy, followed by the default value of an estimator which is 10. And then 20 and 50 gives gets the same accuracy makes no difference to increase the number of trees. There are further columns. For example, for the columns, for example, these columns are referred to the cross validation splits accuracy corresponding to cross validation splits. We can see here that we are getting a better accuracy when we change the hyper parameter value and not using it at the default value. Therefore, it's always important to verify if we can get better accuracy by optimizing these hyper parameters. We can use different combinations of hyper parameters for simplicity we have just used one, but you can use different hyper parameters. And then you will have different accuracy scores in all versus all combination. Since we have used only four values, we have only four different accuracy scores. We also got the best model corresponding to an estimator equal to five which we can use it for prediction. For doing that, we will use the same tool. I think we already know we don't have it right now. So we will find this tool and some methods. And now we need to predict a model, then we will use this particular estimator the best estimator that we received from the hyper parameter optimization tool. And then we use the test data. And our test data contains header, we just check whether we have applied the same values. So a zipped file. So our model is actually a zip file. So we have put it here. And we have test rows and our data set contains header. Now we will predict using the new model that we have. As long as it's not finishing, we can see the results. This particular plot confusion matrix. If you remember, we were seeing a blueish value in this particular box. But now we are seeing a yellowish color here, which means that based on this color bar, which means that it has the number increased from here to here. Which is a good sign and we are getting better accuracy compared to previous algorithms that we used. So our job just got started. Let's look at the precision and recall curve. We can also see that precision also improved for both the classes class one and class zero. And also the recall is also higher, which gives much better overall score. Now we have the predicted data set to plot it. We will use the same tool. Let's find the name of the tool here. And we just find this tool. So we have this tool, our input data is test row labels. And this is our predicted data set here. And we have the best model here. We just run this tool. Again, we will get the same three plots. Which we saw here. Again, we can see that the AC curve here we are getting the best accuracy. And this is the ideal curve for RC and the accuracy is 100%. Our jobs are running now. So in the previous steps, we did not rename these data sets at all. But when you do analysis, it's always good to rename these data sets and not use the default values to avoid confusion. It's always good to rename these data sets. We have seen these plots. So once these jobs finish, we'll see the similar plots in Galaxy. With this, we come to the conclusion of this tutorial. We have seen many different classifiers working on the same data, but producing different results. It means that these algorithms treat the data in a different way. Therefore, it's always important to try out different algorithms on on your data set to see which one works the best. Either you need only a linear classifier or you need a very complicated one, like SVM or random forest or some other algorithm. It's always good to start with the simple algorithm and see if the algorithm is performing good, and then move on to more complex ones and see if we are getting better. And then we do hyper parameter optimization of the algorithm and still see if you're doing better. So this is one of the good, good suggestions to use simpler algorithm first and then move on to more complicated ones. We have used the data set with only two classes, but you can use a data set with multiple classes as well, which will be a multi class classification and see how good you are doing for each class. You see the performance using the visualization tools, how good your algorithm is performing and how good you are doing for each class, which is also very important. If you're doing very good for one class and not good for the other than probably the data needs to be balanced in some way. And we have specified only three or four algorithms, but you can in the machine learning suite in Galaxy there are many different algorithms also which can be tried out on the same data to see if we can do better. Meanwhile, the job has finished, and we can see the plots. As I've said, the plots are bigger in size. Therefore, it takes some time to load. You can see the size of the plot is 3 MB. It takes some time. A few seconds to load. We see the same plot here. This I remember for logistic regression is used to be 508, but now it is 549. And this used to be, I guess, for 247, around 240s, which has also increased. And we can see there are some misclassified values also. The six samples having two levels as one are predicted as zero. And in this one, the true level is zero, but it's predicted as one. So, but they are very few. See the, the prediction rate is really high. And we can see the minimum, minimum recall for class is around 0.98 is 98%. This is good accuracy and the ROC curve is is very good and it's, it's 100%. And these are tools you can use on many datasets and try out different algorithms, different pre processing techniques also, which we have not used, you can use it in the pipeline tool. And mix it with different estimators and see if you're doing good or better. With this, I will conclude the classification tutorial. I hope you have learned something. At the end of the tutorial, you will find a feedback section. I will improve the content of the tutorial by giving your invaluable feedback. Thanks a lot.