 Hi everybody, my name is Fosjfa Blus and this is going to be a tutorial introducing machine learning using R. In this instance, we will be using the Galaxy interface to do the analysis, but specifically what we will be using is the R Studio interactive environment that is available through Galaxy. Just before we start, a few things that you need to be aware of. And the first is for those who want to see how you can start R Studio, there is a dedicated tutorial on the Galaxy network and that explicitly describes this information. And you can find this under the using Galaxy and managing your data section. Additionally, as we will be using a lot of R in this tutorial, and you can always find more information on how to write code in R. And the Galaxy train network help us is also some lessons around R, specifically under the introduction to Galaxy analysis, and you can find our basics and advanced are in Galaxy there. And also, I need to highlight that the information the tutorial you're going to be going through now is also available through the Galaxy network. And it's largely based on the material has been produced by by elixir in order to address machine learning needs for life sciences. So without any further ado, let's get started. So as you can see I have created a new history in Galaxy called introduction to machine learning in R. And it has only one entity which is blinking and you see it's running and it's called R Studio. And just to provide some quick context. If I search for a studio here. You can see that there's going to be this new tool here that you can click execute. And this means that you have you're going to find this particular aspect here. But if you want to access it, you need to go to user and active interact environments, and you'll see that R Studio is actually running here for but not clicking on the link here. So a new table open that going to be the actual R Studio server. And this is where we're going to be doing our analysis today. So the R Studio environment, again, a very quick overview. It comprises these three panels. I open yet another one which is going by file new file and R script. And now this is a more traditional perspective of our studio. Here is where our script will take place our code. And this is the console where the output of ours going to be printed at all times. And this is our environment where all our variables and objects are going to be listed and here's going to be a listing of the files. This is the file stuff they're seeing additional plots will be able to be viewed here. And the first thing that's to be discussed is that we will be using a few libraries that are necessary for us to to to use and all the commands are going to be available on the material itself. So it might take some time to actually run those commands in your particular interface. So I'm going to highlight all those things and click run. And you'll see some red letters being printed here. And at this time it might be nice to pause the tutorial until all the commands have been all the libraries have been successfully stole. You can find you can verify that the commands are the tools that the libraries are successfully installed by actually checking the libraries. So I'm going to be loading all those packages here. Again, I'm going to run here and you'll see some printing out things. And I'm going to zoom this a bit more. The one is the top one is GGB byplot. And then we have a tie reverse, gali, kare, gmodels, rpart, randomForest, that extend, and then ml3, edgeR, and lima. As you can see, all the red words are mostly informative. There are no errors. So if you can reach this point, it means that you're good to go and then we can continue from here onwards. So now that you have successfully loaded your libraries, the first thing they're going to do, I'm going to press a few entries here. And actually before we start, as you can see here, this is still untitled. So I'm going to save this as a script. I'm going to put it as intro to mlR. I'm going to save this and you can see here that the actual script has been, is now available and we don't need to worry if we actually lose this. You can always save this locally by downloading this as an export to your local folder and then you can always run it at your own RStudio instance. And the first thing we're going to do is actually load some data. And for this purpose, we are going to be using the breast cancer dataset, which is a diagnostic dataset, and I'm going to retrieve this from the UCI machine learning repository. In order to do that, the first thing I'm going to do is I'm going to, again, I'm going to be absolutely explicit, and I'm going to first of all, load the library again, the typeverse, which we're going to be requiring. And then let me resize this a bit more. I'm going to use the read underscore CSV function, and in which I will provide the complete URL of the UCI dataset. As you can see, it's a breast cancer, Wisconsin dataset, and so that I can load it up. And I'm going to also use the secondary dataset that has the actual column names so that we can have a more efficient way of dealing with that. So right now, it's running on the background. It's retrieving the dataset. And now you can see that we have these two new objects here. You should be able to have 569 observations, exhaust 32 variables. And we're going to be talking a bit about this more. And this is another single column and data frame with 32 observations. So basically what we have is on one hand, and we have the information about the data itself. This is the breast cancer data. The other is the breast called breast cancer data column names. So the first thing I need to do is essentially to move to copy the column names into our breast cancer so that we can actually use this and use the names. I see the names of the columns of the variables of the original dataset, the breast cancer dataset, X1, X2, X3, X4, which is not really informative as we have no idea what this is all about. The column names actually have more relevant names. So in this case, as you can see I've just executed by pressing control enter or command enter if you're a Mac. And now we can have a look into how our data actually looks like. So I'm going to run this and let me size this a bit more. So now we have a table. So basically a very short snippet of our dataset. And we see now that we have more meaningful column names. The first idea is our first column and then have diagnosis, radius mean, texture mean, perimeter mean, and so forth. So all of our variables now have more information that we can utilize. So what we have to so far is essentially to load the data. So it's an interesting point to highlight that. As you can see, these columns that have numbers actually are identified double so it's the medical ones. The diagnosis one is a character one. Another good enough, a good practice usually is to convert our target columns. In our case, we want essentially to create somehow a model at some point that will rely on the diagnosis as our target. So basically we want to figure out a diagnosis based on this information. So for this reason it might be useful to classify to change our diagnosis as a column from a character to to a factor. So in order to do that, what I will be doing is I'm going to use the diagnosis column of the breast patient that the cancer data, and I'm going to transform it as a factor. Nothing has happened, there's no error, but now if I rerun the previous one of showing sort of a table with the top part of the data frame, we will be able to see that the actual column of diagnosis has been changed into a factor. And as you can see, it took a few minutes and this is now a factor. Right, so we have successfully load our data. And now we are ready to go to the next step. So the first step in a machine learning process is usually to do a quick exploratory analysis of our data. So, before thinking about modeling itself. It is usually quite useful to better understand what your data is all about. There is no point in sort of creating a thousand layer convolutional neural network, whatever that means at your data, and before you even know what you're doing. And the first step that we're going to do actually is we will remove the first column of the data, which is the identifier the unique identifier of its column so what I'm going to do. I'm going to subset our original data set by selecting all the columns except the first one so from second up until the end and I'm going to say this into a new, into a new kind of a data frame, and we're going to have a quick look as well to see how this looks like. The question is, why do we need to do that and you see now it starts with diagnosis. So the question is why do we need to do that. The short answer is that the unique identifier by definition is the unique identifier of its particular entry. So if I try to create a model based on that. And I only want to target the diagnosis. It's quite possible that the model will try to figure out a motive if such exists between the identifiers and diagnosis so the identifier is nothing further than a unique entry for every line it has. It should technically have no up no effect whatsoever to the diagnosis so so if I want to create a model, I need to ensure that only the relevant columns, or at least columns that I know prior that are completely relevant removed. So this is a fresh approach. The second point is that we do have a lot of variables in this in this day and for argument sake I'm going to be working on the first five. So what I'm going to do I'm going to load up the galley library, and I'm going to do a quick plotting of all the different pairs of the five, these five different columns and using the diagnosis as the color and putting some values as well. So this is what is going is actually produced so this pairs is a really neat function. Because what does it provides a very quick overview of all the different aspects of the of our of our data. And I also note that the features have widely varying centers and scales, which means, and sort of the average and the standard deviations. So essentially what we might need to do is in order to be more comparable and is to actually figure out and how to scale center and scale all those, all those variables. So this is more easy and relevant to compare. Again, have a look here. And if you look for example in in the radius mean you have a scale between zero and 30. And then if you go to area mean you go from zero to about 2500. And so, even if the curves look a bit similar, the actual numbers the underlying numbers as standard deviation and mean they're quite. In order to do, we're going to use the current library, specifically the pre process function, so that we can scale and center our data. So that I'm going to load again the career library, and I'm going to use first of all, the pre process function. I'm going to provide as input, the no ID data set so the data set that I've already removed the identifier. And I'm going to apply both centering and scaling on this data set. So this is going to be run fast enough. And then based on that, what I'm going to do I'm going to transform my original data set into a recenter and scaled version of that. And so this will take a second. And as soon as it's done, it's going to create a new kind of of data set. As soon as this is done, the next part would be to actually check whether the there has been any impact at all into our underlying data. In order to do that, we will be using again, kind of a summary of the first columns, and to see what the impact here is. And this is the original data. We see that this is the diagnosis we have and about 300 B9 and 212 malignant cases, and then we have the different and basically statistics of the first four columns. So if I do the exact same thing of on the on the summarize data. As you can see, I'm not, I'm not, I'm not using the transformed table. We see that, and although they get nice stage take the same. Everything else has now been adopted. So all of them have a mean of zero, they have been centered to zero. The mean and max are roughly around the same numbers, at least scale wise, and also the standard deviation has been adopted in order to accommodate that. And how about we actually visualize this and see whether visually there is any impact to that effect. So again I'm going to use the pairs but instead of using the no identified data set I'm going to use it transform So I'm going to run the exact same thing. And we're going to see a new plot here as soon as this is done, this is taking a few seconds to process. And we can also visually inspect what's going to be the difference if there is such one in the in the in the in the plot information. As you can see, and I'm switching back and forth between the figures now, and you see that visually at least all the information appears to be the same. Even the correlations, and the numbers have not changed because the effect is not to change the actual distribution of the values, but rather, and to ensure that any comparison is going to be performed between them. And is actually a bit more reasonable and you can say as you can see now, and the scales are much closer in terms of numbers. So, and this is essentially a preprocessing step, and in order to ensure that what we can apply what we will apply as machine learning process is actually applied to a correct data set. All right, so we have done a quick slaughter dialysis we figured out what we should apply in order to make this more variable. And let's start discussing about the next step. So the first thing that I'd like to try out is unsupervised learning. And before we get the point, and let me give a quick, very brief overview about what machine learning is so essentially machine learning is science and kind of an art of giving computers the ability to learn and make decisions from data without being exclusively programmed. That's the basic concept of machine learning. In unsupervised learning, in essence, it's, it's a machine learning task of uncovering hidden patterns and structures from unlabeled data. The keyword here is unlabeled data we don't care whether a particular entry and has annotation if you like or not. For example, a certain might want to group their samples into distinct groups based, let's say on their gene expression data, without known in advance what these categories are. So, and clustering, for example, is one branch of unsupervised learning. That's about supervised learning a second. But I will continue on with the unsupervised learning and talk about another aspect, which is our next step, which is the immensely reduction in the UCI data set that we are using right now. There are too many features to keep track of. And again, in life sciences, and if we go a bit more into intoomics, talking about 32 or 31 columns of variables, it's not really that much of a deal. And if we're talking about viruses genes, we might get a few hundred and a few tens of thousands there. So, as we have a lot of features to keep track of, and what if we could reduce the number of features, while at the same time maintaining much, if not all of the information that is captured there. So this is what they mentioned reduction is all about the principle component analysis or PCA is one of the most commonly used methods of dimension reduction and what it does, it extracts and the features with the largest variance. And what the PCA does, and when we are going to apply this and directing the second and is essentially a two step process the first step of the PCA is to be correlate the data. And, in other words, trying to apply a linear transformation of your data of the space that your database. is the actual dimension of dimension dimensional direction. What is really happening is, we transform those features that we already have our 31 columns in this instance into new and hopefully uncorrelated features. And these uncorrelated features. And in the second step step we choose from those are we rank them based on the ones that have the most information about the data. So let's try to actually apply this directly. And we're going to use a function. Let me put some space here, and called principal component PRs come. And in this instance, what I'm going to do is I'm going to use only the numerical variables. So essentially I'm going to exclude the first two columns, the first one being the unique identifier, and the second is going to be diagnosis. And again, I start from column three up until the end. And please do note I'm using the original data set. And I want to highlight that a preschool component also has an option inherent option of centering and scaling the data so and I could either remove those options and put them as falls and use our transform data from before. Or I can put it directly here and just to make sure that you are aware of all the different options here. I explicitly put this here and so that you know how this this can work as well. And so as soon as this is done. The next step is to actually try to see and get some information about the principal oneself so this is done. I'm going to use the summary. And this is what we actually get information so a few words about that. So this resulting table actually shows the importance of its principal component. The standard deviation the portion of the variants captures as well as the cumulative proportion of the variants captured by the principal components. The principal components themselves PC one to PC 30 down here is essentially the underlying structure in the data they are the directions where that is the most of the variants the directions where the data is most spread out. This means that we can try to find the straight line, the best spreads the data out when it's projected along it. And this, for example, is the priest, the first physical moment, the straight line that shows the most substantial variants in the data. As you can see, in terms of variants, the first personal columns capture about 44% of the virus itself, and it all goes down from there. And so the PC as we said is a type of linear transformation on a given data set, and that has values for a certain number of variables which are our current for some amount of spaces. So what it's time to avoid going into too much data how the actual information works, but I'm going to provide some more information about what are the different ways that we can extract from this. So let's have a deeper loop, look into the actual PC a object. To do that, I'm going to ask for the structure of the object of the physical component object that we've just created. As you can see, it comprises of five different, five different objects. So, and one of them is the center point at the scaling and the standard deviation of each of the original variables. So this is essentially information describing what our original variables look like. The second is the relationship correlation or anti correlation between the initial variables and the preschool components and this is the rotation, the rotation table. And as you can see this is why it's 30 by 30 because it's 30, our external columns and 30 are also the principal points that have been produced. And the final is the values of each sample in terms of the pencil component so X is essentially our transform table. And each column now our transform table are the principal components. The first column is personal component one personal component two and so forth. So let's try to visualize the results of that. And in order to do that I'm going to use the GDP plot library for this for this purpose. So I'm going to put some space here. And what I'm going to be using is our principal components analysis. I'm going to leave this without explaining a bit more for the time being. And I'm going to use this labels the raw names of breast cancer so I'm going to be using this says identifiers. And I'm going to highlight the different groups based on that. And let's see how this actually looks like. So this seems like a very dense plot. It actually is because the labels in its point is basically the entry, the unique identifier of of its role of marriage or data. So if you can have a quick look here I'll see if I can zoom this a bit more so it's a bit more clear, and there are also additional rows and arrows that define how the different principal components are identified across those and how the different components are aligned based to the principal components of our first is this axis, this is this axis and as you can see, there are several of them that are almost fully aligned to principal component one, and almost some that are almost fully aligned to principal component two. Additionally, you might be able to figure out but there are a lot of the original columns features that are very aligned to each other, at least looking at this particular projection across principal component two. And as an exercise, what I would recommend is changing the parameter of this particular plot. So in this instance, I've said it choices one and two. Some of you may have already realized that one and two correspond to principal component one and principal component two. So it may be worth checking if you actually change those to different viral different values, how the overall structure of those features may look at. And it might also be worth figuring out if you remove labels or the lips and how this and this all works. And the second exercise that I would recommend trying out and you can find those exercise in the galaxy train network and turn itself of this of this of this tutorial, and you so far we've been using the entire table of the date right. How about, and we constrain ourselves, let me resize this once more. There we go. How about we constrain ourselves and to only the columns that have the mean name in them. And, and this might be useful to highlight now at this point that we have radius mean, but if I scroll a bit down, we have also radius standard error, and also we have radius worst, and, and I think that's it. So for each of those numerical aspect medical features of our of our data set, you have both me, the star error, and the worst, and the worst value so it may be possible. We don't know that that those are correlated one way or another. In that sense, and it may be worth trying out an exercise where we select only the mean values and these are essentially columns three to 12 in case this is, this is helpful. And I'm not going to do that here I leave it as an exercise I'm going to move on to the next part, which is clustering. So clustering is one of the most popular techniques in unsupervised learning. And as the name itself suggests clustering algorithms group a set of data points into substance of clusters. And the goal is primarily create classes that are coherent internally, but clearly different from each other externally. So in other words, entities that are within a the same cluster should be as similar as possible. And this in one cluster should be as dissimilar as possible from in this in another. So probably speaking there are two ways of clustering data points. And this is based on the algorithm structure and operation of the algorithm itself, and they are the agglomerative and the device once. The agglomerative approach begins with its observation, its record, its data point in a distinct cluster, and this is called a singleton class because it's contains one one data point. And successively, what it does it merges clusters together until a particular stopping criterion is satisfied. A divisive method is the other way around. So it starts with all data points in a single big cluster. And what it does it's it cuts it into multiple. It splits it until again a stopping criterion is met. So essentially, this is the task of grouping your data points based on something about them. And that something could be the closeness in space or any other metric that can be used. And also very in mind that clustering is more of a tool to help you explore a data set. And by no means should be used as an automatic way of classifying data. So you may not always deploy a clustering algorithm for a real wealth production scenario. And so in other words, it's a single clashing alone might not always be able to give you the information that you need to get from from a data set. So we're going to try to try two algorithms in place, one for the automatic and one for the divisive. And we will see how the work so the most common one is you is called k means. So what we're going to do, I'm going to use k means as a method here. And let me create some space here I'll put here, clustering. And what I'm going to do first is I'm going to set seed to one. And what I'm doing that is so that if you do the exact same seed, any randomized output will produce the exact same results between my execution here and what you're going to do in your own in your own environment. And k means tries to cluster the data in order to minimize the variances between the clusters. So the basic idea behind k means clustering consists of defining the number of clusters. And shows that the total intra cluster variation and also known as within cluster variation is minimized. So the several ones that are available, I'm going to be used the standard one which defines the total within cluster variation as the sum of the square distances of the Euclidean distance between the items and the corresponding So again, we're going to ignore the classes. So I'm going to start from column three onwards. I'm going to expect two standard two centers. And the end start is basically an option that attempts to use let me make this run so that we can have this running. The end start option attempts to more to to to have multiple initial configured for configurations and what it does it reports on the best one within the King's function. And so let's have a look at what we output has been produced actually contains again I'm going to use a structure. And as you can see now, it has again a particular set of information here. The cluster is a vector of integers, and the vector finger starts from one up until K, where K is basically the number of clusters that we expect. And what it does it indicates for each data point for its row of our data set to which cluster it's allocated. The second one is a centers. This is a metrics and that contains the cluster centers. So in this particular case, and we have the two different coordinates. And then we have the within SS so it's a vector of the within cluster sum of squares and we have one component per cluster. So the total within class or some squares. And finally we have the size which is the number of points in each particular class or so as you can see, we have approximately four four hundred and something in first cluster, and 131 in the other ones. So we have produced the clusters. So the question is, would it be possible to view the clusters. Let me change this slightly to view the clusters as a compared to the principal components that we've identified before. So what I'm going to do now. And this is a bit of a combination of what we did earlier the principal components and and the cluster. And what I'm going to do is I'm going to plot out the output of the principal points different data points across the two components the principal one and principal two. And I'm going to use the color of the plot as the cluster so that the same entries of the same cluster are going to be put into the same with the same color, and the shape is going to be connected to the diagnosis so let me run this. I'm going to resize this a bit. So essentially it's the exact same output as the same structures before we have this one and PC two. But now we see that cluster one is basically all the red ones, a cluster to our all the green ones. And interestingly, we have the diagnosis be as round ones, which are mostly but not only a green. We have the endless triangle, and most of them again, not all. And you can see there are some circles here are in class one. So in other words, it kind of works, but it's definitely has. It's not the best possible case. So, and now that we did that we have the cluster, and then we've sort of done a visual inspection, if you like, and we can check how well they coincide with the labels that we do know. So eventually what we do need to do our goal is to create somehow a cluster that will correlate or at least connect well enough to the diagnosis. So we do have some labels. And let's check about them so we not do this we're going to be use a method called cross tabulation so a crosstalk is a table that allows you to read off how many data points in classes once and two are actually be nine all malignant, respectively. And I'm going to use the G models library. So going to load this. And I'm going to create this cross table which is the one that you see down here. It's a shell content so it provides a bit of index so that people can understand what it's all about. And basically contains how many elements this is his quick contribution and the number for Robert column and table in total. So you see that in cluster one. There are 156 that are the nine and 82 that are malignant. So the cluster one, if you see here is basically the red case, and you have this vision. The second cluster basically relates a lot to what we said earlier that we actually usually so that we only have one that is benign here, and most of them are malignant so in the second cluster we have primarily malignant cases, but all in all, it's not the best distribution ideally, what we would have liked is to have the percentage of this one to be close to 100 the same as here, and this off diagonal ones to be close to to to zero if not that direct zero. So, this is a more numerical way of evaluating and understanding how the overall information box so a question that need to be addressed is how well did the classroom work and the sort of answer is, well, it kind of work but it's a way of to an ideal scenario. But in mind that came in by definition does not create, does not decide on the number of clusters to be produced. In other words, you need to define K, the number of clusters in our instance, what we did is define K as two, because we have two groups of diagnosis and we would like to have equal number of clusters if possible. Here, and there is one technique that can be used to identify the best the optimal case like and this is a method called the elbow method. So what it does, it uses this within group homogeneity, or within group heterogeneity like to evaluate the variability. So you're sitting in the percentage of the variants explained by each class. You can expect that the variability to increase with the number of clusters, or alternatively the heterogeneity. So it's the more classes you create the more together the individual and these and the less of the violence you have. So, our challenge is to find the K that is beyond that the meeting returns. The cluster does not improve in the variability of the data because very few formations left to explain. So in order to do that, I'm going to first create a function in our cold came in within this. And what it does, it calculates and confuse a total within cluster sum of squares. So we try this with cables to, and I run this. I'm sorry, I have to run the function first. And this is the arrow that you see here I asked the function but it didn't exist. So if I run the came in within this and it gives me value that you see that you see here. So what you need to do is to test this and times, and not to do you that we're going to use as applying and run the same sort of function across a rather main range of K. So let's set that we want to test until capable 20. And I'm going to run this from two up until this maximum K. I've already done this. It's, it's already done. And what I can do now is I'm going to create this kind of a data frame with this information, this WSS. And essentially what it will contain is the actual K value and what was the within the within this capture of variance with WSS function. Now that we have the data. We can do some kind of plotting with the plot. And I'm going to use this new data frame called elbow and a broad aesthetic and I'm going to create sort of this particular curve. What it says is that with K equals to we have a high value of WSS and as we expect as we increase the number of clusters and we divide our initial. So remember that came in since it is actually such a big one and then it cuts it again began. So at some point, by adding more clusters. We, we, we don't gain anything both in heterogeneity and anything else. So, from the graph. What we can see is that the point 10 is basically the point where we have basically the means in return so we increase the number of classes but we don't necessarily have a lower value of that this is at least by a significant amount. So, what we can do and I will leave this as an exercise is to rerun the clustering step now with a in UK, which is capable stem and try again the crosstalk of the classroom and enables and see whether it's been an improvement of that. The second exercise that might be useful to try out is to try to think of alternative metrics that could be used as a distant measure. So in here came in is using the default Euclidean distance. But it might be possible that in this particular instance, and there is another metric that could be used in place of the Euclidean one and of the Euclidean one and provide a better response of the came in. So, came in is clustering essentially requires us to specify the number of clusters and as we've just seen, determining the optimal number of clusters is often not trivial. So hierarchy clustering so it's another clustering method is an alternative approach which builds a hierarchy as the main applies from the bottom up. And it doesn't require us to specify the number of clusters beforehand, but it does require extra steps in order to extract the final clusters. So roughly put the algorithm works as as follows. Put the first of all is to put its data point into its own cluster, and then identify the closest to clusters and combine them into one, and then repeat this step until all data points are in a single big cluster. And when this is done, it is usually represented by a dental ground like structure. And then there are a few ways to determine how close to clusters are you can use the complete linkage clustering. So essentially find the maximum possible distance between points belong to different clusters, a single linkage clustering so find the minimum possible distance between points belong to different clusters. Find the mean linkage clustering, find all possible pairwise distance for points belong to different classes and then calculate the average. And then you have the centroid linkage clustering where you find the centroid of its cluster and calculate the distance between centroid to classes. And what we will do here is we will be applying again hierarchical class into our data set and see what the result might be. So do remember that our data set has some columns with nominal and the nominal, which is the category of values in our instance the identifier and diagnosis. So we will need to make sure that we only use the columns with numerical values. There are no missing values in the data sets we might which might have create problems, and that we need to clean up before the class itself. And but again we will need to do the scaling of the features and we need to normalize this. We need to do that. And so you get another method that the scaling can be done. Remember so far, and if I'm scrolling a bit back on top. And we've seen that we can use the predict the preprocess function of career, we can do this directly the principal component analysis. And now we're going to use a another function that is called scale, essentially. And what we're going to be using is our new original data frame, and starting from com three onwards convert to data frame and scale, and going to run this. And now I can have a quick summary and you can see that basically the exact same information. You see that all of the, the mean now is zero and then the variance and the extreme values are always a bit comparable to each other. So now that we've done that. The next process for any hierarchical algorithm is to actually calculate the distances. And in order to do that. I'm going to use again the Euclidean method as before. Do bear in mind that the distance function can use pretty much a whole set of different distance metrics. So please feel free to, to issue like play out them, and just to mention a few, aside the Euclidean have the maximum the Manhattan can bear a binary on Koski. And, and these all have a, and occasionally significant effect to how the overall clustering has will be performed. So the next step is to actually perform the hierarchical clustering. And in order to do that I'm going to use the eighth class function. And, and again I'm going to use the average method, as I said earlier, the average method means that it finds all possible provides distance from point one to two different clusters and then it calculates the the average. And then there are additional linkage methods, and like word, and what the two single complete and quickly the media and central and so forth. The other is an intense method, and for those who are more aware of phylogenetics for example is closer to the UP GMA and approach. So I'm going to, to run this. It runs quite fast because it's a rather small data set. And so it doesn't have a lot of expectations and now I can do a very, very simple plot by saying basically plot and this train. And this is what is actually being produced. Which is basically we do see this hierarchy. We see that all it's it's a bit of a mess here. So if you zoom this in you will be able to see more information. But essentially you see that every single and this starts as its own cluster, and then moving upwards we see that they merge again again, and well until you get this point where you actually have a big single, single cluster. So what we can and what we should do is to try to figure out what is the desired number of clusters. So, by looking at the actual background here, we can see that depending on where we want to cut this stream. You might have complete different. Clusters if I cut it up here I'm going to have two clusters from going cut here I'm going to have. I would do a quick assessment of about six give or take. So, what we're going to do is I'm going to use the cut tree function. And I'm going to specify that I want to cut this in some way so that we can create two groups. So I'm going again to define that I want to clusters. If I run this. Nothing will be shown because this is run on on the back end eventually. But what I'm going to do, I can request to plot this again, I'm putting some more comments here that can be used so what is the K. What is the bottom line is which, if I'm going to run this, it's going to create sort of a two different boxes, and the green one and a very narrow red one that you might be able to see if I slightly do that you might see a very sort of small red box here. Plus it's going to draw this kind of a dotted line that is approximately where this is this is cut so I this is is something like predefined this 18. Because again by visually speaking is this is roughly when it's going to be cut in order to achieve that. And so we see how these two clusters are in different. Color boxes, but as you can also imagine this is not the best way to visualize the overall information. So what I can do is I'm going to use a dedicated function, a library called that extent, which is actually dedicated to creating dendrograms and and what I'm going to do here. I'm going to have this in a more easy way for people to see, and I'm going to plot this, but now I'm going to be able to color the branches based on the actual classes so you see, and that all branches correspond to the second cluster are now this green. And again, the very few that are in the first cluster, you see that they are colored in red. So this is, again, and it also also provides this context of this is the first class and this is the second process cluster. We can change also the way that these the branches of colored, for example, by having a look into how are how the prognosis actually fits there. So what I'm going to do instead of coloring the branches based on K equals two, I'm going to call the branches based on diagnosis as the, as the as the point of the coloring. So I'm going to run this again, I'm going to create yet another background right here. And this time around, it's interesting to see that the two different types of diagnosis are indicated different colors. Also, it's interesting to see, but although there is a slightly brighter area here and the slightly greener area over here. And there is in this particular hierarchy of fashion that is not a distinct and splitting of the diagnosis versus of the diagnosis in the two different categories. So, essentially, this is an indication that we might need to figure out a different way of perform the character classroom maybe by using different metrics, or different, or different linking methods in order to produce the actual clusters eventually. And as a final point, and before we move on from the, from the clustering, and it might be useful to do also the same exercise that we earlier, and where we utilize the principal components, and aspect, and plotting, and instead of using the K means as our, as our driver are going to use the output of the hierarchy of clustering. So I'm going to run this again. The back end is going to be again, the hierarchy, the principal components one and two as we created way before. The back end diagnosis based on the shape so benign and malignant. And you can see that cluster one is all those numbers and all those points and number two is basically one to three elements that we see up here. So again, we can do again the cross tab and see how these are fit, but I'm sure that everyone will kind of agree that this is not the best possible hierarchy of clustering. So far, and this is sort of exercise that I will urge you to try out based on all we've seen here, and we only use two methods so Euclidean for the distance and average for the link. So I would really recommend that you try experiment with different methods and see where the final results improve. And the second point is, and after selecting those two methods, and whichever feel more appropriate plans, and the cutoff selections we did for k equals to was optimally obviously not the optimal button. So try using different cutoffs to ensure that the final clustering could provide some context as to the original question and again the original question here is what would be the appropriate clustering so that we can find a better and split of our data points and that reflects the diagnosis. So what we've seen so far is essentially the loading the data, doing some external data analysis, and then we tried a bit of clustering. So let's move to the next part, which is talk about supervised learning. So supervised learning is the branch of machine learning that involves predicting labels, such as survived or not survived and lead on versus the nine and so forth. And such models learn from label data, which is essentially the data that includes whether a passenger survived. And, and this is called this process calls called is called a model training. And then based on this model, you try to do a prediction on data that you have no label about. There are generally called training test sets, and if the ones that we are going to be using because what you want to do and this is a key element is that you want to build a model that learn some patterns on a training set. And then you want to use this new model to make predictions on the data points of the test set that the model has not seen at all. And based on this prediction, you can calculate the percentage of the test set labels that you actually got correct and this is known as roughly the accuracy of your model. So, as you probably know, from the introduction so far a good way to approach supervised learning is basically to do this idea the Explorer data analysis and data set and create a quick and dirty model, or if you like a baseline model, which can serve as a comparison against later models, but you will build. And then you attract this process, you might need to do some more and explore the data analysis and get another model. And at some point you need to engineer features. So either take the features that you already have and combine them or extract more information from them, or create a subset of them, and eventually come to the last point which is get a model that performs better. And as you can see, it's not a run something and let go it's more of a really iterative and investigative process. A common practice in all supervised learning is the construction and the use of this train and test data sets. And what this process does, it takes all the input, it randomly splits it into the two data sets in training and the past one. And then you can put it for that. Usually, the ratio is of course up to the individual researcher that the person was doing the machine and can be anything from an 80% training sets and 22 tests, 7030, 6040 or even 5050, it is completely and up to the researcher and at some points it might also be dictated by the distributions of the classes of information under the data set. So looking for the supervised learning. They are looking for classification as our first task. There are various classifiers available. We have the decision trees. We have an environment based classifiers with have became and classifiers and we have this appropriate machines. And what we're going to do is we're going to be showing here. Tree based approaches. So we'll start with the decision trees. And first, so decision tree is a type of supervised learning what it can work for both for any type of viable so both and make and categorical one. And what it does, and it splits the population into two or more homogeneous sets based on practical feature. And it is based on the most significant differentiator splitter if you like of the input variables. So the decision tree is a very powerful nonlinear classifier. And eventually tree, and a tree like structure that generates this relationship between the various features and the best outcomes. And every time it has a decision it creates sort of a branching one. And this is sort of a decor structure. So there are two types of decision trees we have the categorical ones and this is basically the use of classification where the decision tree has a categorical target variable in other words, survive or not malignant or be nine. So different types of labels if you like. So let's start with the continuous ones, which the decision tree the outcome point is to have a continuous target so what is the value of your label and the usual case here is a regression. I'm going to get to the regression in a bit. I'm going to create a classification tree where we are going to be aiming for diagnosis on our case. So the first thing I'm going to do is I'm going to let's create some space first. Let's start with classification now. And what I'm going to do in going to create the training and best asset, I'm going to use sets it to 1000 again feel free to use the exact same seed, so that we have the exact same benefits. And the first row here I'm going to create a index of whether a particular entry, a data point in my original data set will be going to group one or group two with a probability of 70% trade 100% and test. And based on these indexes. I'm going to create these two data sets. And we see that we can find them here as well. There we go we have the test and the train, and we see that the train set head by says about 400 observations and about 180 are in the test one. So we created sort of our, or our main part. So what I'm going to do now I'm going to use the our parts library. And it will help me and create a decision tree. I'm going to load it. And the first thing I need to do is to actually define a formula. In other words, and how do I want to evaluate what are the key elements that I find relevant to identify diagnosed in this case I specify a formula that says that my I want to classify my diagnosis label based on the radius mean area mean and texture standard error. And columns of my data set of course this is something that I put hard calling it in. And it's quite easy to create anything. So for example, if I want to take inflation all the columns, and I could equally well put a point so diagnosis as a function of all my columns dot represents the columns here, but mostly so that we can keep the sample sort and understandable and to be discussed. And I'm going to use this particular constraint for so let's try this. And now I have my my formula here. So I'm going to now run the, the modeling part. So I'm going to use the our parts. I'm going to define my formula as the way I want to do classification. I'm going to use a train data set as an input. And I'm going to use to specify some of the additional parameters, and that are specifically the mean splits, which basically what is the minimum number of data points in a node, so that it is possible to be split. What is the mean bucket is the minimum allowed number of instances of data points in its leaf of the tree so when we get to the leaf of the very bottom of the tree. So the minimum number of data points that can go there. The max depth is essentially what is the maximum depth, but that we can go and CP is basically contour parameter complexity. So essentially, it's this is separate intuitively. So essentially the larger the value of the CP, the more probable it's it is for the to be prompt so branches and, and, and, and glitz will be will be cut off so I'm going to run this. And it's already been executed. So what it's now it would be nice to do now is to first of all, print the CP table and have information about what is the content. And here you see the different value of the CP, what's the error and what are the number of splits have been done in its, in its iteration up until we get to the point of minus one which is, which is here. Interestingly, I think it is to actually see the tree itself so what I'm asking here is to plot the tree and this is what we actually get. So this is a quite interesting interesting table. So what it contains is basically information from. I stopped sharing my screen. And so essentially what you see is in its point in its node, there's a particular question, which is, for example, the question here is, is the area mean value less than 606 if it's yes, we go to this subtree if it's no we go to the subtree. The color of its node corresponds to the two pluses B and M. And depending on how prominent the its value is you see, whether, whether it's B or M, it's not the best tree itself. So what we can do is we can select a tree with a minimum prediction error. In order to do that, I can use an optimization process where I'm going to select from the CP table that we've just printed out the one that actually contains the mean in the, in the So here, you might find that this is not the best one and makes you actually look at that. You see that there is one option up here that the error is lower than those values. So I'm going to use this one. I'm going to identify what is the CP and we see that the CP value is actually the one that we've saw here. And we're going to from the tree based on this particular value, and I'm going to print it out again. So technically, although this is a cut tree from what we saw before, the actual accuracy that is predicted is much more efficient. And again, it is with the understanding that I'm only using these three variables here to do a prediction. It's a very limited assumption from from the get go. I'm here, and showing how this can be performed. And I'm going to also create a table here, a confusion table if you like, and that is using it is representing how the different types of classes are represented across different models. So there's, there are eight instances of malignant has been misclassified as a nine and 35 cases of benign has been classified as malignant based on this particular variable only. So, and this is our model, right, we haven't tried it yet. So the next step is not that we have the model, and we should check how the prediction works in our test data set so now we actually want to make a prediction. It's, it's a prediction because the test data set that we are using here, and the model has not seen yet. We know the labels, so we can assess it. But it's not something that we know. So what I'm going to do I'm going to run this, and we can have this plot and the actual confusion table as well. In the confusion table we get pretty much what we expect also from the model because it's not it was not the best one. And the benign ones. This is a visual presentation of the confusion table so it's easy to find the correction one to another. So only two of the malignant case have been mis-assigned as benign. And only 15 of the benign has been mis-assigned as malignant. So overall, it is a good model, let's say, but it has definitely room for improvement. So some quick questions that I would like to ask you as kind of an exercise is that what are the key parameters that have been, that will have the most impact here. So far we've played basically with the CP only as a way to minimize the error, but there are also different parameters here like the max depth, the min-speed and the min-bucket that we haven't tried at all. So I definitely urge you to try those a bit more. A second point is that so far we've been using only three of the variables of our entire model. So if we put all of the features together what would be the impact of our prediction would be improved or it would be worsened. And the follow-up question would be this a good plan or a bad plan. And what I would really urge you is to go back also and try to remember the things that we've discussed in the beginning of this tutorial, which are related to the information about the feature engineering and the principle model analysis as well as identifying which might be the relevant features to select. So this is a sort of exercise that are linked also onto the Galaxy Network tutorial and you can find more information there as well. So let's move to another type of decision trees. And it's probably one that you've heard already a few times in your work, which is called random forests. So random forest is an ensemble learning technique. Ensemble means that essentially it constructs multiple decision trees, each tree is trained with a random sample of the training data set, and on a randomly chosen subspace or essentially randomly chosen features of the data set. The final prediction is derived from the prediction of all individual trees, and then you can have either a mean, if you're talking about regression or a majority wording, if you're talking about classification. The advantage is that it has better performance, and it's very likely it's less likely to overfit your data than a single decision tree. However, a key disadvantage of a random forest is that it has much lower interpretability. So as opposed to a tree like this one, where we can actually see why there is a choice from here and there, whether it makes sense or not different question. In a random forest, this is not really an option. There are several libraries that provide the function for random forest. In this instance, we're going to be using the random forest library. It has some advances which is quite fast and can work with limited memory, but it cannot handle data with missing values. It has a limit to 32 as the maximum number of different types of a variable if it's a categorical one. And it does have some extensive help that so the extended forest and the gradient forest are some of them. If there are additional functions that do random forest, I can recommend the party C forest one, which does not have these limits, but at the same time it's rather slow compared to the random forest and it does require a bit more memory. So again, I'm going to be using, I'm going to load the run forest I'm going to set the seed to 1000. I'm going to train my random forest to create some space here. And I'm going to use random forest. And here I'm going to be using all my variables because it doesn't make sense to limit them prior to our work, because random forest as I said before, by definition what tries to do is to find the sample of the print data and then the randomly selected subset of the column so I'm going to use all of them in this particular instance. So let me run this. And as soon as it's done, the next step is to actually do a quick kind of confusion matrix to see what is the output. And as you can see, it is slightly improved from the decision to try earlier, and off the diagonal we have on the 811 corresponding from those two points. And so it is again, slightly better than the previous one. We can also check what the actual content of the random forest is. So if I'm going to run this, let me push this a bit up. It gives us information about the tree itself, we used 100 different trees in each split of the tree we selected that it was selected only five of the variables. So you might be we try every you see the classification error that is 3% and about 10% respectively so it is. It has room for improvement. We can also see the overall performance of the model, and by plotting this information you can see how the error across different trees is actually changing remember that this tries a lot of different trees with several assets above the training set and other features, trying to identify what would be the, well, in all cases we see that the error ranges between 005 and 010, and with the end and green sort of showing what is the is the is the overall performance. And one of the random forest is that it actually gives us the option of reviewing which of the variables has had the largest, the highest importance, where importance in our instance is the impact to the performance of the model. So this is the importance run for us, and it actually prints out down here, and the, the whole list of all the, the, the, the future they've been used, and here is the corresponding. So a way of assigning this information, we can do this also by using the deviance importance. So in this particular plot what we have is, let me try to resize this a bit more, we have again the features, and for each of those features so because this is a short space. And it does not list all 32 of the variables there's there's some overlap so only some of them are being seen. But these are actually all 32 points. And here you can see basically what what are what I consider the most important features out of those of those ones so this is also one way that we can kind of select. And the, the most interesting features like are the most important ones in terms of impact to the, to the model. So let's actually take this model that we just create the random one, the random forest one, and let's do prediction again I'm going to use the predict, and I'm going to create a table, which is out here. And we can see that in terms of the testing data set, it actually improved much better there was been more no misclassification on the design, and only four were misclassified in the other one. So, if I'm trying to plot this as information, and it tries to plot this particular value, and we can see basically the different aspects of diagnosis and you can see how they are improving across different, different data points. We can also try to evaluate based on on the biggest models with reduced numbers of variables, based on the ones that are ranked by their importance. So in order to do that, I'm going to use again the random forest but with cross validation. And this time I'm going to use a cross validation of three. And now I'm going again to ask to show the results. And we can see that this time around, we can see that the error is, it's much lower with cross validation has been proved so basically cross validation means that, and every time a, our base it is being changed as a training test, and does the iteration multiple different times in order to produce the final model. So with that, and the baseline here is that you can also use the random forest for future selection and based on the future I've been selected as more important, you can officially move further with more and with different type of models for classification itself. So, the final parts of this chart is going to be discussing about regression. And we're going to be looking at the linear regression. So linear regression is essentially a model that tries to predict the response with a linear function of the predictors when the predictors being our particular columns. The most commonly used function in our for this is LM. And, and it is essentially used in machine learning as well. In our dataset. Again, let's try to investigate the relationship between three of our variables let's go for radius mean concave points mean and area mean. And we can try to have a quick understanding of how those variables are correlated each other by using correlation. So if I'm going to run this, it gives me this particular value if I'm going to run this, it gives me this particular value so essentially, what does say here is that they are somehow somewhat correlated. So let's try first of all to create a smaller subset of our data sets, the BC one, essentially we're going to select only these three variables. So as you can see I creating you subset here called BC, and I'm going to I'm using only the, the three variables out of that. And now let's build a linear regression model with a faction LM using this particular data set the BC one. And you can see I'm using the full that so I haven't yet split it into training and test, mostly because I want to highlight how this, this is this works. So I can print the contents of the, of the model itself, and it gives us information about how the overall structure of the, of the linear model is being done. So what tells us here are what are the coefficients of the concave points mean and area mean the linear equation that connects them to reduce me. And let's see now if we can predict the mean radius of a new sample. Let's say that we create a new sample. There's not existing more that doesn't exist at all, and has a value of 2.72 and area mean of 00964. In order to do that, I'm going to, first of all, create a prediction model. So I'm going to use this model we had and do a prediction. I can try to do a plotting of that and see how, how well this works. So I'm going to do a plot. And this is what is going, what is being produced. So these are all different points. And I've also added a line from zero to one motion to give an indication of how, how they could potentially fit into into a single line. And I also have a better look of what the regression model actually contains by using a summary. So the summary provides some more context into, into the linear model itself with some also a standard error, and some p values which have the significant code list here. And you can see that roughly all of these points are pulling the same same line, and these are basically with the probability of not being online and quite low so on average, and it is a good enough connection here. This only provides an evaluation the whole data sets that we've actually used for the training. We don't know how well this linear model will perform on an unknown data set. So let's go back to the original plan of splitting our data sets into a training and a test set and create the model on the training set and then visualize the predictions. Again, I'm going to do, let's scroll this a bit down. I'm going to set the seed again to 123. And again I'm going to create indexes I'm going to use a different probability of training and test to 75% and 25% respectively. And I'm going to create two different data set the BC train and the BC test. Both of them are now using only the, the three columns that we've discussed before, and that we are playing around with if you like. And based on that, let's create now the linear regression model, and this time around we're using only for the train data set. And we can also have a quick summary of the model itself and see how it actually behaves. Let me scroll this a bit out again, and see that it actually is not have a much of a difference, much of a difference to to to the previous, and when we're doing what we're using the entire person. And we can do a visual evaluation. So let's do the predictions and put them in a different model. And I'm going to run this and I'm going to save them in the predicted model of the breast cancer. And let's now do a printing of what it actually is the truth with a prediction. And now that actually follows pretty much the same, the same line. And this is quite similar to when using the, the, the whole data set. So this is fairly well done. Now, let's go ahead and actually use the train data set that we haven't tried at all. I'm going to run this. And I'm going to try to, to, to review the. So the error that has been produced is because I haven't yet done the prediction of the test I have done prediction of the train but not the test. So I'm going to run this. Now I have information that the prediction for the test and now I can run this, and we will create a new line. Again, as we can see, it actually follows the information rather well. I can also use the root mean square error and the R square metrics devaluate our model on the training and the test set. And so let's actually go ahead and do that. Let's create the residuals first. So I'm going to run this, this function based on the train set. And let's calculate the root mean square error. So this is called here and and also the standard deviation of the actual outcome here. So these are the results that you can see down down there, as we can see our root mean square error is very small compared to standard deviation. So basically we can say that this is a good enough, good enough model. And closing this tutorial, I would also recommend that you try the same process and comparing to cause really root mean square for the test data and check whether the model is overfeed or not. And again, what we want to check here is to check the R square value and see whether this is close to two to one. That's the important element. The other option that we can try here is to try different variables of the original model and see whether this is can be an overall improved and have different perspectives. Again, overall, the linear regression is an effective way of predicting a real value, not a categorical one for classification. So with that, I would like to close this tutorial. I would recommend that you check out the information and the all the commands are listed explicitly on the on the galaxy network tutorial. And you can go through that at your own pace and you can refer to this video and then point for clarifications. And I would thank again, both the galaxy train network team, the galaxy projects for supporting this, and of course, elixir that he that already put together the original material on learning in our business here. Thank you. I hope you enjoyed that. And I'm seeing one of the other videos.