 Hi everybody. I have to say that somehow Lauren and I prepared the content for course in machine learning, introduction to machine learning, as opposed to a one-hour lecture. So I will try, but my goal is to actually give you the information that you can then take and use in your practice, and to give you the information that you understand that when others give you the results of some of these applications of methodologies and things like that. So I please feel free to stop me and to ask questions and we'll go at the pace that is hopefully comparable for most of you. And yes, definitely ask if you have any questions. So just briefly, there are many objectives, but one is we'll talk about the data integration technique that we have developed, but it's broadly used, which is called similarity network fusion. Then we'll talk about the cluster analysis, the many techniques that I use in the cluster analysis, and then if there is time, because that will be slide 60, would be the building classifiers. But the slides you have I think Han said that she had shared. So if you have any questions about building classifiers and we don't get to them, I'll be happy to go over them with you. So the reason why we have actually developed this data integration technique, which is called similarity network fusion, is that more and more often there's a lot of different and very heterogeneous kinds of data that is being collected on the same set of patients. So again, the set of patients and you might have methylation data on them, methylation and mRNA and microRNA and other epigenetic data, but you can also have the non gene related data, non even omic related data such as the clinical data, right? So related to here is the diet. So this was the case in our IBD study in inflammatory bowel disease study. And so that's one concern is that the data we get on the same set of patients is very heterogeneous. And the other, the two others is that we have a lot of measurements, but very often very few patients. So in classical statistics, we really like when we have a few measurements and lots of patients. But that's unfortunately not the case in omics data. We have lots of measurements and few patients. So we have to take that into account. And also we, the different data flows, they provide different kinds of information. So sometimes it's one confirms the other and sometimes it contradicts the other. And how do we take advantage of all of that kind of difference in the signal? So this is what we have developed. It's a method with two steps. The first is to generate a network. And here in network is the similarity, represent similarity between individuals according to that data type. So as many of the types of data that you have, you will have as many networks. So the each edge in this net, each dot and node in this network represents a patient. And each edge represents how similar they are. So the stronger, the more colorful in this particular representation, the more similar the closer they are to each other. And so here, actually, this is an example of TCGA glioblastoma data set. This is the real data that you are seeing, not on the left, but this is three come from the real data set. And so once you have constructed these networks, we have a nonlinear approach, which I will talk about in more detail as we go. And it combines all of this networks. So it already doesn't necessarily care where the data has come from. What it cares about is that it's in a space of similarities. And once we're in the space of the similarities, we can actually combine all of this data. That's the key point of this approach. So here's an example. And we have lots of examples, because we work with lots of different kinds of data. So here's an example of nonomics data, but you can imagine the same for for your data, whatever data you have, we were just trying to give very, very different types of examples and different types of data. So here we had a clinical cohort of 80 samples. And all of them were OCD diagnosed. And what we wanted to identify is the subtypes of OCD from this data. So we had the clinical data and we had two types of neuroimaging data, so magnetic resonance and the structural resonance imaging data. So for each individual, we had these three different types of measures. Sometimes it works and sometimes it doesn't work. So this is kind of what the data schematic for what the data looks like, we have a matrix, we have each row represents a patient and the measurement supporting to this particular type of measurement. So here it was brain structure. And we had sometimes it goes three at the time. Maybe I will put this down. So we had a brain structure, we had the metabolite concentrations, and we had the three different matrices represented like this, and the patients are all the same. So what we do is we kind of integrate out the information about what type of measurement it is. So we essentially look at the distance or the similarity between patients and now we'll actually go over the different distance metrics that you can use in this context. And once we have the patient by patient similarity, this in a sense is actually equivalent to a network. So if you were just to look at the correlation, say between all pairs of patients, you will be able to construct such a matrix, but it would be full matrix and your network would be a full network. So what we do is usually we specify this matrix to result in a network that you see here. And the specification is just based on who is most similar. So you just remove the one the pay the links between patients that are not similar enough. So we did this for all the three different types of data. And we got the three different networks. And then we integrate it. And I want to give you a sense of kind of what it does. So the imagine this example where we have just two clusters of points, and we are trying to figure out what what is the true clustering in the data. So this is the true class in this we don't see what we see is the one on top, where where some portion of the points is relabeled. So it looks like it's part of the cluster, whereas it is really not. And on the other type of data, another portion as related from the second cluster. So in some sense, you're getting partial information from both of the types of data, which is quite common. So this is simulated data to simulate the reality that we commonly experience. So when you have such information, you actually can use the similarity network fusion and combine all these networks to result in the joint cluster, which gives you essentially the original clustering that you would want to find. So this is just the performance of the metrics with respect to how much noise there is in the data. So how many points are there? And this are some of them are competing approaches that do not perform as well. So in our original network, what that means is that even if we have information from just one type of data, and it's not kept to the other types, if this information is fairly strong, so the similarity is fairly strong, it can propagate to the joint integrated network without having it being complete and all the others. So here's another example of different kinds of noise in the data also very, very common. So here again, the ground truth is not seen in the top perturbation, there is a Gaussian noise in the bottom perturbation is a gamma noise, all you have to know is that this is kind of a different types of noise like normal white white type noise, but one of them, for example, this gamma noise, it actually has a tail, right? So there are lots of values. And so you see that both of them provide information about the clustering, the original clustering, but there is a lot of this all diagonal noise, which you kind of don't want in your data. So once you use this noise in a confusion, it actually integrates the data removing a lot of this noise. So if these noise were reinforced, so both if they were identical, then, of course, if using you get the same exact measures, right? But if the noise is different, and it's weak, right, compared to the signal, then you'll actually get that noise drop out, and you get much stronger signal popping up. So that's the example here is this edge, the similarity between these two individuals was only one dataset, it was fairly weak. And so it disappeared when we combined the three networks. We applied this approach to several different examples. One was published in Kansas last year was in pancreatic TCGA consortium. Another one is, is a completely different one also using imaging data was in psychiatric disorder subtyping. We used clinical combined clinical, and I think we'll kind of wait for them updating. And where did it go? We hear somewhere, right? Here. So we combine methylation data and clinical data. And also we are doing some similar work for inflammatory disorder. So basically, this is just to tell you that there are many different types of data that you can combine to achieve this similarity and the clustering. So now I will talk about clustering, but I will first talk about distances. So I want to make a distinction before I move to clustering that the similarity network fusion, it actually provides you similarity across all patients, you don't have to cluster that data. Very often, you actually do want to cluster, right? You want to identify discrete subtypes in your population, patient population, or something else that you're doing gene set, right? You want to identify gene module, and you want to cluster your data. But depending on your goals, you don't necessarily have to cluster. That said, clustering is a very important and very integral part of the analysis of the all mixed data. And so that's what we'll discuss next. So you can cluster on single or multiple types of data, you can measure you have to figure out what signal you care about. So that's about the preprocessing of your data and figuring out how you would construct, for example, the similarity based on what, right? etc. So let me let me switch to the distance metrics. So when when clustering, you have to figure out how similar is one instance to another instance, right? And whether they should be clustered together or not. So here's an example of a most commonly used metric. It's a Euclidean distance. And Euclidean distance here is shown in 2d space. And you can see that it looks at the coordinates back to early school. This kind of plus where you look at the difference between coordinates in x axis, y axis, etc. So the more generalized, the the squared error is when you have n dimensions. And it's also still a sum across all of this dimension. So this is the most by far the most common use metric in for continuous variables. I want to bring up another one, you might or might not have heard about it, the Mohallabis distance, which I used to call the Mohallabis, but the point is that this distance is a generalization is a generalization of the Euclidean distance. And very often, it is used instead of the where you would normally use Euclidean distance. Why is because of this term? This term s is showing the covariance matrix between all your measurements. So when your measurements are all on the same space, suppose you measure, I don't know, all weight, right? And it's all in the same space. And that's all fine. But if you're measuring height, weight, and maybe some other measures, other clinical measures, blood test, whatever, then this, these are not on the same scale, this is not on the same. And people do some kind of normalization and standardization to account for that. But another way to do it is to use the Mohallabis distance directly and estimate this covariance matrix. So here, it will depend the ultimate result, the ultimate distance will depend on your data. It will be constructed for your data. And this is very often used in machine learning for those of you who use those kind of approaches. And you can see that the incorporating this covariance s allows you to model the distribution of the data better than the Euclidean distance that is shown here in the red concentric circles. But of course, you might have not continuous but categorical variables, right? So you can have high, medium, low, for example, three categories for one of your measures. And in that case, well, Hamman distance is the number of mismatches. It can actually here it's shown for binary data. And that's what people most commonly use. Which metric you use here for binary categorical data really depends on what you are trying to achieve, right? So for example, in some cases, if you are looking at somatic mutations, and you have a very few mutations that you actually care about, and you only care, but if you have the same somatic mutation between two patients, that's actually the strong signal for you. So then you count just the similarity in terms of the same somatic mutations if they are present, right? Because if you are counting matches or mismatches, you would count also the ones where they are absent, right? There are large amounts of genome where there are no somatic mutations and that would dominate your metric if you just counted matches on mismatches, right? So in very often, you have to stop to think about what distance metric you want to use. And this happens a lot in any kind of clustering that we do in our lab also to kind of ensure that the distance metric will ultimately capture what we care about. Finally, and more recently, there is a lot more longitudinal data that's becoming available. So longitudinal or time series, in my mind, longitudinal, it's called when you have a few measurements over time, time series when you have lots of measurements or like hundreds of measurements over time, but it could be my own definition is just based on my experience. But one of the most common metrics there is this fresh air distance. And this is a cartoon that Lauren found somewhere, which I think is great, which is it's so you are trying to compute the distance between the two curves. So between measurements over time of different types. And it's essentially the minimum leash length that you need to be able to make that curve between the dog and the owner. But it ultimately, it's equivalent to the max distance between the two curves. Okay, between the trajectories. One important point is sometimes the methods require distance, which means the bigger is the number, the more dissimilar the the patients or the genes or whatever you're measuring are. But sometimes people look for similarity or dissimilarity, right? And that is kind of this inverse. So you can see here, we have the Euclidean distance that we have computed, according to before. And if we want a similarity, so the equivalent corresponding radial kernel similarity would be the one where you take one over the exponent, it's an unshaded Euclidean distance. So here, in similarity, the higher the similarity, the more similar they are. So it's the opposite of the distance. Just to just to make sure we actually have a lot of a lot of questions in SNF, because we do cluster. At the end, there were lots of questions from the user saying, I get exactly the opposite kind of clustering and what they were using was distance as opposed to similarity, which is what we were using. So it is important. Okay, are there any questions I'll be switching to the clustering approaches? And I don't have a clock here. So I have no idea how. Excellent. Might be able to go through it all. Questions? All right. So common clustering approaches. The first approach that you have already heard about from Andre, from what I understand is the hierarchical, I will then talk about gamemes, Gaussian mixtures and spectral clustering briefly. So hierarchical clustering. The idea is to cluster all the way to one cluster. So the way that hierarchical clustering works is that you start with every instance or every every item being in its own cluster. And then you're trying to merge them. That's that's the idea behind it. So at the end, you merge it all into one big cluster. So here, it basically you computed your distance in hierarchical clustering, they are often called linkage functions. But you compute your distance. So here, it looks like A and B, the closest to each other and E and D are closest to each other. And so we have what merged A and B into one cluster, E and D into another. In the second step, we identify that C is the closest to the A, B cluster. And now A, B, C is a cluster and D, D, P is a cluster and there is a separate one. So this is not a cluster. This is the dental count in the bottom that corresponds to this cluster procedure. The next one is F and ultimately, we have merged everything into itself. So in some sense, this is, I would guess the most commonly used clustering approach, because it's so unassuming in some way, but it does depend on the distance metric. And it will give you a different clustering. And at some point, you will have to decide how many clusters they are, because you have to cut this dental ground somewhere, right? But this gives you an option of making that decision. After the fact, unlike the K-means, where you have to decide on the number of clusters before you actually cluster. So the steps of K-means, and most of you might have heard of K-means, it's also incredibly, maybe the second most common clustering technique. So you choose the number of clusters and then you set the centers randomly, essentially completely randomly. And then each point, you compute the distance of how close it is to each of the centers. And you assign a point to the cluster with that center, with which it is the closest to. So here, you got the red points assigned to one cluster and the yellow points assigned to another cluster. Then once you have made the assignment of all the points, you recompute the center and you repeat the procedure. So you recompute the center, right? You now have this evidence that all the red points are in the cluster, so you are looking to move the center to here because that's the central point of this cluster. And so you recompute and repeat until convergence. Obviously, this procedure converges intuitively, but there is also a proof. And so you kind of, you have the new center, you recompute the distances to the center, you reassign the points, the points now move between clusters. We assign the points, recompute the center, and you end up with the one final option where the center stop moving. So stop moving, the reassignment stops changing, and you are done. So that's your final for this particular set. The Gaussian mixtures quite nicely is a generalization of this. So what K means is disjoint assignment procedure. So you assign either to one cluster or to the other. Gaussian mixtures allows for the probabilistic assignment, but it's exactly the same procedure as you can imagine. It's just done with the expectation maximization because we are talking about probabilities. So you have the probability of being assigned. So for example, some of these points, there will be with probability 80%, there will be assigned to one cluster with probability 20% assigned to another cluster once the procedure converges. But essentially, it's a generalization. Also very commonly used. And finally, the spectral clustering. So spectral clustering in some sense, makes fewer assumptions, right? So the Gaussian mixtures, you have to make an assumption that the underlying distribution of your data is a mixture of Gaussians. And then you find the assignment. Here, in spectral clustering, all you do is you compute your distances, you compute PCA on that space, you take just a top PPCA and then use K means on the top PPCA that you have chosen. So that's your your spectral clustering. It's used very commonly for graphs. And in SNF, for example, that's what we use, we use spectral clustering. And you can see the difference here. So while in K means the most the closest point is the one that gets clustered, right? So if you have this two clusters, which are kind of what's called on manifold here, the special clustering is actually able to capture more, more accurately. This is the kind of a Swiss roll. So for the clustering approaches, for all of the ones that I have mentioned, you have to choose the number of clusters. So not beforehand for hierarchical clustering. But you do have to make that decision at some point. In the third class, none of these procedures make a decision for you. And Gaussian mixtures also has a probabilistic asset. So at every point, once you have new points, you will be able to say probabilistically, which clustered belongs to, which is a nice thing, which, which component belongs to. And if they're not linearly separable, in the sense that you can't just draw a line in between those two clusters, very clearly and easily, like in the Swiss roll example, then special clustering is maybe the better choice. I don't talk about the non parametric clustering approaches here, but there are approaches. Actually two things I will say. First is that there was a generalization of K means called X means and X means does the derivation of how many clusters are there for you. It's not as commonly used. I have not seen it used, even though actually it came out from the lab where I did my PhD. And we all like it. It's not very commonly used. So I figured that will not be introducing it here. Another one is actually used, but they all make some kind of assumption. So there's a kind of the richly process clustering where the assumption is that the rich cluster rich get richer cluster. So you might have if one cluster already has a lot of members, then the new member is more likely to be assigned to that cluster than to other clusters. And that's that's an assumption that one makes. And unfortunately, even in that case, it was it was proven that that procedure overestimates the true number of clusters simply due to the stochastic process of the estimation. So deciding on a number of clusters is not a solved problem. And I think it makes sense. It's not a solved problem because it actually depends, right? The number of clusters depends on on your particular objective for for doing cluster, right? And sometimes it's you can see by and it's very clear and that's great. But a lot of times you want to see what are the small clusters that are combining into the larger cluster or something like that. And it makes sense to explore the full space. So one of but but there are metrics that help you to assess what is the best number of clusters and then but still the decision is for the user. So silhouette metric is very, very common. Eigen gap and there are of course, other ones online. So silhouette statistic was introduced in 1987. And basically, it looks at two measures, which makes a lot of sense. So AI is this average distance of the individual point to other clusters to to the other things within the cluster. And the AI is the average distance to other items to other clusters. So you want AI to be small, so things within the cluster to be tight, and vi to be big, and the things and it goes between minus one and one here is the silhouette metric. And it goes between minus one and one, obviously, you want it to be positive. Because that would indicate that these greater than. So here's an example of a particular silhouette plot. So here, there are three clusters. And this one cluster, it looks like there are few points that are actually closer to other clusters than to this cluster. And this, again, is done for classroom techniques, such as k means of hierarchical clustering, where you don't, you don't have an ability to get the probabilistic assignment of clusters, right? So for those kind of classrooms, this is useful. I saw this very nice exploration where people have looked at various kind of simulated distributions, and how the different metrics that how the different metrics compare in identifying the right clusters. So here, there are four scenarios. And in the next, the results, I will show you all four. But here, the scenario A is the three clusters into dimension, which you see in the scenario B is three clusters in 10 dimensions, which I would not really plot. Scenario C is four clusters in 10 dimensions. Scenario D is six clusters in two dimensions. But notice that while you can distinguish the three clusters distinguishing within these clusters is very, very important, right within this two very tight kind of pumps. So the way that it works is that you have this four scenarios. And this is the silhouette measure. And it really is the most commonly used. And you can see that with scenario A, silhouette has done pretty well, pretty much on top of all the other clusters. Whereas in scenario D, it has done the worst. So it was not able to distinguish these clusters within within the tighter groups. So this is this is an important thing to keep in mind that there is no one measure. So in SNF packages, enough tool, we provide two measures, we provide silhouette and we provide the Eigen gap, which is the statistic between the principle components. An important thing to consider when you're clustering your data, and I think it's a very, very useful thing is the stability of your clusters. So the problem with clustering is that if you say there are three clusters in the data, every method will find you with three clusters. Now, whether they were three clusters in the data or not, the method does not concern itself with that. And the way to kind of get to it closer to to assessing what is actually happening with the data is to re samples take 80% of your data. So I say 80, but it can be anything can be 7090 doesn't really matter. But you get it is supposedly 80% of your data sample it. And then cluster, and then record, if you sample the two points, whether they were in the same cluster or not. And you just have this kind of fair wise make matrix that tells you if they were if this individual so this is the entry in that matrix, if I and J were sampled together, were they in the same cluster or not. And so if you have a uniform plane, then if you resample 80% of the data by chance, you're some of the individuals will cluster. But the next time you resample, they will not cluster. And so the whole point if you resample it often enough, you will actually get uniformed this matrix. And if you get a uniformed this matrix, you will know that there is no real, there's one big non cluster in your data. And what it also allows you to do this approach is to get rid of the outliers. So you will see that sometimes the kind of patients, for example, that you are trying to cluster, they go between two clusters. So 50% of the time they are in cluster one, 50% of the time they are in cluster two, it means that you simply don't have enough data to actually make that assignment. And maybe you would consider that outliers separately, you will say, I don't have the information to confirm that yes, this individual is of sub type one, sub type two, whereas some of the individuals will always cluster together. And that's the core of your cluster. And that's something that you can go forth with the analysis and keep that. So we're doing on time, we're doing very well on time. Excellent. Questions for the clustering. So no questions, not always a good thing. There's no, no reason. There's no, not much of a difference. Right? The whole point is to give you enough stochasticity in your sampling to see all kinds of the different clusters. But for to give the individuals or patients to give them enough chance to be sampled together. So that's, that's all. It can be 80%, it can be 60%, 50%. Do you have a favorite number? No, no questions? Yes. I think it really depends on the, a lot of times, you can use your prior information on what distance would make the most sense. So if your data is continuous, but you are pretty sure that it's not normally distributed, then things like Euclidean distance are not the right ones to use. And if people just use that Euclidean distance, because that's the most common thing to use, you might suggest Michalana this distance. But I think the biggest problem is a lot of times this clustering is done for the discovery purposes. And so we don't know what's in the data, right? And so it's really hard to, to pick. But you should know that if you use very different measures, distance metrics, and you get completely different clustering, that something is up. Something is up on the data. So maybe it was too sparsely sampled, it's very high dimensional, was too sparsely sampled. But usually the core, the cores of the clusters should be roughly the same, regardless of their measures, because I think this, this, if the concern is that there may be a lot of outliers, I would definitely recommend the core clustering. And then you can do the different thresholds with how many outliers to select. So you want just the clusters that, you know, 50% of the time they come up or 90% of the time, those individuals come up. So this would give you tighter clusters and more outliers. Yes? Moving on to classifiers. Very good. So the goal of the classification is really to, to build a map from the set of predictors from the set of measurements that you have to the outcome that you care about. It's just a map. It can be a linear map. It can be a nonlinear map. But the point is that you want to have some kind of predictor, which you feed your measurements for the new unseen previously item or individual or patient. And you want to be able to predict which class they should belong to. That's the basic premise we find behind the classifier. So the particular purposes may be predicting tumor grade or predict disease sub type, or even impute missing data. So if you figure out what is the model behind your particular measurements, then you can construct using that model predict what measurement is missing. I want to actually highlight this find most important metabolites for predicting disease status, because that is a feature selection question. And the feature selection question is actually different from classification, you can do unsupervised feature selection. You can do supervised feature selection. In fact, in some cases, it will come up in this, where a classifier does feature selection automatically, like lasso. But in some sense, you want to decouple these questions when you are addressing them. So there are many common classifiers, and you might have heard of all most of them. So supports vector machines, logistic regression, lasso, random forest, canyons, neighbors and naive base. I will only talk about logistic regression and other kind of derivatives of that random forest and the canyons, neighbors, I will not touch on SVMs and naive base today. All right. So the logistic regression is basically a linear, identify it constructs linear relations from the predictor space to the outcome. And what it tries to estimate is the probability of the class being one. That's the key here. So here you have y and that's the probability of the class being one. And it maps this continuum in that space. So if you consider this, you might have seen this, this is a linear regression, right? And that could actually predicts the value. So it transforms this linear regression via logistic function into this kind of the two states between to go between zero and one. Right. So here you have multiple points. I do it from the example where you have weight and you're trying to predict obesity zero. Yeah, no obesity. Yes. And all these red points were considered to be not obese. And all the red points were considered to be obese based on weight. And then you fit the curve, this logistic curve to this data to identify to then be able to classify new points. Now, unlike linear regression here, the fitting of the curve is done using maximum likelihood because there is no direct formula that you can apply to derive the beta coefficients. So depending on where you draw, where you draw the line, maybe your classes are skewed. But if not, you would expect that if the probability of class one is below point is predicted to be below estimated to be below point five, we would say it's class zero. And if it's above point five, you would estimate it to be plus one. But that threshold can change depending on your problem. So lasso is actually a regularization on regression. So lasso is a is a sparse sparse, sparse regularization of a regression. So here you have logistic regression, and it gives you L1 penalty. So what what that does, it actually adds the sum of the absolute values of the coefficients to the, to the likelihood function. And it performs variable selection at the same time as estimating. So that's that's a very important part. So depending on how much you regularize it, so usually we have a lambda parameter that you can either set or estimate. So how much you regularize it, it might give you very few predictors of your outcome. Of course, there will be not as predictive as maybe more, but it gives you a feature selection in addition to estimating the prediction at the same time. So to give you an intuition behind that, imagine your beta coefficients, which are, which are your coefficients of the importance of the variables. Imagine that this is the original line on which they, if you order them from this, say the smallest to the largest, this, this is where your coefficients would lie in your regression. If you apply the lasso regularization L1 regularization, then some of them are set directly to zero. And then the rest of them will be the same as they would have been before. So compared to that, there is a regi regression, which is an L2 penalty, which actually doesn't set anything to zero. So unlike lasso, regi regression actually gives you, it shrinks the coefficients. Where it's helpful is if you have, for example, two predictors, which are both equally predictive of the outcome. Lasso would drop one of them. It's at one of them to zero arbitrarily. Whereas the regi regression will give them half of the initial weight, right? And that's, that's important when, when you actually don't want to drop any of the predictors, but actually want to understand how they are related to each other. So elastic net in some sense was developed as the best of both worlds. It actually uses a combination of the both penalties of one and L2. And it optimizes them simultaneously. There's only, well, there's an alpha parameter for that. And depending, so I wrote here that if alpha is set to zero, you get lasso, if alpha is set to one, you get rich, it depends on the implementation. Sometimes it's reverse. So be careful and check the implementation that you are using. But the point is that one of the extreme cases is you get the specification completely, and the other case you get, the other extreme you get the shrinkage completely and just that. But the reason why elastic net was used is because it gives you an opportunity to not drop one of the predictors randomly, but to shrink them. But to set some of the really small predictors still to zero. So it has that lasso sense. And it's very, very useful. All right. I will go through all of them and then ask questions for questionaries. They're all different. So it's not kind of the same as them. So before we talk about random forest, it's important to mention decision trees because random forest is a collection of decision trees. So here's an example of a decision tree, which is a very old machine learning, we call it machine learning, because I come from machine learning, but statistical, really, technique that was used. So here, the question is, should the baseball game be played? And the leaf nodes give you yes or no? And the decision is based on all the variables that were collected. So you look at outlook, which has three categories, sunny, overcast and rainy, humidity high, normal and windy, false and true. And so there is an optimization that is done to decide which often for decision trees is greedy optimization, where it's decided, okay, outlook is the most predictive. So first we'll split an outlook. If it's sunny, we go and we test humidity and wind. It looks like humidity is the next best classifier. So it gives you if the humidity is higher than no, if the humidity is normal than this. And if windy, rainy and windy with them. But the problem with this approach, it's actually very useful. And it can be used with small datasets, which is also very helpful in a lot of cases and clinical practice. The problem with decision trees is that they tend to overfit, they tend to not be robust enough as a classifier. So what people have thought of is random forest. And random forest is basically a collection of this classifier is the way you get different trees in every branch is that you get you get a bag of features. So you get a subset of features and you get a subset of samples every time. So you can get 1000 trees and they will all be different because they all used different features, you sample different features to generate each one of those trees. And then sometimes it's majority voting, sometimes it's a little bit more sophisticated. But basically, you get an ensemble method to put together this various decision trees. Right. So the it's a it's a very it's a very nice and very robust algorithm. It also gives you a feature selection and a way a ranking. So at the end of the day, you can ask, how often was this feature selected when it was given a chance when it was sampled, how often was it selected as an important feature for splitting. And so you get this feature importance ranking at the end. And actually, in our own work, where we tested various methods for feature selection, this was one of the most robust methods. So I would I definitely recommend random forest as one of the lessifiers. Another thing that it does better than logistic regression is it's nonlinear. So we can never guarantee that the relation between our predictors in the outcome is linear. And random forest gives an option to model this nonlinear relationships compared to all this other methods that are out there. Cane nearest neighbors. So cane nearest neighbors is also one of the simple, simple classifiers. It works well sometimes, but it's not used as often again due to overfitting, right. And it's also very, it's not very stable. If you get new samples here, your decision will change. So it does in nearest neighbors, as you can imagine, it's quite intuitive. So if you have a new example, you look at what examples, what is the label of the examples, the nearest to it. But depending on how many examples you take, you might get different classification. So say k equals one, your label is plus one. If you set k equals three, you actually have more plus two examples, and you would set the new example to be classroom. And as you get more and more of these examples, of which you are not sure what the label of, you might get different classification, which is not really optimal. So is this the table which tells you the linear logistic regression loss or region regression and loss signal will, in some sense, derivative of the same logistic regression. And they are they encode linear relationships. So n much less than p, that's the number of measurements is much higher than the number of samples. So that's what that encodes. And logistic regression, you simply cannot run. If you have few features, if you have lots of features and a few samples, the other methods you actually can run, including for random forests, you need more because it actually explores the variance of your data, but it still works on very small samples as well. And for generalization, again, the regularized methods work a lot better. And I have mentioned it before, random forest also works a lot better. And the parametric assumptions about the Gaussian work better than the. So if you if you can make the parametric assumption, that's great. A lot of times our predictors are in such vast high dimensional space that they are almost as if they were independent and the relationship as if it was a linear. So I think we always try to start with a linear classifiers in my lab to to see how well we can predict. But then we also try the random forest as VMs to see if we can improve by introducing non linearity into that mapping. And finally, how do you evaluate the classifiers? Well, the most standard are to the ROC curves. So this is the receiver operating curve that comes from from engineering. And basically, if you were to randomly guess, you would get this this line. So if you were if you had a balanced class, then if you were randomly guessing, you would get this diagonal line. So the further away the line of your predictor is from the the this line, the better is the model. What what what does this line represent? Well, it represents the false positive rate versus the true positive rate. You obviously want to have the most true positives at the fewest with the fewest false positives. So that's why this kind of blue model is really, really good. Now, in my experience, if your model looks like this right away, from the first try, it means there's something wrong with the data, because this this model looks to be true for a lot of the real practical data sets that we see in real life. But it does happen. So yeah. And one example of how you would construct this curve is is if you were moving that threshold. Remember how I mentioned in logistic regression, that you had a threshold at point five. If you you were moving it, you would obviously have more or two positive or false to false false positives and vice versa. And that's that's if you were changing that threshold, how well your model would be for precision recall curve is used more often. In, for example, in clinical practice, when we report, we are usually asked for the positive predictive value, which is precision. And precision is basically the number of true positives compared to all the examples that were classified as a positive versus the recall which is the sensitivity is how many true positive they were out of all the the positives that we did and did not capture. So this is also this area under this curve being 2082. This is an incredibly good model with precision recall we usually get at 30%. And that seems to be where majority of this kind of biological models are. And this is it. I went through all of it is amazing. So question