 Welcome to MOOC course on Introduction to Proteogenomics. In today's lecture, Dr. David Fenyo will talk about training a model and test model. It will be continuation from the previous lecture where Dr. David Fenyo will briefly discuss testing error and training set size where he will also discuss about low variance and high variance. He will provide a detailed idea about regularization and how regularization helps in training a model. He will then talk about how to divide data set between training and test and how to deal with situation where the data set is small in number. I hope some of these discussions and points will be very important for you to consider when you are planning a big clinical study for your own project. So let us welcome Dr. David Fenyo for today's lecture. So another way to say this is that if you have a data set, you should divide it into test and training set. Now the problem for us again which is always the case that we do not have very much data. So if we separate out the large chunk into the test set, we do not have much left for training. And that is not good because I mean as we said before the larger your training data set the better you will do, the better the model. So now we are going to come back to this but what people do is cross validation where they do this separation one way and then do the separation another way and so on. But we will get back to that a little bit later. So now if we separate out the training set, so we saw now this is the same data but the y-axis is on a log scale. So we saw that the training error can go down. We can by making the model complex we can make it go to 0. But if we then compare it to our test set that is the test error goes up as we make it make our model more complex. And so one way to get around that is if we really have a lot of data that helps. So in this case now we are looking at the test error as a function of training data set. And as you see sorry the y-axis is not labeled. But as we increase so the numbers there that are shown are the number of points that we have. So you see that when we have few points like 10 we have we just increase our complexity model complex a little bit the test error. But if we have large number of samples, so there the other two curves that are not labeled which have low errors are 1000 and 3000 points. So that is one way of so the best way actually of doing what we will talk about later which is regularization to get around this problem is to have just a very large data set. So I am going to just skip this. Okay so now let us say we do not have now so far we have had linear data. So now we are going to have a little bit more complex data set and we are going to do the same analysis. So here now it is good to increase the complexity a little bit. Because our data I mean as you can see we cannot fit that function well with the with the line. There is no way that we can get a good approximation. And that is why we see that if we have a first degree polynomial which is a line we do not and both 1 and 2 gives us very high errors. But then it goes down but as also before in the example when we continue see when we continue increasing the complexity the training error goes down and down. And eventually well it is very low but then at some point the testing error goes up. So that is why we should choose the complexity somewhere here close to the minimum. And it is usually better to be more conservative so I mean probably better not to choose 5 or 6 but rather 3 or 4 in this case. That is sort of a rule of thumb that since we have this flat region it is better to be more conservative. Can I ask a few questions? So it is a little complicated when your training error is going down with increasing in the polynomial number of degrees. Why should the testing error go up? Because we are I mean training it well but when you are testing it the error goes up. Oh yeah so I mean the reason is that you over train it. So you over fit your model. So you you train it to the particular noise that is in your training set. So it learns something that is not relevant to the process but just because you have a finite set that you are training on that will have some by chance some noise and that is what you learn that noise and but that does not generalize. Yeah so yeah so we will talk a little bit about that. I mean overfitting is always a worry even if you take these into account but the way we are going to talk about that and usually we do it through cross validation that is but we will get back to that later and this I am going to skip this. So now regularization so again Mani mentioned this. So what we a little bit formal way to that we are going to look at so I do not know how familiar you guys are with mathematical notations but this is sort of what we do one way of describing what we do when we train a model. So we have W or our parameter so W here is bold so meaning that it is a vector. So we have many values of the parameter so W1, W2 and so on and then we have a function L which we call the loss function and in the case of linear regression we mentioned that it is the sum of the square deviations that are good loss function. But it is not necessarily the only one there are other ones also you could also you do not need to take the square you could have the absolute add up the absolute values. So we choose some kind of loss function actually the L should not be bold sorry about that because it is a function with one it has many vector as an input but the output is just and then we try to find the W which minimizes the loss function. So that is why we call it least squares fit so we so that is what we are doing. But again so we saw that if we do this we and have many large vectors of many W's we run the risk of overfitting. So what we can do is add an extra sum here where this is some kind of function of the absolute value of W. So meaning that we are going to force W to be pretty small and then this parameter lambda is the one that governs that and there are two that are pretty often used is either you add in the square of the length of W or you can and that is called ridge regression or you add in just absolute the length of W. So those are two ways to regularize and minimize the risk of overfitting. But again remember that even if we do all these things that we are going to talk about more thing that to try not to overfit even if we do this there is a risk that we overfit and we should always be worried about overfitting. So let us have a look at what happens here. So now these are just 10 points here. Again we do a linear regression but with the polynomial of degree 9 and it will be a widely oscillating curve that is cut off over here and we see that the coefficients are these are the for the linear case it would be these two that would give us a line. But then we see that the other coefficients are very high. Now as I mentioned the best way to regularize is to have lots of data and if we have instead of 10 points 10 measurements 100 measurements we see that we get reasonably good fit a little bit wiggle and we see that it is so most mainly dominated by the two first ones which would be a line and if we get even more data it looks even better. But outside the region of course anything can happen where we do not have any data. So now if we look at the same thing we add some regularization and so we instead of just minimizing the loss function we minimize the loss function plus a lambda times the in this case I believe it was the square of the parameters. So then the same thing actually here is the inset so this is what we looked at previously without regularization those are the insets and we see that with regularization we see that we get a much better fit and may it is dominated by the two first parameters in all the cases. So that we definitely need to do that no the large sorry the large this is the regularized one and this is the same as the unregularized that is the same as the previous slide. So if you look at this and we go I go back to the previous so this one is the same as the inset in the next one ok. So then another way of doing it which is to do nearest neighbor regression. So in this case we want to see what these red points are the ones we are interested in. So for example, if we take the three nearest neighbors we would take an average of these values and approximate where the red ones would be. So that is the linear regression we have a very sort of fixed model but here it is really we just looking for data points that are similar. So it becomes it can become very flexible but often with high dimensions there are no points that are similar because it is I do not know you should try to think about how a very high dimensional space looks like and it is not easy to think about. So think about try to think about instead of in this case we have two dimensions just hundred dimensions. So something so in two dimensions you often have points that are reasonably close not so much in this case but in hundred even you spread out all your points in a much bigger space and it becomes like nothing is near anything else if you do not have an enormous amount of data which we usually do not have. It is often even if a linear model is not the we know that let us say the linear model is not right and it is often still better to assume a linear model because in most of the our cases we do not have that much data. So that is another thing ok. So we looked at a little bit about model complexity now now I will switch to how do we train the model. So we already mentioned that we define a loss function which gives us a energy landscape that we try to find a minima. So and usually we have our function defined and then most often we start at a random place. Let us say here we just at first we just randomly assign our parameters and then what we want to do is to go from our randomly assigned space to the minimum. And but we do not I mean and also we have this 10,000 dimensional space that we have to walk around in and so we but what we know is the local environment. So we what we can calculate is if we are here we know we can see what the slope is in which direction should we go to at least get further down. And so we calculate the derivative locally and then we go take a small step in the direction of down. And so then maybe we go there then we repeat this take another step go even further down and then continue again but now what can happen when we get closer to minima is that we take too big of a step. So we over jump and we are going to see that that it is often good to start taking big steps and then at the end take smaller and smaller steps. Another problem that we can run into is that if we start in a region where it is very flat there is no gradient almost or maybe not at all and then we end up stuck there. Since we always want to take a step in the direction of the gradient and the size of the step is also proportional to the gradient. So then we that is not that is a problem. And another thing that can happen is that when we get stuck in minima. So that is those are I think the main problems that we run into. So let us look at how this air landscape looks for linear regression. Now as you probably remember from undergrad for linear regression we do not need to do gradient descent because we can actually solve this analytically and we but we are still going to since it is such a simple case I still wanted to walk you through how it looks if we would need to do gradient descent with linear regression. So again so we have a few points here now we have the slope of this and the intercepts. So we just look at the slope if we change the slope from this is the optimal position this is where we have the minimum that is a line shown it will the sum of the square errors will increase. And we can look at this in different ways. So this is the three dimensional if we look both at slope and intercept and that rotated we can look at it from above where we have two dimensional intercepts here slope is here that is the optimal place where we have the best solution for the linear regression. And we can look at it like this like a map showing the minimum here and the gradient. So if we have two different lines the same number of points here and same number of variation we get slight variation in how this energy landscape looks like. But if we have really a lot of points the energy landscape is well defined with nice concentric circles which of course is helpful. So we mentioned that it is actually we do not need to use the sum of the square errors we can also use sum of absolute errors. The energy surface becomes a little bit more jagged and not as round but it is also a possibility and especially when we have outliers that could be a better solution. So another thing if we have very little variation so that lots of points that we find the line well we get a very sharp minimum. But when we have more error we get a much more much less well defined minimum. So let us go through a case now we are going to walk down this surface. So we randomly start here so this is again intercept and slope. So we have our data here this is our randomly assigned line you see that it is not great but it and it is also because we are pretty far away from the optimal solution. So now we are going to take a small step in the direction downhill at the gradient. So the gradient is perpendicular to these height lines. So the first step we take would be going perpendicular here and depending on what we choose the sub size to be we will take a small step here. Then we take another step now because of the curvature changes we are going to curve in and then we continue going down following the gradient and eventually and on the side there you see that now when we reach is close to the middle the line fits very well. And we can randomly start from different places and we end up in the same location because with the linear this is a very nice surface. Some threshold here that gives us a lot of true positives and very few false positives. But in another case if we have these much closer to each other we cannot do that distinction but we can still use this to select where to set our thresholds. And then we can see what happens if there is an uneven distribution. So the other thing we can do is if we have the false positive rate and a true positive rate we can create what is called the receiver operator characteristic an ROC curve. So, how many of you have made ROC curves? So, that is a very common way to evaluate classification and then when we do comparisons one thing that we in this case we have good separation. So, the ROC curve will start on here and go almost up to the corner here. So, we have that true positives are separated from the false positives. And we can use the for example, the area under this curve to as a characteristic how well we are doing. And so, if we have completely random distribution they will be completely overlapping we would just have a line along the diagonal. And then these are just for the other case where we have them much closer you see that here the the curve ROC curve is much closer to the diagonal. And these are just cases for the other ok. So, one of them probably no conceptually easiest way methods is the nearest neighbor. So, here what we do is we just see where what is the nearest neighbor what is it what class is it in. And of course, if you do evaluate this on the on the training set you are going to get that error is 0 because it is. So, here is one example where it is definitely won't it shows that why one should not use that the training set to value. So, here now we have the two groups and the near if you use one nearest neighbor we get a good separation between them. But in another case if we take the nearest neighbors when they are more intermingled we see that we get a very complex decision surface and no one would claim that this is really what it is. So, this is a very clear case of overfitting. So, now we can of course, average over a few nearest neighbors. So, in this case two nearest neighbors now it gets a little bit more plausible, but this for example, there is still an island here of in the middle of the blue. So, then and it gets as we go to more and more nearest neighbors the decision surface becomes more plausible. But again with nearest neighbors the problem is often that we have many then we have many dimensions it is just there is nothing that is really near to to anything else. Ok. So, now a method that often we can start with to just evaluate this logistic regression. As you if you remember it looks very similar to linear regression that we have the input are different protein measurements. We have our parameters the weights for each we multiply each value with its weight add them up and add the constant. But now that is so far that is linear regression, but now in logistic regression the that becomes the parameter of the function of the logistic function and which we call sigma and it looks like this. So, we introduce a non-linearity and the as we see here. So, these are for different parameters of we can we get this transition from 0 to 1, but depending on the parameters we have different sharpness. So, but what we again so now we have we want to have classification. So, we have two cases. So, in the extremes here at low values we have that the output is 0 and at high it is 1 and then we have this transition region. So, that is why we can use the logistic function for classification and this now is just comparing them. So, linear regression this is in with 1 x value. So, we have the slope and intercept that is linear regression and then logistic regression we just have the same expression, but we have a non-linearity. So, the other thing that we looked at the shape of this function going from 0 to 1 and since we are going to do a gradient descent we need to look at its derivative and it is actually there is actually a very simple expression for its derivative and it looks like this. So, again it is very flat out here when we are far away from the transition and the derivative is 0 and if you remember for gradient descent that is not very good because we get if the gradient is close to 0 we get stuck there. So, we want to make sure that we are not too far away when we start otherwise we want to find it. So, this is just an example of very down logistic regression and if you remember from the nearest neighbor is the same data set we got pretty close to a straight line. That also and that is what we get in this logistic regression. So, now if we look at the energy surface of this. So, remember that for linear regression when we use the sum of square errors it behaved really nicely, but here we see something completely different it is really not does not behave well when we use the sum of square errors and it is probably easier to look at it here. So, this is our minimum in there. So, we have a huge mountain behind it very steep gradients and very shallow gradients. So, we have to somewhere find their way in here through very shallow gradients. So, what this means this is a bad choice of a loss function and this is just some other ways to look at it. We again have that if we approach from here to the minimum it is very shallow, but then it is sharp and then here we have looking at the other way we have this plateau that where we can also get stuck. Now, this is you don't need to remember this, but there is an appropriate loss function for logistic regression that is this one, but I am not going to go into detail. So, then we applied that loss function the it the surface become much more manageable and we can do gradients. We still have that it is in one direction it is more shallow and sharp in that direction. So, it is not as nice of a surface as for linear regression, but it is still reasonably good. And also the other thing that for this for logistic regression we don't have an analytical solution. So, here we have to do gradient descent. So, then if we have the same number of points same distribution. So, this is one class up here at one and then the other class at zero. We see that the surface varies a little bit if we have and if we have fewer points we get quite a much larger variations in this case. And when we can also do gradient descent through this. So, we start out here then we walk down, but again we have to be careful that we don't take two large steps when we come close to this very sharp steep hill because then we end up being thrown away far away from the minimum. So, again so, both for both logistic and linear regression we have these hyperparameters we have to decide on the learning rate, how we schedule the learning rate usually how we decrease it. And if we want to remember some of the momentum and have some friction built in. So, if we look at the regularization here again we want to guard against overfitting, but so here in this case. So, the same as for linear regression we don't we can also add in polynomial terms we can do the same thing I mean I saw the expression was the same. So, for logistic regression we can do exactly the same thing. So, here we have there is no linear surface that can separate these the yellow and the black here if we have. So, if we add in higher degrees of polynomials we can do a better separation, but again in this case our surface is a little bit too jagged and it is probably overfitting. But then we can fix that by doing the same type of regularization either less or rage regression. So, and how does that if we look at the energy surface for logistic regression with no regularization we had this case. And when we add in regularization it actually helps us also in the speed of learning that meaning that and you see that the gradient here is when we add in the regularization is much higher. So, that it is more comparable to this. So, we will be able to find the minimum faster. So, then a few examples of yeah. So, we had so you probably heard about neural networks and deep learning. So, what that is so each of these nodes here is a very similar node to one logistic regression unit. So, we have the inputs the different protein measurements, we have the weights and we multiply each weight with each protein measurement and sum them up and then we have an offset. And we have some kind of non-linear function which can be a logistic function, but it can also be other things. And here we have we always have one input layer at least one hidden layer and an output layer. And this illustrates what is called the fully connected network where each node in each layer is connected to all the nodes in the next layer. And now this only shows one hidden layer, but nowadays it is very popular to have many hidden layers and that is what that is why it is called deep learning because you have many layers. And right now this is the most popular method that probably that people use, but it does often require if you do not want to make a very small neural network, but at least for these large ones that people do you need a lot of data. And in most cases for in proteogenomics we do not have enough data to build neural networks. So, most recommendation is even there is all this hype of deep learning, but best to not for proteogenomics not to get into that and unless you have a very good reason. And probably 10 years ago support vector machines were what everyone did and it was a very popular I mean I would say probably support vector machine 10 years ago was what neural networks are now. So, there is always a fashion in which methods are but support vector machines are very useful and they was what they do is they of course, find a plane that separates the data, but then they also find try to find the largest margin. And the support vectors are the data points that are on these margins. So, I think Mani showed this slide on tree based methods and those are also very powerful methods that especially in this case showing that you can have a very highly non-linear function that you can classify all these even though they are quite intermingled. And by so each of the nodes in the tree is a decision whether it is some measurement is larger than some or smaller than something you can go in different direction. So, I would say that right now the most success people have is with either support vector machines or tree based methods, but actually I would recommend starting with a simple method like logistic regression first also and include those. And the there is actually there is a theorem that is called the no free lunch theorem and this was in the 90s some people showed that when you start with a new project you have a new data set that you do not have experience with there is no way to tell which method will work best. So, it is really sometimes a tree based method like random forest will work best sometimes logistic regression sometimes support vector machine. So, it is really and of course, all the methods have lots of parameters that need to be adjusted. So, what people often do is they try all possible methods. Now, of course, what you can do is to train one method on your training data set test it on the test data set train another method on your training test it also and do this many many times for all the both for the different methods and for the different hyper parameters because you should only use your test set once. So, you need to do this exploration you need to do within cross validation and we I think we are almost getting to cross validation have said that we are going to get there soon, but so the other thing is marker selection that we already mentioned earlier. So, now we do all these measurements and we know that most of the proteins or most of the transcripts are not going to be related to our phenotype. So, we really it would be much better to just have build the model using the ones that we know are related, but of course, we do not know which ones to start with. So, we need to. So, if we look at mark, so why do we do marker selection? So, having few features it makes the model easier to interpret. So, one thing that we have talked about building these predictive models and we want to predict something, but if we can also understand that is of course, a much better thing and often when we build very complex models we do not understand then maybe we will not have a chance to understand. Few features, so it is easier to interpret we can start thinking about biological function they also less likely to overfit because fewer parameters, but usually we get a little bit lower prediction accuracy. So, that is something to balance and that is what we use to then decide how many features. So, as opposed to if we have many features it is difficult to interpret we do not know what is going on and then of course, more likely to overfit because we do have an enormous amount of parameters. But of course, as we add in more and more things we get higher prediction accuracy, but it we are not sure whether that is really real. Dr. Fenny provided a very good overview about how separating your data set in different training or test models can give a better evaluation. We also learned that when there is an increase in the degree of polynomial the error goes down. We also learned it is better to have large data set as it will help in evaluation of the model better. Finally, we understood how to minimize the risk of overfitting of data with regularization and why we should avoid overfitting of data. We also understood two regularization strategies which can be used like rich and lasso. In the next lecture, Dr. Fenny will talk about association and market selection. Thank you.