 Welcome to MOOC course on Introduction to Proteogenomics. In today's lecture, Dr. David Fenio will talk about association and market selection. Market selection helps to model an easy interpretation taking few features into account. Most of the protein candidates may not directly related to the phenotype, so marker associated selection or mass help us to feature those candidates and build a good protective model. Dr. Fenio will also discuss about how many features need to be considered for building reliable model and why multiple features makes a model complex. What is the optimal number of features? We will briefly discuss different kinds of methods which are available which could help in the featured selection. We will also highlight what is data snooping. So, let us welcome Dr. David Fenio for his last lecture. The other thing is marker selection that we already mentioned earlier. So, now we do all these measurements and we know that most of the proteins or most of the transcripts are not going to be related to our phenotype. So, we really it would be much better to just have build the model using the ones that we know are related, but of course, we do not know which ones to start with. So, we need to find. So, if we look at Mark, so why do we do a marker selection? So, having few features it makes the model easier to interpret. So, one thing that we have talked about building these predictive models and we want to predict something, but if we can also understand that is of course, a much better thing. And often when we build very complex models, we do not understand then maybe we will not have a chance to understand. Few features, so it is easier to interpret, we can start thinking about biological function and they also less likely to over fit because fewer parameters, but usually we get a little bit lower prediction accuracy. So, that is something to balance and that is what we use to then decide how many features. So, as opposed to if you have many features, it is difficult to interpret, we do not know what is going on and then of course, more likely to over fit because we do have an enormous amount of parameters, but of course, as we add in more and more things we get higher prediction accuracy, but it is we are not sure whether that is really real. So, there are few different ways to do this. One set of methods are called filtering methods. So, in these we look at what is the predictive power of each protein and then of course, and then we select the ones with the highest predictive power. And so, we look for which proteins are have high correlation with the target variables of whatever we let us say tumor subtype, we also do not want a lot of predictors that are correlated with each other, we want them to be somewhat uncorrelated and also we want them to have lots of information and that is pretty much the same as high correlation with target value. So, but this is now we look at evaluate each individually. And now the other class is these wrapper methods where we look at the predictive power jointly. And so, the idea is that it is they are not independent of each other. So, probably the we will get a better result using the where we evaluate them jointly. So, there are two ways. So, one is that we start with the let us say the one that has most information then we add the second one, we evaluate them together and then was there with what we mean by evaluating is we check is there any additional point of adding the second one. Does it improve our results and then we continue adding adding until it is not the results do not improve anymore. And of course, this again we have to do this within cross validation because otherwise it will overfit. And then that was forward selection, backward selection is that we remove instead. And then we can also there are people have combined this. So, one very popular way of doing this is called recursive feature extraction which is probably the most widely used one. And so, one some methods like lasso, so lasso was when we regularized adding in the absolute value of the parameter vector times a constant, there we actually get explicitly feature extraction. And some variables will fall out will be I mean some parameters will become 0. So, then with marker selection. So, the question is what is the optimal number of features. So, that is usually we want two things, we want it to be as simple as possible, so we can interpret it and, but we want to get good prediction still. So, if you one thing that people talk about is the curse of dimensionality which we definitely have always is that we have we measure few samples even I mean within CPTEC we measure 100 samples that still quite few, but for each sample we measure tens of thousands of I mean 10,000 proteins maybe 30,000 transcripts and so, another 30,000 phosphorylation sites. So, we have much more measurements on each variable than we have samples. And this makes things very hard first of all not to overfit, but also often when we find signatures they are not unique, but we have a large we could there are many signatures that are equally would make equally good predictive models. So, now finally, cross validation so, we have all these hyper parameters that we need to decide on, but what we say we do have our train data set that we divide into training test, but we need to decide on these variables and we not allow to use the test set to decide on the hyper parameters. So, what we do is we further divide the training set, but we divide it many times. So, here we have taken the blue region of the training set as the training that we actually do the training and then we use this yellow as validation for validation meaning that we define for example, the learning rate the regularization rate and so on. And then we do this many times for different subsets so, for example, 5-fold cross validation or 10-fold cross validation are commonly used. And we can even since we do have limited data sets and we often also do that we do another do not do this division of training and test, but do what is called nested cross validation. So, we do a cross validation down here, but then we do a similar cross validation up here on top of it. So, that is and this is very important to do well, but a lot of the software packages do have this built in. Okay, so another few things I wanted to just mention I will only take a few more minutes. So, one is sampling bias so, that is so, we really the machine learning method will only give us what we train it for. So, this is a classical example of in the this election the Truman one, but the polls the polling companies the way they did the polls is that they called people on their home phone. And I forget this was in the late 40s so, only rich people had phones. So, and they preferably mostly voted Republican and so, the polls got it completely wrong. So, here is a newspaper that actually printed it in advance because they were so, sure that Truman would lose. And this is something that happens to us a lot in the biomarker discovery. And so, this you will have the slides so, it is definitely worth reading. So, David Randolph has written several papers on this problem and here this is just some list of what can go wrong, but it is definitely I should spend quite a lot of time thinking about what for example, if one would develop a blood test for early discovery, one shouldn't collect the normal samples in a different clinic than for the samples from people with that have cancer. It should be, but there is a lot of it is worth reading these and a lot of things to think about. And ok so, then the again I said this several times. So, the test set data has to be independent, otherwise it is not if you train, if you test your model on something you trained with, it is really not going to tell you how good the model is. So, we talked about a little bit about when we have very complex models, it is difficult to know when if the model tells us something about reality. So, one thing especially with images it is easy to so, here we have an image and I think the predict the neural network in this case was able to say that it this was it found that it is an electrical guitar it thought from this acoustic guitar from that and Labrador from that. So, that sounds pretty reasonable, but often the we can have this case where this was classified as a wolf, but what was used for classification was the background snow. And this can easily happen in proteogenomics that we since we have so, many we if we build a very complex model, we can get something that is irrelevant. What happened probably was in this in the training set all the wolves had a snow in the background and that is something that can happen. So, another thing that is still with image analysis with all its success, here we have an original image three images that were classified like you would you see it, but then if you add a little bit of perturbation that is barely you can barely see any difference. Then all these three images are classified as ostriches and so, there is a lot then we especially with complex deep learning methods, there are lot of things that we do not understand and it is actually even worse these are quite complex images, but even for simple hand written digits. So, these are classified correctly, so this is the hand written image and below is the what the neural network says, but again if you added a little bit of noise you see that barely definitely does not disturb us, we can still see clearly what it is. And then for example, this nine here becomes a according to the network becomes a three and this is now people that develop this network classification know about this problem and try to fix it, but did not succeed and even worse these are all classified as zero even though there is nothing there. So, few books these are very easily accessible books I would say an introduction to statistical learning more applied predictive modeling both of them teaches you gives you a good starting point for starting to do predictive modeling and feature extraction and the other thing I recommend and then we are going to start that you are in hands on session you really know need to learn how to program, there is no way around it and so, this is a good starting point R is probably since you now have all have R studio installed you should go home and continue using it and this there is a PDF available of this book online. So, I think all of these books are available as PDFs online also. So, I hope that you have learnt a little bit about how to train predictive models and then how to test them to avoid overfitting. I have a more general question. So, here is an example of this 1940s presidential election from the university. So, we have gone long back in 2018, but we still seem to be getting this thing wrong time and time again. Yes, 2016 was another example. That was a good example and right now I am going at its selection season in India here. So, I am sure you get it wrong or many of the outfits will get it wrong. So, what is your take on that? Why are you getting it so wrong? I mean that I am not don't know that much about predicting elections, but in general in let us say biomarker discovery or I mean we get it wrong for one thing is that we don't think about it maybe well enough and we take shortcuts that and, but it is also very difficult to do it right. That is the main thing. Yeah and also I mean it is not easy to collect enough samples. So, we have to do them over many hospitals and it gets very complicated and it is actually it is not easy to think it through. So, you mentioned that the normal issues will be cancelled as much as the normal issues will not be collected from a hospital which was removed from the ones where they are cancelled in 2017. So, one is there could be ethical issues here like if you want to guide it to the same hospital there could be ethical issues and the other is such a scenario is generally not seen where the normal tissue and the cancer tissue both are in the same case. So, there could be another institute where other reasons similar like the infections are being there and for which are not clearly cancerous but other disease tissues which we could consider as normal or not cancerous let us say not really normal or not cancerous okay could be used. So, mention that kind of scenario how about the I mean Marnie talked about some solution like trying to do batch correction and things like that, but again it is as I said better to avoid that if possible. So, yeah so one should definitely try to have that let us say normal patients are I mean for example, for blood testing one could imagine that there is a clinic where people come and when they come there they do not even know if they have cancer the doctors do not know. So, that is a good situation that during the testing no one knows and they are treated the same way and only after the samples are taken and have been tested if that is if one can do that that is the ideal situation I think. But usually I mean often we do have to make compromises, but we should try to make as few compromises as possible. Two comments to make the political thing that you brought up is a lot more complicated because human psychology and how people behave is involved. I think the example David gave was for showing bias, but the one we brought up about the 2016 election is a lot more complicated. So, who goes to vote is also part of the prediction until the polling does not take that into account in a proper way. So, I hate Hillary Clinton, so I get all my friends to come and work for Trump. So, that dynamic is not taken into account when the polling is done. So, I think that that is a much more complex example that involves human behavior and psychology. So, I think that is how the charts got this close. The other comment I wanted to make was David brought up feature selection and he also talked about keeping your best data set separate and using cross validation. So, one of the common mistakes that people make and one of the mistakes I made too early on is to use your entire data set to do feature selection and then split your data set. So, that is a very bad. So, if you are working in the business square and you have millions of data points, I think it doesn't make that big of a difference, but in biology where you have only 100 or 50 samples, if you do that then you already have your answers in your features and so you will do very well on your best set, but the next set of new samples that come from the hospital will fail back. So, I used to work previously in the telephone industry where you get hundreds of thousands of samples of records, but when I go to working on biological samples, if you really have to pay attention to not commanding your training and best set and keep them separate from the team. My question is that as see I developed the curve of frame management on set of features selected from a large number of features as selected from a number of features from which my model is already developed. Now, when I do the test team, the test data set somehow misses one feature that features in a that could not be selected from a combination sample at all. In such scenario, one of the way of the game to go back and develop because we want this feature or can I have it in such a way of you and let us see. So, some machine learning methods are better at handling a missing data like that. So, especially some of the tree based methods there you can set them up so that they can even if you have in your training set the feature and then in the test set not it can still work well. So, when in tree based methods like David mentioned especially the random process, when you build your model you can go on storing surrogate features and so when your main feature is missing, if you use your surrogate feature to insert all that and so if you have that file surrogate for each main feature then if one or two of your main features are missing if you use the surrogates. So, obviously there is some secretation in the model because if the surrogate is as good or better than the main features that would have been the main features. So, you will do some performance, but you can still do your prediction. Is this the same thing that interpolation and fixing some more data points? No, that is missing value implementation that is different. I spoke about it David for yesterday but that is another thing you can do but I think using surrogates is a more robust way of dealing with it because when you impute it is one thing to look at all your data and impute but when you have test data which is just a few samples how do you impute? Do you use only the test data to impute or do you use your entire training data and test data to impute? So, I wouldn't recommend combining everything to impute because then you are kind of making your test data look like you are training data by design that is not what we want and so the ideal way would be to have a large test data set you can try imputing but usually if you get a few samples for prediction and in that case I think using a method that can deal with surrogates or missing features would be the best of both. I know only random parts that can do that. I think for some of the others you can calculate very little importance but I don't know some methods or other methods that would use surrogates automatically on their own. So, today Dr. Fenu provided a brief idea about different biomarker selection methods and how it could help in optimal selection of features. We also learned that cross validation forward selection and backward selection plays an important role in feature selection. Less is a very good software to explore these kind of extraction features. We should choose features keeping in mind two important things model should be model should be as simple as possible so that we could interpret easily. At the same time the model should also provide a good prediction. Today we also learned curse of dimensionality. In simple words the complex algorithm and data frame having a big number of dimensions or features frequently make the target function very complex and it may lead to the model overfitting. Finally, Dr. Fenu talked about sampling biasness and biomarker discovery. He also mentioned about data snooping. It refers to the statistical inference that the researcher decides to perform after looking at the data. We should avoid sampling biasness and construct pre-planned interferences. For example, a group of researchers plan to compare three dosages of a drug in clinical trial. They pre-planned the data comparison on the basis of record of patients and group the patients on the basis of that which is an example of data snooping. The next lecture will be hands-on session on Web Gestalt by Dr. Bing Zhang. Thank you.