 Welcome to MOOC course on Introduction to Proteogenomics. Today's lecture by Dr. David Fenu will talk about predictive analysis. He will provide a detailed information in supervised machine learning how it is linked with predictive analysis. Dr. Fenu will briefly discuss various parameters which are important for how to train a model, how to test a predictive model. He will also talk to us about how predictive analysis can be used in treatment of cancer, especially in taking decision of treatment strategies. Dr. Fenu will also talk about image classification and how the image classification could be used for the skin cancer diagnosis. So let us welcome Dr. David Fenu for today's lecture. Now we are going to talk about machine learning and specifically about predictive analysis. So you heard Mani's lecture where he talked about unsupervised and supervised machine learning. So what we are going to talk about today is purely supervised. So that means that you need a set of labeled data. So I mean Mani gave a very short introduction and I will plan to go a little bit deeper and give you more details about this. So what I would like to learn this morning is first of all how does one train a model. And there we are going to talk about gradient descent which is a way to a method, quite general method to find the parameters of the model. Then we are going to talk about regularization which is a method to protect us from overfitting and we will talk about all these terms in detail. The other thing we will talk about is feature selection. So one thing that in proteogenomics is that we measure a lot of things. We measure let us say tens of thousands of transcripts, maybe 10,000 proteins, maybe 30,000 phosphorylation sites so it is a lot of measurements on different molecules. But most of these will not be relevant to let us say predicting what happens to a tumor. And so what we want to do is focusing in on the important ones. And that is what why we do feature selection. So we select out the genes that are important to and closely related to what we want to predict. And so then we will just briefly touch upon that, but people have developed a lot of different methods to machine learning and a lot of different approaches on how to do this. So we will talk a little bit about how to choose the right method for the problem that you want to solve. So that is another thing that is quite important. Then the very important thing is that then after we have trained our model we need to test it. We need to evaluate how good it is and how well it generalizes. So that is and there we are going to talk about overfitting and underfitting. So I showed this slide in two days ago and so this is one example of predict the modeling. So when this for example the surgeon cuts out the primary tumor, we can we analyze the primary tumor that is a measure we do RNA-seq and proteomics. And then we want to from that measurement build the predictive model that can tell the oncologist which combination of drugs to give to cure the cancer. And this will of course depend on the both individual and the type of tumor that they have. So this just shows in one example in top the treatment A is what we want and but for the individual in the lower panel we want the treatment B works much better. And this and because currently as you probably know very well that is not the case I mean there is some standard of care that is given to everyone and it is only in a few specific cases where we can make this decision. But of course the hope is that by doing research and proteomics in the future we will be able to make these kinds of decisions in the more general way. So you probably read in the newspapers that machine learning has improved a lot. So people have been working on machine learning for several decades but the last few years it is really exploded and things work much better than it has in the past. So one thing that has been for example very successful is image classification. So the like Google and Facebook they have a lot of images so they have they put large efforts into automatically annotating these images and classifying them. And it is actually amazing how well it works. So and as you see these are just this is one big data set that is often used for this training. It is a very very large variety of images but also how easy it is to see what is in the images. So that is but this has really become so for several years they had competitions of who could develop the best algorithm to look at these images but now they have actually given up on image classification competitions because it works too well there is not worth doing much more. So they look at more complicated problems. But this is the general and of course this we can apply in our field and for example so there was a nature publication last year on the skin cancer diagnosis. So what the authors did was they had cell phone images of moles on people's skin and then they built had pathologies I mean dermatologists look at these images and classify them if they were benign or they were cancerous. And then they built so they and they collected a lot of images. So it was I think close to 130,000 and then built a model with that and then they could show that it actually their machine learning model worked better than at least average pathologist. So that is and that is quite incredible I mean you can imagine the implications if you can just you worried about some mole and you take a picture of it on your cell phone and upload it to some web service and then you get an answer back right away what with high accuracy. So another thing that has been very successful is teaching algorithms to play games. So quite a while ago chess game machines were became very good at playing chess and then for became better than any human but still for a while there if you had a collaboration between an algorithm and a person that was better than any algorithm on its own. But that is not the case anymore now the human does not add anything extra in chess. So and then more complicated games like go has also become and jeopardy. The advantage in games which we do not have in our case is that if you can have slightly different algorithms play against each other you can in general you can generate as much training data as you want because in our case we have a certain number of tumors that we analyze and if we analyze more tumors that is more expensive. So that is limited but in the game case it is only if you have large computers you can have the algorithms play against each other and learn from these playing. So it is in principle generating any amount of training data. So that is something that people are trying to do in produce genomics also but it is dangerous because there we have to then if we want to generate more data we have to model we have to build some kind of model how our data behaves and so that of course then the algorithm will probably mainly learn what we think the data looks like and not really not anything real. Another example that from the general thing is language translation. So this was in the New York Times two years ago. So this is a passage from Hemingways, Snows of Kilimanjaro and so one of them is the Hemingways original. The other one was a Japanese translation by an author from English to Japanese and then taking the Japanese translation and translating it back using Google translator. So which is which? Yeah, yeah so it's mainly the dead body of Leopard. That's the main and then there are maybe some other nuanced things but there's only one small grammatical error. So it's I mean this is quite amazing. So I'm just showing these general examples because as an inspiration that we should do the same for proteogenomics to be able to do these kinds of things. But of course as I said before the advantage in all of these cases both with image analysis translation and with games is that there is a very large data set that's been labeled. So that's really what we need. And unfortunately our data sets are usually limited and we would always want them to be larger to be able to achieve things like this. So let's look a little bit more at the details of supervised learning. So as Mani already mentioned we have two main things. One is regression, other one is classification. And so what supervised learning is is we build a very, we have a very general model with lots of parameters. We into this model there is no biological knowledge. It's just a very generic and we're going to look a little bit at what what is. So just generic model that can pretty much approximate any type of function. And then we want to learn the parameters that are that best fit that. So we're going to look first at regression and then what regression is is that we have some variables that we measure. We call them x usually. So in this case for illustration purposes we only show one x axis and then we want to predict what the value y is. So x would be for example the level of transcript that we measure with RNA-seq or the level of a phosphorylation site that we measure with my spectrometry. But with all data we have actually many measurements. So even though I only show one measurement you should always imagine that there are 10,000 axis or 100,000 axis. It's very difficult to imagine what that happens and also it methods that work on low dimensions become it things behave very different when you go to high dimensions. So what regression is is that we pretty much try to in this case when we have one x and one y we try to find the function that describes the relationship. It's quite straightforward that way and in classification we try to find the boundary between two classes. And so in this case the there are two measurements so x1 would be let's say the level of one protein, x2 the level of another protein and then we have for example the yellow circles could be that patients that have long survival and the black ones the patients that have short survival. And we want to find the boundary so we can classify when we've done the measurements will this we can answer the question will this patient survive for a short or a long time. And if we go back to the regression case there we would the y could be instead how many months will this patient survive. So we're going to start with linear regression and so here the axis so we can have many axis those are different measurements that we do quantitative measurements and then for each of them we have a different weight. And so what the output the y is is we take each measurement and multiply its weight and sum that up and add a constant. And then so that's you recognize that if we have just one x as the linear regression case. So now one thing to point out that we can also have an arbitrary function of x we don't just need we can we don't have to limit ourselves to just using x. But we can for example have a polynomial as a that has an input as x it's still linear regression because it's what linear regression it's linear in the in the w in the parameters that we are learning. So one thing is that we have limited data we can build these models arbitrary complex. So what we we can but we have to choose how complex to make them. So in this case if we have these data points it could be pretty reasonable to draw just have a linear regression a very not a very complex function and that could work well. But one could also have a much more complex function and then we would have the lower case. Can you please give a example of some particular data for the linear regression. Yeah so maybe the the output could be we want to predict how many months a person will survive. And then we the input would be the several different proteins the levels of them and then we can use that as the to predict and then what we want to learn are the weights. So the analysis is the rank will get us for how long that person will survive. Yes. So what will be x1 x2 x3 and so on. Oh so x1 yeah so x1 would be one protein x2 would be another protein and so on. So we would and then we are going to talk about some let us say we measure the levels of 10,000 proteins but that is a lot of parameters. So we can we probably do not have enough samples to support such a complex model. So we are going to talk about how to select which proteins are important a little bit later. So we have these two cases. So which one is right? Correct. Who wants to please who thinks the top one is correct? Please raise your hands. The answer is that there is no way to tell because you if you only have your training data there is no way to tell. I mean yes I agree the top one is more likely that is what we more would guess but it is just a guess. It is really we do not know. We need to collect more data and so we need to always train on one data set and then test our model on an independent data set. So let us say that we measure more data and these are now so the black ones are the same as before. So we have the grey ones or is our new independent test set. That is what we now we can say that we trust the linear regression. But for example if for some reason this would happen when we do our then we would choose the more complex model. But the main thing is that the it is really when you only have your training data set there is no way to tell how good it is and so that is probably the most important thing from my lecture here today. And another way to show this is here now we were on the this is the same data set that we those 12 points I think it was if we and we see that when we increase the degree of the polynomials so we increase the complexity we can make the error go down in this case to zero when we have the same degree polynomial as we have number of data points. So this is so one thing about the error that I am and you are probably familiar with this that we have to of course both for training and for testing choose a function that we in training minimize and then we in testing we use to evaluate. And for linear regression anyone remembers what we use as the this loss function. It is the sum you take for each data point the distance to the the line and then you square the error and then you take the sum of the square of the errors you remember that from high school maybe no no no sum of square errors. Not I mean usually not when you can take the average that does not matter but you can just it is really usually just the sum of the you take each error for each data point and then you sum them it's very it's simpler than that it's just taking each error taking the square and then adding them up that's the most common right it's that's the the sd so square deviation yeah you can do the mean but you don't even have to do the mean just the square deviation yeah so the sum of the square yeah you have to sum them but you don't have to take the mean yeah so it's very simple and you will you all know this I'm sure and so oh yeah so going back to here so we know that in a training set if we make our function complex we can minimize have the error go down to zero but of course this is meaningless because we just have made an overly complex fit to all the sort of noise that's in our training data and so because of this long time ago Johnathan Neumann said that if you give him four parameters he can fit an elephant to any data and and with five he can wiggle his trunk so meaning so what he meant was just this that if you train on if you evaluate your model with your training data that's not meaningful and of course this was a long time ago so he had much less data so he was worried about four parameters nowadays when people build deep learning model they have hundreds of thousands of parameters and worry much less than Johnathan Neumann I hope today you learned how supervised machine learning and regression and classification plays a role in predictive analysis. Dr. Fenni also showed how predictive analysis could help in skin cancer diagnosis and found to be superior when compared to pathology based diagnosis we also learned that how overfitting or underfitting plays an important role in model capacity finally we understood that the capacity of a model describes how complex a relationship it can model you could expect a model with higher capacity to be able to model more relationships between more variables then a model with a lower capacity in the next lecture Dr. David Fenni will talk more about predictive analysis giving more emphasis on training a model and testing a model thank you.