 Welcome to MOOC course on Introduction to Proteogenomics. In the last lecture by Dr. David Fenio, you will introduce to the concept of proteogenomics and its ability to provide expression level information at multiple levels. In today's lecture, Dr. Fenio will introduce you to few more capabilities of proteogenomics and its applications in various clinical problems. So, let us welcome Dr. David Fenio for his today's lecture. So, one thing that people observed a lot in the cancer genome atlas and in many cancer genomic studies is that there are a lot of changes when you look in the either whole exome sequencing or RNA-seq. There it is and a lot of changes and it is difficult to say which changes are interesting, which changes are important. So, one thing that one can use the proteomics for is to focus in on the changes that actually have an effect on the on the proteome because I mean it is really in general often does not matter if we have a copy number change that does not lead to any changes in the protein, that is probably not so interesting than if we have something that actually changes the proteome. So, one thing that we can then look at is so this is from the CPTAC breast paper. We can look at when we have consistent, we see consistency between the different measurements. So, this shows copy number transcript protein and phospho protein and we see that this is again for RBB2 here on top. So, we see that usually when we have an amplification, so a copy number change, we then also see that the transcript levels are high, the protein levels are high and a lot of the phosphorylation levels are high. So, when we see this consistency between the different data types, we can then we do see that the RBB2 which is well known to be an important driver in a subset of breast tumors. So, that comes out and then we see this for a few other kinases there in other samples like PAC 1 for example also has this consistent and there are a few others, but that says so we can definitely focus on things where all the different data are correlated and tells us the same thing, but if we only do that, that will be very limiting. So, but that is one way to get started to see what are the consistent things between different data levels. So, another thing is we can look at then correlations between different genes. So, this is one example again RBB2 now comparing it to GRB7 and we do this both on the DNA, RNA and protein level and this case we have very consistent result they are highly correlated on all the measurements. And so, the reason why we have the copy number change so highly correlated is that they are very close to each other on the same chromosome. So, this is one example, but if we compare RBB2 to RBB4, we see that we have RBB4 doesn't have any copy number changes. The transcript levels are not correlated, but we have rather high correlation on the protein level and this is quite common for proteins that work together. So, if we look at RBB2 in general and just rank how in breast cancer again and how which are the genes that are highly correlated with it, now it is only on the protein level we see that we have some high. So, these are the two that we looked at GRB7 and RBB4. So, those are the two highest correlation, but then there are others that are highly have high positive correlations also and we have some that have rather high negative correlations at least. So, this one is minus 0.54 correlation coefficient. So, then we can start by looking at this we can start sort of seeing which what are the sort of other proteins that each protein works with. So, then so we looked at so the question is what does this then mean. So, we saw that for the simple thing was that on if there we see correlation on the copy number level it is usually just that we have that they are close to each other on the genome. So, if the copy number one changes it is usually the copy number changes in a larger region. So, then they are close to each other they will both change. But then it gets more involved what we can and we saw then on the protein level we could have that if they were working together in a complex they were regulated in the protein level maybe by that the complexes are formed. Then if they are one of the components is on its own and left over and they does not have an event period it gets degraded at the high rate for example, but there could be other and but we see also correlations at the RNA level. So, then we can start thinking about how do we then use this information and one way to do it is if we have a large enough experimental data set we can start to see look at networks and pathways to see how if we see any differences. So, for example, this is just one example where we have three genes A will affect B, A will affect C also and C will affect the phenotype. So, in this case we see that A will affect the phenotype through C. So, if we then inhibit C we will also inhibit the effect. So, A now can't affect the phenotype. So, if we have one subtype of our tumors that have this structure then we see that inhibiting C would be a good way to change the phenotype in this case treats the tumor, but if we have another subtype where we also have a connection from B to the phenotype that will then change things. So, in this case if we have a drug against C it will stop this pathway, but will still have the other pathway to do. So, that is one thing that one can start thinking about when one has these correlation data between different genes on different levels. So, then we are going to talk a lot about signatures and already now we have talked about different signatures. So, for breast cancer we have the PAM 50 signature which is a set of 50 genes that are used to subtype to determine the four major subtypes of breast cancer. So, here we just listed the different the 50 different genes and we are looking just at the correlation between them. And then we see that and this is very common with signatures that we select genes that are quite highly correlated, but not perfectly. So, we see these regions of high correlation and then we see another group that is with the quite highly correlated within the group, but anti correlated with the other group. So, the some of the genes go up and down and that is how we then build a signature. So, the question is then how do we get a signature? So, Mani talked about unsupervised analysis and that is one way that we can try to extract signatures. And there as he mentioned there are many different ways to do the unsupervised analysis. One example of it is independent component analysis and independent component analysis was initially developed for if you have a room with several people talking and then you have a microphone you hear an overlay of the different voices. And so, the whole point is to then separate out the different sources. And the way that it is done is that we have the composite of all the sources and then we separate them out. So, that we have into two matrices one that would be the signal source matrix and the other one is the mixing matrix on how to mix them. And so, we can do the same thing for tumors and there the idea is that there are several biological processes that both basic biological processes that does not have anything to do with the cancer that we measure, but some of these biological processes are related to cancer. But what we are measuring is a sum of these different processes going on. So, if we separate them out so, the first step is completely unsupervised. So, that we yeah. So, in the case of this mixing school that is it because of the differentiates between these of the different components of the conditions because we are the leader. Yeah. So, these could be either the proteins levels that we measure or RNA. We could do it for either or we could even combine it. And yeah. So, we are going to actually talk about that how in an analysis like this, how one could potentially combine both proteins and the transcriptomic measurements. Yeah. So, the signal source matrix are the sort of signatures of potential different biological processes. And then the mixing scores are how the large effect is of each of those signatures on the overall measurement. So, yeah. So, as I said the first step is then completely unsupervised, but then of course, since we both get signatures from basic cellular processes and of processes that are related to cancer, we will then go back and look at which of these signatures are correlated to different clinical data types. So, then one example would be here the one of these signatures are highly correlated to the luminal A subtype. And then what we see in that signature is that there are a lot of genes associated with the cell cycle. And most of them are down. So, this is the luminal A subtypes to see that most of the genes are blue in this case. So, they are down. So, these cell cycle genes are much lower in this signature. And this is actually a well known signature of luminal A, but we got it through an unsupervised analysis and then were able to recover it without making an initial assumptions. And if we then look at these signature again, we just look at the correlation between the different genes in the signature. We again see the same thing that we have most of the genes are in this case positively correlated. And, but there is quite a bit of variation and that is one thing that we have still variation between the genes, but they are quite well correlated. And then we have another group that is internally positively correlated, but then anti correlated to the rest. So, now we are going to another thing that we are going to talk about is how do we use this for predictive modeling. So, this is just a cartoon showing that a patient here who has cancer can either be treated with treatment A or treatment B. And treatment A is preferable, but at the time when we start a treatment we do not know. So, we definitely would love to have a be able to do a measurement at this point and predict which of these two treatments are would be the best. So, that is and for another person of course, it could be the opposite that treatment B is what would be preferable. So, this is one thing that one example, but there are of course, a lot of other things we want to predict. And we are going to talk more on Saturday about how to do this prediction money talk to start introduced it, but we will talk in more depth about it. And the biggest problems with this kind of analysis is the that we can have bad luck, if we do not have enough samples especially which we almost never do. We can just have random matching and that will give us then predictive models that only work for the patients that we trained it on and not generalize. And the other serious thing is bias. So, the for the to treat the random variation there we have quite well established method as already Manu started introducing. But for bias is actually a very problematic thing because one and one really has to be very careful because all these when we build these predictive models we use machine learning algorithms that we train. So, and they will only give us back the answer to what kind of data we use to train them. So, if we choose the control samples for example, in a wrong way then that is then it will not give us something that generalizes. And this is the bias especially is really problematic and one should spend a lot of time on trying to think about that. I hope today you learned that how proteogenomics analysis could provide you very useful and novel insights. In addition to providing RNA to protein correlation, the proteogenomics can also provide information on the association between two or more genes or proteins. The network and pathway analysis can provide information on how the presence of a protein or gene influences the expression of other protein or gene. A comprehensive understanding on the concepts such as predictive analysis, pathway enrichment, mutation and signaling as well as market selection will help you appreciate the role of proteogenomics in accelerating clinical research. Thank you.