 Welcome to MOOC course on Introduction to Proteogenomics. In the last lecture you were introduced to the effect of mutations on gene expression and how they could alter the signaling pathways. You have also taught about how these mutations can lead to the modifications of phosphocytes and play a role in downstream signaling processes. Today's lecture by Dr. Kerstin Krug is a continuation of the last lecture and he will cover the use of two important tools active DB and MIMP. So, we are going to go quickly through the different steps here I mean here the idea is to get the principle how it works. So, if you do not understand like all the details that is that is not crucial you know to for our hands on session. But I mean this is again very similar to what we have just looked at. So, you have two sets of phosphocytes collection. So, one is ok. So, this is actually how we built the model here. So, one is all positive kind of phosphocytes. So, these are known substrates and then to get like a background data set or like a negative distribution you choose random negative phosphocytes. And then you construct your model and then you use a Bayesian approach to you know to determine whether your phosphocytes that you that you are interested is comes from a negative distribution or from the positive distribution. So, this is you know money was talking about mixture modeling yesterday and this is kind of a related approach here. So, in this case amino acid frequencies are calculated by what they call a position rated matrices again is very similar to what we what we have looked at. But so, it is it is the relative frequency of each amino acid at each position. So, again we have the positions on the x axis and the amino acids on the y axis and they add some some error term here. So, that is basically the only difference. So, from that you can already calculate these sequence windows here. So, in order to yeah in order to calculate the score for yours for a specific phosphocyte they came up with these with this matrix similarity score which is actually a concept it has been introduced for for genomics sequence analysis. So, again like the principle is you want to calculate a score given a phosphosequence how likely it is or how similar is it to my position rate matrix matrix here. So, this is this you have for each kinase and you calculate a score for each phosphocyte how likely or how similar is it to that motif and you know it is it is based on on information content. Again we do not have to to go into too much detail here, but what you do you calculate the phosphocyte the score for your sequence you you subtract the minimum given your matrix and your scalar by by the range. So, in order to build up these these Bayesian models what the software does it calculates exactly this score for all known binding sites this is one distribution this is the blue distribution here and then in order to get a negative distribution it takes randomly sampled phosphocytes from the union polyohm right and then it can it calculates again all of these scores and then you end up having a negative distribution. Then for a given this is what I just said and then for a given peptide let us say now you are looking at your peptide that you measured you get a score which is might maybe here and then using Bayesian theory we can calculate the probability whether this score is more likely to be derived from that distribution or from that distribution that is the entire concept. So, now with this approach we can actually calculate kinase substrate specificity given a phospho peptide how likely it is that it has been calculated by phosphorylated by this particular kinase or maybe by another one. So, now how do we connect that to mutations? So, we basically calculate these specificities for both versions one is the wild type phospho side and then if there is mutation that happened in this sequence window here we calculate the same specificity you know using the mutated sequence right. And these again these are can be known cancer mutations from TCGA or you can upload your own mutations that is both possible. And so, then you basically look for differences in these probabilities. So, one is your wild type was a very likely that you know kinase or kinase B phosphorylated this wild type sequence and now after the mutation how likely is it that still or B can can phosphorylate that. So, this is how MIMP you know calculates the effect of or predicts the effect of mutations on these kinase binding motives. So, this is the entire concept of MIMP and I hope that we will make that work during the hands on. So, this is one example output here that you get from MIMP. So, basically this is the mutation that I fed in. So, it is an arginine at 1555 which has been mutated into a lysine effect in this particular gene here. And here you actually see you know the sequence window. So, this is the wild type sequence window it has this arginine at position minus 3 and after the mutation this arginine has been replaced by a lysine. So, in here in this case we see that the predicted effect is actually a loss of phosphorylation exactly. So, the motive is gone. So, I do not see. So, the wild type sequence has a probability of 0.97 to be phosphorylated by AKD1 which apparently recognizes the arginine at minus 3 after mutation AKD1 is no longer able to phosphorylate it to phosphorylate this side. So, this is how we read these kind of plots here. And then for the same side you have other kinase motives I mean what I probably forgot to mention is these kinase motives are very loose and not very specific. So, many kinases share the same motif or at least parts of the motif right. As you can see here you know the CAMK2A also is able to recognize an arginine. Loss means AKD cannot phosphorylate this phosphocyte anymore. So, that is just an example I mean we will go through more examples to enhance on I hope. So, here you see. So, these are the scores that we have calculated a couple of slides back for the wild type sequence it is 0.6. For the wild type sequence and this is for the mutated sequence almost 0 right and you know this difference kind of determines the probability. Further questions? Before I move on to this is actual data from a patient. So, the question was whether this is real data or like some example data, but this is a TCGA sample that I have used here. Can you use the mic please? So, what you are saying is it is very unlikely that this is a loss because it is too unspecific. I mean the data that goes into this analysis is real data, but I mean you know keep in mind these are all predicted events right. So, we do not know of course yeah. So, I mean using this computational framework which I just you know tried to explain to you we calculate given this data we calculate the probability you know that this might happen given our statistical framework. So, that is not an experimental proof that this phosphorylation site is actually loss right. And also you must you know know that we do not know these kind of sequence motifs for all kinases. So, this model has been trained on 120 kinases or so right for which we actually have very highly curated high quality substrate sites. So, it might very likely be that this phosphorylation you know site after it has been mutated can be recognized by any other kinase that we might not know the motif of right. Again, so this is all computational it is all predicted, but you know it is a way to kind of approach these kind of relationships between mutations and signaling. Sure, but you know I am not the one who is going in the red lab. So, that is your part. Sure. So, I mean what goes in here are phosphorylation sites that I have measured that has been measured like in the lab you know that have been measured in a sample in a patient sample in this case. And also we have genome mixed data it comes from the same patient right mutation somatic mutations in a patient. So, these are the data that goes into this analysis right as I already mentioned. So, you can access MIMP on a web server. So, I encountered a couple of difficulties while trying to run analysis on a web server. So, that is why actually I decided to use the R package ok. Let us move on. So, the second tool I briefly want to talk about in the next 10 minutes or so is it is more like a database it is not really like a tool, but this database also enables you to upload your own mutations and see what you know could possibly be the effect of these mutations on phosphorous signaling events. Again, so these are all predictions right. So, we there is no validation whatsoever. So, active I would be is a very very well developed very complete database proteogenomic database that annotates disease mutations, but also population variants and relates them to two PDMs. So, we have like two major types of omics data which has been integrated here. So, one are like post translation modification sites like more than 380,000 and also like more than 3.5 or 3.6 million Smiths and yeah. So, basically it also predicts network rewiring impact of mutations and also it uses actually the MIMP software we just talked about. So, it basically comes from the same lab. So, there are three types of human genome variation data sets that goes in here. So, one is TCGA, the other are disease mutations that come from the CLEANFAR database and also mutations from the 1000 Genomes Project. So, it again it uses these publicly available data sets, but also enables the user to upload your own mutation parts and again this is something that we are trying to do during hands on. So, in terms of proteomics and interaction data it uses you know data that is available in several different databases like phosphocypylase, phosphVML and the human protein reference database and in terms of PDMs we are not only looking at phosphorylation here we are also looking at acetylation, ubiquitylation and methylation and also pulls in information of you know drug enzyme interactions and also PDM site specific protein interactions. So, these are you know these information are all passed from publicly available databases and now are gathered in one place and it is a very nice very intuitive user interface which enables you you know you know everybody who doesn't have the computational skills to do these kinds of analysis to do those. So, what you do is very similar to what you know what we have learned in the previous part of the talks. So, we look as to the combined genome variation with PDM data and you basically map SMVs and PDMs to your protein sequence and look what are the non-synonymous events in your protein and you calculate motifs or using you know the effect of mutations on the motif around that. But again so here we are not only looking at phosphorylation, but also other modifications. So, it's very interactive this entire web page. There's many different ways how to view your data. So, one is the sequence view where you have you can look at your protein sequence you know along the X axis and then you have like all mutation events or PDM events you know highlighting in this protein sequence. You can actually look at distribution of PDMs along the sequence and it also has a way to build up these interactive network views between proteins kinase and trucks and PDMs and so on and so forth. So, this is something again we are going to explore during the hands-on session. So, there's a web server available which works pretty well and also there's a github page where you can access the tool you can download it, you can fork it, you can modify it and run it locally on your computer. So, to sum up mutations affect signaling networks. I don't think that there's any doubt that this is not going to happen, but that this is not happening. So, there's millions of mutations that have been identified that are associated with cancer, but we still don't know the exact molecular mechanism. So, how so what is the association between phenotype and genotype or genotype and phenotype. So, and I think that the integrative analysis of these mutations in PDM has a great potential to shed light into these kind of relationships. And here I just pointed out two of these tools there's many out and there's also many you know ongoing efforts. So, I know that that Binks Group is also working a very nice tool that uses deep neural networks to look into these kind of relationships. So, I just picked these two because there's they're kind of easy to use and very intuitive to use and the output is very easy to interpret and so on and so forth. This does not mean that these are the best tools right. And here I just put you know a lot of references for all of the papers and tools that I've used here in these in my talk. Today you were introduced to two tools namely MIMP an active driver DB which have been used to export the mutations and their effects on a few cancers. You've also shown how MIMP predicts the gain or loss of a mutation by calculating a probabilistic score at different locations in a given sequence. Active DB is another proteogenomic database that annotates disease mutations and population variants as PTMs. It also provides information about various molecular interactions from publicly available databases. There may be more tools freely available and you are encouraged to try out using them and then you will explore many features which could really help you for much better and deeper understanding. In the next lecture, we will look at pathway enrichment using a few widely used tools and softwares. Thank you.