 Welcome to MOOC course on Introduction to Proteogenomics. After understanding the two approaches for pathway enrichment and basic differences between both the approaches, we will now listen Dr. Kastan Krug who will explain the use of GSEA at the pathway level analysis for different PTMs by taking an example of phospho data. Dr. Krug will talk about the way one could analyze the available data at gene level to take it to the PTM level. He will also talk about the recent work which is an initiative to make a PTM site curated database at Broad Institute. This involves a scoring of each P site by mapping it against the database. So let us now welcome Dr. Kastan to talk more about how GSEA can be used to map PTM pathways for the analysis. So now I am going to talk about phospho data but this is actually true for all kinds of PTMs. So there we measure different phosphorylation sites on proteins right and it might happen that we have multiple phospho sites on the same protein might happen that we have different isoforms of the same gene or protein you know which we measure different phospho sites on that and so on and so forth. But in order to do pathway analysis all of these pathway databases are gene centric. As I said pathway is a list of genes. So right now all of the database that we have are curated at a gene level. So meaning if you want to do any pathway level analysis of our phospho data you first have to collapse all of these measurements of phospho sites into a gene level. So we are basically throwing away a lot of information but that is this is something we have to deal with right now because there are no databases that are curated at PDM site level or protein or isoform level. So you would do that by for example taking the average per gene or median or looking across the most valuable site you know across your samples. So variance means information so that is why you would pick that one. This is one example which where we have you know employed this approach. This is again from the press cancer study published in Nature two years back where you know we started with roughly 6,000 phospho proteins and performed the same type of analysis I just said. So we used single sample GCA to map these phospho proteins which we of course first collapsed to genes using a median ratio and so on and so forth. We projected those into the space of pathways 900 pathways and then we performed consensus clustering on this data matrix and interestingly we saw like a clustering like a unique cluster which we only saw in pathway space and not we would not have seen that if you would look at phospho site or phospho protein level. But it is one example where these kind of analysis where you just you know project your data on to a higher level of annotation and perform some analysis can give you new insights that you would probably have missed if you would not have done that. Can you please use the mic that everybody can understand you. No, here we are not looking at phospho sites anymore so. No, we calculate for each I mean now we are looking we are not looking at phospho sites but each row is a pathway and the data is actually the enrichment score I was describing a couple of slides back. So we already combined all phospho sites to phospho proteins to genes and then we perform our signature projection method. So meaning in each sample so these are the 77 press cancer samples that we are looking at here and in each sample we have a score and enrichment score for this particular pathway right. So red means this pathway is more active, blue means this pathway is less active in this particular sample. Then we can do in this case we have done an unsupervised cluster analysis on that. So very convenient way to you know combine multiple measurements of phospho sites that map to the same gene to to the same like to a gene central gene centric level is actually Morpheus which is a very versatile matrix visualization tool we are going to use it again in the hands on it is very powerful but we are only going to use it for for this particular purpose here. So just like to repeat what I have just said so we start with a data matrix which can be either proteins, phospho sites or genes or transcripts whatsoever but the first step that we always have to do is we have to roll up our expression values to the level of genes because all of these databases that we are using for our pathway analysis are gene centric and after we have done that then we can continue with our path analysis employing GSEA single sample GSEA you know David or there is so many different tools and approaches out there it is you know a matter of taste what you like more but we highly recommend and we highly prefer doing some sort of GSEA genes enrichment type of analysis because you do not have to throw away you do not have to filter your list beforehand. So as I just told you all of these databases that are available now are curated at the gene centric level. I just want to introduce you to like a project that we are doing at a pro together with many other collaborators where we actually try to come up with a pathway database that is curated at the site level at the phospho site level. So this involves many people and many resources let us stay here for a moment. So we call that the PDM signatures database which in theory or that is actually the goal to be able to score each and every single phospholation site directly against the database. That is a large curation effort as you can imagine it is very difficult to curate pathways at the gene level but if you want to even you know go deeper and want to curate at every single phospholation site what does it do in this pathway? Does it go up or down and is it involved at all? So it is a lot of curation effort. Most of the signatures are human we started to do that for mouse and red too but it is mostly human. So we teamed up with other database curators from phospho site plus from net path also Vicky pathways as I mentioned earlier and nobody helped all of these people involved here we you know started to curate this database. Another very important aspect when you want to do something like this is how do I represent the PDM site robustly which might sound trivial but it is actually like a big problem like gene symbols have been standardized in a way right. There are these you go gene symbols which try to harmonize and standardize human gene symbols. If you look into protein databases you know you look the same protein might have a different accession number in one database so this is now a Uniport database. If you look up the same protein in RefSeq it will have a completely different unrelated ideas so it is very difficult to cross reference those right and this is even more like even more severe problem if you look at PDM sites. So how do you robustly represent a PDM site? So there are different ways how we try to approach this problem so one is Uniport centric. So Uniport is a well highly curated path protein database. So you know we picked Uniport and we represent the site as Uniport ID the modified residue and the PDM type and in this database we also have information whether it goes up or down in a specific pathway or perturbation. So another way to represent phosphocyte are flanking sequences so this is what we just looked at in the morning right. So we look at like plus minus six or seven amino acid or whatever you know around the phosphorylation site. This is a pretty unique identifier already if you compare that in the human proteome. And also phosphocyte plus they are actually trying to come up with an unambiguous way to group sites across or within protein families just keep in mind you know this residue number might change if you look at isoform A or isoform B right. It might be the same site but residue number might be completely different. So the site group ID tries to harmonize those kind of events. Okay so I quickly go through here so right now we have like pathways we have kind of substrate signatures so we talked about and we have a lot of perturbations by growth factors also small molecules in this database. And so most of them like these pathways have been completely manually curated by curators from netpath, wiki pathways and we also extracted signatures semi automatically and fully automatically. So this is we try to come up with a standardized way how to extract automatically do I have these kind of signatures from known literature. Let me quickly go through these slides here. So and actually in order to extract these signatures we came up with the you know method to extract consensus signatures. So let's say you have one perturbation or one pathway that has been studied by different studies right different labs you know there's different papers that might report you know this site goes up upon this perturbation another study reports the same site but it goes down. So this kind of inconsistency you find all over the place right this might might be due to you know the labs they use different cell types of experimental conditions different protocols whatsoever. So what we try to do we try to come up with a consensus between at least two independently published signatures like papers you know in order to include these signatures in our database. And we use a very similar approach compared to GSA we actually extended that scoring scheme in order to look for these signatures. So we tested that against a very well studied data set of EGF simulated healer cells has been published a couple of years back now in 2014 it's a very good like systems biology data set to test your computational tools. So we picked that one because the authors used you know the cortisol to metodically arrest the healer cells and also EGF to stimulate you know phosphorylation in general signaling and both of these signatures are in our database so this was our benchmark data sets you know if you can pick up these signatures I think we are on a good track that what we are doing is the right way. And what you're looking here is the heat map of enrichment scores so each row here is an enrichment score of a signature and here are like the different experimental conditions where this DMSO, blue is EGF, green is no cortisol measured in different number of replicates right. So here's four replicates, four replicates here six replicates. And luckily we see that no cortisol here in green the highest enrichment actually we observe and in no cortisol treated samples so that's a good thing and the same is true for EGF right. And also in the control sample we don't see any consistent enrichment. So this was our kind of you know global approach to prove that you know we are on a good track and then we specifically focused on the clustering metric. So how well do these enrichment score cluster our data and we compare it to a gene centric approach. So here on the left we look at this is how clean our clustering is if you do a site centric approach and this is how clean our clustering is if we first projected to genes then do the same type of analysis and we see that you know in a site centric type of analysis we see a very clean cluster. So the higher these bars are you know the better representative is my clustering. So and also what we can see in the gene centric space is that samples from you know from the control and the demons are clustering together which is not what we would expect. And also if you just look at these EGF and cortisol signatures alone so and compare the signature scores across these different treatments and compare site centric and gene centric we definitely see that you know we see a higher enrichment compared to gene centric approach in the site centric approach for EGF and also for no cortisol. So meaning we can pick up or we have a better signal of our of the biological pathways if we do it site centric compared to gene centric. I think now I am going to wrap up here so both these both so these tools are available on GitHub and gene pattern there is many other tools that do similar kind of things like foxtracks that specifically look at kidney substrate interactions you know there is no other database that can do pathway or perturbation analysis. But there is other databases that do kind of substrate analysis as we as we have talked about that this morning. And tomorrow I think you are going to hear about Webcast out and so on and so forth. So and with this I want to end my talk again here I have put some references and I am open for one or two questions before being is going to talk about any questions. So we recommend to use the sequence windows which I know that you don't have in your database and your data set. But the sequence you know if you just represent a fossil site by sequence window but a flanking sequence it's much more robust identifier. Yes you don't have to but you are on a safer side if you do so. So today in conclusions you learnt about GSEA and how it can be used for pathway analysis and how it can be used for pathway analysis and how it can be used for pathway analysis and how it can be used for pathway analysis So today in conclusions you learnt about GSEA and how it can be used for pathway analysis for all the different kinds of PTMS. We also learnt that all the pathways are gene centric all the databases are gene curated which may dilute your reference. Hence we need a PTM curated database for proper analysis of PTM data studies. We also heard that why different IDs and different isoforms curation is important but very challenging. We also heard about the three signature categories of PTMS sig database. For example perturbational signatures, signatures of molecular signaling pathways and signature of kinase substrate interactions curation of PTMS sites using GSEA can be done manually by semi automated or fully automated function. Based on this they have also made PTMS or the modified version of GSEA to look at the signatures of PTMS. We also saw with an example that site centric data grouping is more efficient and properly grouped as compared to the gene centric. The next lecture is going to be hands on exercise by Dr. Kirsten to show us how one can use GSEA for pathway enrichment. Thank you.