 Welcome to MOOC course on Introduction to Proteogenomics. We have finished couple of hands-on session about how to use mass spectrometry data analysis using Pro-TG software. You also gone through many of the basic concepts of looking at mass spectrometry data. Today we have another distinguished scientist Dr. Deva Shildas who is a professor at CSIR IGIB Institute in Delhi. Dr. Shildas is going to talk about integrative proteogenomics approaches in understanding of human proteoforms. As you know proteoforms are various modified forms of a protein molecules after different modifications in a living system. In this lecture Dr. Shildas will talk about the importance of proteoforms in human system by taking a reference of a research paper by Dr. Rudy Abert-Sold and also by sharing some of the work done in his own lab. Dr. Shildas will also provide you information for some of the repositories available to look at the protein proteoforms. So let us welcome Dr. Deva Shildas to tell us about integrative approaches, what proteoforms are and their role in clinical biology. The topic I chose today to share with you is an integrative proteogenomics approach to unravelled human proteoforms. So this title has 3 terminology which needs attention. One of course proteogenomics the conference is on that. The second one is proteoform. We will try to understand what are these proteoforms I am talking about and the third is the integrative approach. And the integrative approach is what I mostly work on and so my major thrust will be the approach by which we identify the proteoforms. As you can see very nicely covered in nature methods 2013 and later on in 2014 expert review on proteomics. Proteoform actually the all the alternate forms of the protein which can arise because of alternate splicing the mRNA and any variation in or the translational errors all of them put together or the even the amino acid modification all of them put together can lead to generation of a variety of proteoforms of a same protein. Now all this while we have been talking about the missing protein. So we had a catalog of human protein and we were looking for what are the protein that has a transcript evidence and do not have a protein evidence. Now from there on we move to identify all the proteoforms. So there are expected to be around 1 lakh proteoforms the number can vary but this is what people guess and which are those proteoforms that are active that are functional some of them involved in diseases. So discovery of this proteoform actually can give better understanding of the functioning of the tissues. So Rudy very nicely covered this area in nature chemical biology this year 2018 that how the protein proteoforms will be implicated in the future of biology. So we need to understand first of all where are these proteoforms, where are they expressed and what are the possible roles of these proteoforms we need to understand. But before we go there we need to understand where are these proteoforms a tissue wise atlas of the proteoform need to be done and that is what the research topic on which my student Anurag who is in the audience he works on and some of the work that I will be presenting is done by him. Now some basic couple of slides I have kept for those audience who are new to this area. So how do we detect first of all the peptides in a shotgun proteomics experiment mass spectrometry. The left hand side which I never do the experimental is do protein extraction digestion injection into the machine and then getting the spectra. And my lab starts from here and does do the right hand side job. So creation of a database which is very important unless we create the right kind of a database we will not get the answer. So this database creation I will little bit delve into this theoretical digested peptides generation. So this is a rule based so there is nothing much here and then the peptide simulated fragment generation which will create which will give us a theoretical spectra. Now matching of the theoretical spectra with the experimental spectra and thereby giving a score to this is what is needed. So here is an example there are 3 words I have written here allergy gallery and largely. They are constituting of the same alphabets. So the amino acid composition are the same so but the peptides are different. So in that case the answer lies in the MSMS the fragmentation pattern of these the fragments that will generate from these 3 peptides. And a matching will be done peptide spectrum match scorer. So this is different for different algorithms how mascot works, how sequester works, how tandem works all these scorer will differ in their way of giving a score to this. But however all of them will get some of the other score. So now it depends on us or on the method to say who has passed and who has not. So what is the passing score here nobody knows and in fact that can be a debate here. To do that people take this approach they create a decoy database. A decoy database is a falsified database a database which do not contain the natural proteins. So the proteins read from right to left maybe or randomized suffering, suffered sequences and the target sequence. And when you draw a threshold that threshold actually divides the true positives from the false positives. To do this there are many such search engines are available most of them you are familiar the one that mass species is developed in my lab and all others are also available in the domain and many more also have come across. So what they do is that they give us a lot of peptides that are identified with a score. But then some of them are positive some of them are negative. Whereas in a decoy database we know for sure that the peptide that we have got and the score distribution we have got are all false. Now a comparison between this target and the decoy and a proper threshold we will let us know what is exactly the passing score. So what is that what is that we generally say the false discovery rate. The false discovery rate is generally calculated like this every 1 incorrect in 100 correct. So 1 in 100 is the false discovery rate or 5% false discovery rate 5 incorrect is allowed for 100 true positives. So this was again very nicely covered by Nesbiski in journal of proteomics 2010. If people can go and read this article very nicely written which says that all the target target target all of a sudden decoy comes. So FDR is number of decoy divided by total number of peptides that we have identified from the target. So this is how we get to know the FDR at the peptide level at the PSM level. But then the next challenge would be to identify the protein FDR protein level FDR we will see. Now the decoy database as I told is a reverse or randomized sequence alteration of the original sequence so that we can keep the amino acid composition intact. We can keep all other properties of the protein while suffering all these amino acids. And this is already I have covered. So this is 0% FDR when no rate is above this score no rate is there but this is too purist an approach. So what we generally do we reduce our bar in such a way that we accept few allow few rates and get some more greens into our search results. And that is how the PSMs are obtained and from there the story begins we get peptides we match these peptides back to the proteins and from this the protein we infer what are the protein that are true for our experimental data. So there are two ways one can do FDR calculation one is a concatenate net search another separate in concatenated what you do actually you merge the target and the decoy into one database whereas in a separate search you searched in the target separately you search in the decoy separately and then you apply this formula whatever I just now told FDR is a ratio of decoy to target. Now what happens in the case of the whatever just now I said is very generic those who probably did not understand the FDR that is how I narrated in a brief manner. In proteogenomics case what happens is that you take the genomic sequence you translate computationally and create the protein sequences and thereby you inflate the database size. So, the database when it is inflated the chance of FDR enormously increases. Earlier we used to search let us say only the known protein from Swiss Prod we took and search our mass spec data in a limited number of proteins. In case of proteogenomics I take all the theoretical ORFs in case of prokaryotes or translated transcriptomic in case of eukaryote and inflate the database size. Now when we inflate the database size the chance of false discovery increases how supposing you are looking for a place let us say Bhubaneswar and I have given you only the map of Orissa and you are searching Bhubaneswar. So, the chance of once you get it it will be correct. I give you a world map and then ask you search Bhubaneswar. So, there are chances that by chance you will get another city with a similar name with one letter change here and there and then you start getting which one is the right one which one is the wrong one. So, this is the same thing happens as soon as you increase the database size your false discovery rate increases and then you need to do, but you have to do proteogenomics so database size has to increase. So, you have to find out where how do I limit my false discovery rate even though I search in the larger database. Any suggestion? I need to increase my database size because I have to do proteogenomics, but at the same time I want to reduce my false discovery I want to reduce my false discovery what do I do what is the way out? Answer would be there in the next slide, but just for interaction sake you can be wrong no problem, but still participate. Select the peptides with very high score that means a purist model. So, the chances of being wrong will be less that is one way, but you will definitely lose many other correct peptides in the process that is another one way definitely. Any other way? Can replication ok. So, two different experiments you look for the same peptide being identified. So, you force me to go to the biologist and ask them to do a replicate study of course that is a good idea always, but we can always increase our search engines. We can use different different search engines and take their results and hope that multiple search engines will not simultaneously fail in giving you a wrong result. So, this is one way we thought of in the computational lab where where can we improve our search result may be include results from multiple search engines. So, what we did? We created a pipeline which will do all this process automation at the same time take result from various search engine one in house, but others from other sources and take results from all the search engine and then try to analyze the data and hope that the chance of being wrong is less. What happened? When we did this we got a scenario like this another question coming your way. Now, world is not that simple to me all search engines gave different different results. Now, what do I do? Who do whom do I trust? What do you do if you get results like this? Same experiment same database search parameters being same that search engines are giving you different results and you are a PhD student and you have to take a call now take. Take the overlapping somebody said consensus from here I think both of you are telling the same thing. Any I mean if you are agreeing to this idea no need, but any other radical thinking idea is coming elimination on what basis on their on their scoring value ok. So, this is one different ideas coming compare their scores and then eliminate the weaker ones. Now, the problem with me is that a student coming out of IIT Bombay if he gets even it 70 percent mark is a smarter student and a student coming out from unknown university from a remote place is getting 95 percent, but still is not a smart student. Now, our evaluation processes are not streamlined. So, relying on the score that the student has got was not probably a smart way of doing that, but of course, we are thinking in that line can we re score them can we create an entrance examination for all of them to reappear and come through that entrance examination again. So, something in that line we are thinking, but our first thought was whatever you people you suggested take the consensus one easiest probably and little bit safest, but definitely not the smartest. So, we went out with this took all those peptides that were identified with two or more algorithms and made our story went ahead published and that is how generally we know under pressure you do, but then we were not happy with the way as a computational biologist we handled the problem. So, we started observing what is the behavior of this FDR. So, you look at this curve and try to understand that how the FDR is behaving as the score is reducing. So, the score is reducing to the right and the FDR axis is in the y axis and you know the FDR was 0 all of a sudden a red bullet comes the FDR shoots up. And then more and more greens are coming the FDR is going down and then another red hit comes and then it goes out. So, this is the function by which the FDR is jumping up and down. Now, the problem with us is that a peptide which is identified at with a higher score had higher FDR than a peptide which is identified with lower score had lower FDR this is not acceptable. How can you have a person having higher score and still has high false discovery rate. So, what we did we created a step function and try to join through a linear line at the base of this next FDR line. And this was fairly ok with us because still the FDR is same for this peptide as well as this peptide for this score as well as this score. The FDR has same, but the best was when we joined these points through a linear regression line. And now we have a curve which is going upward as the score goes down then you get the FDR is going upwards. What was interesting is that all the methods irrespective of it is mass phase, sequest, umsa, tandem whatever you take this green line this behavior of this green line was reserved. So, it was easier for us to create a cutoff for FDR score and then use that FDR score for all the methods and choose the peptides. So, what we did we took the E value P value P value score whatever we had the evaluation parameter the matrix we had and then applied on all the methods and at a given cutoff for all these algorithms we selected those peptides which is following that cutoff criteria. So, that part is over now. Now, multiple algorithms search results integration and then getting a pool of peptides from there is what we could have achieved. But the main problem was to identify proteoforms. So, how do I now get the proteoforms? We have created a translated transcriptomic database we have now created multiple algorithms and then rationalizing the results from multiple algorithms. Now, can we have a end to end solution for a mass spectrometry person coming with a data and do a proteogenomics end to end solution. So, to do that we needed a bridge for this and we constructed this bridge. So, we named it as a genosuit rest of the talk is will be little bit boring because that I will beat my own drum this is what we have done. But nevertheless just see that what we have done this for prokaryotes it was much easier for us 6 prime translated database creation it was cakewalk and we could get the genome re-annotated with new ORFs identification. But whereas, for eukaryotes we had lots of difficulty because we had to create the 3 frame translated transcriptome and then incorporate all the possible alternate splice forms into our proteome. To do that we had created this pipeline by taking the best of tools available elsewhere. So, we did not write any of these codes. So, we took the SRA Trimomatic Star Coughlings whatever was available for analyzing the RNA-seq data all that we require is our set of protein sequences which represents this transcriptome and this is our source space and then using that using multiple search engines we wanted to get the peptides and from peptides infer the protein I am using the word infer because it is a bottom up approach what we get are the actually the peptides. But what we pose that as if I we understand the protein now we know the which protein was there. So, from these blocks we infer what are the proteins that we probably would have got. From the first part the prokaryotic story we published several papers in which used using this particular pipeline we could identify new translated regions in sigella flexionary in bradyryzumium japonicum and euthanobacter extrochones. And there are if there are students in this audience who are computationally oriented and want to do something. So, here is some low hanging fruit for you as a researcher what you can do take mass spectrometry data from the internet, take genome proteome data from the internet use some of these tools and then start re-annotating the genome using the experimentally available mass spec data and the static information of the genome data. And you can do wonders by sitting at a in a place with a computer and internet connection you can do all these things. So, some of these also I could get it done through the trainees who come to my lab and we could re-annotate the genome identify noble translated regions. So, for that the resource is available already. So, browse the internet you will find many places where you get mass spec data and from there you can download data. What we have done and for our purpose since the prokaryotic part is already I mean some whatever we wanted to do we wanted more challenging jobs. So, we wanted to go to human proteoforms. So, we looked at the resources and with lots of effort and difficulty we could arrive here although it looks pretty easy simple to you. From the resource it was difficult for us to identify which are those projects that will give us brain specific mass spec blood specific lung specific and different different tissue wise mass spec data. Because you cannot download the entire data and then re-annotate and then segregate. We wanted to create a pipeline which will go talk to the pride database massive database and other resources and once you type brain it will fetch all the brain related mass spec med data and give it to you for the analysis. So, for that we created this human tissue skip and this is the statistics of the pride projects where you can see the how many number of projects we have per tissue and then we group them on the basis of their group identification DOI or the publication date when it has come and that is how we could group them. As you can see it took about several months for us to analyze a tissue by tissue what are the where are the proteoforms and after doing the analysis we realize oh this is not the human tissue I was looking for it is just a cell line. So, lot of back and forth we had to do a lot of lesson we had to we learned while doing all these things that it is not that straightforward you type lungs and you get it and then you get that is some some other cell lines which people have already done the analysis. So, right now my student Anurag who is here he is focusing only on brain and this is only a handful of data sets that we have analyzed using the this strategy of eugenosuit using next plot switch plot and gen code we could identify several proteoforms in various tissues and all these proteoforms now have been ok this is this is interesting I am I run out of time I need another 5 minutes ok sorry this is something very interesting another puzzle which is yet to be solved in a computational pipeline manner otherwise right now lot of involvement is required. See these are the isoforms these are the peptides it was much easier for us only when we had a unique unique peptide for that particular proteoform which is not shared with other proteoforms. So, these proteoforms could be identified and then they have been put into the database these are all the proteins their function their gene names and number of proteoforms each of them. For example, tau protein if you look you look into it when you see that these are the various proteoforms of tau available to seen in how many different projects how many distinct peptides were identified in that particular protein and, but then this came what is this? These are different proteoforms of tau these are different peptides and this puzzle is for you not for me which proteoform is present if you see a data like this what would be your answer. Somebody took a stand first peptide 3 is that what you are saying these two. Sir. The peptide 5 is mapping to 3 different proteoforms. Yeah. Proteoforms. And 6 is mapping to 2. Yeah sir. Tau f is there. Yes. Based on these most data you will say tau f is there. Ok. So, there is no simple solution to this problem even the answer that I will give you we can spend another half an hour debating on that, but we chose it like this. We mapped all these peptides onto their transcript. As you can see for every match we say that evidence for D tau Fital E F, but not G for the pink peptide that is this one these are all possible not these two possible now these these two possible for the green peptide where are the green one the yeah it is here for this green peptide evidence for this, but not this. So, this, but not this, this, but not this through a series of statement like this and also from other because this could not be shown in one single slide I broke them into another slide. So, evidence of tau D and Fital tau comparing all of them together we just given an answer that probably these are the three proteins most likely even answer is most likely these are the three proteo forms which are there in my sample. But as I told you very clearly that we can again debate for another one hour or overnight on this why this is possible why that is not possible. So, we have to go back to the data and then you can see look at the peptides this particular peptide is a unique peptide which clearly says after E the AE comes which is which is very difficult to read from here. So, which will tell that one only one proteo form is possible other proteo form cannot explain such a separation of the peptides. So, I hope from this lecture of Dr. Devashidhas you got a glimpse of how one can process the proteomic samples and prepare a database to facilitate mass spectra data understanding and analysis. You also learnt about preferable limit and role of fast discovery rate in mass spectrometry data analysis. We have also learnt about the hurdles which are related to the multiple algorithms available for data analysis because Devashidhas explained about the possible ways to eliminate and how to select the proteins from differently used algorithms. We have also learnt about various sites for database search like uniprot, nexprot and gene code to make customized database for the study. Use of hubs prot for accessing already reported proteo forms of a gene could be another valuable resource. So, the next lectures we are going to shift gears and now Dr. Mani will take you to workflows of automated data processing. Thank you.