 Welcome to MOOC course on Introduction to Proteogenomics. You have studied about genomics, proteomics and various tools involved in proteogenomics. We thought to give you some cases studies and examples how proteogenomics can be used for some applications. In this slide today in the proteogenomics case study, we will have a TA of this course MS Shalini Agrawal who the research is called at IT Bombay. She will talk to you about how proteogenomic analysis have been applied for some clinical applications. She will talk to you about malaria and which way the proteogenomics research have made contribution to understand the parasite plasmodium vivax which is less studied less known parasite but is now causing much more issues in India and many parts of the world. She will talk about how proteogenomics can help in better understanding of the parasite as proteogenomics is now emerging as a broad tool to solve various problems and especially in the clinical field it has started showing its applications. So let us welcome Shalini to give her lecture about proteogenomics case studies on malaria. Hi all. So my name is Shalini Agrawal and I am a research scholar at IIT Bombay. In today's lecture I am going to tell you about proteogenomics and infectious diseases such as malaria. So in this lecture we are going to cover three things. One is advancements in proteogenomics, problems related to plasmodium vivax and then evolution of proteome database of plasmodium vivax using proteogenomics approach. So for the reference first of all I would like to explain you about what do you mean by proteogenomics and how one can use it for understanding the biology of an organism better. So in a paper by Ruggles et al in 2017 with the title advancements in proteogenomics methods, tools and current perspectives they have explained about the three major part of central dogma that is DNA, RNA and protein how these three components can be used for further sequencing and analysis of an organism to understand the organism better. So as you can see in this slide for the first case DNA, DNA can be analyzed by two different methods one is WGS and other is WXS. So WGS stands for whole genome sequencing where this whole genome sequencing is done when you do not know anything about the organism and you are trying to study that organism for the very first time. So you extract the genome of the organism and you go for the whole genome sequencing. So this helps us in de novo read assembly and it will give a reference genome for all the further studies which one can do in future. Then comes whole exome sequencing where one can sequence all the exomes specifically by leaving the introns but this will only be applicable once you have a reference genome sequence which is done by whole genome sequencing. So whole exome sequencing helps in alignment of genes on the reference genome. Similarly in the case of RNA-seq one extracts the RNA and go for the RNA sequencing. RNA sequencing adds on the value by telling us the information regarding the introns which are also involved in the production of a protein. There are many protein which are not only formed by exomes but also includes some part of introns. So if we do whole genome sequencing or whole exome sequencing will miss out those introns which are contributing to the production of a protein but in the case of RNA-seq will be able to identify those introns which will miss out otherwise. So after aligning the sequence genes we align them on the reference genome and then we can understand two things one is genome assembly that how the gene is getting assembled on the reference genome and second thing is variant calling which includes SNVs that is single nucleotide variants, indels that is insertions and deletions, gene fusions, splice junctions and RNA editing will be able to understand all these things if we use RNA-seq WXS or WGS but after that one more component is very important that is protein one can analyze the proteome database of an organism and then subjected to mass spectrometry to extract the possible spectra for all the proteins which are present and processed and then this protein which we have obtained from mass spectrometry data we can analyze it using the reference database which we can make by using the information which we have obtained from whole genome whole exome and RNA-seq data. So here one more component plays a very important role that is FDR control FDR stands for false discovery rate and it helps in removing all the contaminants which can possibly come from the very first step of sample collection to the last step which is subjection to mass spectrometry. So for this we need to make a decoy database let me show you how we can make a decoy database. There are two methods by which one can make decoy database one is so one is randomization and second is reverse. So let us take an example of a peptide sequence let us say this is a sequence of a peptide we have and here if we go for randomization it will randomize all the amino acids except K because mostly we do trypsin digestion that is why we are keeping the K constant. Now it will randomize all the amino acids by keeping one amino acid constant and the next one is reverse in this one it reverses the whole sequence of the peptide. So if this is the peptide we will not be getting this peptide if we cut it by using trypsin digestion it will give some peptide which is this way because trypsin is C terminal digesting enzyme. So now let us see how this decoy database can be useful in understanding the FDR control. So let us assume that we have these two circles where one represents the mass spec data and the other one represents decoy database and in the mass spec data let us assume positive stands for the peptides which belong to the sample whereas the red ones are the ones which are coming from the contamination while processing the sample or any other source. So now what we do we overlap these two circles and then we remove all the negatives which are present in the mass spec data and after removing it we will only have our peptides which are useful for us in the understanding of the proteome of a particular sample. So this is how one can find the FDR control and remove all the contaminants from the sample and then use the proper sample peptides to analyze the results. Now we know about the gene sequence, RNA sequence and proteome data and then we can use all these information for various purposes like genome annotation where we can map the peptides on the genome. Then for mutation analysis where it will tell us about all the possible mutation which are leading to particular clinical condition like malaria, cancer or any other disease and also for metaproteomics. Proteogenomics is vastly studied in the case of cancers and many people have already taken it to a very high level. So I would like to take an example of a study which is done by Philip Martin et al in 2016. The title of the paper is Proteogenomics Connects, Somatic Mutation to Signaling in Breast Cancers where they have disapproved the conventional thinking that DNA mutation leads to RNA mutant and then the RNA mutant leads to a protein mutant which causes a disease. But in this paper as you can see here that they have taken approximately 90,806 mutations out of which 84,667 were DNA mutations and out of those mutations only 40,697 of gene led to the RNA variants whereas others were only present in DNA variant and they were not further replicated on RNA or protein level. But out of these RNA variants also very few have translated into on to proteome level. So this tells us about the variations on different level which one can face that is why on studying only one platform RNA, DNA or proteome will never give you a perfect answer for understanding the biology behind any disease. So here in this paper I have taken one part of their study where I have taken example of basal like cancer, they have confirmed the cancer by using PAM 50 test and then they had also checked for ER and PR status where each box represents a patient and these different levels shows different level of analysis they had done. One is mutation, mutation they had checked for this particular gene that is TP53 and here on the right hand side you can see the different color which represent different type of mutations. So just like green represents missense mutation, blue represents nonsense mutation or frameshift mutation. CNA stands for copy number aberration which in the bottom is represented by different colors that is dark blue represents deletion and dark red represents amplification. So similarly RNA sick and for other mass spec data they have given the scale from minus 3 to 3 which shows for the up regulation or down regulation of the protein. So as you can see here in these 3 patients as you can see that these 3 are having missense mutation which is leading to heterozygosity loss but still the amount of protein which is produced is upregulated. So one cannot tell directly or for sure by looking at a mutation of a gene or RNA level that this particular protein will upregulate or downregulate. So hence the study at all the 3 levels is very important. So now I would like to take you towards plasmodium viobax and why this particular parasite is very harmful or very important to study. So plasmodium viobax is a parasite which causes malaria and it is having a life cycle where it includes 2 hosts one is mosquito and other one is human. So malaria can be caused because of 5 major parasites which are falciparum, viobax, ovale, nolicy and malaria. Out of these 5 plasmodium falciparum and plasmodium viobax contribute to 90% of the malaria cases across the world. And according to WHO 2017 malaria reports it was reported that approximately 3.5 million cases were found. So the major issue with the parasite is that the vector that is mosquito can cross the borders without any inhibitions handled or taken care properly. So problem related to plasmodium viobax are many one of those is diagnosis and differentiation where plasmodium viobax infects only reticulocytes that are immature RBCs, reticulocyte number in whole blood is very less and in that also if parasite is infecting only reticulocytes the chances of finding an infected reticulocyte becomes very less. Plus because of which the parasitemia level is almost every time very low in the case of plasmodium viobax which makes it very difficult for being diagnosed by using microscopy. Whereas the other method is RDT that is rapid diagnostic test. So we have two types of protein in RDT. One is PFHRP2 which is specific for plasmodium falciparum whereas the other ones are aldolase and lactose dehydrogenase which tells us about the remaining all parasite except falciparum. So RDTs which are existing can only tell us whether a person is suffering from malaria and the parasite which is causing it is falciparum or not. It will never tell us about whether the patient is infected with viobax or not. So apart from that it also has a major issue with the life cycle where when the mosquito bites an individual and causes malaria if it is a if it is falciparum the parasite goes into the liver and then it infects blood whereas in the case of viobax it infects liver and it stays there in the form of hypnozoids those are the dormant condition they are in the dormant condition and it can relapse whenever the conditions are suitable. So it can relapse up to 6 months after the infection also without any further mosquito bite. Because of this the malaria because of plasmodium viobax is becoming annual rather than being seasonal which was the case earlier. It also has another problem which is because of discontinuous culturing capability of the parasite because in the case of plasmodium falciparum one can continuously grow culture of plasmodium falciparum in vitro whereas in the case of viobax one cannot take the culture beyond 48 to 72 hours. So because of this we lack proper proteome database and we poorly understand the dormancy of the parasite because one cannot diagnose the dormant condition parasite in the patient. And we also do not have any rapid diagnostic methodology available for diagnosing the infection because also G6PD deficiency that is glucose 6-phosphate dehydrogenase deficiency in an individual may lead to severe hemolysis if primoquine is given. Primoquine is the only drug which helps us in reducing or removal of dormant states of plasmodium viobax that is hypnozoids but if a person is G6PD deficient and if primoquine is given to that person it may lead to severe hemolysis and then a patient may die. So understanding the parasite is very important and to have a diagnostic kit is also very important. Apart from that we also do not know what is the criteria for viobax infection to be severe or non-severe. So according to WHO they have provided a criteria for differentiating severe infection of viobax from non-severe viobax condition and this is exactly same as falciparum because viobax is still not well studied to differentiate it properly on the basis of viobax specifically except one condition where peristimia level is not directly correlated in the case of plasmodium viobax whereas in the case of plasmodium falciparum if the peristimia level is high one can say that the person will go towards the severity. So at present we do not know whether the severity in the patient is because of the patient or because of the parasite level. So in this case proteogenomics can help us in understanding what is leading to the severity. So for this I would like to take an example of proteogenomic analysis of the total and surface exposed proteome of plasmodium viobax salivary glands sphorozoids where they have tried to incorporate proteogenomics in the understanding of sphorozoids of plasmodium viobax in salivary glands from the mosquito. So what they had taken they had collected four different plasmodium viobax infected samples blood samples and then they had used hundred mosquitoes to fed them on the plasmodium infected blood for 5 to 7 days and the fed mosquitoes were taken further for culturing for 14 days and the salivary glands were dissected out of those mosquitoes. Then the lysate of the salivary gland was taken and run on the SGS page and further process it for subjecting it on the mass spectrometry. Then the data was analyzed and they had data was analyzed and they had considered the oxidation on methionine and carbur as the modification which is contributed because of the sample preparation. In this how they have taken the help of proteogenomics in understanding the parasite is they have taken 19 thia isolates and extracted the DNA sequence reads and then they had taken 13 thia isolate RNA sequence read then they have aligned these sequence reads and then they have aligned this customized database of proteome on the existing proteome database on plasmodium and then they have removed the duplicates and made a customized database for thia population. Why they have done this because in the very starting I told you that they have taken 4 pivac samples from 4 different origin to cover the maximum variations they had taken the different DNA and RNA-seq isolates. So when they tried to correlate between plasmodium-phelsiperium and plasmodium-pivac data they were able to see that most of the proteins are correlating these blue proteins in the diagonal are showing the maximum correlation and the ones which are towards the x-axis are more related to the plasmodium-pivacs whereas the one which are more towards the y-axis are more related to plasmodium-phelsiperium. So the challenges with the plasmodium-pivacs and the study here is that dynamic range of protein concentrations were present because of which the amount of protein which are in abundant quantity are mostly taken up and then the protein which are in lesser quantity are mostly neglected because this is a shortcoming of the shotgun proteomics. So it is biased towards the highly abundant proteins and because of which one can miss out on the low abundant proteins. Alternate thought is that one can go for correlation studies using proteogenomics because of which one can also include the low abundant proteins and the genes which are having variations related to the particular protein. So future perspective proteogenomics research can be accurate diagnosis and proper treatment, understanding of severity of the parasite plasmodium-pivacs genes related to development of resistance can also be studied by using proteogenomic analysis and also variations of species of vivacs as India itself contain maximum variations of plasmodium-vivacs as compared to the rest of the world. So we need to understand the variations in vivacs as much as possible to treat the patients in India effectively. So with this I would like to end the lecture, thank you. I hope in this case study you learnt about how proteogenomics is emerging rapidly to solve the intricacies of various diseases and this approach can also help to study the parasites or organisms which are not very well understood like plasmodium-vivacs. Major issues like the lack of proteome database can also be complimented by transcriptomics and exome sequencing data as we have seen in today's case study. We also provide you few more case studies in context of cancer and that will probably give you much better idea that how proteogenomics have started making its impact in the actual clinical field. Thank you.