 Welcome to more course on Introduction to Proteogenomics. After getting a glimpse of proteogenomic concepts by Dr. David Fenyo, we will now move on to the next step of understanding the sequence centric proteogenomics by Dr. Kelly Reggels. Dr. Kelly will talk about the basic workflow and requirements of sequence centric proteogenomics. She will also talk about the reference databases like RefSeq, Uniprot and Ensembl. She will also talk about the gene annotation and if genomic data capable of facilitating the search of novel peptide or identification of functional proteins. So, let us welcome Dr. Kelly Reggels to know the answers of some of these interesting important questions and also to understand the sequence centric proteogenomics approaches. We are going to talk now about sequence centric proteogenomics and so what do we mean by this, right? So, what does this mean? This just means we are really focusing on the sequencing data in terms of what information we get from it like the SNBs, the indels, the fusions, the splice junctions which is what I discussed yesterday and trying to combine, use this information to get more information out of our proteomics data. So, whether that means genome annotation which I will talk about in detail or looking at the actual mutation analysis specifically in tumor samples and you can also use this for metaproteomics, but this is outside of the scope so this is something that we do work on, but it is in the microbiome. So, there are a couple of requirements to this. You of course need DNA or RNA sequencing, if you have both you can use both in different ways. Some sort of high-resolution mass spec data and then the actual tools to combine these and we will talk about some of those later on in the session. And just as a review, I know you have heard this a bunch of times, but I want to really focus in on the importance of the protein sequence database. So, I want to touch on it a couple of slides about that. So, when we are doing protein identification and quantification by mass spec, right, we have our sample fractionation digestion, you have peptides, you run them on the mass spec and then in order to actually identify them, you need a protein sequence database or you can be like Carl and do it by hand, but let us assume that we do not we want to have the protein sequence database. So, from our database, so this is something like RefSeq or Uniprot, your the algorithm will pick a protein, do an in silico digestion, pick a peptide and then have the fragment masses here and do a comparison test and test for the significance and we will continue to do this over and over again. But of course, if your peptide or your protein is not in this database, you are never going to find it. So, there this is a very important thing that sequencing can help us really make sure we have the right sequences in our database. And so, databases with these missing peptide sequences will fail to be identified and if we make our database too big, you know, sometimes people will say why do not just put everything you could possibly put in it, then we are going to lose sensitivity. So, we do not want to do that either. So, really we want to make sure the database is small, but complete. So, ideally it would contain all of the proteins that you expect to see in the sample, but nothing else. Obviously, that is not usually not going to happen, but you want to get as close as you can to that right. So, the example of reference databases I already mentioned are RefSeq and Uniprot and Ensemble, there are more, but so I do not know how many of you know about the New York City Marathon, I was just watching it recently and so I wanted to make a comparison between this and the marathon. So, this is what the marathon looks like, there is 50,000 people running in the streets and I was looking for one person, a friend of mine who really wanted me to see him, this is very hard to do. So, I am searching for Mike, so he is the peptide, the marathons are database. So, will I find Mike? So, one is he running the marathon, that is important if he is not then I am wasting my time right. So, is he in the database, is the peptide in the database at all? Two, how many people who look exactly like Mike or very close to Mike are also running in the marathon? The answer to that was a lot, it was very hard for me to find him. So, if you add more and more people right, can I find the right peptide if there are too many unrelated ones in the database? So, the perfect ideal database here would be just Mike running through the streets of New York, I would obviously find him. So, this just happened to me, so I thought it was a nice example of why we have to make our databases really the best that we can to find the peptides of interest. So, the first example I am going to talk about is genome annotation, so using sequencing data for genome annotation. This is not cancer specific, but this is just another use that I think a lot of people in the room may end up using in their work, so I wanted to mention it. So, what is genome annotation? It is the process of identifying and assigning functions to genes. So, human genome has been fairly well annotated, the mouse genome, Drosophila, all of our favorite model organisms, but there is tons of organisms out there right that have not been annotated. So, if you are working on one of those, this may be a really good way of trying to help further annotate your genome. So, historically there has been some, there has been software and models that have been developed to try and predict genes from genomes and those are okay, but they are not perfect and we will talk about how some of them can fail in a little bit of detail, I am not going to go into too much detail. Also, RNA transcripts, transcriptal analysis is of course really important here, but to really understand what is making it to the protein level, you have to do proteomics. So, for really good genome annotation, proteomics and proteogenomics is really important. So, really proteomics has been used in several instances to supplement our sequence analysis to really get the best genome annotation we can. So, we can use mass spec data to confirm gene models, to correct gene models and to also identify novel genes and splice isoforms. So, here is an example and I am sorry you cannot read this, I will, oh you cannot hear, great. Okay. So, the green here would be our predictive models where it takes the sequence and it predicts what actually is an exon essentially. You can see, so here is the actual annotation, so you can see that it is missing a whole lot of things. So, it does not have the right transcriptional start site, it only has one transcript, it does not have the 5 prime exon, it does not have the UTR, so it is missing a lot of information. Now, if we add in RNA-seq data, we end up finding at least that there are two isoforms here, but then when we add in proteomics, we figure out what is actually making it to the protein, right. So, then we can find out where the start codon is, we can better understand where the UTRs are. So, merging all this information together is really necessary to correctly annotate genomes. So, how do we do this? So, we have our reference database or whatever we have available, so if there is a lot of databases for understudied organisms that kind of exist, but they are incomplete, so you can use that. And then if you do a sequence, you can do a whole genome sequence of whatever organism you want, and then you just do a six frame translation, we will talk about what that is, and add that in. And then you can from that find new peptides that will supplement what we already know about the organisms annotation, right. So, here is what, so we have our sequence. So, you do a positive strand three frame translation, so you know you start at ATG and you go from there, TGA and you go from there. So, you do every frame and then you go in the negative direction and you get the other three frames. So, then you have six frames and then you can use this to supplement your reference database. This is the only the best way to go if you don't have RNA-seq data and you know because you are really you are blowing up your database, you are making it enormous if you do this. So, it is not necessarily the best thing to do if you have other data available that we will talk about, but it is one option especially when you are working with like an understudied organism and you only have whole genome sequencing. So, if you have RNA-seq data as I mentioned then you can add in splicing information which will be even more, it will help you annotate even further. And if you let us say you want to study an organism that has no genome sequence. So, zebras don't have their sequence genome, their genome sequence that I checked last night, there is a whole bunch of organisms that don't. But let's say you want to study zebras and you are interested in this, but you don't have anything about it. Well, you could use a horse, it is close enough and you can try and see what you find using the horse sequence to see if you are able to find interesting related proteins and then you can do some de novo sequencing as well to try and supplement this. So, this is an option. Is anyone studying zebras? So, one example, a recent example of this type of method was in the pig genome. So, in 2017 there is this paper that came out, you can find it. If you are interested you can go look at it where the pig genome had been recently sequenced, but the actual annotation of the genome was not complete. So, they use mass spec. They did mass spec on nine organs during different stages of development and then they were able to improve the annotation for over 8000 protein coding genes. So, they think this is like the perfect example of how you can use these two things together to really better annotate genomes. And there is a list of their databases. So, they use all sorts of sequence databases. They use prediction models, they use six frame translations of the genome. They use transcriptome data. So, they kind of did all of the things we already talked about. So, I think it is a really nice example of how you can use this data for annotation. Any questions about annotation? So, I am going to move into go ahead. Yeah. Yeah. So, for which I do not have the whole genome sequence, but I have the whole genome sequence for related spaces. So, I am identifying the protein. So, how this will be useful that whatever genome information will be there from related spaces and then using for my target spaces. So, you have similar situation to the zebra. Yeah. Okay. And you want so. So, I am meeting one example. I am poking one face. Okay. Which is close related to the zebra face. Uh huh. Then you have such a species. But then you have the whole genome sequence is available. But where in the case of this space which is close related to that, the whole genome sequence is not available. So, I am trying to identify the protein using relative database. Then again, I am going close to close relative domains. Yeah. How about this infinity? So, I mean can you do sequencing on your species? That would be the best thing to do. But if you can't, yeah, I have you tried to use the related species database. Yeah, that is what I am doing. I am doing that. Then how much is related? How much is it related to there? Depends on how related they are. So, we can talk about offline. This is interesting because we could chat about this exact question. But it really depends, right? I know David Fenya was working on rat and mouse work that was similar to this. So, he may also be a good person to talk to you about it. Yeah. Okay. Yeah. Anyway, that is the most common model of the zebra fish. Yeah. So, there we have written this and most of the information available is for this related species. So, I think that how much is related or not if we use protein in a mix using the same in a mix of the other like zebra fish. Like still not your identifying protein using its relative, you know. It is not perfect, right? Like you are going to be missing a lot of information, but I think it is worth trying. And I do not know how similar your species are, right? So, it is hard for me to say, but I think it is worth, it sounds like you are already going to do it. Seema as we will try. You can let me know how it works and then we can talk about it. Yeah. Like it is very case by case, right? So, yeah. Conservation scores on the protein coding region. Is that what you said? What about it? Yeah. So, you can predict what would be protein coding. Yes, correct. Yes, yes, yes, correct. Is that, I do not know what the question was, but I agree with what you said. Okay. Other question? Move on. So, I am going to spend most of the time talking about variant identification, so variant peptide identification. So, what do we mean by this, the novel peptide identification? So, what the questions we want to answer here are whether or not genomic snips are translated into functional proteins. So, when we have a single nucleotide polymorphism, does it make it to the protein? And if it does, do we see it at the protein level? And also, do we see novel protein expression? I am going to be specifically really looking at tumors in this case, but you could apply this to whatever you are, you know, whatever you are looking at. It is just, typically in tumors there is, you see more of this novel expression. So, something that we pay a lot of attention to it. So, here this would be looking at different RNA-seq splice junctions, which we talked about yesterday. So, things, and we will talk about all of the different sort of combinations of splice junctions that are novel and how we deal with them in terms of proteomics. There are a couple different kinds of snips. So, if we have, these are our codons. If we have no mutation, we get a lysine. If we have a synonymous snip where there is a G to A, but it does not actually change the amino acid, we can have non-synonymous snips where it turns into a stop codon. So, it will just stop the protein synthesis early or we can have a missense mutation where we actually get a complete change in the amino acid. And we will talk about these in more detail. So, we are really going to focus on how do we put these peptides, how do we get them into our database so we can even find them. So, and one of the reasons this is really important is because there have been several studies that, well, so most proteomic studies that had been done especially previously, we have gotten better about this, is that they usually use a reference database. So, either RefSeq or Uniprot to model whomever. But as we talked about yesterday, a reference database is just trying to represent the population, but it does not have all of the different variation that occurs in a population. So, and there was this 1000 Genomes Project which we also talked about yesterday that really uncovered how much variation there is person to person. And they are not necessarily disease causing snips, they are just snips that exist. So, if we model everyone using a reference database, we are going to miss a lot of information. And also in cancer, there are somatic mutations that occur. So, mutations that are just occurring in the tumor. So, if you are trying to measure proteins in a tumor and you do not include these somatic snips, you may actually miss those peptides. And these are really interesting because many of them act as they are very much involved in disease progression. So, if we use a reference, we cannot find these snips in our data. So, we really need to figure out how to make sure we include them in our data so that we can find these to uncover both patients, if it is if we are looking at cancer, and tumor specific variation. So, for example, if we have our mass spec flow, if we have germline mutations. So, these are just mutations occurring in in people and somatic mutations occurring in the tumor itself. We have to figure out how to get these into the database so that we can actually find them. So, this is just a representation of the same same thing. So, the there is the VCF file format is the most commonly used format for looking at these variants. In I uploaded these are the columns again, we have the chromosome, the position of the SNP identifier. So, sometimes there is not really anything there, it is just the dot. The reference I mean a nucleotide and then the the new nucleotide and then a quality score and then some other information about about the SNP itself. Many times when you look at the variant calling yes mentioned as true variant etcetera, but when you actually do PCR verification you find that actually it is a false. Yes. So, what is the criteria by which you can actually pick out true variants from I mean the best way is to do PCR verification. So, there are a lot of ways of validating it one of them is that way. So, it just depends on the study how much work a person is willing to do to validate. If you have hundreds and hundreds of them you cannot do that. So, that is why I mentioned yesterday that there are several SNP callers. So, right now a lot what a lot of people do is they will use a whole bunch of them and then look at the overlap and then trust the overlap versus just using one because you are going to have a lot of false positives and that seems to work fairly well. In conclusion, I hope today you have learned how one can use gene annotation and genome sequencing to create the proteome databases for unexplored organism type. I would like to emphasize it is very crucial to learn this information because many time you are working on the unknown organisms for which databases are not available and therefore, your searches are going to revert back with unknown or hypothetical proteins. So, refining databases is very crucial especially if you are not working on the human and other model systems. So, you may have to first try to establish good databases for doing the search for proteome data. You also learned why it is better to know your targets while searching for SNPs single nucleotide polymorphism as you do not want to sequence non-pathogenic SNPs in this process. We also heard about how one could make the personalized protein databases for specific studies. The next lecture is about variant analysis and they reflect on RNA and protein expression in clinical conditions which will be continued by Dr. Keri Regals. Thank you.