 Welcome to MOOC course on Introduction to Proteogenomics. After understanding the sequence centric proteogenomics, we will now listen to Dr. Kelly Reggels talk about variant analysis and its effect on other biomolecules leading to various clinical conditions. She will also talk about how to read dot VCF file and also about informatic tools for creating the customized databases. She will discuss about how one could create variant peptide manually by using integrative genomics viewer or IGV to understand the basics. However, once they have understood the basic mechanism, one can use different software to develop variant peptides. Let us now welcome Dr. Kelly Reggels again to tell us more in depth about creating and analyzing variant peptides. I'm going to walk you through doing this by hand even though we now have tools to do this, but I think it's good to do these things by hand sometimes. So what you have is your your reference genome. So this would be HD 38 or HD 19. And you have the sequence you add your variance in to the sequence within these exon boundaries. You do it in silico translation and you throw these into your database and then you can actually search against your database to find them in your mass spec data. So these are examples that David actually just showed, but you know, we might as well just look at them again with the now understanding a little more about it. So here you can see there is a SNP that's occurring at this location in this protein. So it goes from a valine to an isoleucine and then you can identify this within your data. Or in some cases you can have a stop code on introduction. So instead of being a full protein sequence, you'll actually have part of it. As David mentioned, this is a lot harder to validate, right? Because we don't always have coverage in proteomics. So are we not seeing it because it's not there? Are we not seeing it because we're just not able to measure it? Or in some cases, it can go from having a stop code on to having an amino acid. So here it would just continue on and you would get a novel protein that would continue translation after the original stop code on. So there are a couple of different tools for this. I mentioned two because these are the two that were created by two people who are in this room. His group was responsible for Custom ProDB and then David and I worked on quilts. And so you can use both of these. I've linked to them. So you're able to put in either VCF files and or bed files and get back a database that has all of your information in it. So it sort of sits here. So after you do your next generation sequencing data, you get the splice junction bed files and your variant, your VCF files. And then quilts and Custom ProDB can then create databases for you that you can then use to do your peptide identification. And here's quilts. If you want to go, it's just a, this is the web page. So you're able to just upload your data here and then you can get your databases out of it from there. This one, right now it's, you can just do human on the website, but if you're interested in other, we can chat. Okay. So this is where I wanted to talk about, kind of walk through what we're going to do on the hands on before we actually do the hands on. We'll see if that helps with the actual hands on. So you're welcome to try and follow along, but you know, let's wait for questions until the actual hands on. So I just kind of want to show you what we're going to be doing and we're going to use a different example for the hands on itself. So if you haven't downloaded IGV, please do so. So we're going to create a variant peptide by hand. So once you have IGV open, you can, you can create, you can zoom in and out on different genes pretty easily. And so if you go to the search field and type in, I just picked a gene so that you could just get sort of zoomed in. So I picked ERBB2 in the search field. And then you're able to sort of zoom in on one gene. And then you can change these coordinates either by picking a gene or just typing in different coordinates and it will move you along the, the genome. And so the reason I wanted to show you this is because there is this, this line, this, you can drag up and down once you've zoomed in enough. And we'll, if you, you don't have to do this right now, we're going to, we're going to do it during the hands on. But if you drag it down, you actually get the sequence information of the genome. And that's going to be something that's going to be really useful for actually doing the hands on itself. So you'll end up getting something that looks like this. And this is the, so what's up here is the genome sequence. And then this is the three, through the three frame translation. So this is amino acids and we'll zoom in a little bit more so you can see. So it actually does the translation for you. So you can zoom in and you see this is the genome sequence. And then here you can, you can see that there's the amino acid translations from the genome sequence. And then it has the annotation for the actual ERVB2 exons here as well. And then you can flip on this arrow to get, if you're looking at a negative strand versus a positive strand. So this is really important here. And you can flip back and forth and it will flip the translation, the three, the three frame translation for you. Okay, so when we're creating this variant peptide by hand, this is the first entry from the VCF file for the hands on, we're going to do the second entry. But you can upload files into your own files into this and then actually visualize them. So here you would upload your SNP VCF file. So it's the sequence PG-SNP.VCF. So you would go to IGV, there's a file and then you load the file. And then you'll see if you go to this position chromosome three in the position within this VCF file, you actually are able to see that it labels your variant. So you have the variant labeled here. So then you can zoom in by entering into the search field exactly where the position is and you see the variant location of this gene here. And then what we'll do is we'll just do the translation, the infillico translation, using the information that's provided here for the sequence so that we can actually sequence a peptide that would have this variant within it and then change that amino acid to the correct amino acid based on the SNP that's there. So here we would start here. So the variant again is here. So it's a G to C. So this is a G. So everything before the variant is the same, right? So we don't have to do anything. We can just kind of copy exactly what's there in the reference. So it's L, T, H, G, D, S, V and then we get to the variant, right? And we got to figure out what it is. So now we know that it was a G, A, C and now it's a C, A, C and then you can go up to your handy codon translations and figure out that it was a D and now it's an H. So now you can add that in and then now you can have your infillico translation of your SNP at the protein level. And then so the output for Quilts and Custom ProDB will look something like this where you'll have a FAFSA file which has a header that sort of has information about the SNP that was incorporated into the sequence and then it will have, I bolded here the full tryptic peptide that would include the SNP. The blue is what I just showed in the demo. I didn't include the whole peptide. You could go through though and scroll in and look at it. And then the H is where the actual SNP is. So we're going to do a similar example in demo, a different SNP and actually going in the negative direction. Any questions on this that are not needed? Okay, yeah, go ahead. Yeah, there's a couple of different tools. I like this one the best. That's why I use it. But there are, I mean, UCSC also, you can use that. There's a couple of different genome browsers. So like you can use your favorite. This is just the one that I decided was the most user friendly. So I use it. Yeah. So that again? The Variant File Information with the Variant File Information only if you can give you a VC file format or any other file material. So you're saying can you use a different input format? Yeah. You may be able to use math files that I haven't tried. What kind of file format were you thinking? Like a file in which the header file, like it's in pasta format and the header contains all the variation information. Yeah, you'd have to kind of parse it out and make it into a VCF file is what I would recommend doing. It's definitely particular about its file inputs. Yeah, I think all of them are though. So other questions. So we can also look at novel expression identification, meaning novel alternative splicing or fusion genes. So here are a couple of examples. So for example, if we have two known exons, but let's say that they're combined in a way that is not annotated. So that's a new alternative splicing event that doesn't have novel expression in terms of new exons. But it's just a new way of connecting exons. So that's one example. You have an example where one exons connected to the middle of another exon. So it chops off the beginning of it. Or maybe it's one exons connected to an intronic region. So it's actually adding on some sequence before the annotated exon. Or maybe it's in an intergenic region. So it's a whole it's passed where we think the gene ends can occur. Or maybe it's just completely novel. Maybe it's there's no boundaries, exon boundaries that exist that are annotated. So there's a lot of different ways that this can these things can be combined. If you're doing genome annotation, you'll have have a similar this will be the same kind of problem that you'll face. So this can go for either either of those questions. And also fusion genes. We talked yesterday about fusion genes, right? If you have gene X, one of the exons of gene X is connected to another exon and gene Y, you're going to have a totally new gene or sequence that you'd have to also add into your database in order to find it. So we can take our RNA seek data and the information we get from that, including the junctions and the fusion genes, throw those into our database. And then we're able to find these tumor specific proteins as well. So yeah, this is kind of the same. So you could have different new alternative splicing or you've just completely novel expression. Okay, so that this really requires a bed file. I also include an example bed file in your zip folder. It has a very specific format where you have the chromosome, your gene start and end. In this case, it would be a junction start and end, but you can kind of treat it the same way. Name, a score, some information on strand and just display info. You'll notice if you open the bed file, I changed some of the colors specifically so that during the demo, we can point things out. So you can see there's there's RGB numbers in there that have been altered for that reason. The number of exons were blocks, the size of these exons and the start of these exons. And I showed this yesterday, I'm just going to show it one more time because we're going to be thinking about this a lot. So in our bed file, we have the start of the gene, right? And then we have the the first block start will be zero. It will start where the gene starts. And then it will be the exon is 126 nucleotides long. So it will make 126. And then we'll continue doing this. So you always take the start. And then you add the block start to get the start of the exon. And then you add the block size to get the end of the exon. And you continue doing this for each of the different exons. So that's how you sort of parse out the bed file to get the actual genome annotation from it. And the junction files is so if you run RNA seek on the sample, the junction files will just indicate where the the boundaries between the exons that are spliced together are. So if block one and block exon one and two are spliced together, you'll have a junction read here. If block exon one and three are spliced together, so so on and so forth. So you'll you'll see how these junctions are you'll are visualized in the IGV and hopefully it will be clear. So what we do to create this plus spliced junction databases is we take this junction bed file, we compare it to the known annotation. So we just take everything out that's known. Because we don't it's not that we don't care about that. We just take everything out that's known, because we don't it's not that we don't care about that. We that's already included in our reference. So we want to take those out and just get things that are new. And then so we take the new file bed file with the new gene mapping, and then we figure out what kind of mapping it is. So is it just an unannotated alternative splicing? So we already know the exons, but it's spliced in a new way. Is it what mad doesn't map to one end of an exon, but not the other end is mapped to something new? Or is it just completely new? And the way that we deal with this is changes based on what kind of novel splicing it is. So here would be an example of alternative splicing with two exons that are known, right? So you can see here, and this is the peptide data. And this is the exon structure, you can see that there is actually evidence for this in at the peptide level, where you're connecting this exon with a new exon with another exon that's known, but did not have annotation already in the database for that connection. Or a connection between an exon and some intergenic region. So here's an example, where we have an exon that's annotated connected to this intronic region here. And there's evidence at the peptide level for this connection, because we added this into the database. Or also you could have, as I mentioned, these completely novel peptides, where it's just either an intronic or an intergenic region. And you can sort of see, in this case, it was in the middle of an exon and then in the middle at the end of an exon. So these are less likely, especially in a really well annotated database, but you do find them. Okay, so again, you can put your bed file into these tools and just create your database using them. But we're going to do one by hand. So you upload your bed file the same way you would upload your VCF file. So you load here. And I made you a very small bed file. But if you have the full bed file, you'll have things that look like this, you'll have like every single junction that connects all sorts of different exons. So what I did give you should look like this. So you'll upload it. And it will look like this. There's six different junctions that are included. The purple ones are ones that are annotated. And the red ones are novel. Full disclosure, I made up the novel junctions, just for this purpose. So they may occur in reality, but I just made them for the demo. So when we open this up, and we'll open it up, and you can open it now, but we'll open up in the demo, you should see this. And what we'll do right now is just walk through how to by hand create this novel splice junction peptide. So what you can see here is these connect, so the purple connects annotated exons. So this would be this exon connects to this one, this exon connects to this one. And then there's this red junction that connects this exon to something in the middle of an intron. So there's novel expression here. So what we're going to do is figure out how to make the peptide that bridges the the known exon with the novel expression. So if we zoom into this exon here and the known exon, which I've included the boundaries for here, you'll see that there is again, this is the genome sequence, and these is the three frame translation. And then right here is the actual annotated gene. So since this novel junction is actually downstream of a known exon, we can just keep this sequence as it right because it's not changing frame. And we know that this exon is still being transcribed in the same way. So now we have or translated sorry, in the same way. So now we can just take this chunk of sequence and just sort of have it. And then we're going to add to the end of this the novel sequence. But what we have to pay attention to here, right, is that we have this sequence that ends. But there this G that's here is actually not it's assuming that the next thing that comes up is is annotated. But it's actually just two G's that are hanging after the annotated D. And we're going to add something new to the end. So we have to keep that in mind, right? So we have these two guanises are left hanging. And they're going to attach to this new boundary. So so we keep the C the amino acid sequences of the peptide sequence from before this. And we remember that the two G's are here. And then we look at the other side of the junction. And we can see if the other side of the junction, we take the two G's and then we add on the actual genome sequence from the other side of the junction here. From the novel expression. And then we are able to use this information, right, to actually encode what what this new novel expression would look like. So we can get our new amino acids and add those in. And you can then take these. So this G would be the barrier between the known and the unknown. And throw these into our our database as well so that we are able to find hypothetically this new boundary in our data. So these are really hard to find, right? Because to prove that this is happening, you really need to be able to identify this this boundary between the known exon and the new exon. And so you have to be able to find that one peptide that proves that that's actually occurring, which is very it's not so likely. So if you don't see it, it doesn't mean it's not happening. It just if you do find it, it's exciting. And we'll talk about the likelihood of finding these kinds of things in real data. So in addition to this, as I mentioned, you can also include these fusion genes in your data. So this is an example of what different outputs for fusion genes look like. I like this because I think it looks I don't know kind of like art. So if you zoom in on here, these are actually sequences. So everything up to this point is reads from RNA seek for one gene and then reads from RNA seek from another gene and it's just showing how the two are fused together in this this this cancer data set. So you would really want to take this boundary and add this into your database. So you would take the boundary here, find the consensus sequence of this boundary, and then you do a six frame translation and add that into your database as well. Again, very hard to identify this in real data, but worth trying if you have if you have it. Okay, so this is really specifically important in cancer, because as you've heard many times, there's a lot of altered expression, either snips right mutations causing changes in protein expression. And we really want to understand if they're these are found at the protein level. And if they are what their effect is on the protein function, just showing that there's a lot of variability in how much variation occurs in different tumors, even if they're in the same, the same type of tumor. So here's human breast tumors and you these the circles plots just showing the rearrangement within each of the tumors, and you can see that they're really different and highly variable. So some of them don't have a lot and some of them do. So there's two studies that I wanted to talk about. One of them is a study you've already heard a lot about. It's the CP tech retrospective breast study, where we looked at these 77 tumors to, we looked at a lot of different things within these tumors, I'm going to talk a little bit about how we looked at the effects of somatic mutations and also within the proteomics data if we were able to find in terms of mutations at the protein level. And then also these patient derived xenograph tumors. So these are tumors that are injected into immunodeficient mice and they're able to grow on these mice. This is a very cool system because you can have many, many mice that have the same tumor and you can treat them with different things or you can you can just grow lots and lots of the same tumor for all sorts of different QC experiments or just to better understand that tumor. So this was two different tumors that were really used as quality control within experiments are still being used actually for the studies that are ongoing. And what's cool about these is that they were measured over and over and over again. So we use the fact that they were measured over and over and over again to really understand where we're at in terms of our depth of discovery of these protein variants. So we'll talk about that one first. So we had these two tumors, they're basal and luminal tumors. There was proteomics was completed as we've discussed with eye track. And then there was genome and transcriptome sequencing that was done and these were incorporated into this protein database using quilts. And then we did novel peptide identification filtered out the mouse proteins and just the normal proteins we expect to see based on a reference database and then looked at things that were novel based. So either novel junctions or SNPs. So what we found was that, so if you look, these are the two different tumors. So this is the, the blue is all of the DNA variants that we identified. So in the genomics data, then at the RNA level, it's the purple and then at the protein level, it's the orange. And this is for the basal tumor and the luminal tumor. So what you can see from this is that we were only able to find so about 3% of the predictive genomic SNPs were actually found at the protein level. And about 10% of those at the RNA level were found at the protein level. So there's, there's a lot of reasons that this could happen, right? So maybe a SNP causes the protein to be degraded, so we're not going to find it, or maybe a SNP causes the protein, the peptide to be homologous to another, another part of the proteome. So we're going to assume that it's normal, even if it's not. Maybe we just don't have the coverage to find these. So there's a lot of, just because we only find 3% doesn't mean that those are the only 3% that make it to the protein level. So this was just kind of a way of, of assessing where we're at in terms of our ability to discover these. And this is just some examples of somatic SNPs that were identified that are, are cancer related. We also looked at the novel, these novel junctions, if we're able to identify novel junctions in our data. Again, the two different tumors. So the purple is all of the novel junctions that were identified by RNA seek. The blue is ones that had at least five reads. So some of them, they're just like one read and probably just kind of garbage. So we wanted to make that clear as well. But you can see these tiny little dots here are the number of novel junctions that we were able to identify at the peptide level. Very, very few. Again, this may not be because they don't exist, right? Because in order to find, to prove that these novel splicing events are occurring, you have to find that peptide that exists right at that junction. And maybe the peptide at that junction is too big or too small or homologous to something else or, you know, so it's, it's not that these are the only ones that exist. It's just that they're the only ones we were able to find using this method. I think this is also very shows how well the human genome is annotated. I think that's a, you know, if this was a less annotated species, we find a whole lot more new, new, new splice sites. And then the last thing I will discuss is this just quickly is the, the CPTAC data. So, again, the 77 breast tumors. And so here is actually analysis where we combine all 77 and looked at them together. But it's a similar kind of Venn diagram where we have all of our DNA variants. We have our RNA variants. And then we have only 4% of them were found at the protein level. So similar, similar to what we showed in the last data set. And the red is just our somatic variants. And most of these were had been identified previously by, but just as existing, the, so they either were in the DB SNP database, the cosmic database and only a small percentage of them were completely novel. And then there are just a couple of examples here of SNPs that were identified within this data. Okay, so the last thing I just wanted to quickly talk about was map, proteogenomic mapping. So this is mapping. So let's say you find new and cool peptides, but you want to map them back onto the genome. So this, if you have a lot of them really requires automation. So we have come up. So this would be a, the reason you would do this would be to try and visualize the alongside your genomic data. So let's say you want to put it up in IGV and just see, like, alongside your junctions, make sure that it makes sense that you're, that's actually proving that your junction exists. And just kind of having it all in one, in one vision, in one, one browser. So this tool PGX, where you can put your peptides and your sample specific database in and it will map onto genomic coordinates. So here's just showing a schematic of where you could have all your copy number, your methylation data, your RNA seek data, your and your peptides all maps to a chromosomal location. And then you can look and see how things are quantitatively changing. And so what happens is you can use all of this data to create bed files. And the bed files again are what you then use and input into IGV as we did with the junction data. So this is just an example. So this is a spectra from this variant that was identified in a tumor. I think this is probably from the comp breath data. So we have this, I can't actually read what that is a t to a here. And you can put this is a track showing where the variants are. And then you could actually just map your tumor peptide and just make sure it maps to the same place. And then you could have your novel peptide data where you have your junction. So this is the RNA seek data. And then you see in green, this is a peptide that spans the exact same novel junction that we found at the RNA level. And so you can actually visualize them together just to see that you're actually seeing the same boundary that was predicted by your RNA seek. And then you can throw everything up there and look at it all at once. So this includes the reads from RNA seek data. This is the annotation, this is the proteomic mapping, all of your variants. So if you want to throw everything up there and look at it at once, you can too. So I want to just thank everyone in my lab, the fenu lab, and of course, CPTAC, where most of this work has been done. I hope today you have learned why is it important to know and understand the variant peptides. You've also seen how integrative genomics viewer IGV can be operated and accessed to understand your data. You also heard about IGV helps us in finding what kind of mutations are present in a given gene by using VCF file containing details of all the SNPs in the data. Using the detail of mutated genes and type of mutation, one could create variant peptides, as Dr. Kelly had just mentioned. IGV could also be used to visualize the novel expression due to the splicing. You also learned about bed and bed junction files, which contain the information about various possible splicing involved in a particular protein expression. The next lecture will be by a mass spectrometry scientist, Dr. Suman Thakur, who will talk about proteomics in clinical studies. Thank you.