 Hello there, my name is Tim Griffin. I am from the University of Minnesota. I will be presenting the introduction to proteogenomics as part of the GTN Smorgasbord and the Global Galaxy course. Before I move on, I want to acknowledge a few people who have contributed to the presentation I will give, as well as those who will be giving the more detailed tutorials as part of the proteogenomics overview and tutorials. Prateek Jegtap, who is a co-leader of the Galaxy for Proteomics or Galaxy P project. Subina Mehta, Andrew Rajewski, and James Johnson. So Subina, Andrew, and James will be giving the three modules as part of the proteogenomics tutorials in greater detail that I will introduce in this talk. So what I hope to do is give some background and information on the informatics challenges related to proteogenomics, a bit about the overview of the core components that are involved in proteogenomic workflows, and then just a bit of a brief overview of what is to come in terms of tutorials that will be a part of the Global Galaxy course related to proteogenomics. So as a bit of a primer, proteogenomics is based on mass spectrometry based proteomics data using shotgun proteomics. So this is the technology and the method that is used to generate this type of mass spectrometry data, where we take protein samples that are complex mixtures of proteins. We digest these two peptides. These peptide mixtures then are fractionated and separated using liquid chromatography, such that these peptides elute from this column and are introduced directly into a mass spectrometer. They're ionized and put into the gas phase. When they enter the mass spectrometer, the different sized peptides, so these are made of amino acid sequences and have slightly different sizes and molecular weights. These are first detected by the mass spectrometer, and then each of those molecular weights, those masses of these different peptides that are entering the mass spectrometer at any given time, are recorded and each of these peptides in turn are isolated in the mass spectrometer. They are fragmented, so they are collided with an inert gas. This breaks them down into smaller amino acid fragments. The masses of those amino acid fragments are then detected and recorded in what's called a tandem mass spectrum. So that's the second round of this, and then the reason that this is called tandem mass spectrometry is after this recording of the peptide masses, we then record a fragmentation spectrum for one of those peptides. And these different peaks, the mass to charge of these peaks is recorded. These different peaks are a fingerprint that goes with the amino acid sequence from the selected peptide. The instrument does this extremely quickly in the matter of tens of microseconds, and it remembers that there were other peptides also entering the mass spectrometer just a brief moment ago. So it goes back, selects another peptide and does this fragmentation process, and it does this over and over as fast as it can recording these fragmentation spectra that go with each of these peptides with the goal and hope that we can use the fragmentation spectra to annotate and ultimately match an amino acid sequence to this tandem mass spectrum or this MSMS spectrum. So the way we do that is take the raw fragmentation spectrum, the MSMS, and we take it through a computational routine which is called protein sequence database searching. So we take a database of known protein sequences or putative sequences that have been predicted from genomic DNA sequences, RNA sequences, whatever it might be, and these are matched to these potential sequences that are in this database that may be in the sample. And in this matching process, what we ultimately do is match up amino acid sequence fragments to these different masses of fragments that were recorded, and ultimately we get a full peptide sequence that matches best to this particular fragmentation pattern that was recorded. When we do that, we get and get a confident amino acid sequence here, we can then relate that back to the protein that this particular sequence comes from because we started by purposely digesting with trypsin and enzyme the intact protein into these smaller peptides. So then we relate this back to where in this database does this particular sequence come from in terms of the intact so we can infer a protein identity from a peptide sequence that we have identified by the mass spectrometry data. This is automated, can be done such that you can start to identify thousands of proteins from complex mixtures. So what is proteogenomics? Proteogenomics is a bit of a twist on that idea, so in this case what we do is we are trying to overcome one of the limitations of this approach which is we're basing the identification of sequences to these peptides on what is in this database. So ultimately the results we're able to get is really only as good as the database that we use. So one of the limitations then is if there are novel protein sequences that haven't been identified or sequenced by biochemical means in the past, but they're present in our sample, if they aren't in our database we won't find those in doing this the sequence database search process. So what proteogenomics does is takes what has now become a much much more routine sample analysis where we can say record RNA sequences. So we can do RNA seek on this same sample that we are running and using for proteomics and using that RNA that is sequenced using RNA seek we get a full transcriptome sequence and we can then use that RNA as a template and use the rules that we know in terms of codons that code for amino acid sequences to in silico translate these transcript sequences into possible proteins that are in our sample. This could also be done from DNA sequences gets a little more complicated because of the the two strands but that is possible as well. But the end result is that you become or have the ability to then create a more comprehensive database of sequences. So this would include known sequences that are encoded in this RNA but if your specific sample has mutations isoforms that are a part of the RNA sequences you would also capture those in the translated protein sequences. So we create a much more sample specific comprehensive protein sequence database that then we use in the similar way to match our msms spectra too. So this slide kind of walks you through that where we take these msms spectra from the peptide sequences that we've recorded from that same sample we now take this customized protein database and we do this matching and so we ultimately get peptide sequences that match to these msms spectra and then we can using bioinformatics map these to the genomic regions that code for these particular peptides and use that to understand the nature of of these proteins that are expressed in our sample. So we will potentially start to see proteins that are expressed from regions of the genome that weren't predicted to be coding so we can see some novel protein products in that case. We will hit a lot of these sequences to known protein coding regions so that's not a surprise but within these known protein coding regions what we can see is if we look closer at some of these these matches to these amino acid sequences we can start to pick up things like novel splice isoforms and that's what this part of this figure is trying to show is that if there's say three exons here to this particular gene we will start to see peptide amino acid sequences that are coded and span different junctions of these exons so if there are novel splice isoforms we now will have peptide level evidence of those being expressed as proteins as well as other things such as single amino acid changes that may occur in these protein coding genes short indel insertions deletions anything that may cause a change to an amino acid sequence can now be potentially identified using proteogenomics. This has a lot of power and a lot of different applications you can confirm translation of variant amino acid sequences that turn into at least stable proteins which may have functional effects say in disease or other contexts you can do things like use this approach for neoantigen discovery so novel protein sequences that might be recognized by the immune system and have application in amino oncology and other other aspects this is used in terms of where proteogenomics is used it's been a pillar of the CPTAC program which is a program looking to do proteomic characterization of tumors and they have used this sort of multi-omic approach that's based on proteogenomics to do much of this work so it has lots and lots of places where it has application and power. So how do we do proteogenomics and in the end the the technologies are there to acquire the data so to acquire deep proteomics data using mass spectrometry to acquire things like RNA seek data on the same sample those have become relatively mature to do proteogenomics though it becomes a bit of an informatics challenge so there's many software tools that have to be pulled together here to make this work we need to generate these databases say from the RNA seek data then we need to match our msms data we need to filter those results to make sure they're of highest confidence and then try to visualize and interpret what comes out of this analysis. So this poses lots of challenges in terms of pieces to the puzzle here of how do you assemble and call the variants that are part of the sequencing data how do you generate those those customized protein databases what is the best way to match msms spectra to those those databases and ultimately how do you filter to make sure you have high quality results and then also interpret them so beyond just getting this list of possible variant sequences how do we understand what the nature of those sequences are and what their functions might be and also how do you make this a bit more accessible to the to the larger research community so this can be adopted by more research groups so our group in the University of Minnesota has been working on this for a number of years now and a solution that we turned to was the galaxy ecosystem and so galaxy gave us this platform that was really ideally suited for this type of a problem because it provides this sort of work bench to integrate different software to to get this in a in a place where bench scientists could access these tools it already has a thriving community in terms of genomic and transcriptomic tools and so what we started working on was was bringing in proteomic tools to galaxy and that gave rise to this galaxy or galaxy for proteomics project and proteogenomic type applications have been one of our main main focal points and really our philosophy here was to not try to reinvent the wheel but to take many of the software pieces that are already out there but bring them all together in galaxy to really make a cohesive workflow platform that could automate this and then make this accessible to the the greater community so just on that point what galaxy gives us is this workbench and really this ecosystem that is really well suited for multi-omic applications like proteogenomics where we can bring in data sets from whether it's proteomics or or genomic transcriptomic data sets brings together a central place where tools are the software tools that are needed to analyze those different types of omic data computing resources so this can be then implemented on diverse computing resources including very powerful resources that give you memory and processor power as well as its cloud platforms and really all of this comes together within the galaxy interface and within that that environment so that we can we can bring together all these tools with the data on this this powerful infrastructure create workflows that that can be validated and tested and then used by others so it gives you the interface of the web interface or the programmatic interface if that's so desired and really this this is kind of the overall idea is that it's a it's an environment in which you can integrate all of these pieces to do this this data science and really is well suited for multi-omics and I just add as part of this is that galaxy the community has a very strong training network the galaxy training network which is a key piece here so once these things are in place as a way to train users in the use of these sophisticated tools so they can take them to their research and actually utilize them so what are the core pieces of proteogenomics and there's there's several really but I'm gonna kind of boil this down to to basically three main areas that will be a part of the tutorials that are a part of this this training course so what's being shown here is that what we do is we really bring together say RNA seek data and also tandem mass spectrometry proteomics data from the same sample and that's what is the starting data that we then need to have tools and workflows that integrate these together so one of the first things that that happens is based on say the RNA seek data using tools within galaxy we call the variants that are present within those that RNA seek data and then utilize some tools to generate a protein sequence database that now includes both the reference known proteins as well as potential variants that that have been expressed within your same sample using this this RNA seek as a template so that's one piece of this that's a core component to a proteogenomic overall workflow and the other then is once you have this database is bringing in then that that tandem mass spectrometry data and utilizing tools to do sequence database searching to match these ms ms spectra to these potential sequences so that we can directly confirm at the protein level which of these these interesting variant sequences have been expressed that we can actually identify and then and then move on towards understanding what what those potential impacts are of those protein sequences so those are two core pieces here of a of an overall proteogenomics workflow and then there's also this question of what do you do next with this so now we can generate say a list of possible variant proteins present in a sample what do you do with that in terms of characterizing and understanding what the importance of those sequences are what are then what's the nature of those variants so the last piece here is is visualization that is a key also part of galaxy where you can now visualize and interpret and characterize your results so we've developed some tools where we take the outputs from this workflow and these these become the inputs to say visualization software that can look at these do some quality control filtering as well as characterize the nature of the variants that we've identified as a as a sort of first step to interpreting potential functional aspects of these these express protein variants so with that that is going to be the basis of the the three modules that are a part of the proteogenomics tutorial so the first of these is database creation and that will be given by james johnson jj and that will be followed by andrew who will will talk about how you can then do this database search so matching those msms spectra to these putative proteins that are part of this database and then followed by a little bit on more of the sort of interpretation and analysis of the results how do you visualize and look at these results to start to make some assessment of those those things that you have found and what's important and most interesting so that will be the three modules that will be a part of the proteogenomics tutorial