 Hello, my name is Andrew Uczewski and I'm a graduate student in the Griffin-Intret-Yacobo Labs at the University of Minnesota and a part of the Galaxy Proteomics Group. As a part of this year's Galaxy Training Network's morgis board, I'd like to present you with the second of three tutorials in a series on proteogenomics applications in Galaxy. In this tutorial, I will take you through how to search mass spectrometry data against a database to perform proteogenomic analyses. A prominent approach in modern biological research is systems biology, wherein a plurality of biomolecules in a system are measured simultaneously to ascertain the response of that system to various stimuli. This is accomplished through multi-omics technologies, each of which measure a different biological molecule. The contents of a genome, for example, are determined through sequencing in a process known as genomics. Similarly, the degree of genomic methylation in chromatin architecture is determined through ebogenomics. The totality of gene transcription can be examined through mRNA sequencing, a process known as transcriptomics. And finally, the phenotype of a system can be more directly ascertained through omics technologies such as proteomics and metabolomics, which measure the proteins present in a system and the small molecules produced by these proteins, respectively. Our lab's focus is on proteomics and on the analysis of proteomics data in Galaxy. Modern proteomics approaches utilize mass spectrometry to identify and at times quantify the proteins in a biological sample. In a conventional proteomics experiment, proteins are digested enzymatically into shorter constituents called peptides for ease of downstream analysis, a technique called bottom-up proteomics. This results in an exceedingly complex mixture. To analyze them all at the same time would be akin to trying to judge the makeup of an entire crowd all at once. Just as this crowd could be separated using cubes and turnstiles, the peptide mixture is separated using liquid chromatography, resulting in a few peptides entering the mass spectrometer at a time. Within the mass spectrometer, the mass to charge ratio of the peptides is first measured before the peptides are then fragmented into smaller pieces. The mass to charge ratio of these pieces are then measured in a separate event called an MS2 spectrum. Once MS2 spectra are collected, the mass spectra can then be searched against a reference database using bioinformatics to determine the peptides and therefore proteins present in the sample. When the peptide fragments in the mass spectrometer, the peptide breaks up in a predictable way along the peptide backbone, giving a series of ions made up of the peptide fragmented at discrete locations on the backbone. Within the proteomics software, these ion series are treated as a sort of fingerprint and compared against the theoretical peptides within the database, looking for a match that would correspond to these ion series. Once a spectrum is annotated with a peptide sequence, it is designated a peptide spectral match and assigned a score depending on the quality of the match. These PSMs can then be assembled to identify proteins. In this way, thousands of proteins can be identified and potentially quantified in a single sample. Ideally, identifying all the peptides in a sample would be as simple as optimizing the collection of data by the mass spectrometer. However, it is important to note that there could be spectra within the dataset that may originate from an unannotated portion of the proteome. By using only canonical reference proteome databases in bottom-up proteomics is theoretically possible that an enormous amount of biological information is lost. This can be corrected for by making custom databases that contain the canonical reference protein in addition to experiment-specific extra data. For example, any sequence variants that are not present in the reference proteome can be identified using a six-frame translation of the genome, three-frame translation of cDNA, or approaching database to arrive from RNA-seq experiments. This leads to identification of single amino acid substitutions, frameshift mutations, and alternative splice isoforms, also known as proteoforms. The use of transcriptomic and or genomic data to supplement proteomic analysis in identified novel proteoforms is termed proteogenomics, and this is the approach my colleague and I hope to communicate to you in these tutorials. For hovering proteogenomic analyses, as it has been with most multi-omics analyses, has historically been somewhat taxing, requiring considerable time and computational finesse on the part of the analyst. Fortunately, many tools used in proteogenomic analyses have been uploaded into Galaxy, or even the most novice bioinformatician can readily use them in a simple graphical user interface. What's more, the tools can be utilized together in workflows, aligning for analyses to occur automatically when given the starting datasets, ultimately saving the analyst's time. This graphic represents a hypothetical workflow containing all the requisite steps in proteogenomic analysis. For this tutorial, I will be focused on this section highlighted in red, where mass spectrometry data is searched against custom databases to identify proteative novel peptides. Other steps highlighted here are covered in the other tutorials done by my colleagues. The searching of mass spectrometry data against a custom database for proteogenomic analysis has been isolated and painstakingly simplified into a compact, straightforward workflow shown here. For the remainder of this session, I will be going through the individual nodes in this workflow so that you may better understand what they do, after which I will walk you through the process of importing data for this tutorial as running the workflow itself so that you may be able to practice on your own. Begin, let us discuss the input files needed to run the proteogenomics database search workflow, as this workflow cannot be run without all the requisite inputs. To run this workflow, you will need three input files, each in the correct format. First input file, or files that are needed, are the raw mass spectrometry data for the experiment you mean to analyze. It is important that the files be in the mascot generic format, or MGF. They are not, the files cannot be converted using the open source, the files can be converted using open source tools like MSConvert, which can be found right in Galaxy. Second input file needed is a custom database for this experiment, generated from suitable RNA-seq data to reflect the alternate proteoforms not found in the conventional proteoform, in the conventional proteome. This database is in the FASTA format with all the proteins in the sample expressed in one-letter codes with a unique identifying header on each. The generation of this database is covered in the tutorial created by my colleague James Johnson, which I encourage you to watch before this. The final file needed is the second FASTA database containing the reference proteome accessions for the system you are analyzing, which will be utilized near the end of this workflow. In addition to the conventional proteome, this database can also contain common protein contaminants, such as keratin from human and animal sources, to avoid misattribution of these contaminants to alternative proteoforms. As with the custom FASTA database, this is generated by the first proteogenomics workflow. Now that we've established the requisite files needed to run this workflow, let's discuss the first node in the workflow, where we use search GUI. The search GUI engine is arguably the heart of this workflow, as this is the node that searches the raw data against the custom FASTA database. Search GUI was developed by the Martens group to effectively perform searches of protein mass spectrometry data against FASTA files. Over the years, many different proteomics search algorithms have been developed, each of which has its own unique advantages and disadvantages. Ideally, multiple of these algorithms would be run, however that can cause results in considerable time and computational power needed. Search GUI allows for multiple search engines to be run at the same time, maximizing the ability to interrogate the data without the time commitment needed for sequential searchings. In addition to being able to perform multiple searches simultaneously, users can also adjust the settings in search GUI to accommodate differences in experiments. Specific digestion options can be chosen, selecting from several potential enzymes, as well as varying the number of missed cleavages. In addition, the peptide precursor options can be adjusted to account for different mass spectrometers' resolutions and optimize the ability to identify PSMs. Finally, post-translational modifications such as oxidation, acetylation, or phosphorylation can be adjusted in search GUI so that chemical modifications of the amino acids can be accounted for when searching through your data. Once PSMs have been identified using search GUI, the results go to this next node, peptide shaker. Peptide shaker is in many ways a companion piece to search GUI, having also been developed by the Martens Group. The search GUI results include all potential PSMs generated from your data, regardless of the quality of the match of the spectra to the period of peptides. To account for this, peptide shaker will filter out the PSMs that are below a certain false discovery rate, or FDR, set by the user, leaving only the highest quality PSMs. Search GUI can also output data files in the form of simple tabular lists, as well as the MD Ident amount file, which we will discuss later in its use in subsequent analyses. Having identified PSMs in the data, the next step in the workflow involves the use of two steps, which will filter out peptides belonging to conventional proteoforms and leaving novel peptides for our analysis. While we are able to identify high-confidence PSMs using search GUI and peptide shaker, with proteogenomics, we are interested in the novel peptides not found in the normal proteome, which are invisible to conventional bottom-up proteomics approaches. At this node, we utilize two query tabular steps that remove all peptides corresponding to those proteins in the reference proteome, such as normal proteins and contaminating peptides. This node of the workflow will therefore leave behind only those novel peptides that are unique to this sample. The second query tabular step will filter out all those peptides that are either too large or too short to be seen in the mass spectrometer. This workflow also includes nodes that are necessary for downstream analysis. The first is node denoted mz2sqlite. One of the outputs of peptide shaker is an mzidentml file, which stores peptide and protein identification data. This node can convert the mzidentml file produced in peptide shaker into the mzsqlite format, which is needed in the interrogation of peptide spectra using the multi-omics viewing platform, or MVP. As shown here, this is an output tool known as lorikeet, wherein the quality of the spectra can manually be ascertained by the user. It is important for the generation of figures and confirming that the peptides are indeed real. As with the previous node discussed, the final node in this workflow is necessary for the use of novel peptide analysis workflow, as presented by Sabina Mata in her tutorial. In the third video genomics workflow used for novel peptide analysis, there is an initial BLAST p-step that is used to further analyze the peptides identified in this workflow. To do this, the data output from the previous query tabular filtering steps must be converted to the FASTA format, which is done at this node. For this to work properly, it is worth checking before you run the workflow that the title column is set to column 1 of the previous output, and the sequence column is set to column 2. Now that we've discussed the components of this workflow, let's go through it how to run the workflow in Galaxy, specifically the Galaxy Europe instance. Okay, hello again. Here we are in the Galaxy European instance, and we're going to go through how to perform the database search for proteogenomics workflow. Starting at the very beginning, let's go ahead and create a history for this, simply by clicking here, and then we'll give it an appropriate name. Call it ptn-database-search, rather, ptn-proteogenomics-database-search, and then I like to put the date as unambiguous pattern as possible. So we're going to obviously need a few things to run this. We're going to need the appropriate input files, as we discussed before. We're also going to need the workflow in order to do this. To start, let's include our, let's add in our RNA-seq custom database, as well as our reference annotations. Now, when you do this, ideally you would go ahead and run through the tutorial as performed by Janice Johnson or JJ, but to save time, I'm just going to import the results of that workflow in here for our own use, like so, and now back at it. So that takes care of these two aspects of our workflow, but we still need the raw aspect data in the MGF format for this experiment. Now, to find that, we're going to go into shared data, and we're going to go to data libraries. Now, we want, since this is for the Galaxy Training Network, we want GTN material down to proteomics here and proteomics data-search, and it's going to go to this link here. So, these are all assorted files that are used for the various proteogenomics workflows, so we're going to focus just on these four fractions here, and we want to select all four of these, and we're going to export to history. Now, it's very important that for this, we import them as a single collection, so that they can be run and give a single output for all four files together. Run them as a collection here, current selection, and we're going to import them into the history that we just made. This is fine here as a list, so go ahead and add that. So, these are all the files that we want, and we will call these, like, gpn and gf, and we'll just go ahead and go back to our workflow here. And, behold, we have our workflows here, our data set here, with all the four items in it, in the collection, and now we want to go ahead and import our workflow, so that we can do our database search. Now, to do that, it's fairly straightforward. Again, we go to share data, we go to workflows, and of course this applies to any and all share data and shared workflows, but so for us, we want this workflow here, this gtn proteogenomics to database search. So, we're going to import this that we can use at any time, and we'll go here to start using this workflow. So, this is my workflow menu, and here's the imported proteogenomics workflow here, so let's open it up and take a look. All right, so it's just as we, I went through in the tutorial in the PowerPoint slides, you've got your three input files here, including the custom database and the reference protein accessions from the proteogenomics workflow one, as well as our MGF collection. So, this is all set up to be run as a collection, all the more reason to import them as a collection. You can, of course, convert the individual data sets into a collection once they're in your history, but it's just as easy to import them as a collection. Then through search GUI, peptide shaker, as before, mz to SQLite, to generate our mz SQLite file for data visualization, then here are final nodes of filtering steps, and then conversion to FASTA. One thing I'd like to note here for this exercise, this is currently using the version 1.1.0 version of tabular to FASTA, sorry, we're going to want to convert that to the most recent version here, so we'll just save that, save our changes to that. Once you have your workflows, obviously, you can add tools and take them away, and as you change settings, you can save them or copy them to a new file if you want to preserve the original workflow. So, without further ado, we can go ahead and run this workflow here. That is going to trigger all the data sets in our history, and it's going to populate them into the relevant fields here. So, let's just go ahead and go through the individual settings one by one, or the individual tools and nodes to make sure that everything's correct. So, here we've got our custom database made from the proteogenomics workflow one, that is where it needs to be. We've got our MGF collection here in the right spot, and then our reference protein accession, so that's all good. So, now each of these here represents an individual tool or node in the workflow, so we can just go ahead and click on the expand collapse to look at more specific things. As I alluded to before, you can modify the enzyme, the digestion enzymes depending on the parameters of your experiment. I know that for this dataset trips in with two miscletages is appropriate, and that's arguably the most common setting, the most common conditions for bottom-up proteomics, so you probably won't even have to change that. One thing I think is worth pointing out here is the protein modifications. You've got fixed modifications, and you've got variable modifications. Fixed modifications are these sorts of post-translational modifications you expect to be there, because more than likely you put them there yourself. In the case of this, we've got selected carbamidometallation of cysteines, excuse me. This is an extremely common part of any bottom-up proteogenomics workflow where before you digest your proteins, you will add a reducing agent and then an alkylating agent to get rid of any disulfide bonds and then cap your cysteines to prevent the disulfide bonds from reforming. This just aids the digestion. Importantly for this experiment, we've also got this modification here, itrack foreplex of lysine, as well as itrack foreplex labeling of the peptide end terminus. Itrack is what's called an isobaric tag. It's useful for quantitating different experiments that are then concatenated together. This is something that would have been deliberately done in the processing step, so we're going to include it in the fixed modifications. At the same time, we've got a few variable modifications. These are things that might occur as a part of a biological process within the cells, something like a phosphorylation or an acetylation, or in the case here, we have these chemical reactions that can be considered side reactions, or things that just happen. For example, processing protein invariably introduced a degree of oxidation to methionine, so it's important to include that, as well as itrack is generally meant to react with primary means, such as lysine and the peptide end terminus, but it can also sometimes react at tyrosine, and it's worth including that as well. In addition, we've got peptide shaker here, where we've got it creating an mzidnml file, which is important for our workflow, and just the default methodologies here. Then down here, we've got our first of two query tabular steps, where we manipulate our data. Essentially, it's going to take outputs from peptide shaker and then filter it based off of the contents of our reference accession number. What this code means is in SQL, if anything is found within the reference proteome accession numbers, it's going to be removed from the peptide shaker outputs. Then this is all fine, good. This is a similar query tabular step, where anything, any peptides less than six amino acids long or more than 30 are going to be removed, just because those tend to not give as good spectra within the mass spectrometer. It's easier to remove them. You'll spend less time chasing those in the long run. Finally, here at the bottom, we've got our query tabular to fastest step, which we need to do a blast P analysis down the road. Now, if you look here, you'll see that there's a title column and a sequence column, just like we talked about in the PowerPoint portion of the tutorial. This you want to make sure, as we said, we want to make sure that's populated so that the title column is one, people in the column of two. That's pretty much everything you need. We're going to go ahead and queue up this workflow. Now, those of you familiar with Galaxy will recognize this as steps are queued and jobs begin to happen here. Now, I didn't want to go ahead and make you sit here and wait for this all to be done. I did go ahead and run this workflow advance just to make sure everything worked. Let's go ahead and take a peek at the finished product here. We will go into our history collections and switch to the previous one that was run successfully. Here is our successfully run database search workflow that we've run here. First, we have the output of search GUI. It's just going to be a rather large file that has all the information in it. That feeds directly into peptide shaker that gives us these outputs here. You've got your mzident file, which will be useful down the road. Our parameters for analysis as well as our PSM report. Now, this is where things start to get interesting. You've got here all the proteins that were identified in the system, rather all the peptides that were identified here. Here's the peptide sequences, the proteins that they come from. You can see that some of them are common to multiple proteins. Then a bit more information about modified sequence, the variable modifications and fixed modifications. This proceeds on to all this other information. Charges of the peptides, theoretical mass precursors, et cetera, et cetera. From there, we get this mzsqlite file here, which is useful for visualizing proteins. We can do that here just by way of example. We open up this MVP application. It can take a minute. This will give us a potential spectra here that you can visualize down here. This is maybe not the best one. Back to this, we see our first query tabular step here. We remove those proteins that are not found in the reference database and it cuts it down from, I think, several hundred or thousand lines here. Let me just see. We've gone from over 5,000 lines to just nine that are unique to our system and not found in either the reference proteome or in the contaminants. You can see here they're annotated by these more unusual accession numbers here. From there, you can go ahead and see that some of the ones that are too large and small are removed where we have just our simplified ID in the sequence. Then finally, our last step, we get this FASTA file here where the column on the left is used as the header and then the sequence is used as the sequence. This is a successful invocation of this workflow. You've got six sequences of novel peptides that can be used in the subsequent proteogenomics workflow for analyzing novel peptides. That's pretty much it on how to run this. Now let's just jump back into the slides for a minute. With that, I'd like to remind you that the other members of the Galaxy P team at the University of Minnesota are also giving tutorials today. On the subject of proteogenomics itself, there is an introductory presentation from Dr. Tim Griffin. In addition, James Johnson and Sabina Mehta have created excellent proteogenomics tutorials on the generation of custom databases and novel peptide analysis respectively. Beyond proteogenomics, Dr. Pratik Jagtop has given a talk on metaproteomics applications in Galaxy, which is not to be missed. The introductory tutorials for proteogenomics can be found at the Galaxy Training Network at the address shown below. I hope you found this tutorial useful and hope that you will embark on your own proteogenomic experiments in the future. Thank you for your attention and enjoy the rest of the smorgasbord.