 Hello, welcome to the GDN Smorgasbord, a global galaxy course. I am Subina Mahatta, a member of the Galaxy for Proteomics team based at the University of Minnesota. To learn about us and our ongoing research, visit us at galaxyp.org. In this video, I am going to cover the third part of the Proteogenomics tutorial which is the novel peptide analysis. In the previous Proteogenomic tutorials, we have covered the generation of protein sequence database from RNA-seq data. We performed Proteogenomic database searching using mass spectrometry data which led to distinct peptide identifications that are not present in the reference database. In the third workflow, we will go over these peptides and identify these novel proteoforms, annotate them and visualise them. This is how the complete workflow looks like. For the purpose of the tutorial, we have separated them into three different workflows. In the first tutorial, you learnt about RNA-seq to variant faster database creation. In the second tutorial, we learnt about database searching using MS-MS data. And now in the third tutorial, you will learn about identifying novel variants and their visualisation. As we have developed a workflow for the novel peptide analysis, there are some important steps involved. Now, let's go through these steps. In the database search tutorial, we had a list of peptides converted to a FASTA database. This FASTA file is then searched against the NCPI-NR database to find any matching peptides. Once the step is completed, a tabular output with its alignment to the references are displayed. We combined this information that we got from BLAST-P with the PSM report from peptide shaker to obtain all the information about these detected peptides. Now that we have all the information about these peptides, we perform a filtering step, wherein these proteoforms, which fulfil the criteria of not being present in the existing NCBI repositories are called novel peptides. We can perform visualisation of these peptides using our multi-omics visualisation platform. It will give you information about the peptide spectra, its localisation on the reference protein, and we can view its genomic localisation. These novel peptides are then subjected to mapping using the genomic mapping file that you got from the first workflow and the mz2sqlite file that you obtained from the second workflow. Now using these three as the inputs, we run the peptide genomic coordinate tool. Now this tool will provide you a .bed extension output, which can be used as a track on your IGV browser. Now to perform some strand analysis and to get more information about your genomic localisation, we have used a new tool called PepPointer. Now for PepPointer, the main input is your GTA file, which you have the output from your first workflow and the peptide genomic bed output. Now once you get the information about your peptides, we have used QueryTabular to get a complete summary of your novel peptides, which is basically the information about the peptide, the protein, it comes from the file it's present in and a link to the UCSE browser. Now if you have completed the second tutorial of proteogenomics, this would be your current history. It will have all the tools that you have run and the last output should be the peptides for blast analysis. This is how the output should look like. You're left with the FASTA file that has to go through the Blaspy analysis. If you do not have that file, do not worry, we already have shared this information with you. Go to the share data, click on histories, and then go to the proteogenomics to database search and import this history. Now this history will become your active history or current history. Once you have imported the history, the next part is to obtain the workflow. To make your life easier, we have published this workflow. To do so, go to the share data and select workflows, search for the GTN proteogenomics three novel peptide analysis by the owner Galaxy P, import this workflow. Now when you import this workflow, select on start using this workflow. And now you're ready to run your workflow. Your workflow looks something like this. The arrow marks pointed here are basically the inputs that are required to run your workflow. As I mentioned before, there are peptides for the Blaspy analysis, the PSM report, the MZ to SQLite file, the genomic mapping SQLite file, and the edited mass-musculus file, which you obtain from running the first tutorial workflow. Now let me do a hands-on overview of the workflow. So the first thing right now I will do is log in to the Galaxy server. And now I have this open. Let's have a new history like this. This is how your Galaxy interface should look like. So there are a few things that we can go and get this data from. So if you go to share data, go to data libraries. All the data that is required for the tutorial is present at the GTN material. So select that. And once you select that, you can go to the proteomics part of the tutorial and select proteogenomics three novel peptide analysis. Once you do that, select the Zenodo link there and the few files that we need to run the third workflow, which is the genomic mapping file, the mass-musculus GTF file, the MZ to SQLite, the peptides for blast P analysis, and the PSM tabular report. So once you select all, you can export to history as datasets and name your dataset, sorry, your history as novel peptide analysis. And import. So if you go to analyze data now, all the files that you have imported from the share data libraries will be right there. If you have run the second workflow and you have gotten output, then your history will be having more results. So for the sake of running this third part of the tutorial, we have also shared histories, as I mentioned before. So for that, you can go to shared histories, search for the database proteogenomics two database search, select on that and import history. So you just click on the plus sign over here and I'll just select import. Once you do that, that will become your active history right here. Now the next thing we do is to extract the workflow. So for that, you go to share data, go to workflows, and now we search for the GTN proteogenomics three novel peptide analysis. So if you click on the dropdown, you can import or you can directly run it, but I have imported this just to show you. So you can click on start using this workflow. Now the workflow that you have just imported is now in your workflow tab right here. So as you can see, the workflow is right here and then I click on the dropdown and select edit and here is how your workflow looks like. So each and every tool involved in your workflow will be right here. The Mus Musculus GTF file, the peptides for BLAST-P, the PSM report, the MC-Desequilite genomic mapping file, and these are the tools involved, the NCBI BLAST-P, query tabular where you extract the novel peptides, PSM information, a novel peptide-only file, you run the peptide genomic coordinate tool, the pep pointed tool, and the final summary query tabular output. The next thing we do is we'll run the workflow. So if you select on that play button, you basically run the workflow. And here, the main thing you have to pay attention to is that please select the correct input. So it's mentioned here, edited Mus Musculus GRC M3886. So we have to select that file from your history. Again, the same peptides for BLAST, which is right here. PSM report, so we will select the PSM report, the MC-Desequilite, which is already pre-selected, because it looks for the SQLite file extension, and this is one of the SQLite file. Again, with this, it's looking for an SQLite file. So for that, we have to perform the same, like select the genomic mapping SQLite database file. So once we select that, we have all the correct inputs required for running this workflow. Please make sure that you have gone through the process of looking at each and every name matching so that we can run the workflow. Once you have established that, we will start running the workflow. Now, it will tell you whether the workflow has been invocated or not, and the tools will be loaded up here in the history panel. Right now, it is gray in color, which is saying that all these tools are now queued up in your history, and it will run once given an opportunity. So once this tool starts running, it will turn orange in color, and when like this, and then it will turn into green once completed. Now, in the meanwhile, this is running, I already have published a completed history for the sake of the tutorial. Let me just import that. So for importing that, I'll go to the share data, again, histories, and I'll search for the proteogenomics three novel peptide analysis right here. Now, I will select that. So as you can see, all my tools have completed running here, and I'll import it. Now, this will become my active history. As you can see, there are various outputs. The last workflow, part two, ended at peptides for Blaspy analysis right here. Now, these are the tools that were present in the current workflow, the part three. So here, the first step, as I mentioned, is that we performed NCBI Blaspy analysis, and we have results for the six peptides. Now, let's view these results. If I click on the I icon, I can view the results. So as you can see, I have information about BlastP. So it'll give me the peptides, the proteins, the identity information, and it'll have all the 25 columns that BlastP provides, as you can see. Now, this is the information that I acquired from BlastP. Now, I want to also know the PSM information regarding these peptides. So for that, I run a query tabular. If you click on rerun the button here, it'll tell you what criteria I chose to extract the PSM information of the novel peptides. So for this, I selected distinct PSMs right here, in which the PSM information, like PSM sequence, should be equal to the Blast sequence. And the selection criteria is that, if it has to be a novel peptide, if only the Blast identity is less than 100%, there should be more than one gap, or the length of the BlastP identified sequence which should be less than the query length. So if all of these criterias has been satisfied, then it'll give me an output of those novel peptides. So PSM information of only those peptides which have fulfilled the criteria will be displayed here. So as you can see, there were initially six peptides, and now I have only four peptides, because these two peptides are the same. So out of six peptides, I have four peptides, which are considered novel, according to the criteria which we just mentioned. So now what I did was I extracted these distinct novel peptides, which is right here. So we have four peptides now. Now that we obtain our novel peptides, how do I think that I can be sure of these peptides? Are the spectra good? Does it match to a reference? Or how does it look like on the genomic sequence? Now, for this inquisitiveness, I want to view the peptides. So for that, we have our multi-omics visualization platform. Now to open the multi-omics visualization platform, we need to select the NZ to SQLite output that we get from running our second tutorial. So select on that, go to visualize this data right here, select that. Now once you select that, it will give you two options, editor and MVP application. Now select that MVP application. Once you select this MVP application, I am going to open it in a new window. So now I have a new window with only the MVP platform. I can go back to my history if I select back the tab. So let's open the MVP viewer. So here is the information about whatever is present in your MZIdentML file when you run the peptide shaker tool. Now this is all the entries present in your peptide identification file. But all I'm interested is, is looking at the novel peptides. So for that, I just want to load the novel peptides input. So for that, select the load from Galaxy, search for the novel peptides, which is right here. So I'll select that and use it for filtering. So now we have filtered sequence right here. So what I will do is select that. And as you see here, I have four novel peptides right here. Now what information is this giving? The peptide overview tab is basically giving me information about the sequence, the spectra related to that sequence and the proteins that are assigned to the sequence. Now you can either, now the second step I want to do is I want to look at the spectra, the PSM involved with these sequence. Now I can select one by one, like one after the other, or I can select all at the same time. It's your choice. Now for the purpose of the tutorial, let me just select all and show you how the PSMs look like. So for that, select PSM for selected peptides right here. Now, as you can see, I should be getting six spectra information, which is right here. The first thing I want to do is look at the spectrum, the quality of the spectrum. As you can see, this is a lorry key to viewer. So the lorry key to viewer will basically tell me information about the spectra that I have just clicked. It'll give me the peptide information, the file it came from, and the theoretical mass, all the information that is present in your PSM report. Now for me, I can not accept the spectra because it doesn't have continuous B and continuous Y, three ions, but it depends upon the user and the researcher what they want the spectra to look like. So this is a very interactive lorry key to viewer and we can select multiple ions or we can select MH class or internal ions. We can also select the neutral losses. We can change the mass tolerance. For example, I can just make it 0.5 and update it or I can uncheck the peak detect to get more detection. Peak detect will just give you all the refined peaks. If I don't want the ions to be displayed and now I want the MYC value to be displayed, I can click on that. So anything, you can perform anything through this. It's your choice. Now, for the purpose of the tutorial, I will just unclick all these and show you how the spectra would look like. Now, if I like the spectra, I can click on the thumbs up icon and if I do not like the spectra, I'll click on the thumbs down icon. So for now, let me click on thumbs down. That means I did not like the spectra. Now, I can perform the same action with the other peptides. So as you can see, when I clicked on the next peptide, it comes on the top of the previous spectra. Now, again, with these, I can manipulate the spectra as I want. I feel like I am glad with the spectra so I will perform a thumbs up. I will look at the next spectra and again, it doesn't satisfy my criteria so I will do a thumbs down. Now, for this peptide, there are three different spectra that is assigned to it. So I can select one spectra at a time. If I like the spectra, I can give a thumbs up and then the next spectra assigned to it. Now I don't like the spectra related to that peptide so I'll do a thumbs down. And again, the next spectra, I can like it or unlike it because I'm just concerned about one of the peptide spectra. I'll just keep it the way it is. Now you see here, there's something called export scans. Now, as I've selected two peptide and the spectra which I liked, I can export these scans to my own galaxy history so that I have all the information about that particular peptide. So basically it is going to give me a tabular output regarding all the information that is already being provided here. So if I click on that, it goes zero. That means it has been copied to your current history. So once that is run, it will also show me the information about that peptide. Now let's go back to the MVP application. So all you saw now was the PSM detail of the selected peptides, that is the laurikid viewer. The next thing I want to look at is the peptide protein viewer which is basically telling me the location of that peptide on the reference protein. So if I click on the peptide protein viewer I can select the peptide of interest and it will be shown right here. So this is telling me this is the reference proteome and here is the peptide. So if I drag the gray box I can get to the peptide of interest. So if you can see this is the peptide and this is the reference proteome. Now there's various color codes that are shown here. The N and the K are color coded different compared to the C and the other peptides. So the orange means they are matching to the reference proteome. What is the difference between the C, K and N even though they match? The match, the color coding is because of its modifications. So basically the N and the K are modifications along with the C. Now to understand more about this if you look at the peptide overview here you can understand that there are modifications. So if you hover over the K you can see it is modified as I track four flex. The N here is modified, I mean the N here is modified as the terminal modification I track four flex and the C here is the carbidol methylation of C. So that's why they are color coded differently. This is the protein viewer. Now that we have it aligned to the protein the reference protein of interest the next step we want to do is look at the genomic localization. It basically will tell you the information about that peptide right here in the heading. It's telling you the information that it's coming from the string tie output. It's an intron and it's coming from chromosome four of the MM10 which is the mass-musculus 10 file and the location where it's present in and the strand that means it's present in the positive strand. So what I will do is I will click on the arrow marks which is right there on top of the reference protein. So if I click on that it's basically showing me the IGV browser right here. Now I will certainly give it a few minutes to load. Now as you can see this is the IGV browser. Now I want to look at this peptide right here. So I can click I will click on the wheel right there and perform a three frame translation and zoom in. Now once I zoom in I can find this peptide right here. As you can see it's right here. The location of the peptide. Now as an IGV browser you can edit it as you want. You can change the chromosome location. You can add the cursor guide to it. You can add the centerline or you can track the label like this or untrack it and you can also save this image as an SVG. You also have opportunity to add a track name a height color or anything you want related to this. We have also added a feature called add track where if you click on it you will find all the files that can be added to this browser. So wherein you just have to select the files you want it to be loaded and load track. So if you load the track it will just keep on piling below but this is how your MVP works. It'll give you three main outputs. It'll give you the peptide overview along with the PSM detail, the Loreke spectra of your peptide of interest. If you go scroll up it'll give you the peptide information on the reference protein and also the IGV browser. So now that I have all the information regarding my novel peptides I will go back to my history. The next information we have also added is the peptide genomic coordinate tool as I mentioned before. The peptide genomic coordinated tool will look at the peptides that are involved with better spectra. Now as I already saw that there were two peptides with good spectra it will open up those peptides. So if you click on the eye icon which is the view it is giving me the two good spectra peptide. It's chromosome information, the chromosome start and the stop information, the strandedness, the score, the thick start, thick end and some other information regarding the peptide. Now this is a bed file. So as I mentioned before MVP can open up bed files. So you could load this as a track on MVP. Now we have more information regarding the peptide. So if I click on the pep pointed tool it'll give me information regarding the peptide again with its annotation. Now here it's telling me that the peptide ELGSSTLTAR is in chromosome two and it's overlapping on two regions. But it wants us to look manually. By looking manually it is telling us to look at the MVP MVP's IGV browser. So we will be updating this in the next version. It'll tell you every information about the peptide. The last thing in our workflow is basically the summary of these novel peptides. So for that again I will click on the eye icon and here's the summary. It's giving me the peptide sequence, the spectral count related to it, the protein it's coming from. So E means exon, I means intron right here. It's giving me the chromosome, the start and end, the strandedness whether it's the reverse or the forward, the annotation, the genomic coordinates and a link to the UCSE browser. So what I can do is copy this and I can search it right here opening a new tab. So it'll give me the location of that peptide on the reference genome in the UCSE browser. So that concludes our workflow. Now all of the information that I've just mentioned is already present in the Proteogenomics novel peptide analysis tutorial in the Galaxy training network. So for that you could go to training.galaxyproject.org slash training material, go to the proteomics section and you can find the tutorials I mentioned right here. You can select on the hands-on portion and view the step-by-step walkthrough of the tutorial. Now you can also have an easy option or access to this tutorial. If you're already on this page, if you click on the hat icon right here, you can select that and that will open up the Galaxy training network and you can repeat the steps again. Select proteomics and select the required tutorial right here. The take home from the Proteogenomics tutorial three is that we not just extracted the novel peptides but we also looked at the quality of the spectra and its genomic localization to confirm that the peptide is of good quality and also to confirm its identification and its presence in our sample. In the multi-omics visualization platform or MVP as we call it, we saw the different aspects of the platform. The PSM Laurie Keat viewer to look at the spectral quality, the peptide viewer to align it to the reference proteome and the genomic localization using the IGV browser. The three main outputs from the workflow apart from the MVP is a bed file using the peptide genomic coordinator tool, a tabular output for the peptides annotation from the PEP pointer tool and the final summary tabular output containing all the information regarding the peptide. Now the main part is this. You can copy paste the information in that column to a new tab to open the UCSE genome browser which should look something like this. Now that you have opened the genome browser you can see that it has similar functions to the IGV viewer. This is just a sample of what you can view. You can look at the chromosome and the location of the peptide, the nucleotide codon corresponding to the variant amino acid and then the different variants found in different species. It'll also tell you what kind of variant this is. This completes our proteogenomic tutorial three. The tutorials are published on the Galaxy training network under the proteomics section. There are three training material for proteogenomics as you have seen the database creation, database search and the normal peptide analysis. Now the training network also has information about the input files and the workflows. Also the Galaxy instances you can run the workflows in. Thank you for listening to our tutorial. We truly would appreciate your feedbacks to improve our content. These feedbacks are present in each and every training network hands on tutorials. Please go to the bottom of the page to find the feedback Google forms. Thank you for attending the Galaxy Global Workshop. This is Subina Mehta. If you have any questions or concerns please contact us through the Slack channel, Gitter or through GitHub. Thank you.