 Welcome to MOOC course on Introduction to Proteogenomics. In last lecture, Dr. Pratik Yachtap talked to you about the workflow of RNA sequencing to faster database creation. In today's lecture, he is going to continue the discussion about bioinformatics solutions for big data analysis. He will talk more about the output files and data analysis to understand the unraveling the questions in hand. He will also be talking about three workflows, RNA sequencing to variant faster database, database searching using MSMS data and identification and visualization of novel variants. He will explain about search query, a set of freely available algorithms for mapping the MSMS data to peptide sequences. He will also talk about loricate viewer, a spectra viewer of all the B and Y ions identified in the mass spectrum. I hope it will also refresh you about previous lecture by Dr. Kar Klauser, where he gave you the understanding of manual interpretation of data sets. So let us welcome Dr. Pratik Yachtap for his today's lecture. So these two tools, both of these workflows, this one as well as this one generates a protein sequence faster file as well as a mapping file. So, this is a protein sequence faster file as you can see there is the ensemble identified here along with the name of the protein if that the function is known and this is another one. And the genome mapping file basically gives you again the identifier of the or the accession number of the protein which chromosome it comes from what is the start side, what is the end side and it also gives you the length of the exon. Now one of the things you will observe here is this is basically saying that this is 174 base pair sequence or basis sequence and then you can see that there is some gap between this and this which basically shows that there is an intron present there. So, this kind of helps you to map your regions of the protein and later it actually also helps you because if you just identified a peptide from this protein one can use these coordinates to go back and find out what are the coordinates of the peptide as well. So, let us imagine that we have run that workflow right this workflow which started with the FASTQ files and FASTF file as well as the GTF file ran through these all of this and you got a protein sequence file and so this is how your history looks like once you go through the documentation once you use go and access the galaxy instance go through documentation after the first workflow you have this as a history and one of these here would be a protein FASTF file that you can use here for your analysis. So the second part here is you have your RNA-seq database or protein FASTF generated from your RNA-seq database you have your MGF files which you have acquired from the same data and now you search it with we have got a tool called search GUI which basically has at least 9 different search algorithms in it. So, it helps you to identify the peptide spectral matches and then it uses peptide shaker to perform both FDR analysis as well as protein grouping and then there is another tool called as MDZ to SQLite that is used to perform further analysis. And it also generates peptides so that you can perform blast P analysis and blast P analysis is important because you want to go back to the NCBI RNA database and ensure that this does not these peptides do not match to what you have and there have been some instances where in something that we found to be a novel peptide just a month before is no longer novel because somebody annotated it. So you have to ensure and even report in your manuscript that December of 2018 this still was a novel peptide. Now in January maybe it is not but at least you have covered your requirements by saying this is the database I searched against on this date. So, I might actually skip this slide but this is a really I mean I use this slide a lot to explain the complexity or the process of how we start with proteins digest the peptides put it through LC-MS and NC-MS and identify them. But you know this basically captures everything and you know you have perhaps seen this many many times so I would not go through this but this is a really good article this comes from one of the from Jürgen Koch's lab in Germany and it covers most of the contemporary proteomics research that is going on so something I would suggest to go and read if you are really interested in basics or even contemporary status of proteomics. So again you take your spectra you match against your protein database and you know you can identify your peptides. So it is GUI as I said is a freely available software you can also run it as a GUI but we have got it in Galaxy so it kind of runs in the workflow that I mentioned and then there is peptide shaker which basically does protein inference does FDR analysis and identifies not only peptides PSMs but also proteins from your sample. It also generates a MZ identical file which you can use it for further processing. So in this second workflow right so we are in the second workflow now you have got your protein faster from RNA-seq data we have our MGF files and we want to search that. So what it does is it takes those two inputs does searches the data uses peptide shaker to generate multiple outputs and then this goes into a tool what we call as MZ to SQLite tool. So what it does is it takes any tabular outputs and generates a SQLite database out of it. What it helps you to do is it helps you to perform some more complex analysis on your tabular inputs and get outputs from that. So instead of you processing this tabular file files in multiple processes you can have it in one place and then perform more complex queries on that to get answers right. And eventually what comes out of this is you identify peptides that we think are novel in the sense they do are not present in your reference proteome and then you can take those peptides and subject it to BLAST P analysis. So that kind of covers these two workflows the first one which generates a protein faster file the second which gives you two outputs one is your peptides that are going to be subjected to BLAST P analysis and it also gives you a MZ to SQLite file and I will talk about it what it means right. So the third workflow which I think is the most interesting because you kind of you know these two workflows you can you know some way maybe the second one you can use it you know you can use any other software to do it right. You have a MGF file if you are RNA-seq generated protein faster file you can use any search algorithm to get your outputs but these two workflows are something that you can you know with the work that we have done hopefully helps you to get it done easier on your datasets. So the third workflow so by the end of this your history would look like this it will have 54 items and these are there are 54 of these because you kind of added you know outputs from each of these tools. Again and this is just as a backup in case your search hasn't run because you know because it got stalled or because of some reason you can always go back to that Galaxy instance I talked about and download the history for the third workflow. So again the same you know process you take your active history run it with the third workflow and what does the third workflow have? The third workflow basically takes in inputs from workflow 2 which is this BLAST-P peptides for BLAST-P it also takes the PSM report. So the peptide spectral match report from peptide shaker and this is generally at one person global FTR and then it also has this mz2sqlite file. So this is basically all the information in your mzidentimal file. The mzidentimal file is the file that is generated by a protein or PSM search. That particular file all that mzidentimal file information gets into that and this is really important because it also has information about continuous B ions, continuous Y ions all the spectral annotation information and that's something that you can use to process your data. So mzidentimal file basically is just your peak list it doesn't have any any peptide annotations. While the mzidentimal file has that and has got a lot more information. I mean if you want to use the next tool which is the ability to look at the spectrum view your spectra then you should have a mzidentimal output in general and there are many software with I mean scaffold does that protein pilot does that I am sure proteome discord does that do I haven't used that. So there are mzidentimal is a quite a standard output nowadays that most of the software generated. But that doesn't I mean so that will if your software doesn't do that then it will avoid you to do the spectral visualization but that doesn't mean you cannot do the rest of the things in this workflow. But again if you run it in Galaxy since search go in peptide shaker do that you can use that right and then we also had this genomic mapping file from workflow one right. So all of these are actually used in the last workflow and what it does is you know so this tool here which is the query tabular tool. So remember I talked about taking tabular files and generating a SQLite database out of it. The SQLite database now can be queried and you can only say show me those peptides or peptide spectral matches which are novel right. So you are matching it against the known database and try to find that and these novel peptides sorry these novel peptides would be a result of this NCBI blast that you have done right and these novel peptides or novel PSMs now can be converted to just peptides and these peptides can be subjected to looking you know generating an output which can give you peptide its genomic coordinates and other information. So I will basically talk a little bit about that so there are two tools used here one is peptide genome coordinate and peptide pointer. So looking into the details right we had this blast P peptide from workflow two this also from workflow two workflow two and this one from workflow one and then you know by NCBI blast P you use certain rules and these are the rules that we used is if the peptide if it is blast P ident is less than 100 or there is at least one gap present or if the blast length is lesser than your query length then you basically call it a novel peptide. So and we have kind of used this on multiple datasets and these three features seem to be enough to identify a novel peptide. So that is the information used here to identify novel peptides and then and then it uses this tool called as PEP pointer to generate a tabular output that you can use for further analysis. But before that I will like to maybe talk about this thing called as mc to secolite which is used as an input to visualize your data and I would really if you are really interested in spectral visualization I will strongly encourage you to go through that you know the galaxy instance and the documentation because this really doesn't take time each workflow takes I think the first one takes 12 minutes the other takes just two minutes each but you know it will help you to go through that. So what the mc to secolite data does is it takes this mj n intimal file that you have and generates a list of all the peptides that have been identified with associated information. Now you can select novel peptides so remember we actually identified some normal peptides here which are already in the history you can use that so you could say I just show me the subset of these novel peptides right and then if you do that if this list goes down and then by using various tabs that are open here you can visualize the spectrum using lorikeet this is all within galaxy so you can use a lorikeet visualization tool to look at which b and y n's are or which peaks are annotated with b and y n's how many of these are continuous b and y n's and so on. So basically look at the spectral quality because even if you have done one person fdr and if you are getting really good great score for your peptide it is not necessary that the spectral annotation is good and I think it is important that if you are identifying a novel peptide you at least have the means to you know to show or convince yourself and then the reviewer that this you know this is a peptide that indeed is novel and is definitely needs further interrogation. One can also perform genomic localization using that MVP tool that I mentioned so again in that tutorial there is this peptide that we looked at and as you can see it is annotated quite well now the lorikeet viewer also is interactive in the sense you can check on b and y n's and so on and so forth to select these. So this if you were to do this manually this would maybe take you know 20 minutes to a half an hour for somebody to look at each of these while this takes a few minutes or even a few seconds to go through this. It also gives you an option to say this is a good one or bad one so that you get you know it kind of you can look at the list of all your novel peptides that you have here let us say if you have 200 of them you can look at the spectral quality and then only select those that are that are important for you. The next part is genomic localization so again by using various buttons or various features when you look at the protein view report in the mgccolite file you can open the integrated genomics viewer and then you can look at you know what are the variations that you see and there are you know you can lay down tracks and look at the RNA-seq data you can look at the three frame translation of the DNA sequence the annotated peptide sequences and so on and so forth. So it kind of gives you a pretty good way of looking at the peptide that you identified and you can also scroll and look at adjacent peptides and so on and so forth. So the eventual output that comes out of this is you tell it tells you a chromosome number what is the start and stop sign what is the peptide that came out of it and what is the annotation you know where do these peptides lie. The ones that you found in CDS were basically the ones that were either single amino acid variants or there were some that we found in the 5 prime your untranslated region and so on and so forth. So if you have let us say hundreds of them it helps you to identify or even classify these peptides based on the variants that you found it also gives you something more interesting you can sort this by chromosome and then you might even see a pattern wherein these peptides maybe are lying around the particular region and that kind of gives you an idea about a mutation that could be occurring in that particular phenotype. You can use the information the output that you generated from this you can just take that link and open it in UCSC genome browser and then you can look at more details about you know how was this gene or the homology of that particular gene in other organisms and convince yourself that the mutation that you see is indeed novel or is similar to some other organisms. So again there is information here that you can use it through UCSC browser and in this case for this that particular peptide I talked about the same peptide right that is the localization we found that there was a mutation in this particular region that G you had gotten converted into A. You can also and this is not just in Galaxy but you can also go beyond and use some other tools to look at which you know what are the conserved domains that it occurs in. So I mean basically I would really strongly encourage you to go ahead and use that tool because if you are interested in using proteogenomics or using Galaxy for proteogenomics I am sure most of you are interested in proteogenomics but and it kind of gives you a sense of you know how it can be used. Now it is again set up for a small data set you know with with something that works in maybe an R or 2 but you you know you could also think about then upgrading that to you know using it for larger data set your own data sets for example. So I just wanted to mention that Galaxy P is not only about proteogenomics we also work in the area of metaproteomics and metabolomics so we are developing tools and workflows to enable that analysis as well. And then yeah so this is what I would encourage go try it out look at this site there are instructions all you need is a registration and password and you know a login and password and if you use this documents which are really detailed and we have used it for at least three workshops last or this year you know feel free to try it out. Again these are the same documents as I said as a backup if this fails you can always try this instance and as I was mentioning to somebody here this would be a instance that would be there for longer so if you want to just skip that and start using that that is fine too. Just wanted to acknowledge the people who were responsible to make this happen Praveen Kumar in Professor Tim Griffin. Professor Tim Griffin is the PI of the Galaxy Grant and Subina Mehta where the people who actually so Praveen developed a few software along with James Johnson but Subina Mehta actually tested them and also worked on the documentation that you you can go and have a look at. Lastly we work with multiple researchers around the world not only users but also developers because you know the tools are getting developed as the field is emerging not only in proteogenomics but other fields as well. You see that many tools are getting developed so we try to work with the best tools that are available try to package it in Galaxy and integrate it into workflows links to you know you can actually go to galaxyp.org and find more information there is also galaxyp.org slash contact if you want to contact us but with that I think I will I will be happy to take any questions. Yeah so the mapping file that is generated from the second workflow would have that information. So yes for a particular dataset it will have it right so you need to run your dataset your RNA-seq data to get that so I mean for this example yes we do have it if you run some other datasets we have which we have done we have that yeah so it is possible so that is why we have those two different workflows one which is single amino acid variants which uses a different set of tools and junctions which is a slightly more a larger genomic reorganization right so yeah and it is not only captured in your genome mapping file but it is also captured in a protein phosphate file you will have your genomic coordinates and you know the fact that it is a junction a normal junction. I would encourage you to do a hands on session later with the documentation but I can show you what what I was talking about in the sense I can show you this in action or where things are and again if there are more questions I can take them so this is the this is the galaxy instance right this is for so you know for you to login and register this would be you know so I have already registered logged in into this this is the tool format and this is the history and this is the pain the central pain so the way you started you go to share data and then there are histories click on history so as you can see there is an input history here right and if I click on that it shows these are the five files or 10 files in fact that I have here and I can explain why there are 10 files and not five or and we can see only six of them sorry and then I go to this place called import history and you can name this anything so I can call this Mumbai imported once I import it it basically goes here so this is just one way this is for the tutorial but you could also have all your data sets on your somewhere else right on a FTP site or on your computer and you can upload that as well so there's a upload function that's also available but you know for the sake of this so here as you can see there's a fast queue file there's a gtf file faster file mgf files and so on and you know there are again icons here I can try to look at the data so if I click on this the central pain shows me which chromosome you know what's the format of that particular file and that's true for your fast queue file as well and so on and so forth right bam file so one can look at that on the left side here are tools available so you can also search tools for example if I were to look at search GUI it shows me that there is this tool called search GUI and then associated peptide shaker tool but again for the sake of workshop if I go so you can actually you know spend a lot of time looking at this and then if you go to the shared data you can go to workflows now and what we're trying to do is convert this fast queue file in a protein faster file so go to a workflow and I'm going to select the first workflow which is or the second one here which is workflow one RNA-seq database generation right so I go here and I say import and once I import I can say start using this workflow and there are there's one feature which I didn't talk about but this is the workflow now so if you look here if I click here this workflow is in your you know you've kind of transferred the workflow in your domain you can modify the workflow so I can show you that so if you have this right I go to this but so there are many things here but I would strongly encourage you to go and explore these so I go to this place called edit and then hopefully very soon it will open the workflow output that we are seeing earlier so you can see here this basically shows the the structure and I think there's a way of reducing this though I don't know how to do this on this computer but you know you can reduce the size and look at this but here you can actually see that there's a fast queue file the gtf file as an input and there are these various tools that we talked about right so now for example if I wanted to go into this tool right and if you for example wanted to not have I mean I'm just giving an example let's say this says run individually but if I want to say no I want to change that to merge I can do that and save this but I might name this differently or I could I can change anything here and save it and so you can modify these you can also add things to these workflows you can add inputs or outputs from these workflows add tools to these workflows so it's quite flexible that way right and then so I'm not going to save this because or maybe I will just have this yeah so and from here I can now run the workflow right so maybe I'll save this and I'll run the workflow again also run it from the earlier place that I showed you and when I'm running the workflow now all I need to ensure is that I select the right input so here this says the first one should be fast queue but it's showing fast file so I might go to you know select the right one which is the fast queue file here I need to select gtf file so I select that and thirdly this is right so this is the protein fast file that you have and then I go ahead and I just click and you know so each one of these as you can see are tools in that workflow right and there are 20 such tools so or sorry there are 22 such tools right and then I go ahead and just say run workflow and then what will start happening is and you should be able to see this very soon is that the outputs as you can see have started piling up so are in initial inputs here and the first file has started getting converted from the first tool and this will start happening for rest of them and all these proteins or all these uh in inputs will start getting generated so this is how you're generating history right and maybe they just wanted to have a more dramatic way of saying I generated history but I don't know I mean that's why it's called history so at the end of it we should have these outputs that are coming out right so I'm going to kind of just jump and skip to the next history which is so let's just imagine that you know that history is run right which it showed in 12 minutes we have checked that so I'm going to go to history 2 and import that history and sometimes I actually put in dates so that I know you know which of these workflows were unknown which dates but in this case I'm just saying Mumbai because I know if I just store the same thing it might conflict with my not conflict but I might get confused later if I have to select it what's that uh history 2 is basically uh no history 2 is an output from the first workflow so the ones that were gray right now once you run them you'll have history 2 so I'm kind of doing like a cooking show right I mean just imagine this is cooked now let's look at the second yeah this will be the input files yes yeah first workflow and the third one would be from the second workflow history 1 should be the result of our first workflow okay I'm just going to import it just the way it is so that would have everything that is run and I'll show you what we do next I wanted to kind of just mention that if you have a completed history um you can convert this into a workflow so I'll I'll I know this is little advanced but I'll just show you um so for example this is your you know after running the first workflow and you generally do this in a stepwise manner right you can go to this wheel and then say extract workflow and so now what happens is if I've run a history right I can call it something right uh right and then I say create workflow a little generator workflow now I can share this workflow with you you can take your input files and on it I can share the workflow with anybody so the ability to share so let's say if you are on the same galaxy instance I should be able to take this workflow right and then I can I can share it so I'll I'll I'll have to do is generate a link and then send this to you by email if you click on this you log in on this you should be able to open that workflow so that kind of helps you to not generate yeah use the same parameters um so coming back to that um yeah so again you want to do this uh what I'll do is maybe just show you so we did the same thing now we we search the database and then we generate outputs what I'll show you the last output that is generated which is basically everything together right your input files uh output from the first workflow second workflow and third workflow which is one here and then we can very quickly see if we can look at some of the features and galaxy has been in use in genomic studies for quite a few years now we basically just adapting it and using the ability that the fact that you know it's really strong in genomics and transcriptomics to use it for proteogenomics because then it's easier to merge these so you can basically go through this really step by step manner it also gives you information of basics of galaxy what are histories and all that so it's really is a very easy way of learning galaxy if you're interested so those who are interested please go and use this and then later once you get a good feel for it you can also contact us at galaxyp.org I think uh which is at the end we have contact yeah so you can contact us galaxyp.org and we can suggest so there are some uh sites in Europe right now which I've got a lot lot more you know infrastructure so you can run your data sets on that or you could also contact some researchers who are locally running galaxy as well to run some of your data sets there so um so this is the final summary or the you know after running all the three workflows as you can see here it generates this output with the peptide the number of spectra that it was identified with what was the localization which chromosome what was the star start site it also gives you this link that you can use to open in a UCSC browser to look at you know which is kind of an alternative way of looking at than IGV browser uh so this is the mz2 sequalite MVP output that I talked about in a fact click on this visualize an MVP application it opens this application no so these ones here are not novel these are all the peptide sequences from a mz intimal file but in your history you have a place where it says uh novel peptides yeah so this 58 history is a list of novel peptides so these are only you know since this was small data set there are uh what eight of them here right and then the MVP application basically helps you to load that from galaxy and then here you would see only those eight peptides if I click on it it opens this and then yeah so it gives you this peptide here and now I can select uh you know whatever internal ions yeah but uh it's not so it's not been annotated right now so if I add this and you're right I mean you know if you think it's not a you know if I start doing this um you you know you might agree with it but it's it's a very subjective thing right I might think okay three ions um continuous is good but you might not and at least it gives you an ability to to look into that um but yes I mean it it helps you to do that but you can also do it for other spectra and so on and so forth so right now oh yeah so and then so if I go from load from galaxy if I do that I should be able to I can get those eight now and I can also go to view and protein which I might not cover right now I can then select this peptide because we went from a RNA-seq data so it's from your protein phosphate but then it gives you ability to look at look at yeah so this is a genome centric view and now if I go here you know so there are there are ways that you can look into this um or you can expand this and look at the three frame translation of that particular peptide you can see that peptide is here the third frame um and you know so anyway I mean again this is for one peptide but if you had 200 peptides or many peptides you should be able to see that so I think with that I will end this in today's lecture I hope you have learned that two output files from RNA-seq to faster database creation workflow which means the faster file and genome mapping files you also got a glimpse of how one can search for novel variants using three basic steps shown by Dr. Pratik Jakta we also learned that how one can use galaxy output files and use it in different online softwares like integrated genomics viewer IGV to understand the mutations in a gene in the next lecture you will listen another speaker Dr. Ratna Thangadu who will talk about large-scale data science thank you