 Welcome to MOOC course on Introduction to Proteogenomics. Today's lecture will be delivered by Dr. Pratik Yachtab, his research assistant professor at Department of Biochemistry, Molecular Biology and Biophysics at University of Minnesota, Minneapolis in USA. His current research interest includes developing electrical workflows using LXC platform in the areas of proteomics, metaproteomics, proteogenomics and data independent acquisition data analysis. Dr. Yachtab is going to talk about bioinformatic solutions for big data analysis. As you are aware, big data being generated by variety of technology platforms such as NGS, mass spectrometry, RNA sequencing and many other different technology platforms are contributing towards big data analysis. Dr. Yachtab will talk to us about how to analyze RNA-seq files by first converting them into the protein FASTA files followed by locating it onto the genome. He will also provide a demo of Galaxy software and tell us about its application to understand the multi-omics data set. So, let us welcome Dr. Pratik Yachtab to tell us about bioinformatic solutions for the big data analysis. I am going to talk about using Galaxy platform for proteogenomics. So, we have been working in this field for last five years where in the researchers at University of Minnesota, some of them that you see here, we have been trying to put in proteomics tools within Galaxy framework. And the idea was to not only have the Galaxy framework or the tools to perform you know standard mass spectrometry searches, but be able to do something more more complicated like perform proteogenomics analysis. We also work in the area of metaproteomics, but that is not what we are going to cover here. And one of the things that we have realized is as you are working in this field, it is very important to work with the user or a project to make these possible. So, the structure is going to be more like a demonstration of using Galaxy for proteogenomics. And I will obviously be talking about a few concepts as we go along. So, I am going to talk about introduction to proteogenomics and multi-omics studies and I might actually not spend much time. What I will mostly focus on are the next three bullet points is if you have RNA-seq data, how do you use the RNA-seq data to convert into protein fast-file and then use that protein fast-file to search against your mass spectrometric data. And then eventually once you have peptides identified from that, how do you make use of that to one look at the spectral quality, secondly to localize that on a genome and then basically you know try to make some biological conclusions out of it. So, going through multi-omics, you know each of these field has its own strengths, transcriptomics, proteomics, genomics, metabolomics. But the strength actually lies in making the best use of the features that are available in each of these and help you to answer the questions that you as a researcher have put forth. So, again there are many technologies available, many newer coming up given the fact that there are lot more sensitive instruments as well as instruments that can also have got really fast scan speeds. So, they are not only able to go deeper but also much more complex data sets can be handled by newer mass spectrometric instruments, which kind of helps you to approach the transcriptomics sensitivity for most of the analysis. So, I would not cover much of this expect except to say that you know because of the ability to not only have tools that work really well in each of these domains, we are also developing now tools so that you can make correlations amongst various disciplines. So, this is I am sure this has been covered but I just wanted to reiterate the fact that if you have a mass spectrum and you have many mass spectra, generally if you have a reference protein database, you end up identifying proteins which are annotated or of known proteins identified. However, you can actually expand your number of identifications by if you are let us say searching it against you know what was earlier used as a 6 frame DNA sequence, genomic DNA sequence or even 3 frame CDNA sequence. But nowadays with the amount of RNA-seq data that is available and ability to generate both RNA-seq data as well as mass spectrometry data for the same sample, one can actually use RNA-seq data and I will talk a little bit about that. Researchers have also used repositories like there is repositories like Cosmic and others which actually help you to just go and get data from somebody else's research. So, you know it could be a protein faster file or it could be RNA-seq data from representative you know clinical samples and you could use that to search against your mass spectrometry data with the understanding that that particular data might actually not match completely because you might have some unique you know sequences that have been expressed in your sample. And this all leads basically to identification of peptides which are corresponding to novel proteomes which is what we think is proteogenomics is all about. So, again we talked about one data point comparison and that is obviously is challenging but there are methods in place and people are doing that. But then eventually the field is going to move and I am sure has already started moving in looking at time points of you know RNA-seq data as well as proteomics data and then try to merge that so that you can make a time dependent or temporal analysis of the expression of these RNA and proteins. That also means there is going to be a lot of data and a lot of analytical power that would be required. So, what I am going to start with is there is this instance that we have set up it is a galaxy instance and I will come back to what is galaxy and I would not encourage you to go on it right now because I am going to demonstrate on that particular instance. But I will definitely encourage you to go to that instance it is z.umn.edu slash galaxy p in Mumbai. And it basically is a galaxy instance on which you can use step by step directions which have been provided here in this in this documentation and that you should be able to use you know use the instance like I am going to use right now. So, let me take you through this. So, all you need to do is first go on to that website and register and all you need is a login and a password and you know and once you register you basically would have to go on to this in this place called as histories. Yeah. Yes. So, there is a tool called get orf that is available that can do that. You can take the cdna. Yes. So, you can go to ensemble for example ensemble has links to genomic dna and cdna. Take this download the cdna on to galaxy use this get orf tool and then you should be able to use it. So, and then if that fails and I am just keeping this as a backup you can also go to this site it is z.umn.edu slash proteogenomics gateway and this is the document that goes along with it z.umn.edu slash pg in November 18. So, this was done last month. So, that is why that anyway this if you go to Mumbai slides you should be able to see all of that. So, you have to go to this site and get registered and anybody can register to this. This is again on a cloud instance in Indiana University and and then what you have in this is so, galaxy is basically made up of you know there are tabs available at the it is a web based platform. So, there are tabs available and you can go to what is called as history and you can import the history there. A history is basically collection of your inputs or your any data that you would have processed and I will talk a little bit about that little later. So, just remember the word histories and workflows at this point. And so, in that you would actually have quite a few published histories and the first one would be a mouse proteogenomics input history data and then once you go there you should be able to import the history. So, you basically bringing that history into your browser and what it does is actually gets you know uploaded on to your what is called as a history panel. So, the panel on the right here is the history panel on all you have there are these input files there is a protein faster file which we are going to use for database search. There is a mass muscular GTF file right and that is used for estimating genome coordinates. There is another protein faster file and then there are MGF files that you have generated from your mass spec data and sorry this is actually a fast Q file not a fast A file. So, the fast A file was somewhere here and then you also have a BAM file and a BED file and I will talk about that why we need that later right. So, these are the inputs that one uses and the dataset used here and this is just to mention that this is a representative data which was published in 2014. It is a database about B cell development and the researchers were interested in comparing two different types of B cells in its development and we basically they had RNA-seq data as well as proteomics data for that and we have basically taken a part of this to demonstrate the use of Galaxy. So, Galaxy is the interface the web-based interface you can select the history. So, we are selecting the input file. So, remember the 5 files that I talked about earlier these 5 files that is the input files and then you can once you import the history you can start using the history which means your history becomes active and then you select a workflow. So, what is a workflow? Workflow is basically a set of tools that one would use in sequence so that you can process the data. So, just imagine you have your input files right and one of the input files needs to be converted from one format to another. So, it would use one tool. So, you know you asked about converting cDNA into a three frame translation. So, there would be one tool, but the process does not end there right you want to use that in your next step and so maybe adding contaminant sequences to this would be the next step and so there could be a tool which could upload those contaminant sequences and then there will be tool which will merge this together and so on and so forth. So, this basically becomes a workflow wherein you are not only taking one tool and running it and going to the other one, but you are taking the input and you can run this workflow of multiple tools right. So, you take select a workflow from one of the tabs you import the workflow and you start using the workflow and once you run the workflow you get an output that you want. So, a workflow could be as small as you know two tools or it could be as large as 20 or 30 tools right depending upon how complex analysis you would like to do. So, I will come back to the Galaxy interface. So, this is the Galaxy interface and if you go to z.ehrman.edu slash Galaxy P in Mumbai you should be able to come to this. So, few things to kind of know about Galaxy. One is on the left side we have what is called as a tool pane in the sense there are tools that somebody has developed and implemented in Galaxy and that comes with so let us say there is a tool like get or it would have a place to show an input file. So, it is a basically going to face for a command line interface tool right. So, if you have let us say there is a tool it gives you where to have an input what are the parameters that need to be used and so once let us say you have a input file here and you use that tool the processed output will basically get added on in your history. So, input file data processed process data right. So, you kind of start building a history and that is why this is called the history play pane when you starting with the input and generating input file and the central pane is a place where you can view your data you can look at the parameters and you know there are other things that you can do in the central pane. So, that is about Galaxy interface. In this particular demonstration that I wanted to show you we are going to cover RNA-seq to variant first day database conversion. So, if you have RNA-seq data how do you convert into a protein first a file then take that protein first a file and search it against you know against that protein against the mass spectrometer data and once you have identified your peptides you can process them to either visualize this spectra or even identify the localization on the genome right. And this all is possible because somebody has worked hard to put this tool into Galaxy and also put it into workflows and so each of this is a workflow. So, if you if you observe closely once you go and look at the slides you will see each of this is a workflow which is kind of connected to each other right. So, the objective of the workflow is to basically show that if you have genomics data you can generate a customized database to generate a database that you can use it for your proteomics experiment or you can also use RNA-seq data to do that. And then once you get the data from that you should be able to modify a gene model you might actually find that there are some peptides that have been identified in regions that you know you would not have found earlier because of and because of some genomic rearrangement you found them as a result of even it could be just a single amino acid variant or it could be a insertion or deletion. So, what this workflow first workflow which takes in RNA-seq data and generates a protein fast file does is it generates a protein fast file but it also generates a genomic mapping information right. So, this is the input file that I described earlier and it has basically the fast queue file which comes from your RNA-seq data right. There is a gdf file which is a gene transfer format file which basically has genes as well as genome coordinates and which chromosome it comes from and and so on and so forth. So, this basically helps you to connect your your protein fast file or the accession protein accession numbers to your genome coordinates. We also have a known protein fast file generally from Unipro that we use and then we have the mass spectrometry files. So, the MGF files right. So, this gdf file is also available in Ensembl Ensembl has a gdf file for that particular organism so you should be able to get that. So, this is something that you generate through experiment this is something you generate through experiment this is publicly available Unipro and this is also publicly available it gets updated I think every 3 to 6 months. If you are doing metatranscriptomics if you are doing microbiome analysis or if you are doing a two organism or you are talking about contamination yeah. So, no you can still use it you might need to do some QC filtering you might you know. So, you are going to do some amount of mapping onto your genome to only select those sequences that are of interest to you. But if you are doing a multi organism analysis then obviously you want to retain all of that does that answer the question or sure okay. So, again we start with the sorry we start with the input files the 5 files that I talked about and then we basically start using the first workflow which is the database creation workflow and I will perhaps use this time you know. So, this is where you can go to workflows you know once you go to that site and download this workflow again import the workflow start using the workflow in wherein you you can say run. So, once you import this workflow it actually shows up in your workflow list of workflows and then what it shows you here and I really wish I could have shown you on this you know in the on the screen. But it basically gives you 3 input files to select to run your workflow. So, what you see here is you know fast Q file as a first input gt file as a second input and faster file as a third input and then there are these series of tools that are used so that you can you can convert into a protein faster file eventually right. Yeah, yeah I mean we can keep this available for 3 months but I mentioned there is another website called protein genomics gateway right z.io the slides as well that could be something you could use that would be available for much longer time. Now, both are galaxy instances. So, once you have these inputs you click run. So, imagine there is a button run here for people at the back maybe you cannot see. But if you click the button run here it runs it starts running and what it does basically is it starts adding these outputs from each of these tools about this you know it starts building up history. Now, when it starts basically it is green color. So, the job is in Q right when the job starts running it is it turns yellow and when the job is actually successful it turns green. So, each of these would have some output generated from that workflow. It is generally bad news if it is right which means there was some error that was generated very early in your analysis and that is when you go back to your developer or your infrastructure specialist to ensure that you know something wrong did not go on the the side of the network or the tool version and so on and so forth. But just let us imagine that this this is how it works right I mean in the sense let us let us hope that in our case we get this outputs and we have actually tested at least for this data set multiple times to ensure that that is how it works. Then we basically so let us let us talk about this this particular workflow in general right. If you look at that workflow and expand it it basically is made up of two parts it starts with the fast queue file that I talked about. But it also has gtf file and other files here the third input that we use here the faster file here thank you and then it can basically it is divided into two different groups and I will I will show you the details of this later but the first one basically looks at single Aman acid variants as well as indel variants where the second one basically looks at you know junctions novel junctions and so on and so forth. So these are the just different kind of variants that you can identify for example single Aman acid variants and indels can be identified by this first workflow while the second workflow identifies the rest of them here. So if you look at this details and hopefully this is little more clearer you have the fast queue file you have the genome coordinates and you have the gtf files right. So the initially what the first tool that is used is it uses the RNA-seq data high sat is the tool that is used to align your RNA-seq data right. So let us say these are all your RNA-seq files it maps to the genome and then you can you know you can see where your your RNA-seq reads fall and based on these now you can you can perform variant calling and that is where the second tool which is freebase so this tool here takes in all these aligned files and then starts finding something which is different than your reference genome right. So it is trying to find a variant in your sequences right. So the green dots that you can see here that is where it starts identifying those and then so for example if you are looking at this I am basically going to focus on this region here and as you can see here this is your these are your RNA-seq reads and you can see that the sequence in the reference genome has got a G in it right while in and you see a green I do not know sign there right. So you can see that in the RNA-seq data you are actually finding a variant. Now that is what the the tool does is it kind of captures this information right. So the high sat aligns these RNA-seq data freebase finds the variants in it and then the next tool which is custom ProDB the one that I am this one here basically looks at the reference genome these variants which have been identified by freebase and then it takes in this and then translates not only the original sequence if it is you know if there is no variation it just identifies translates that or it can also translate the variant sequence and it kind of puts that in the accession number that you know this is the variation along with the genomic coordinates. So you have all the information right there right and so that is this part that we talked about right. So you have it identifies your single amino acid variants as well as indel variants. Now this part here is basically your part wherein you are identifying you know changes in the junction and so on and in your assembly. So for that again you start with alignment right. So these are your aligned files and then it uses a tool called as string tie which converts your you know you know your assembled transcripts into basically larger sequences and these are in this case this would be the axons that you are looking at right and then once you identify the axons it assembles this into an assembled transcript and that assembled transcript is generated you know it is subjected to three frame translation to give you a faster sequence right. Now one of the advantages of this workflow is you do not have to actually take all of those and then convert into protein faster sequence. It does a GFF compare so it compares to the GTF file which is basically a list of all your known gene coordinates or assembled genes and compares it with what string tie has found out and if there is you know if there is a variation from that that is the only thing that gets you know retained and then converted into protein faster file right. So this output that you see here would basically have only novel transcripts from your assembly. In conclusion today you have learned that genomic data and proteomics data as well as the transcriptomic and proteomic data are interchangeable and how one can make sense out of these multi-omics data requires lot of fiscal set, lot of experience and need for many softwares. You got a glimpse of how one can process multi-omics data sets in today's lecture. However in next lecture Dr. Jagthap will continue and I will talk more about the bioinformatics solutions for big data analysis in which way you can make a complete workflow and try to accomplish that. So next lecture will also be by Dr. Pratik Jagthap and he will finish this whole module of looking at bioinformatics solutions for big data analysis. Thank you.