 Welcome to Galaxy Training. Today's topic is proteomics. Today's tutorial is Proteogenomics 1 database creation. Note that the training material is available in multiple languages and now we go to the training. First let's provide a little background on proteomics. Mass spectrometry is used to identify proteins in biological samples. Proteins from the sample are digested with enzymes to produce peptides, which are then separated using liquid chromatography. The peptides are ionized. Each peptide is then fragmented into randomly sized pieces. Those fragment pieces travel through an electromagnetic field, which will array them across a detector based on their mass charge ratio. The detector records the intensity across that mass charge access in what is called a spectra. The spectra from mass spectrometry must be compared to theoretical spectra that would be produced from my protein-faster sequence database. The peptides are matched by comparing the intensity peaks between the theoretical and the mass spectra. Often we compare spectra against a reference protein database, but often we want to include novel proteoforms that wouldn't be in a reference. In proteogenomics we use next generation sequencing and variant analysis of those sequences to produce putative proteins that we can add to make a customized protein database. There are numerous methods for identifying novel proteins from sequencing data. We will look for small amino acid variants and also novel isoforms. This is the workflow that we will use in the tutorial. We have our inputs. We will need to do some changing of chromosome names in the GTF. We will map the reads to the genome. We're going to use FreeBase and Custom ProDB for single amino acids and indels. Then we will use StringTie and GFF Compare to look for novel splice variants. We're going to need to do some modification of accession names coming from Custom ProDB. We'll finally build a list of reference accessions and these are the outputs we expect to generate from our workflow. We begin our hands-on tutorial. First we need to upload our data. We're going to begin by creating a new history. Since we're working on the useGalaxyP Europe server, we can get our input data in a shared data library. We will look for the GTN material, the proteomics section, our tutorial, and then we select the FASTQ sequences, the read sequences, a gene GTF feature file, and also the FAST aid database. Now those are long and messy names, so as the tutorial suggests, we will edit those data sets and shorten the names. So we'll just highlight the prefix of the URL and delete it. And then we also need to set the reference genome that these will apply to and we're using mouse reference MM10 from the University of California, Santa Clara. And UCSC references prefix all of their chromosome names with letters CHR. Now our GTF file came from Ensembl. Ensembl does not use a CHR prefix, so we need to change that so that it aligns with our UCSC genome reference. So we're going to use a tool called regxreplace on a specific column and we will replace the names, chromosome names in this GTF file by pre-taxing each one of them with the CHR characters. So select the right GTF file. We're changing column one. Oops, column one. And now we go back and find all of the chromosomes that are consist of just numbers in our GTF file. So this pattern will select those and we are going to prefix those with CHR and the two back slashes one says whatever we matched up above. Now we also need to do this with the XY chromosomes. So start group X or Y, end group and to the end of the selection. That's our pattern. Again we'll replace it by putting CHR in front of whatever we found as the pattern. And we have one last thing to do and that's the mitochondrial chromosome. It's called MT, the edit sombo. And in UCSC it's just chromosome M, CHR M. That's how it used to look. Now we pre-fixed each chromosome with CHR. Let's change the name of this dataset to something a little more memorable. Our next step will be aligning the reads in our FASQ file to the reference genome. We're going to use the galaxy tool called hysat to do that alignment. First we make sure we select the correct reference genome, mouse genome. We select our FASQ files and we proceed to align the reads to the reference genome. Again we're going to give the output a name that's a little easier to recognize. Now we're ready to begin variant analysis. The first kind of variant we will look for is single amino acid changes and small indels. We're going to use the application freebase to search for those. So we have our correct alignment data that we generated from hysat. We choose the correct reference genome and the rest of the arguments can remain as defaults. Again we'll give this a little easier to remember name. Now our next step from these identified variants is to translate them into protein sequences based on those variants. We're going to use an application called custom ProDB to do the translation. We'll use a mouse reference. We have hysat, BAM, and freebase variant call format files. We are going to select the additional outputs since those will be used in the multi-omics visualization platform in a later tutorial. Now let's move on to the transcript assembly. We are going to use the application string tie to assemble all of the reads that align to the genome into transcripts. Now we want to make use of the reference GTF file because those will be the preferred alignments since those are already known. Again we change the name. We'll take a quick look at the output of this file. It is a GTF formatted file. Next we are going to compare the output from our string tie assembly back to that reference feature file GTF file. We select the string tie GTF. We are going to compare it to the reference which is in our history and that's our chromosome renamed GTF file from ensemble. Now if we look at the output from GFF compare we are going to have all of our entries from string tie but we're going to add something called a class code. This tells us how that particular transcript compares to the reference GTF file. In our next step we are going to convert that GTF file to a bed file format but in doing so we are only going to select certain classes of compared transcripts. So in particular we want the novel splice junctions and a few other ones classes that will produce novel proteins. The output now is a bed file which gives us the genomic locations for those novel splice variants that we identified. Our next task will be translating those locations into protein sequences. So we start with our bed file. We are going to get the bases from our reference mouse reference again and we're going to do one other thing. Peptide shaker always wants a source for the accession proteins that you're entering. Anytime we do a novel one we used the source as generic so we added that. So that is a bed file showing all of the new peptide locations and here we have our fast day that was generated with the generic source placed in front of it. In our last step we're going to take that bed format file that gave the genomic location of all of those peptide sequences that we're adding to our reference database and we're going to change the format into one that we can use later to generate a SQLite database that the multi-omics visualization platform can use. So there's our pro bed format and now we've created something that will work much better as a relational database table. Next we're going to start assembling our fast day databases. We begin by merging together the three fast day databases that were generated by custom ProDB. We insert each of those databases in order. We're going to pick the one that is closest to reference proteins first and then we'll add our variant ones after that. That is because we want to give preference to reference proteins because we only want to identify the variant peptides if they are not included in a reference protein. Now I'm highlighting one of the accessions because that is going to demonstrate a problem we will run into later and that is in the names generated by custom ProDB. So if we look at this after the accession name most of these have a space character before the pipe separator. We also note that some of them where there are indels have a greater than sign and there are also commas for some of the amino acid changes. That's not going to work very well with peptide shaker. Peptide shaker cannot handle greater than signs commas or spaces in protein accession names. So the first thing we will do is we're going to convert our FASTA file, the one we merged, into a tabular file. This will create a tabular file with two columns. The first column will be the identifier, the second column will be the sequence. Now we need to modify those accession names so that they will work with peptide shaker. We will use the tool column regx find and replace. So we're going to make changes in the identifier column which is column one. First we're going to copy the pattern that's going to allow us to get rid of the greater than signs in the accession names. So we've included that pattern. Now the replacement part will substitute in an underscore for where we had the greater than sign before. Next we'll modify any of the accession names that had a comma character in them. So we put in the pattern that finds that and we are going to replace that instead with the period character. Finally we need to get rid of the space at the end of the accession name. So this regx pattern will accomplish that and also while we're doing the replacement, remember that peptide shaker requires a source for the accession as a prefix. So we're going to use generic which is what peptide shaker requires for anything that's not from a known protein source database. So here we notice that our accessions start with generic and we no longer have any space characters after the accession name. Finally we need to convert this back to a tabular file, or from the tabular file back to a fast A file. So our title column, our identifier column, is column one and our sequence column will be column two and we give it a more recognizable name. So we put together the three fast A files from custom ProDB, but we also need to change all of the accession names from the genome mapping file and also from the variant annotation file. We'll start by modifying the accession names in the genome mapping file. Custom ProDB outputs that mapping as a SQLite database. So the first step we will do is to convert that SQLite database into a tabular file. To do that we'll just use this query that'll pull out the information with the right format. Then we're going to rename that just so it's easier to find. Next we have to modify all of those accession names very similarly to what we did with the fast A file. So as a shortcut I'm going to go back and find where I did the conversion of the fast A file and I can use that same column regex with one change. Of course we do it on the correct file but also we do not need the source prefix now so we get rid of the generic term. The tutorial suggests we change the name to savindel. Now we also had a genome mapping file that came from our stringtiegffcompare method. So we are going to concatenate together these two genome mapping files. So the savindel genome mapping we just did and we will also use the bed-to-protein map that we generated from our stringtie data and there's our merged result. And finally the multi-omics visualization platform mvp is going to need this data in a sql lite database in order to look up entries quickly. We're going to use a tool called query tabular that will input tabular files and can produce a sql lite database from them. So we've inputted our merged genome mapping tabular file. This is the name that mvp will look for in that database. So we use that as the table name featurecdsmap and it expects these column names in that table. So we'll copy those and we'll paste that into query tabular. Also for mvp to look up things quickly we should have an index on that table so that it can quickly find something from an accession name. Finally we want to save the sql lite database not just use it to create another tabular file. So now we've created the genome mapping sql lite database. Let's give this a name that we can find for a later tutorial. Now we need to follow a similar procedure with the variant annotations database. Again custom prodb outputs this as a sql lite database. So we're going to use sql lite to tabular to convert it to a tabular output. We're going to use this query to put things out in the format that we need. And again we don't want any column headers on the tabular output. Our next step will be to do the same column regex we've replaced to change all of the names. Since we've already done that with the genome mapping file we can use the we can just rerun that tool on our variant annotation tabular file. And then our final step will be to convert this back to a sql lite database again and we're going to use the query tabular tool for that. So we're going to insert a database table. We have to give it the correct name that MVP will eventually look for. So variant annotation is our table name. These are the columns that MVP will expect and we'll say only load those columns in case there's any extra fields. Finally we need an index again so MVP can locate the correct information quickly and we're going to save this as a sql lite database. Once again we'll give this a memorable name for the next tutorial. Now we're ready to produce the final output protein fast aid database. We are going to merge together all of the databases that we did from each individual part. We'll start by taking the uniprot reference database. We put it first because we give it priority. Next we will add our custom pro DB database and finally our string tie fast aid database and again we'll give our output a name that's easier to find for a future tutorial. One other thing we'll use later in tutorials is eliminating any matches that were to known proteins. We're going to do that by generating a list of reference protein annotations. So we're going to start by using fast data tabular on our rpmk database from custom pro DB. Then we're going to use a tool called filter tabular. So filter tabular we're taking that tabular file. We're going to as our first filter we're going to select only the first column which is the id column not the second which is the sequence column and then we're going to use a regular expression filter and we're going to look for this pattern in that identifier in the tabular representation and from that we'll just keep what was matched within the parentheses in the pattern. So if we look at what the identifiers look like before and now we've pulled out just the accession part of that. Now we need to do a similar thing with the uniprot fast aid. So first we convert it to a tabular file. Next we're going to use filter tabular again to select out just the accession portion in that identifier. So I will set filter tabular to the output. I'm going to change the expression that we're looking for. Now we pulled out just the protein accessions. Now we want to merge those two reference protein sets of names. We're going to use concatenate multiple datasets again. We have our reference output. We'll also select the other filtered output and concatenate those two lists together. Now we want to give it a memorable name for our next tutorials.