 Hello and welcome. My name is Mallory Freberg and today I'm going to be walking us through a Galaxy tutorial for reconstructing a transcriptome without a reference using RNA-seq data. Today we are going to be analyzing RNA-seq data, aligning it to a reference genome, but then reconstructing the transcripts without a reference, which is called de novo reconstruction. We are then going to quantify the transcript abundance using our assembled transcripts in two different cell types and then comparing the gene expression across these two different cell types. So we're going to be walking through the tutorial here and we're going to start with the first step, which is uploading data to our Galaxy instance. Today I'm actually going to be using the public Galaxy Europe instance. You can use any publicly available Galaxy or even an instance that you've installed locally. And I've already logged in here, but if you aren't logged into your Galaxy already, you'll want to click the login button at the top toolbar and log into your account. So we're going to start by loading data into our Galaxy instance that we are going to be using to construct the transcripts and do the differential gene expression testing. The data we're using is from a publication that looked at reads or sequenced two different cell types in mice. And in order to do the tutorial more quickly, we're going to be using a subset of these data, which have been made available on Zenodo. So there's a link here to those datasets. And in the tutorial here, we've also already put in the URLs to each of these data types. And so we're going to start by copying, click the copy button here and copying all of these URLs. So you'll notice there are nine files here. The first four represent the first cell type G1E. The next four represent the second cell type Megacarya sites. And then there is a reference transcript annotation file, which we'll use in one step later in the tutorial. You'll also notice that our two different cell types contain two replicates represented by a rep one and rep two, as well as a forward and reverse file. So these are paired and sequencing files. So there's a forward read and a reverse read. So I've now copied these URLs. And back in our public Galaxy instance, we're going to start by uploading these files to our Galaxy. I'm starting with, oh, actually, first, yeah, I'm starting with a new history. And actually, the very first thing we'll want to do is to name our history. So I'm just going to call this to Novo. You can name your history, whatever you would like. And then on the left hand toolbar, I'm going to click the upload data button. And in the pop up that comes up, I'm going to click paste fetch data. And here I'm going to paste those URLs that I copied from the tutorial. I'm actually going to delete this last one. We'll upload the reference transcripts files separately. So what we have left are the eight fast queue files of the sequencing reads. So I'm going to set the type to fast queue Sanger for these eight files, and also set the reference genome to the mouse genome version mm 10. And then I will click start to start the upload. I'm going to reset and do this step again, this time keeping that ninth reference transcript file and set the file type to GTF and set the reference to mm 10. And click start to import this file. And then I'll close this box. So you'll see on the right hand side in our history, we now have these data files being uploaded. So they're indicated by the yellow. And while we wait for those to finish uploading, we'll go back to the tutorial. Now before we start analyzing these reads, we're going to first do a quality control check. And we're going to do this running a tool called fast QC. So we'll search for this tool in our toolbar by typing fast QC. And we'll see it comes up here fast QC read quality reports. So what this tool will do will give us a general assessment of the quality of the reads that we are going to be using. And so we're going to, we could run this tool one by one on all of the eight fast Q files. But if we click the multiple data sets button here, we can actually select all eight files at once. So these are the eight sequencing reads, read files. And we're going to keep everything else the default values and click execute to run these eight jobs. In the meantime, you'll see that all of the read import jobs completed, they're green. And I'll just mention briefly that these, I mentioned these were subsets of the full data files that were originally published. And they have been subsetted for one of the smaller chromosomes and mice chromosome 19, as well as some regions of interest, so that we can have some interesting results later. But if you want, you can view the data by clicking the I icon here. And you can get a sort of a snippet of what the reads look like. So as we wait for the fast QC to run, we're going to jump ahead quickly and run the second step so that it can be running in the background. And this is a step to trim off low quality sequence to bases from the ends of reads. There are a couple different tools to do this, but we're going to be using the tool traumatic. So again, we'll go to the search bar and the search bar and start typing traumatic. And we'll see it pops up here. We'll click on it. And now we're going to set some parameters. So in the tutorial, you will see if there are parameters to set for a particular tool, they'll be indicated here. So for this tool, we are going to make sure to set the tool to run for paired end sequencing, which is the type of sequencing that we have in these examples. And this will request us to select the forward or the R1 and the reverse or the R2 read. And then we'll make sure that this perform initial Illumina clip step is set to no. So again, back in the tool, we'll select paired end. And we will pick the two paired end reads that we want to analyze together. So we're going to first grab the G1E cell type replicate one forward read, which is the first item in our history. And the G1E replicate one reverse read, which is the second item in our history. And we see here also that the Illumina clip step is already set to no. So we don't have to change that. And we will click execute for the job to run in the background. Now, we want to actually repeat this tool for the other three sets of reads. And we could go through and so we're going to click the traumatic tool on the left. And again, set to paired end, grab the G1E replicate two forward reads, and the G1E replicate two reverse reads and click execute. And now we'll do this for the third pair of files. These are mega carrier site replicate one forward and mega carrier site replicate one reverse. And then the final time, mega carrier site replicate two forward and mega carrier site replicate two reverse. So while those jobs are executing, we'll now go back and look at our fast QC results in the history. So we'll just go down to the first fast QC job. And you'll see that the tool outputs two results, a web page and a raw data. And we're just going to view the web page results first. I'm going to also hide the toolbar so I can see I can have a wider view in my screen. So the first output from this tool is a table of some basic statistics about our file. So it has the file name. And importantly, what you'll see here is the sequence read length is set to 99. So these are pre trimming the read length, the length of each read is 99 bases. And the other bit we'll look at is this per base sequence quality. So this is a graph with the position in the read on the x axis from one to 99. And a quality score on the y axis with the higher quality scores highlighted are highlighted by the green. Okay, quality in this gold color and then poor quality in red. And the box plot here is showing the range of quality scores per base over all of the reads in the file. And one thing you'll note is that as we go towards the three prime ends of the read, the quality tends to drop off into getting this gold or even red area. And so these are the bases that we want to trim off by using the trimimatic tool. So we can look at another one. This is the G1E replicate reverse reads. And we'll see the same pattern with lower quality bases at the three prime ends of the reads. So we've run our trimimatic tool, you can see there's a lot of items in our history. And I'm going to do a bit of cleaning up just so we are less confused we need to bring these items to the next step by removing every history item that denotes an unpaired read because we're only going to focus on we're only going to focus on reads that are paired for the following mapping steps. So I'm going to delete unpaired and unpaired. And you can just delete these by clicking on the X in the corner. And if you make a mistake here, it's actually okay, you can always go back and retrieve these reads from your history. Unpaired, unpaired. So now we're left with eight unpaired files from our trimimatic step. And just to check that we actually have trimmed off low quality scores, I'm going to run the fast QC tool again. And this time for the sake of brevity, I'm just going to check on one of the files. So I'm going to scroll down and just pick the first trimimatic output and keep all the parameters, their default value and click execute. And this should run fairly quickly. And we're going to view the results. So in the table now, you'll see that the school, the sequence length ranges from one to 99. So in fact, we have done some trimming. And the per base sequence quality ranges are now almost entirely in the green or the high quality region. So we have successfully trimmed low quality bases from the ends of our reads. So if we go back to the tutorial, we've now done some quality assessment, trimming of the reads and checked with the QC that we have trimmed off the low quality bases. And we've done trimimatic on all of the read pairs. The next step is to take these trimmed reads and align them to a reference genome. We do this because we need to get the genomic coordinates of the aligned read so that we can then reconstruct the transcripts. Now, one important thing to note is that because we are aligning RNA reads to a genome, we are going to have to use an aligner that can enable spliced alignment because we are going to be mapping over introns that have been spliced out in the transcripts. So for this, we're going to be using the tool Hisap. There's also a box here describing how we set some parameters based on the library type. So if you remember, we've done paired end sequencing here. And we've actually done the type of experiment or the authors have done the type of experiment in the first row. So we're going to be setting Hisap parameter forward reverse indicated here. And there are some other experimental types indicated so you can use the appropriate parameters for the type of data that you were going to be aligning. So first, we are going to find the Hisap tool in our toolbar. So we want Hisap2 here. And we have quite a few parameters to set. So we're going to go through one by one. So first, we want to use a built-in reference genome. Again, these are reads from mouse cells. So we're going to be using the reference genome MM10. So the built-in genome parameter has already been selected. And we'll search here for MM10. Now, if your reference genome that you need to align to is not available in the galaxy instance that you're using, you can actually import it as its own history item and then align that by choosing the user genome from a history option here. If you remember, we're also using a paired-end sequencing library. So we'll change this to paired-end. And we want to pick the read pairs that are a result of the traumatic step. So here we want to make sure that we're choosing not our originally loaded FASTQ reads, and in fact, the ones that are a result of the trimming step. So we'll start with the G1e cell type replicate one, and we'll choose the forward reads for the file number one here. And then the reverse reads G1e, rep one, reverse reads for the file type two. Now, we also need to specify the strand information. And in that table, I noted that we are using the FR strategy. So we'll choose that here. And we also have some advanced options that we'll need to set. And these are specific because we do want to make sure that we're capturing reads that span splice sites. So we're going to change some of these options. So if we go down to advanced options, we'll click on this to expand. And then we want to specifically set some parameters for the spliced alignment options. So we'll choose specify spliced alignment options. Now, we'll leave the first item to default value, but we're going to change the penalty for non-canonical splice sites to three. So penalty for non-canonical splice sites, three. And we're going to use penalty function f of x equals b and set this to zero. And then the same here. And the reason we're doing this is because we want, we don't want to penalize mapping over spliced sites, both conical and non-canonical. So one of the main goals of doing this de novo transcriptome reconstruction is that we want to identify either novel transcripts or novel isoforms of known transcripts. So we really want to find as much as possible, not penalizing about whether a particular transcript or isoform has been identified before. So we're lowering all of these thresholds to not penalize against that as much. The last item we'll set is the transcriptome assembly reporting. So we want to set this to be report alignments for transcript assemblers using string tie, which is the assembler that we'll use in the later step. And we just double check all of the options. So built-in genome, MMM 10, paired end, FR, and then we changed all of these items. Yep. So now we will go ahead and do execute to execute this. So again, this is now going to align the forward and reverse reads from the G1E cell type, the replicate one experiment to the mouse genome, and not penalize against mapping across unknown or unknown splice sites. Now we want to do this alignment step for the other three cell types and replicates, but those were actually a lot of parameters that we set. And I want to make sure that I don't mistakenly set or forget to set some of those parameters. So I'm going to actually expand the job that's currently running that current high set job and click this run this jobs again button, these two arrows, which basically brings up the form again from the job that I already ran with all of the parameters set that I chose. And the only thing now that I'm going to change are the input files. So I'm going to pick the G1E replicate two forward reads and the G1E replicate two reverse reads. So this is the second replicate. And I'm going to keep all the other options the same. You'll even notice that under the advanced options, it has all of the items that I've changed for the splicing. I'm going to click execute. And I will do this two more times for the remaining mega care site reads replicate one and replicate two. So normally if I had run these alignment jobs on the full data set from publication, it would have taken a very long time. But luckily we've down sampled and so I'll just let these run in the background, but they actually shouldn't take more than a few minutes to finish running. So we'll just move on to the next part of the tutorial while those alignments are happening. And that is to do the actual reconstruction of transcripts from these aligned reads. And again, we're doing the Stenova, which means we are constructing the transcripts without a known reference of what transcripts are to be expected in these particular cell types or for this organism. So we're using mouse here. And again, the benefit of this is we can detect novel isoforms or novel transcripts that may be present in our cell type, but have not yet been detected in common cell types that make up the reference transcriptomes. And we're going to be using a transcript assembler called string tie here, which I'm going to go back to my Galaxy and search for string time. So we have here the string tie transcript assembly and quantification tool. And we're going to run this on each of the hysat to alignment outputs separately. So there will be four, we'll run string tie four times for each cell type in each replicate. And again, we'll need to set the strand information to the forward reverse FR. So we'll set that now. And I could run this tool individually on each of the hysat to outputs, but I'm actually going to use this function in Galaxy called batch mode. So I'm going to click this multiple data sets button. And that lets me select all four files that I want to launch a job on. And so this will launch four string tie jobs each on one of the input files. And we will keep all other options default. And we'll click execute here. So we'll see our hysat jobs did complete running. We can actually expand and see some information, see a little snippet or header of the output of running the tools if we had run it in the command line. We see that it does have a size, which is good. I'm always going to check because if this is zero, then you know that something had gone wrong. And while the string ties are running, we'll skip to, again, the next step, which is to do the assembly across those four different string tie runs. So what we're going to be doing is taking the individual output for running string tie on the four alignment outputs and combine them to produce one reference that includes all isoforms and all transcripts across the four experiments. And so we're going to use string tie merge. This is a function of the string tie program to do this. And the input is essentially the four outputs of the string ties. And in this case, we are going to provide the reference annotation for known mouse transcripts. And this is one of the input files that we initially imported into our history at the very beginning. The step isn't required for the string tie merge function. But because we do have this information, what we can do is actually annotate of all of the isoforms and transcripts we identified in our four experiments. We can annotate them as being known or novel. So that's one of the benefits of having an annotation. But again, the step is not required or this reference transcripts are not required for completing this tutorial. So we'll find string tie merge in our tool history. And we will select the four string tie outputs. So importantly, do not select the reference transcripts that we imported here, just the four string tie outputs. And then from the reference annotation to include emerging, we do want to select the reference annotation here, which is our imported history item. And we'll keep all other parameters the default and we will execute this tool. So you'll see our string tie outputs have completed. You can actually view these files if you want to just get an idea of the transcripts that have been identified. And this is a standard GTF file. So there's a chromosomal location, both the chromosome and the start and location. And these are annotated as either full transcripts or individual exons. And they have also been provided IDs. So these are the string tie unique IDs for the transcripts and an estimation of the coverage. So how many reads aligned to the reference genome? And there's a semi quantification done, although we won't be using this moving forward. We can also now see our string tie merge tool has completed. And we can also view this to get a sense of the different transcripts that have been identified, again, from the four outputs of our alignments. So we'll then run the next step, GFF compare, which is going to annotate our identified transcripts using the reference transcript file, again, to identify known and novel transcript isoforms. So we search our toolbar for GFF compare. We'll find it here. And we are going to select the output of the string tie merge. So that's just the first file here. And we are going to use a reference annotation from our history. And that's the GFF file. So we'll do reference annotation. Yes, from my history. And make sure we select that imported GTF file. And then we're going to go into use sequence data. Yes, from history. Oh, sorry, from locally cached. And we'll do the MMN 10, which has already been selected for us here. So we'll just double check these GFF inputs, output of string time merge, reference annotation, GTF, sequence data, yes. And we will click execute. And you'll see that this one tool actually produces six different output files. And we'll let this tool run. There's a bit more information about the output that GFF compare here that we're not going to go into too much detail, including what types of transcript classifications are done. So you can see, for example, a transcript that is annotated in the reference. This is for novel isoform of a known reference. And you can read more about these. So at the output of this step, this is actually a good place to maybe take a break while this tool runs, are going to be our list of transcripts that had been identified in our four experiments. So it's two cell types, two replicates per cell type, they have been aligned, the reads have been aligned to reference genome, the transcripts have been assembled, de novo without a reference. We've merged all the transcripts from the four different experiments to get a single set of identified transcripts. And then finally, we've done an optional step of comparing these to a known set of transcripts to get further annotations, although this step is optional if a reference transcript done is not available for the particular organism you're using. So at this point, we're now going to move to the last steps of the tutorial, which will be to now quantify these transcripts among the four different experiments, and then finally to do differential expression testing between the two different cell types. So we'll start with the quantification step. There are a number of tools that you can use to quantify transcripts using aligned reads and a reference of transcript features, and also multiple tools to do the differential expression testing. The two tools we are going to be using here are feature counts for counting the reads per transcripts, and then DEC2 for doing the differential testing expression. And these two tools work very nicely together and are used in other tutorials as well. So we'll start with the quantification. It's using feature counts. So in our galaxy, we will go over here to the toolbar and search for the feature counts tool. Meanwhile, you'll notice that our GF of compare tool has finished running, and you can feel free to look at the different output files. The one that we will be focusing on for the remainder of the tutorial is this annotated transcripts file, which again is our complete set of annotated transcripts and isoforms from our four experiments. So back for feature counts, we're going to be running feature counts on each of the alignment outputs, but actually, again, we can run this separately by using the multiple datasets feature here and selecting the four alignment output files. And again, we'll specify string information to forward. And then the last thing we need is to provide the annotation file to say quantify these transcripts. So we want to choose this from our history and want to make sure that we're running this on the GFF compare output. So we have multiple sets of GTF files here, but importantly, we want to use the GFF compare output. And the last item we want to update is under advanced features. And instead of quantifying on the gene IDs, we actually want to quantify on the transcript IDs. So we'll change this to transcript ID here. And I'll just double check we've done all of the parameters. So alignment files, string information, annotation file and transcript ID. And we will select execute. And you'll notice here that the feature counts has been initiated four times. And each feature counts will actually produce two output files, both a summary table, as well as the actual counts. So the quantifications on each transcript from each of the four experiments. And so those counts will be the ones that we take to the next step. So now we come to the differential expression testing part. And again, we are going to be using the DC tool, which does accept outputs of feature counts as one of its inputs. And then DC2 also performs a normalization step, which is important for removing any biases in the input files, such as sequencing depth. So in our Galaxy, we will go to the toolbar and define DC2. And we're going to set a few parameters here. Basically, we're going to tell it that we're going to give it the input files for the first set of cell types, the GE1 cells. And the second set, the mega carrier sites, because we want to compare the expression among the GE1 replicates to the mega carrier site replicates. So these factor names are basically cell types. So for the first one, we will call this Q1E. Actually, what we're going to do before I do this is change the names of these files in our history. Because as you'll see here, the feature counts outputs aren't really indicative of which cell types and replicates they are. They just use the history numbers. But we want to make sure we're selecting the correct files. So I'm going to change the name for the counts. This first one was feature counts on G1E, and we'll click save. And the next one is G1E replicate 2. Save. If you want, you can also optionally delete the summary tables if you want to clean up your history a bit. So now back to DC2. Again, cell types is our factor. We'll first look at G1E. And we want to select the feature counts for both replicates that can be done at the same time. And then I'm just going to call the mega carrier sites mega for short and select the two mega carrier site replicates. And then we'll want to make sure these two parameters are also set to yes. So visualizing the analysis results, we want to set to yes. That will provide us with some graphical outputs, which are nice to take a look at. And then output normalized count table, yes, which we will also use in a later step. So we now have all the parameters and we're going to click execute. So now that our differential expression testing is underway, we are going to take a look at the output in a couple of different ways. So first, if you remember, we chose to output visualization of the results. So we're going to take a look at the DC2 plots item in our history. And this produces a couple of useful visualizations for looking at our results. So the first here is a principal component analysis, looking at the different replicates. And we can see that along the first component on the x axis separates nicely between the G1E cell types on the right and the mega carrier site cell types on the left, which is as expected. There's also a clustering diagram here showing that the G1E cell types cluster cluster most closely with each other in terms of expression of transcripts. And similarly, the mega carrier sites also cluster together, which is great. And there are a few other plot types. So there's a dispersion estimates plot here, a histogram of p values. Again, there are quite a few significantly differentially expressed genes because we've specifically down sampled these, the input files from the original publication for regions of interest. And I'm a plot looking at the full change compared to the normalized recounts. So those are useful to look at. And also oftentimes what we'll want to do is subset all of the transcripts that we've compared for the ones that are significantly differentially expressed. And this we will use the tab, tabular output file from the EC2, which has seven columns, which are indicated here. And in particular, we will look at the seventh column, which is an adjusted p value for the difference between expression in the G1E cell type versus the mega carrier site. So that file is here, we can just take a look really quickly at what it looks like. So to filter for significantly differentially expressed transcripts, we're going to use a filtering tool. So if you go to the tool panel on the left under filter and sort, we're going to choose the filter data on any column using simple expressions tool. And we're going to select the result file. And we want the adjusted p value, which was if you remember in column seven, to be less than 0.05. So you could use whatever p value threshold that you would like here. And we will click execute. And then while that is running, so that was the first step here filter for significantly differentially expressed transcripts. We also as a second step, we'll want to look at the difference between which transcripts that are differentially expressed are either upregulated in G1E cell types or down regulated versus the mega carrier sites, which will be another run of the filtering. But first, I'm going to change the name of this file to differentially expressed transcripts, where the p value was less than 0.05. And I want to rerun this filtering step, but on my differentially expressed transcripts. And I want where the the transmits were upregulated in the G1E cell type. And this was the third column. So if you remember, the third column was the log full change. And to be upregulated in G1E, we want a positive log full change in G1E compared to mega carrier sites. So we'll run that. And then we will rerun this for genes that are down regulated in G1E, which is where the log full change would be negative a negative value or less than 0. Now, if we take a look at all of the transcripts, we can see there were just shy of 800 transcripts in our that we've actually tested the differential expression of. And of those about 800, when we run our filtering, we can see that there are 115 of these that are differentially expressed with a p value, adjusted p value less than 0.05. And you can see those listed here. Again, it looks like that adjusted p values were actually sorted from most significant to least significant. So we have the top 115 here. Under the first column under gene ID, you'll see different types of accessions. So the ones that are labeled with nm or sometimes you'll see nr. These are previously known and annotated transcripts that we've pulled from our reference GTF file. And the ones that are annotated with M string, these are the results of the string time merge that are novel or previously unknown in our reference transcript annotation file. You can also see the log to full chance here. So these negative values means that the transcript was down regulated in G1E compared to positive full changes, which are up regulated in G1E. And if we take a look at our filtered outputs, I can see that this first one, again, we'll rename this so that's more clear. These were differentially expressed transcripts that were up regulated in G1E. And there were, it looks like 57 of these. And then the second word, differentially expressed transcripts that were down in G1E. And it looks like there were 58 of these. So about 50% up and down regulated in G1E. These again are the plots that were output from the DEC2 that you can take a look at here. And finally, this leads us to the very last optional step of this tutorial, which is to do some visualization of the basically everything we've done so far. So this is looking in a genome browser, which you can use the built in galaxy browser called Traxtor, or we can do these steps. And then you can import the result files into a genome browser, such as the UCSC genome browser or IGV, which you can also run locally. And in these steps, we are going to produce some coverage files in the Bigwig format, which show the re-coverage across the regions that we've aligned to. And you can also import our annotation files, the GTF files of the merged and assembled and merged transcripts from our experiments to show those transcript models in the browser. So I'm not going to do this step live for you now. It does take some time to load the tracks. But you can see a screenshot here, which is something that you might expect to see. This is a region on chromosome 11. And this is using the Galaxy Traxtor browser. And we've loaded here in the first track is the output of the GFF compare steps. So these are the assembled and annotated transcripts from our experiments, as well as the reference track here. So you can see what was actually in our reference file. So this first transcript was a reference transcript. But this second transcript here was actually unannotated in our reference file. So this is a novel, a novel transcript. And then the remaining sets of tracks are the reads and assembled transcripts from the G1e cell type at the top and the mega karyocytes at the bottom. And these tracks are also color coded. So in blue are reads or transcripts that came from the plus strand of the genome. And in red are reads and transcripts from the minus strand. And you can see in this particular example, there is a clearly an annotated and known transcript that it was in that is expressed in the G1e cell type that is almost completely lacking from the mega karyocyte cell type, as well as this transcript in the middle, which was has multiple isoforms that have been detected in our experiments, but has actually been was not annotated in our reference transcript dome, but has quite a large number of reads aligned into it from the G1e cell types. And so you can use this visualization to actually see expression and corresponding transcripts of your experiments. So with that, that concludes this tutorial on de novo transcript dome assembling of RNA-seq data. You can see that we took our initial raw reads, which in this case, for this tutorial, we down sampled from a published data set. We did some quality control and trimming of low quality bases, which were followed, was followed by aligning or mapping these reads to a reference genome. From that, we did transcript reconstruction de novo, so without a set of reference transcripts, and assembled them into a single set of transcript of transcripts. We then did do an annotation step with known set of transcripts to annotate them, followed by a quantification of these transcripts and differential expression testing to look at the differences in their expression between the G1e cell types and the mega carrier sites. And at the different steps, we can also do some visualization to look at, for example, the results of de-seq2, or to look at our reads and transcripts in a genome browser. So I hope you found this tutorial useful. If you have any feedback to perhaps help us improve the tutorial in the future or to correct any mistakes, we would very much appreciate your feedback. And I hope you enjoyed the tutorial and hope you have a wonderful rest of your day.