 Hi everyone. Today we will be discussing a bit about transcriptomics and how to do early set data analysis using Galaxy. This is going to be a very short introduction on transcriptomics so that we can identify a few of the basic concepts before we move on to the hands-on parts of the day. Before going through the details here, which you can find in the Galaxy training page, it would be nice also to have an introduction to the Galaxy analysis, how it works, and also how to do overall sequence analysis. There are very good slides about quality control and mapping that you can find there. First and foremost, a key question to address is what is actually RNA sequencing? For this, we'll first do a quick introduction about what RNA is. This might be very familiar to a lot of you so I'll just give a very brief context here. If you look at the DNA at this level, what you actually have are a lot of different parts that comprise a gene. You might have enhancers, promoters, you have the main part of the gene and the open reading frame of the part, which actually leads to the protein, and you have additional elements on the right and left side of the gene. So the transcription transforms your DNA into the mRNA into the post-retion modification part, which actually comprises only of the axon parts and the internal parts of the gene. So the rest of the area around the gene is cut off, if you like, and the main part is left here, the red parts. The post-retion transcription modification of the mRNA leads to the mature mRNA, and this contains only the axon parts, the internal parts, the gray parts of the gene are being cut off, and only the protein code in region of the mRNA is being kept. This part is what can be causally translated into a protein, which does the whole process. In any case, RNA is in a nutshell the type of form of the DNA, and is what it can be used to produce a protein that will do the activities within yourself later on. When we talk about RNA sequencing, it's basically the part where we take the mRNA and we try to quantify it. What it achieves, it has a RNA quantification at a single base resolution. So starting from the whole DNAs we said earlier, and we have the pre-mRNA, when we're talking about the mRNA part, this is what is being retrieved during the individual process of RNA sequencing, doing the library prep. This is constantly fragmented into RNA fragments, reverse transcription takes place, and the cDNA that is produced is usually sequenced through high throughput sequences, and this is the part that you actually get at the end. So the RNA sequencing part, the RNA sequencing process is a cost-efficient way to analyze the whole of the transcriptome of a particular cell or a particular sample in a high throughput matter. So if you look at the process in a bit more detail in asking where your data is actually coming from, this is where you take yourselves, you extract the RNA, and depending on where you talk about the mRNA or the small RNA, you have different processes. In all those cases, eventually what you end up being is a library where you have the fragments of the RNA listed here, and these are going to be sequenced, and the sequence part is what you get as a semi-put-off of your data. So moving away from the biologists part and going direct to the RNA sequencing. So what is the main principle of RNA sequencing? This is one of my favorite comic strips, if you like. So you have the science here, so I wanted to do a transcript of everything that I have, all the different cells that I have in my samples, all different samples. So what it actually happens is you have the transcriptome, and you basically shred it. You have your mRNA fragmented into a million, literally, or even hundreds of millions, billions of different small pieces. And what you get from the machine, from the sequencing, is this whole mess of colored strips. The RNA sequencing by informatics approach, the computational approach here, is to take all those shredded pieces of paper and try to reconstruct the original red piece of paper. And as you can imagine, because it's definitely not an easy process, it might have a few mismatches here and there, which means that what you get at the end might have some inconsistency, some of these yellow and blue parts in the red area. So part of the computational process is also to ensure that such errors are identified and how to keep them in mind at least. So what are the actual challenges of RNA sequencing? So there are three main points. So the first is that when you do the sequencing, what you have as a sample might be completely different or have some very notable difference with what is the reference genome that we will be using to create the mapping and to quantify those RNA sequences. The second part might be noise. So in other words, you might not have a clean extraction of your RNA. You might have additional information here, fragments of pieces of information from RNA that are present when you do this process and are consequently sequenced. And finally, this is all a bit of a chemistry, which means that there might be some sequencing biases. So some PCR over amplification or errors or anything else that comes into place when doing the preparation part. So all those are challenges that need to be, one needs to be aware before doing the actual RNA seek analysis. But beyond the challenges, there are also a lot of benefits. And the main one is that I mentioned earlier, it's cost efficient and it's high throughput. And it allows us to identify a lot of to get a lot of information in a relatively short time and a bit with low enough cost. It allows us to have a very good understanding of the quantities of RNA that exist in a particular sample and to identify splicing points and novel transcripts, gym fusions and all in all to have a better understanding of what is happening at the molecular level. So if you look at the actual questions that are being addressed using RNA sec, you have two main applications if you like. So the one is addressing the question of what are the actual RNA molecules that I have in my particular sample. So this is the transcript discovery part. In this case, the primary goal of RNA sec is to identify novel isoforms, identify alternative splicing points, fusion genes, potential circular single nucleotide variations and so forth. So the main focus here is to identify and annotate the RNA molecules that you find. The second question is what are what is the concentration of the different RNA molecules in my sample. So here the particular point is to quantify our name and we either aim for an absolute expression. So if we look within the particular sample, we want to make to understand what are the differences in the gene expression in different genes, for example. Or if you're looking between different samples or different groups of samples, what is the differential expression of genes there. So these are the two main applications for RNA sec. What we are going to be doing, we're going to be seeing mostly the RNA quantification part. But at some times I'll be highlighting how the trans discovery can also be applied in the same context. So how to analyze RNA sec using aiming for RNA quantification. So roughly the process is as follows. What you get from the sequencing is basically a lot of fragments of the different molecules. And after you sequence those, you try to map them onto a particular genome. And you have those black lines here corresponding to extremes. And you might identify cases here that some part of a particular read is mapped into a particular gene or a particular hexome of a gene, and another part of the same read be mapped to a different part. I will be covering this in a moment. But eventually what you do is you map your reads onto a reference, a genome in this instance, and then you start counting. And you start to count how many reads or how many layers in this instance of a particular gene you find. So for example the purple gene here, you find basically one layer of that. In the yellow one you find one, two, three, or two if you see these straight lines. And this in the blue one here you have one, two, three. So quantifying how many layers of reads you have on the particular gene is one way of quantifying. There are different ways and I'll hopefully I'll address a few of them here. So the data processing pipeline is basically those five steps. So you have basically a set of reads, either single length, so only the forward part, or forward and reversal, you have parent, and you have multiple sets of reads for multiple samples for the control, for example, and for the treatment. And although there is no standardized workflow for RNA-SEQ, like a gold standard that everyone uses, there are different, there's a lot of best practices and some standard ideas that can be used for every dataset. And these are basically the steps that are corresponding to them. So after you get these files from the sequencing facility, the first thing is to do some basically quality control. This is already covered in a different lesson in Galaxy. You've already seen this on yesterday, on day one of this event. So I will not be covering this in more detail. And then after mapping into a reference genome, you get some annotation and given some information about how your transcripts are matching to a particular piece of formation, like genes, you can do a recounting. And eventually, from this process, what you get is for every one of those samples, you have a count table. Having those multiple tables at the end per group, for example, you can apply different questions. And one of them, one of the most common ones, is the differential expression analysis. If you look at the data preprocessing, so the part right here, this is a single step where you try to refine your data and to ensure that whatever information comes from there onwards, I'm sorry, is clear enough. And it minimizes the noise and n-particle errors that may come in from the sequencing part. So the first is to do some adapter clipping. So if there are any adapters from the sequence that are being left over, this is a good step to actually remove them. And also to do some quality assessment. So if there are any low quality reads or low quality bases, you can also remove them or trim them depending on the size that you're going to be using. The key part, and this is one that I will focus a bit more, is how to do the annotation of these RNA secretes. So you have a lot of different fragments, these small black lines. And the question is, okay, how can we figure out how those, where those came from our reference genome? So it's basically a mapping process, but it's not an always straightforward or easy approach. So keep in mind that what we have as input is basically fragments of this mRNA, of this whole orange piece of information. But if we try to map this onto the reference genome, you actually have a bit of blanks, if you like. So black areas, these introns that have been removed when sequencing the mRNA. So if we map those reads to the mRNA, you expect something like that, that everything is going to be mapped across the entire sequence of the transcript. However, if you try to do this onto your genome, you might have cases, those highlight like in red here, that are basically between two different axons, that if you map them on the genome level, they appear as to have a gap. So one part of that it is in one action, then you have a big gap, which corresponds to the intron. And the second part of the same rate is mapped to the next action. So this is a useful piece of information. This is a challenge. And one of the times that the aligners and different mappers need to address. But it's something to always consider, especially if you are thinking about identifying novel transcripts, alternative splicing points, fusion genes and so forth. This is one way that those things can be identified. So if you look at the mapping and going back to this particular process, there are three main ways of dealing with that. So the first approach is to map directly to the transcriptome. So having this particular piece of information itself, we map our reads directly here. So this is straightforward. The second strategy is to map to the genome. So if we don't have or we don't want to use a transcriptome, the second way is to map those reads into the genome itself, again, with the chance that we've identified earlier. A third strategy is to do a denovable assembly. So trade the streets and try to reconstruct a transcriptome and try to use this one as a means of counting. I'll go through those in a bit, also a bit more detailed. Going for the transcriptome mapping, this is the easiest one to achieve because this is what you have a transcript. You have basically x1, x2, x3 for a particular transcript. And what you do, you take your reads, you clearly align them to the transcriptome. And even if it's a parent, you can see how they can be aligned to different parts. So this is easy enough to achieve, but it has two main disadvantages, if you like. The one is that you really need to have reliable gene models. So what you use as a reference for a transcriptome need to be reliable enough. There exists for a lot of the reference organs, different species out there. But if you're working with a not so common species, that might be something a bit more difficult to achieve. Also, if you want to detect novel genes, this is not possible to do because what you're doing here is you are aligning your reads to your existing transcriptome. So known, already known transcripts of genes. So novel genes, novel isoforms are not going to be identified through this process. The second strategy, as I said, was to use genome mapping. So in this case, this is a bit more difficult to achieve, more challenging. And as I said, if you have a parent read, the first one would be easy, for example, to align this instance, because it's completely mapped on the axon side, on x and one. But the second read actually spans three axons. It's a bit of x and one, the entire of x and two, and a little bit of x and three. So this one would be a bit harder to align. But it has a very distinct abundance that you are able to identify splicing points and also to detect potentially new genes and new isoforms. So this is the advantage of genome mapping. So both of them, as you can understand, have a very common theme. In both cases, you require a high quality reference genome or reference transcriptome, ideally in a faster format. And in order to have a good annotation of the regions of this reference genome or transcriptome, you need to have also the annotation of these known genes, again, usually in a GTF file format. But there are additional formats that are equally compatible here. Both those piece of information are relatively easy enough to find, especially if you're aiming for mostly studied organs like human, mouse, and so forth. And some sort of, some projects and organizations that actually produce and maintain annotations on that include AMBL, MBI, UCSC, RAFSEC, and Sembal, and so forth. So you can look into these projects and organizations and retrieve those files. And so if none of those two strategies work for you, or if you don't have a reference genome or you don't need one, then the third strategy is the one that is the one that might work. So in this case, what you do is you assemble your agent transfer, you do a de novo assembly into the transcript, and then whatever is produced, you use this as a reference to map your reads back and actually do the quantification. If you aim only at identifying the transcript, so putting together the list of the individual molecules that you found, this step one is sufficient. But if you want to do a quantification as well, which is our goal here for this particular introduction, then you need to map your original reads back to your transcript so you can have a quantification of its read to the transcript itself. So these are the strategies of the mapping part. The next step if you've done the mapping is to actually do the quantification, which is again addressing the question of what is the expression level of the genomic features that we're looking for. So if we want to count the number of reads per feature that is relatively easy, if we have the features and we have them mapped, we count them depending on how they're structured. But there are also few challenges. So the first is the one that we have already touched upon a few times. If we have reads that are mapped into multiple cases, what will happen in this case? So if you have, for example, repeat regions, you might have a read that is aligned into multiple cases. So you expect a read to be coming from multiple different regions. So the aligner, the mapper itself, will propose that this particular read can be mapped here and here and here. So how to address this is a question that needs to be decided upon during the analysis. Also, a different question is if we want to do a quantification of features, how do we want to distinguish the different isopropyls, for example? Are we going to do this at the gene level? Are we using the different types or are we going to do this at the exon level? So all those are different questions that need to be addressed before doing the quantification itself. So given that we have the quantification done and we're going to be seeing a few tools in the hands-on later in this tutorial, we need to move on to the differential expression. So we've done the quantification in a single sample, but we may want to identify what are the differences in the numbers, in the concentrations of RNA across different groups, different conditions of different samples. So essentially, what we are going to be producing per sample is the sort of a distribution, if you like, across the same reference. But then, if we want to do this differential expression analysis, we need to account for the variability of expression across both the biological replicates as well as different technical replicates, again, with the help of accounts. The first step usually is to do normalization. In other words, try to make the expression levels comparable across the different groups. And there are different ways of doing that. You can do it by features and you do this at the gene level, the aziform level, for example. You should do it by samples so that you ensure that all the samples are comparable. And there are multiple methods that achieve that. And every method usually corresponds to a different tool that actually implements this. So there are the FPKM and RPKM methods of normalization across different samples. There is the TMM method that is evident in EDSAR packets. And there is also the DC2 method available through the same named packets in R, which is the most commonly used approach. It's important to highlight that so far, the ones that are shown to be the most robust are the DC2 and TMM ones, because they are more efficient and more robust when you're discussing about different library sizes and different compositions of the library. So if you're describing different sets of samples, then you might, TMM and DC2 are the most relevant ones. And in closing, it might be also important to keep in mind that the number of replicates used in differential gene expression, as well as the secrecy depth of its individual sample, are critical aspects and have an effect on which genes and how many of those genes are identified as differentially expressed. This is significantly expressed. As you can see, this is a study done by Conessa and Tall in 2016. And you see how the number of replicates per group and the sequencing depth actually has an impact on the probability of detecting a differential expression at the individual levels of 5%. And as you can see, by increasing the number of replicates, you significantly increase the effect size. And also by having a significant depth of sequencing also increases the probability of identifying those differential expression genes. So the rule of thumb basically is to have at least three biological replicates per group in order to have a sufficient enough power of the statistical analysis that is done at the end when you identify the differentially expressed genes. In closing, and after doing the differential expression, the next step is usually to do some visualization. And there are different visualizations for different parts of the process. So for example, if you're looking at the aligned reads using the BAM files, which we've seen earlier, you can use the IGV or trucksters to visualize those aligned reads. And you can also do the sesame plots again through the IGV or other tools to see how the read coverage along axons and splice points, splice junctions look like and how they work. At the end, after having the count, you can also do a more efficient visualization of the counts and their differential and the full change, for example, using packages like Cameraband, which was designed to connect with cufflinks as part of the Daxindo pipeline a few years ago. So there are a lot of tutorials that are available to do that. We will be seeing now the RNA-seq pipeline reference-based RNA-seq pipeline leading up until the R part, how to do the analysis of counts using R. And I would like to acknowledge the GALAX training network, and particularly Bernice, Anika, and Marcus for putting together this particular tutorial. And thank you for listening to this.