 We're going to start module one, which is sort of moving on from the cloud introduction now to an introduction to RNA sequencing. So this first module and this lecture in particular is the one that's sort of most focused on the wet lab biology. And it's really important to keep those details in mind and think about how the differences and the way you generate your data can influence the way you analyze and interpret that data. So I was trying to keep mentally take a tally when everyone was doing their introductions, but I think it would be interesting to kind of review by a show of hands. How many people here are working with human? So we don't work with humans every day, right? Human RNA-seq data is about half. What about plants? OK, so three or four, so that's a bit more than usual. So you guys have a challenge in one sense or several sentences relative to the human people, one because the plant genomes tend to be really, really crazy. Another is that they tend to be not because they're crazy, they tend to not have as good reference genomes. But also there's still a lot more tool development that's kind of working with human data as a reference point. So there's a bit of an advantage there as well. OK, great. OK, so that raises an interesting point. How many people are working in a kind of a metagenomic context? Either two or possibly many. OK, so a couple. OK, interesting. What about reference genomes? Who doesn't have a reference genome? Anyone? One, two. So yeah, OK, so that's a good idea to keep working on that. We're going to talk a little bit about that, but there's definitely some specialized methods that need to be employed when you don't have a reference genome for RNA-seq. How many people here would consider themselves a bioinformatician to some degree? Two or three? OK, four. So maybe four or five, maybe some maybes or some people that are too modest maybe. OK, so for you guys, you're probably going to be helping out the people beside you at some points, and it'll be a bit more of a review. And you can focus more on the sort of RNA-seq informatics aspects of this. For the rest of you, sort of in addition to learning about RNA-seq informatics, this is really going to be a crash course in working at a Linux command line and just doing command line bioinformatics in general. So a lot of what you will learn here will be useful for various kinds of analysis that would be things other than RNA-seq potentially. OK, so that's interesting. Good to know kind of where you guys are coming from. So with that, let's just jump into module one. Do the introduction to RNA sequencing, and then we'll do a brief introduction to the first workshop, and we'll get into the hands-on tutorials after that. Again, feel free to ask questions as we go through this. And we'll just get right into it. So the learning objectives of this first module are to cover kind of really the background and rationale for sequencing RNA in the first place. We're going to talk about some challenges specific to RNA-seq. So we're often getting folks in these workshops that have been doing some work with DNA sequencing or other types of sequencing. And there are certain commonalities, but there's also certain challenges that are really specific to RNA-seq. We'll talk about some of the general goals and themes of RNA-seq analysis workflows, some common technical questions related to RNA-seq analysis. And we'll kind of point you to resources to answer a lot more of the technical questions that come up that we couldn't possibly cover in a short amount of time here. We'll talk about how to get help outside of the course. And then we'll do that brief intro to the hands-on tutorial. So before going any further, I find it useful to sort of take a really step back and make sure we're all on the same page about what we're actually sequencing here. So this is, sorry, I don't have a very good point, but this is a very simplified and yet still kind of busy picture of the central dogma, which starts with a double-stranded genomic DNA template from which we transcribe, or the machinery transcribe, single-stranded pre-MRNA molecules. An example gene is shown here with three exons and two introns. And some of the features that the splicing machinery will recognize are shown here. These features allow a very complicated splicing machinery to remove the introns and assemble the exons into a mature MRNA molecule. So this gene example here is a coding MRNA molecule that's going to be polyadenylated. Not all transcripts are coding, and not all transcripts are polyadenylated, so we need to keep that in mind. And then these things are made in the cytoplasm, or exported to the cytoplasm from the nucleus, where they get translated into protein sequences. And a lot of biology, this is the thing that we're really caring about on some level, is a protein that's been prostranslationally modified. If we could sequence those and measure their abundance in a massively high throughput fashion cheaply, we would probably do that. So a lot of experiments were using RNA as a kind of proxy for these protein sequences. But I think it's really important at this point to think about, in RNA-seq, what we're actually sequencing. So in RNA-seq, we're really that this, often this mature MRNA sequence is the subject of RNA-seq. But we're not really sequencing these things directly. So first of all, we're not sequencing RNA directly. RNA gets converted into CDNA. The CDNA could be single-stranded or could be double-stranded. And then in most of the sequencing that we're doing, the read lengths that are available are not nearly sufficient to sequence the average length of a transcript in most species. So for example, in human, the transcripts are on the order of 1,000 to 2,000 kb or longer. And RNA-seq reads tend to be the ends of fragments that are in the 3 to 400 base pair range. So we're sequencing fragments of CDNAs that have been created from RNAs. And then when we think about reads that have been aligned to a genome, so we're looking at these alignments and thinking about their position, and we're comparing those positions, perhaps, to known annotated transcripts. And we're inferring, in many cases, that the reads that are aligning and overlapping with the transcript came from that transcript being transcribed in a cell. But that isn't, strictly speaking, necessarily true. Those reads could be many other things. Can anyone think of an example other than a matriam RNA where a read could have come from that we're seeing in our alignments? Yeah? You don't like DNA? Great, excellent point. So we're busting open cells and we're isolating RNA. And of course, there's genomic DNA in there as well. And we may do a DNA's treatment to try to reduce the amount of genomic DNA or we may do a column cleanup, which sort of indirectly washes away some of the genomic DNA. But inevitably, these procedures are not absolute. There is always going to be some level of genomic DNA contamination in our RNA. So if we see a read that aligns inside an exon, we're tempted to think, oh, that read must have come from a transcript that contained that exon, but it could have been from genomic DNA. And when we see the smattering of reads sort of throughout the genome that don't overlap with known transcripts, of course, those could also be from genomic DNA. Any other examples of where these reads could be coming from? Yeah? Could just be a bad map to the reference. So could it actually be somewhere else? Excellent, yes. So good point. It could be that the alignment we're seeing, the read has actually been misplaced there and that it belongs somewhere else. Perhaps there's something missing in the reference genome that causes the read to align somewhere, but basically that's a false alignment. Anything else? Isiform, so definitely the reads that we get will correspond to isoforms and we might not know exactly what isoform a particular read aligns to just by looking at its alignment, but I think when you raise the topic of isoforms, it does, again, raise the topic of splicing which is another major category of reads that aren't necessarily the ones that we're worrying about which is unprocessed RNA. So there could be and almost always will be some amount of molecules in our RNA seek data that came from this material instead of that material. So the introns are there. It could be because an intron is retained in an alternative isoform, but it could also be that the RNA just didn't get processed yet and that it's an immature RNA that we're sequencing. So we'll get some amount of noisy signal in the introns of every gene that's sort of added on top of the noise signal that we get from genomic DNA. Those are kind of the sort of the major categories. I guess another one that we could mention is the transcriptional machinery can potentially be kind of noisy. So because you see some reads somewhere, they could really have been from a transcription event, but that could have just been kind of a noisy, meaningless, not biologically significant transcription event. So there's a sort of subtle distinction there as well. So this is what an RNA seek workflow looks like at a really, really high level, starting with the very simplified experiment where we have two conditions of interest and we want to perhaps do some comparison between those two conditions. So we have tumor and normal in this example and we isolate RNAs from some of those cells. So there's some RNAs depicted here with little poly-A tails on them. These things get converted to CDNA and they get fragmented. The fragmentation can be done at the RNA level or it can be done at the CDNA level, depending on what procedure you're using. And then often there will be a size selection step. So you're selecting for fragments that are in a certain size range or perhaps you're just throwing away the small ones. And then you're gonna add sequencing leakers to these fragments, which are depicted here in blue and red. And then you're gonna flow these fragments across a flow cell. So here's a depiction of a flow cell in the hands of someone in our group. And that's gonna produce sequence data in the form of many images, sequential images on the surface of an Illumina sequencing machine, which get converted into sequence data. And the sequence data is depicted here as paired and read. So you have read one corresponding to the blue adapter and read two corresponding to the red adapter. And we basically sequenced from the ends of these fragments inwards. And you can see that sometimes if the insert size was short, the reads will cross over each other and meet in the middle and you basically sequenced the whole fragment. But this isn't usually the case in paired end data. Usually there's some amount of sequence in the middle that remains unknown. So we have these two pairs that came from the same fragment that when they align correctly, well aligned close to each other, but there's some sequence in the center that's sort of unknown. And you can infer that what it might be, but you haven't actually sequenced it. These reads get aligned against often a reference genome or reference transcript sequences or some combination of those two things. And you wind up with an alignment file where reads are basically stacked up against the reference genome. And then these BAM files get fed into downstream analysis. And we're really gonna dig into the details of all the input files that represent these things. And then the BAM files that result and then downstream analysis on those BAM files. So this is sort of a background, a rationale slide. Why would you sequence RNA? I think you guys are all taking an RNA seek workshop. So you guys are obviously convinced that it's a useful thing to be doing, but just to review some of the things that people think about. So there are functional studies where it doesn't necessarily make sense to sequence the genome because the genome is actually constant, but some experimental condition is having some pronounced effect on gene expression. So this might be a drug treated versus untreated cell line, for example. People that are interested in transcriptome annotation, for many years, people attempted to predict what transcript sequence would be created from just looking at the genome sequence and the features that are in the genome sequence. This is extremely hard to do well. Gene annotation has been basically completely revolutionized by RNA seek, because we can now just directly sequence transcripts, align them back to the genome and then resolve the accent, intra-unbound just with a high degree of accuracy. Certain molecular features really can only be observed at the RNA level anyway. So alternative isoporms, which were mentioned already, fusion transcripts, RNA editing is something you can only see at the RNA level. In the cancer field, RNA seek can be really useful for interpreting mutations that don't have an obvious effect on the protein sequence. So a mutation that isn't a missense mutation, perhaps occurs in your gene, maybe has some regulatory context. RNA seek can help us sort of tease that out. And then another cancer specific application that we're interested in is prioritizing protein coding, somatic mutations. So these are mutations that have some predicted effect on the amino acid sequence, and we wanna use RNA seek data to kind of bend them into different categories. So this is a mutation in a gene that's expressed, and we're actually seeing expression of the mutant allele, or perhaps we're seeing that the mutation causes nonsense mediated decay, and we don't see as much expression of it as we might expect, and so on. There's a number of challenges that are peculiar to, or stronger in RNA seek experiments, and a lot of these relate to the sample. So any sample you can have purity problems and quantity problems, RNA in particular is subject to degradation, so quality is a problem that we spend a lot of time thinking about with RNA quality, can vary quite a lot from sample to sample, and this can have pretty important consequences for analysis if you have varying sample quality across the conditions that you wish to compare. Some of the informatics challenges, so RNAs generally consist of small exons that in eukaryotes are separated by large introns. How many people here are not dealing with eukaryotes? They're dealing with a prokaryote. Okay, so five or six of you. So you guys have a bit of an advantage in terms of the complexity of the structure of transcripts, most likely, but in a lot of species you're dealing with this problem where you have RNAs, they're small exons, they're separated by huge introns, and we have these little reeds that we're trying to map to the exons, and the alignments need to span across these big introns correctly and get the boundaries just right, and that creates a sequencing alignment challenge informatically that's much harder than aligning whole genome or exome data to a reference genome. Another thing that's different from the genome to the transcriptome is that the relative abundance of RNAs varies widely, so when you're thinking about the genome, a diploid genome, you have approximately the same, under normal conditions, the same number of each of the chromosomes, so you generally expect that when you sequence a, by shotgun sequencing the whole genome, you get approximately even coverage, and of course that's an oversimplification, but for the most part you get, say, 30x coverage of the genome, and it's pretty even, at least in all the places that you can map reads well. In RNA, we don't have this expectation at all, so we know that there are transcripts that are expressed at a very, very low level, their biological function requires only a few copies per cell perhaps, and there are other transcripts whose normal biological function, the cell requires thousands or even tens of thousands, and it's thought that the range from the lowest expressed transcripts that are functional to the highest expressed transcripts that are functional is in the range of one to 10 to the five or 10 to the seven orders of magnitude, so you have stuff that's really, really common and stuff that's really, really rare, it's all mixed together, and you're kind of stuck with that mixture. There's some things you can do to kind of tweak the conditions a bit, but this is sort of a fundamental challenge of RNA sequencing, and the reason it's a challenge is that RNA sequence works by random sampling, so when you start pulling these reads randomly out of a hat, you tend to just sequence the most highly expressed things over and over and over, and if you happen to be interested in a gene that isn't a highly expressed housekeeping gene, you're gonna have to sequence quite deeply, maybe more than you'd expect to get at the transcripts that you care about. And this of course raises the topic of ribosomal and mitochondrial RNA, which tend to really dominate in a lot of RNA-seq data sets. Another challenge that's particular to RNA is that they come in a wide range of sizes, so generally you can't really sequence the entire transcriptome in one shot. The way it's kind of shaken out is that we've sort of divided into two camps, they're sort of small RNA sequencing, which is anything in the range of say, 15 to 20 to maybe 100 bases are generally processed in a quite different way from everything that's over 100 to 150 bases, and that's kind of what we're referring to as RNA-seq here. Small RNA sequencing has sort of its own tools, its own lab methods and so forth, and a lot of people will, at the outset of their experiment, will kind of divide their sample into these two paths and try to do both approaches perhaps. And then the fact that RNAs have different sizes has various influences on the downstream analysis. One comes back to this idea of random sampling, so if you think about how many reads that you observe for each of your genes, you're gonna observe more reads for large genes just because the gene is larger and smaller genes are gonna be harder to sequence robustly perhaps because you have sort of less options to generate a read from that sequence. And another aspect in which you should really think about the size of the RNAs is when you're doing poly-A selection, you should be really careful not to do poly-A selection on RNA that's degraded at all. And the reason for that is that you can introduce a three prime end bias, which has a couple of problems with it. So your RNAs are in solution and they perhaps get a bit degraded and then you do a poly-A selection, you're holding on to the three prime end of all of your transcripts and you're washing everything away to get rid of the non-poly-analyzed stuff. Well, if your RNAs are broken, you're also washing away the five prime end of all of your transcripts. So now when you sequence that, you're gonna see most of your sequence data is piling up at the three prime end of your gene. And there are various tools out there that will help you sort of quantify how even the coverage is in your data across each of the transcripts. And there's sort of metrics that give you sort of a five prime to three prime ratio to give you a sense of how much of this bias is present. And you should really assess your RNA quality carefully before you decide to go down the poly-A selection route. And it's particularly problematic if you have a range. So if you have some samples that are really heavily degraded and some samples that are really intact, comparing those is gonna be potentially challenging. And then of course, this all stems from RNA being quite fragile compared to DNA. So it's easily degraded. So this is a review of sort of an Agilent. How many people have worked with the Agilent Nano platform? So about six or seven or eight maybe. So this is a really, really common way of assessing RNA quality. So if you're working with RNA-seq data and you get your RNA-seq data from a collaborator or from a sequencing core or if you're running your own instrument, you definitely need to be thinking about this. You often will see these traces. We run basically a small sample of your RNA and you're effectively like running a gel through a capillary and then you're reading off that gel over time. So along the X axis here, you have time and there's a ladder with molecules of known size that are run and used to convert that time scale into a size scale. So the things that come out first are the smallest and progressively larger molecules come out over time. And if you look at the example on the right here, this is a sample of RNA that I isolated from a cell line that was growing happily in the incubator one minute and then the next minute it was doused in a trizol and RNA was isolated very quickly and it resulted in very, very intact RNA which consists of 95 to 98% ribosomal RNA. So what we're seeing here is the two ribosomal RNA peaks of their expected sizes. And then basically what this assay does is it looks at those peaks, their relative size, the area under these curves and it comes up with an estimate of the integrity of the RNA based on that. And this is given a score by the assay of 10 out of 10 which is the best. And then as the RNA gets degraded, you get basically these RNAs that are at these expected sizes being broken into smaller and smaller pieces. And then what happens is you get secondary peaks that are to the left of the position of these peaks and you get sort of this smearing effect. So on a gel, this would look more and more like a smear and less and less like two distinct bands. And again, using this assay, you can sort of quantify the degree of degradation based on the sort of drop in the relative area being in these two expected ribosomal RNA peaks. So in this case, it gave it a score of six. So these scores will often be reported along with RNA that you get from external collaborators or elsewhere. And a lot of sequencing cores will have some kind of cutoff or recommendation about the minimum RIN score before doing an RNA seek experiment. And for a lot of places, that number seems to be eight, which of course is arbitrary. But if you have a wide range in the way these profiles look across your samples, that might be a problem. And we can talk about strategies for dealing with heavily degraded RNA if you guys want. So all of these things kind of relate to sort of design considerations of how should you do your RNA seek experiment in the first place. And this is really getting out of date, but I haven't really seen an update to it or that many things that sort of have gone down this road. But there's this document produced by the on-code consortium, which you can download from the course or the RNA seek quickie. And it basically discusses things like what metadata should you supply with your samples? How many replicates should you do? What kind of sequencing depth is needed in an RNA seek experiment? What kind of control should you include? How should you report on the data and what standards have been proposed? And it's quite lengthy. It's basically a list of all of the things that you probably should do when you set up an RNA seek experiment. But almost no one does all of them, but that most of them are really useful things to consider doing at least. And a lot of them have to do with dealing with the sort of complexity of how you make RNA seek data in the first place. So we have this challenge over the last five years. There have been many, many RNA seek library construction strategies. So it's a very evolving field. New kits are constantly coming out. People are coming up with different ways of enriching RNAs and preparing the library for to deal with different sort of edge cases, degraded RNA, small input, and just generally do a better job of RNA seek. So one of the big decisions first is whether to use total RNA or polyA RNA. So you can either start with total RNA and do a ribo reduction. So you basically can't just sequence the total RNA, at least in species like human, because all you would be doing is sequencing ribosomal RNA over and over again. So you need to do some kind of enrichment for everything that isn't ribosomal RNA. And you can do the sort of negative selection where you try to subtract out the ribosomal sequences, or you can do kind of a positive selection where you try to select for the polyadenylated RNAs. And you're really going down two different paths there. And if you're interested only in RNAs that are polyadenylated and your RNA is very intact, then you can get really nice data by doing a polyA selection. You don't burn a lot of your data sequencing ribosomal sequences or non-polyadenylated sequences. But it's sort of a less complete representation of the transcriptome. You're really focusing on RNAs that are polyadenylated. The ribo reduction, though, there's sort of a challenge with the stoichiometry of this. You're trying to capture something, but the thing you're trying to capture is actually 95% to 98% of your samples. So that's a hard thing to do, so it's difficult to get rid of all the ribosomal RNAs. And it's common to have a pretty high amount after you're done. So how many people here have kind of gone, would they say, have gone down the ribo minus road versus the polyA? So ribo minus first. OK. And what about polyA? OK, and then the rest are maybe still trying to decide or don't know yet. OK, so it's about half and half. So, yeah. Is there any work when you do ribo reduction? Is that when you make some of your other transcripts and adding bias and what that bias looks like? Like, do you know that I haven't read any of that before? It is. I mean, it definitely is introducing a bias of some kind. And it's definitely, I mean, there are a lot of people interested in non-coding RNAs, many of which are not polyadenylated. So you're definitely getting rid of a lot of that. You're reducing the proportion. Neither of these things are absolute. So it's not that they won't be there. It's that their ratios will be shifted. So it, again, is really good if you can do it consistently at least so that you can compare things without a lot of systematic biases. If you're doing meta-analysis, I think it's going to be really challenging to combine data sets that were made in these two different ways. And indeed, if there are any differences in the way they were made, I'm not aware of a reference that exactly sort of digs into that question offhand. But we have a resource table that covers a lot of these things. And there are a lot of references in there. And one of them might speak to that question. So I can check on that at the next break. Any other questions? So another thing that's common and has some variation in it is size selection. So we already talked about this difference between really small RNA sequencing and then sort of regular RNA-seq. But even within RNA-seq, there's sort of some people do a very narrow fragment size selection. So they either use an instrument, or they actually run their RNA on a gel. And they use a page gel to basically cut out a very tiny fragment or a slice of the fragment. So they're all very consistent size. There are other people that don't like that strategy, that just use sort of a cleanup method that gets rid of small things and just leaves everything. So these give you two quite different distributions, which I'll show on the next slide. There are some kits that use a linear amplification. So if you have a really small amount of total RNA, you can add sort of an additional adapter that includes a promoter. And then you can use a polymerase during your RNA-seq library construction to basically increase the amount of material that's there. There are stranded versus unstranded libraries. We're going to talk about that specifically. There are kits that help you do basically an exome capture of your RNA-seq library before you sequence it. So this is another way of kind of enriching four sequences that you think that you would like to sequence and getting rid of sequences that you're not interested in sequencing. And then there are other library normalization that sort of use a wet lab techniques to normalize the library. And what I mean by normalize here is trying to reduce the scenario where you have a lot of your most highly expressed genes and a really small amount of your most lowly expressed genes. You're trying to shift it back to be more sort of where every gene is on more of an even playing field. And that, of course, introduces bias in the expression output, but it could be useful if you're just interested in figuring out what the sequences of your RNAs are and you're not as concerned about accurately estimating their abundance quite yet. If you modify or have data that varies in any of these criteria and you're trying to compare them, it could create a problem. So you have systematic differences in the data and you're trying to compare them, make biological conclusions based on the differences you observe. But the differences could be down to the way the data was created and not the biology that you're trying to study. So that's really the main point of all of this. So this is just a depiction. It's really small sort of here as a reference and mostly you can go through it on your own but it covers a lot of what I just described, showing RNA being isolated from some tissue and then being assessed in this case on a gel and then showing some how that lines up with sort of examples of these Agilent traces. So you have very intact RNA and then some partially degraded RNA and then heavily degraded RNA and then basically completely degraded RNA. So you can see how the distribution shifts continually to the left as the RNA becomes more degraded. Assuming that your sample is okay and you go ahead with library construction, you'll go through this fragmentation steps. You start with these large DNAs. You break them into pieces that are kind of in a much more even size. And often you'll also use the Agilent instrument to assess the size range of these libraries. You'll see in this case we have a very tight, a relatively tight distribution of sizes. And here's an example where there's sort of more of a tail. So instead of a really tight distribution, you've got this tail of larger fragments here that we're gonna be sequencing. So this is after a size selection, you wind up with some selected fragments in a certain size range and then you add your sequencing adapters and those go on to be sequenced. But in this area, you're losing small RNAs that were below some sort of threshold. And again, they're not totally lost, but they're gonna be mostly lost. And then this slide depicts sort of in cartoon form some of the major enrichment strategies. And the three that are the most common are, so you start with total RNA where you're basically just gonna have tons and tons of bravozomal RNAs. And then there are sort of three main arms of enrichment, bravozomal RNA reduction, where you're trying to pull out the bravozomal RNAs and keep everything that's left polyase selection where you're trying to capture the polyadenylated RNAs and throw away everything else. And then CDNA capture where you're trying to capture specific sequences that correspond to known transcripts or genes. And then I mentioned stranded versus unstranded data. This is something that in the last couple of years, the strand specific library kits have really come along. And now, instead of having the situation where you sequence RNAs from double-stranded CDNA where you basically at the end of it, you don't know for sure which strand your sequence came from. So you can align it back to the genome and you can say, okay, it lines up. So there's some pictures of depictions of little reads being aligned back to the genome. They have these reads here that line up against an axon. So you can infer that they came from that axon, which is being transcribed from left to right in this example. But you don't actually know what strand each of these reads came from. So that's just an inference. And there'll be approximately 50, 50 mix of these things. And if you look at this data in IGV, you'll sort of see the comparable sort of red, green or red, blue indication for all of your reads. But now we have these library construction methods that allow the strand information to be maintained so that when we align them back, we have an extra piece of information that says this is where this read was being, fragment was being transcribed from, which strand. And then you can use that in addition to the aligning to known transcripts to give you extra confidence that you're actually measuring that transcript and not another transcript that happens to be on the opposite strand being transcribed in the other direction. So it basically gives you more power to really say what transcription event each read corresponds to. If you can get stranded library, I highly recommend it. Okay, and then the last few slides here just going through some of the common questions that come up. One has to do with replicates. So there's a lot of different kinds of replicates that one could think about in an RNA-seq experiment, starting with the most technical. So a technical replicate could be something like multiple instances of sequence generation on the same instrument or same flow cell. So if you want a lot of data for your RNA-seq data and you can't get enough from one lane, you sequence two lanes, do you need to worry that lane one is systematically different from lane two that they need to be normalized to each other or something. Similarly, what if I produce some data this week on this flow cell and the next week I take the same library and put it on another flow cell and produce some more data. This graph is showing an example of an experiment like that where the same library has just been re-sequenced showing the correlation in expression values that come out of that and they're very, very good. And it's generally been accepted in the field now that the Illumina platform is robust enough that these really technical replicates are probably not necessary. One lane is approximately equivalent to another lane as long as the library is the same. So people are not generally doing those kinds of technical replicates. But of course there are experimental replicates. So I would consider a more experimental replicate to be going through the whole procedure, starting with an RNA and all of the molecular biology steps, the size selection, the conversion to CDNA, all of these things can introduce variability and measuring that variability can be important. And then of course biological replicates are always important and RNA-seq does not make biological variability go away magically. You definitely need to have a properly designed experiment just like you would for any other expression measurement platform. How many people, is everyone here working with the Illumina platform? Are there anyone working on a different, what other platforms? Anything? Everyone's Illumina, okay. Okay, so it was like a while when we started doing this or some people had ion-torrent data or there are other but Illumina has so completely dominated that now in a room of 40 people there's literally no one using anything else. Which just makes that aspect of things simpler. Okay, so some common analysis goals. So we're gonna go through some of these. There's a huge list of things that you can do with RNA-seq data which is one of the things that's really exciting about getting RNA-seq data is you have this one starting point, this BAM file or these raw reads in FASQ format and you can ask all of these different questions of the data and each of them involve, unfortunately there's no push button, at least not that I'm aware, a tool that will ask all of the questions you could possibly ask of the data in every way. There tends to be a lot of different methods but some of the common analysis goals are things like gene expression and differential expression so this is really what we're gonna talk about the most. We're also gonna talk about alternative expression analysis, alternative splicing, alternative transcription starts at usage and so forth. Just discovering transcripts. So novel gene discovery, discovering novel transcripts of known genes, annotating a genome perhaps that you've just sequenced the first reference genome for a species and you want to figure out what things are transcribed and what the structure of those things are for your species. RNA-seq is extremely useful for that application. A little specific expression. So this could relate to common polymorphisms or SNPs or it could be somatic mutations in tumors. Also in tumors are some people that do mutation discovery so de novo mutation discovery using RNA-seq data, fusion detection, RNA editing. There's also, you can look for viral sequences in your RNA-seq data and the list goes on and on and there's also some material in the wiki that covers so there's a table that covers all of these applications and many more and gives some lists of tools that are relevant to each of those applications because we just simply can't cover all of them in two days and frankly we would need more experts to do so. So although there's a lot of different applications, they have sort of different goals in mind, they have a lot of common themes which helps from a sort of training and education perspective when it comes to analysis. So each of these, they do have distinct requirements and challenges but they generally all involve this sort of flow that starts with raw data and maybe you have to convert your file from one format to another and then you're gonna align or assemble the reads and then you're gonna process your alignment files with a tool that's specific to each goal. So this is often where the branching really happens. You align your reads and then you have a BAM file and you feed your BAM file into an allele specific expression tool or a gene annotation tool or a gene expression abundance tool or a gene fusion detection tool. And then there's gonna be some kind of, you're gonna run this tool and it's gonna give you what are often very peculiar custom file formats with a lot of esoteric information in them and that's gonna feed into some kind of downstream analysis, statistical analysis using some package like R or MATLAB and then you're gonna summarize and visualize. So this is gonna be creating your gene list, prioritizing candidates, producing visualizations, reviewing the data in a genome browser, this kind of thing. And so we're gonna go through this whole procedure in the walk sort of hands-on tutorials for a couple of the applications and then when you go to do additional applications you'll find that a lot of the same kinds of steps are involved and hopefully you'll be able to use what you learn here for each of your applications. So we usually do this bio-star exercise. So Michelle already introduced the bio-star site. How many people have a bio-star counter that used bio-stars before? So there's maybe four or five of you. So I think it would be useful to just take a few minutes to go to bio-stars.org and to log in. So you have to create an account you just log in most likely with an existing login that you already have so you can log in with your Google account and there's a few other options. And usually we stop here for just say five minutes and ask you to create an account in bio-stars, find a question that seems useful to you and vote it up or search for something particular to your area of interest or if you really have a question I guess you could answer a question although please search for, try to search for that question first rather than asking a question that's been asked a million times before. So yeah, we'll continue in five minutes. It's actually one of the faster than usual though. It's really variable. It depends on how much description it has. Yeah, these next ones. That's where you start to start. Theoretically we start at 11. Yes. One of the things about this, I guess RNACC is such a big topic that it sort of gets its own tab at the top here. So if you click this RNACC at the top of the homepage it will filter down to things that have actually been tagged with RNACC. So when you ask a question you're asked to sort of create these tags. Someone asked this, how do I interpret coupling's output to detect novel isoforms? Very relevant to this class. And they've tagged that with these tags, novel, cufflinks, isoforms, RNACC and cuffdiff. And then hopefully people will come along and answer your question. And so some of the people that are involved in these courses are sort of active user or participants in BioStars. So for example, if you click on that question you'll see that this guy Malachi attempted to answer this person's problem. And I didn't hear a response or anything but, and I also didn't get any votes. But maybe I'll vote this question up and, only if you think it's a good answer or even better than that, do a better job of answering the question which is actually not a simple question. Yeah, there's quite a few people that have basically made BioStars the way that they answer questions that they're commonly asked. So usually if someone asks me a question that's interesting and that I'm gonna devote time to answer I say, well, let's make this discussion public and then other people can comment on it. And then next time someone asks me we can point them to that. So we're running a few minutes behind and so I think we'll continue on with just a few of the common questions that get asked sort of over and over again in these courses. So we kind of just address these upfront. There's many other common questions and there's a couple different ways. There's BioStars. We're also gonna show you a very lengthy table of sort of question and answer from any questions that have come up. And then of course we're always pleased if you, can I think of a question to ask that we haven't been asked before? That's always very interesting. So one of the common questions that sort of really ups very soon in the RNA-seq processing pipeline is should I remove duplicates from RNA-seq? So a lot of people that have been doing some NGS sequencing maybe they've done some chip-seq or they've done whole genome or exome sequencing in some of these applications removing duplicates is very, very common. So the idea is that because making a library involves PCR amplification you can introduce amplification bias when you're doing that. And when you're shotgun sequencing the whole genome sequence, the chance of two fragments being exactly the same is very, very low. So when you do see two fragments that start, you align this fragment and it starts and ends at exactly the same spot where you make an assumption that that's likely an artifact of amplification that you didn't actually, those aren't two independent molecules and that they shouldn't be counted twice. It should be collapsed to just be one read instead of two. So people that have been doing that for a while and they start, they get their first RNA-seq dataset. A lot of people just out of routine almost say, well, of course I'll remove the duplicates. It's NGS data, there's PCR amplification involved. We must remove duplicates. But for RNA-seq, it's really not that simple. And the reason is sort of relates to what we already discussed about some of the peculiarities of RNA-seq data and that RNAs vary in size and their abundances vary from very, very highly expressed to very, very lowly expressed. And the problem with removing duplicates is that, for example, just to take the extreme scenario, if you have a small transcript that's extremely highly expressed, there isn't that many unique fragments that you can make from that transcript, perhaps. So say your fragment length is around 300 to 350 and you have a transcript that's 500 bases and you have many, many copies of that 500 base transcript in yourself, there just isn't that many 350 base fragments that you can make out of that small transcript. So if you remove them, you may be underestimating the actual abundance of that transcript. So generally, in RNA-seq, for this reason, duplicate removal or marking is not recommended. So I guess that's the sort of simple answer is don't do it. And the reason is that it might reduce the dynamic range of expression estimates. That being said, there are certain applications where it might still be okay. So for example, if you're doing mutation detection in that application, if you're doing mutation detection from RNA-seq data, people do seem to still want to do duplicate removal. And it kind of makes sense in that context because you're more concerned about the sequence identities rather than the abundance levels. How much library depth is needed? So everyone wants to know how much data to generate. It's still expensive to generate RNA-seq data. If it was free, you wouldn't worry about this so much. Of course, you still need to analyze the data and that takes more time. And you use a lot of disk space to store the data. So a lot of people want to sequence just the minimum they need to get the answer that they want and know more. And so a lot of people ask us to give them just a hard number, which we pretty much refuse to do because it depends on so many factors, which I realize is probably very unsatisfying response. But I think we should think about it and discuss it. So one of the big factors that matters a lot is what you're going to ask of the data. So if you just want gene expression outputs, say you want to produce the kind of result that you would get from a classic gene expression microwave, I would say that application presents the least demands on the amount of data. You can get a pretty good readout of gene expression values. If that's all you care about is just to get sort of one relative abundance per gene. It's typical to produce maybe as little as 25 million reads and there is actually a reference that's cited in the Wiki that you can point to and someone has thought about this and they've actually made quite a good argument that if that's all you're interested in, you're better off to spend your money on additional replicates. So you'll get more value statistically out of biological replicates than you will by sequencing the same smaller number of samples really, really deeply. But of course, the caveat there is that you've already decided up front that all I really care about is getting this one reasonably robust estimate of gene expression for each locus and that's a very narrow question to ask of RNAC compared to all of the questions you could possibly ask. And this is a sort of common problem as someone who analyzes RNAC data is that when they're designing the experiment and they're thinking about the budget, they say all I care about is gene expression and I want it to be really cheap so let's pool eight samples in every lane and we'll get the data and then the data lands on my desk or someone else's desk and about a week later, the collaborator, the PI forgets that they have that plan to just look at gene expression and they get really excited about mutation discovery or fusion detection or discovering novel isoforms and then you realize, oh shoot, actually this data is really kind of sparse for these other applications. So things like alternative expression analysis, they place much, much more demands on how much depth you have. I mean to resolve or have a chance of resolving the structure of a transcript from the first exon to the 10th exon and to understand how all of the exons in between are connected, you need to cover that whole transcript and you need good coverage across those exon-exon junctions. So that places a lot more demand on just getting sort of a sense of how many reads hit this gene somewhere. Those are two very different things. So what we recommend instead of just giving one number is to really think about what your analysis goals are. One strategy that you can take is try to identify publications that have a similar analysis strategy or similar goals, but just a different species and try to use it as a benchmark to say, okay, this is what they did. I'm doing something very similar, maybe my genome is a bit smaller or there are other factors I can adjust a little bit or I'll just reproduce the way they did it. Even better than that is to do a pilot experiment. So to sequence one or two or a subset of your samples very, very deeply and then to do a down sampling experiment where you say, okay, I have three lanes of data for every sample and this is what I'm hoping to get out of the analysis. This is the kind of readout I'm looking for. How much data can I get away with? And that's gonna be the most sort of real world example. The good news is that the throughput of the aluminum instrument is incredible and improving all the time and for most of the applications we're looking at in most scenarios one to two lanes per sample will be pretty darn good, but that's still not cheap. So you may very well not be willing to pay that much for each of your samples and then you really need to think carefully about what your expectations are and try to make an informed decision. And we can talk about specific projects if you want during the breaks and brainstorm on sort of what the demands might be for each of your projects. Mapping strategy. So it's something that is gradually becoming more standard. There's still some people that again for cost reasons want to produce really short reads so you can basically make a single aluminum kit stretch further by sequencing short reads. And again, if you just want a straight gene expression output you can get away with reads that are quite short. It could be 40 bases or 35. Sometimes you'll see or 50 and if you have those short reads you really need to take a different alignment strategy than if you have longer reads. So if you have these real short bite size reads you're gonna need to align against something where the sequences you're aligning to already have the introns removed so you're gonna wanna align directly to transcripts or to an ex on ex on junction database in combination with the reference database and there are a bunch of methods that have been published describing specific pipelines to take that approach. If your reads are longer than 50 especially if they're longer than 75 or up to 100 or greater then you have enough length there to do the gas alignment. So you can use a splice and wear alignment like top average what we're gonna use or star or there's others. And then you won't need to necessarily align against the transcripts you'll align directly against the reference genome and you'll let the reads tell you what the ex on ex on boundaries are. And that's a much more unbiased and it gives you the opportunity to discover novel splicing events and novel isoforms and to quantify specific transcripts more accurately. We're gonna spend quite a bit of time visualizing RNA seek data over the next two days. We're gonna use the integrated genome viewer IGV. There are other browsers out there that are quite good but generally this is sort of your first introduction to it. This is what an example set of data looks like. In this example we actually have three tracks and only one of them is RNA seek data. So these first two tracks here are whole genome data from a tumor. So this is the normal blood sample and this is tumor, I can't remember what kind of tumor it is, it doesn't matter. This is the tumor whole genome data. And what we're showing here in this case is that a somatic mutation was detected. So this is a mutation that's in the tumor that is not present in the normal sample and it happens to be at the acceptor site of a known transcript. So at the right edge of this intron or the left edge of this exon. And then we have our RNA seek data track here. So we're here, we're seeing reads that have been aligned and these reads have been aligned with a gap splice aware liner. So we have reads that are starting in one exon and then spanning, so for example this read here is aligning to this part of this exon and then the alignment continues over here and when we look at that alignment and compare it to known transcripts we can see that that gap alignment aligns perfectly with the expected intron. So that gives us a lot of confidence that this read actually came from this transcript. And what these stars are showing are actually three reads where the alignment shows that it's starting at this exon and continuing on past this exon into the third exon here. So these are basically reads that are showing evidence of exon skipping and it's precisely the exon skipping that we would expect to see when we have a mutation at this splice acceptor site. That's just sort of a very simplified example of how you can integrate the DNA and RNA seek data to understand splicing patterns that result from a splicing mutation. The last question I believe is how reliable, basically how reliable is RNA seek? And this is quite old data now but it's still true to this day actually even more true than it was then that RNA seek data can be quite accurate if the analysis is done well. So one of the questions we can ask is are those exon exon junctions real? So it seemed like they aligned with the boundaries of known transcripts and that seems very sort of pleasing that it actually corresponds to that transcript but we can of course validate those predictions so we can say what proportion of these junctions will validate if we use a more traditional approach like RT-PCR and Sanger sequencing. And then similarly for the abundance issue are exons that we predict to be differentially or alternatively expressed. Can we confirm those things with the sort of gold standard QPCR experiment? So this is an experiment that I did in my PhD where I took about 400 predictions from RNA seek and I validated them by QPCR, TPCR and Sanger sequencing. You can look at this publication. If there is a more recent publication that's done this kind of validation I'd be interested to hear about that because this data is now quite out of date so the RNA seek data quality is much higher now than it was then. But the bottom line is that the result was very good. So the overall validation rate was about 85%. So 85% of the time we predicted an exon being skipped we went and captured that junction in that sample and sequenced it by Sanger sequencing and we saw exactly the skipping event that was predicted and this is probably really a lower estimate because the data has improved and of course this assay is not perfect either. It will have false negatives as well. And then similarly with quantitative assay we got about 88% validation for specific exons that were called as differentially expressed by RNA seek data. So if you go back with QPCR you measure those same exons in two conditions. You get a very comparable result. And what we're showing here is that the RNA seek output for differentially expressed exons on the Y axis and the QPCR output for the differential expression of those same exons. And we get a very nice correlation in the things that are called differentially expressed by a statistical test in both cases having a very large overlap giving us a validation rate of 88%. And that's probably much higher than that now. If you repeated this experiment with new data. Oh and the last thing, what do I do if I don't have a reference genome? So we kind of decided for practical reasons that this would be outside the scope of this workshop so we're not gonna work through a specific example where you don't have a reference genome but we can certainly talk about some of the options and we can refer you to some of the sort of extra optional resources that we've created that talk about this issue. And the good news is that a lot of the skills you learn here will be totally applicable to that. Yeah. Yeah, I mean it would be great. We would, I think we need like a third day to do that. And we definitely have other things that we would like to cover in the third and possibly a fourth day. We teach a version of this workshop at Cold Spring Harbor every year that involves a separate DeNovo transcriptome assembly. I think it's a whole day just for that. And we use, it's not taught by us but another faculty member from the Broad Institute comes and does a demonstration of the Trinity application. Yeah. Yeah, so we're trying to add more and more at least online materials to help with that aspect of things. And I think it's very interesting. DeNovo transcriptome assembly is very interesting and has a lot of potential uses. Even when you have a reference genome, it would be nice to always do a DeNovo transcriptome assembly because there are things that you can discover when you're not biasing yourself to what we currently know about the structure of the reference genome and the transcripts that have been annotated on it. So it's definitely an area of interest. Yeah. Sorry, they're to draft versus close. Close genome. They haven't generated enough like. Yeah, I see. Yeah, so the question is basically to what degree does it matter how finished your genome is? And of course, the more finished, the better. In terms of the analysis tools, thankfully it isn't a big deal. So a lot of the, whether you have a really highly polished reference genome like human or mouse say, or quite a draft one, like many, many, most of them are really quite drafts and some of them are really rough drafts. But the good news is that they tend to all use the same file formats and conventions for storing that data, whatever its current state. And the tool, the pipeline that we're gonna walk you through is quite agnostic to how finished the genome is. As long as you have some notion of chromosomes or contigs in a FASTA reference file, you're gonna align to them. And it's gonna increase the complexity perhaps of the interpretation downstream because you'll open IGV and instead of having chromosomes one to 22, you'll have, it could be, each reader will be on one of 500 or 1,000 contigs. But everything about the analysis will still work. Any other questions? Okay. We will start right at the beginning with FASTQ files like raw RNA seek data in FASTQ format, which is basically how it comes off the machine now and a reference genome in FASTA format. And we will make the BAM files. But we're gonna start even before that with how do I install the tools I need. So the idea is to not assume anything. We're gonna try to start from the beginning as much as possible. So that's it. We're gonna now go into the tutorial, but there's just a brief introduction to what we're gonna cover in the tutorial so that when you're going through the commands, you have a bit of context, you've gotten a bit of an overview. We're gonna keep showing this figure as we work our way through the hands-on tutorial. So this is module one and we're gonna start with what was just discussed. We're gonna start with what is the input to all of this and what do these file formats mean. So we're gonna start with our FASTQ raw sequence data files. We're gonna have a reference genome. We're gonna talk about where we get these things. And we're gonna have a gene annotation file. We're gonna talk about what this file format is and how it works. And then we're gonna start to put these pieces together and run our data through this pipeline that's gonna create alignments and then try to compile or predict transcripts from there. And then we're gonna try to estimate the abundance of those transcripts, do differential expression analysis and alternative expression analysis, and then do some visualization and summarization of the final results. And then we're gonna obtain a reference genome and get some gene transcript annotations. We're gonna talk about the GTL file format. We're gonna index our reference genome files for use with the aligners. This is a really common theme of alignment is that you have a reference genome and you have reads you wanna align to it. And there's a sort of intermediate step that you need to index your reference genome and indexing, you can think of it kind of like an index in a library. Maybe, I don't know if some of you are maybe too young, but there are these. Index cards in libraries. And it's sort of like a lookup table so it helps you more quickly find where you want to look in a very large reference genome. So it basically helps the alignment happen much faster. And then we're gonna get our raw sequence data and we're gonna look at the FASTA and FASQ format methods for storing read data. And all of this is gonna be done on sort of example data. The example data is from, it's human data but it's really been kind of down sampled so that when we run commands, when we do alignments it doesn't take hours or days for it to happen. So there's sort of, things are a little bit idealized so we've selected four reads in advance that align to a single chromosome and then we pick the single chromosome and we pick the small one, it's chromosome 22. So instead of aligning to the whole reference genome we're kind of aligning to sort of a mini version of the reference genome. And that allows us to work through analysis much more quickly. If we analyze the full data sets from which this down sample data was drawn it would take several hours for some of the steps and it wouldn't really work with the tutorial being kind of hands on or alive. So some of the common problems that are encountered while we're working on these tutorials and this is sort of a common criticism of this workflow a lot is that we're gonna be asking you to copy and paste from the tutorials. Some of these commands are really, really long. So for the short ones, it's okay to type them out but the long ones it just takes too long because you have these really long paths and so it's kind of unsatisfying to be copying and pasting but it's really the only way for us to work our way through even this sort of toy example. But of course you can slow down and type certain things out and see what happens when you make mistakes. But sort of a common problem when we're doing this copying is that you don't always get all of the commands. Some of the commands span across multiple lines or you get to the end of the line and you forget to hit the enter key so it's just sort of sitting there and then you paste another command and it gets joined onto that and then it makes kind of a garbled command and that causes a problem and everything here is very sequential so if something gets messed up at one stage it tends to mess up everything downstream so everything depends on the upstream thing happening correctly. Another really common thing is just to be in the wrong direction at the wrong time. So if you follow the tutorial through it basically tells you where to be at all times but if you start exploring around and you say oh I wanna go back up a directory and look in that other directory we created a minute ago and then you try to continue on the tutorial without going back to the directory you were in before bad things can happen. Not having this RNA home variable set I think this is not really a problem because we've now basically hard coded this but we'll double check. So this is basically like a shortcut so we've defined this variable called RNA home using this command so that you don't always have to type out home upon to workspace RNA seek so just to make some of the commands mature, yeah. So like when we go home we can do this on our own machine and set anything to be RNA home. Yes, exactly. But one of the nice things about using this little tilty here is that means just like home on most Linux boxes if not all. So you can create, you could create it the same way and that would also be fine but yeah, you could change that. But what we've done is basically added this to your BashRC file so the BashRC file in Linux is basically a place that you store configurations that are loaded every time you log into a machine so this has been set up for you so that when you log in it's already set up. So the wiki, so we're gonna go to the RNA seek.wiki and start going through commands. Just a few notes about that. Lines that start with this sort of hash symbol. These are comments so if you paste those in nothing will happen. These are just things that are meant to be there as a kind of reference point to help explain each of the commands. If there's an individual command or part of the command that you don't understand and it's not explained then you can ask questions and we'll walk through more of the details of those commands. But all the other lines that don't start with this hash mark are commands that are meant to be executed in the terminal in order. But we're really assuming a very basic familiarity with Linux so you should know that mkdir means to make a directory and cd means to change a directory and then there's a few reference materials on Linux there. These are the tools that we're going to install that's just here as a reference. We're going to obtain a reference genome from the iGenomes project actually that URL needs to be updated. I'm just realizing right now. So the cufflinks author has moved labs and he's changed where everything lives which has been a bit of a hassle. I think we've updated almost everywhere except here. But basically this step is going to download a reference genome and again we're just using a single chromosome here. We're going to similarly obtain known transcripts. We're also going to get these from the iGenomes project but there's lots of ways that you can get gene annotations and it varies to some degree for your species. So you need to get the reference genome from a place that is made available for your species. And there are some sort of common places like Ensembl and UCSC that aggregate some of these but sometimes they don't have your species. Same thing for annotations there's many places but as long as you get your annotations in this accepted GTF format they should work with what is being described here. There's a description of the GTF file format there but we're going to talk about it in the tutorial as well. Before sequences can be mapped to the genome they must be indexed so we talked about this. We're going to use bowtie to index the genome for top hat alignments. We're also going to use another aligner called star and it has its own way of indexing the genomes. This is sort of a common theme of alignment is that when you index a reference genome it's tied to that aligner and often to that version of that aligner so sometimes if you're getting an error in your alignment commands it could be that you are attempting to use an out of date index from a previous version of that aligner. So it's generally a good idea when you update your alignment software to create a new indexed version of the reference genome. We're going to get sort of sample set of RNA-seq data. The test data is two samples that are really, this is just all for demonstration purposes so these were commercially purchased RNAs. One is called the universal human reference and the other which we were going to abbreviate as UHR. The other is the human brain reference which we're going to abbreviate as HBR. UHR is like a mix of several different tissues and HBR is a mix of brain samples from multiple individuals or areas of the brain. We've also spiked two spikens into these two samples. So there's mix one and mix two of this spiken that you can order. So these are control molecules that come as a reagent and we spiked one of them into the UHR sample and we spiked the other one into the HBR sample and we're going to talk about analysis of the spiken data as we go through the rest of the analysis. And then we made replicates of these samples. So each RNA was used to make three different libraries and then those three different libraries were each sequenced independently. So you're going to see everywhere we're doing a lot of commands where we do a command. We align the reads for UHR replicate one and we're going to align the reads for rep two and rep three and then the same thing for the HBR sequence. So this is just a more detailed description of those, the samples and the spikens. We're going to also talk a fair amount about QC of the data and we're going to divide it into two parts. First is QC'ing the data when you just have raw data in fast Q format and then once we're done the alignments we're going to do another round of QC on the alignment data.