 We're going to start with the intro lecture. This is really sort of the background to RNAC, which many of you probably know already. Please feel free to ask questions as we go, though, and that goes sort of throughout the whole course, if you haven't already been told that. We're going to talk just a little bit about the sort of rationale behind RNAC, some of the fundamental assumptions. This lecture is sort of meant to be 30 to 40 minutes, but it often becomes longer as people ask questions that relate to their area of work. That's really great. If that's what happens, that's great. It's good to sort of hear sort of different people's perspectives or areas that they're wondering about with regards to sort of the, this is a little bit more of the lab end of things. Where does the data come from? And that'll be pretty much the end of the sort of wet biology side of things. Everything beyond this lecture will be quite a bit more focused on the analysis and the hands-on tutorials. After this lecture, we'll do our first hands-on tutorial. It's going to start very gently. We're going to sort of start at the basics of where does the data come from? What are the reference files that we need and how do those things work? But before we get started, I'd usually like to do a little bit of a survey, ask a few questions about you guys. We sort of had profiles of you, but it was a long time ago that we reviewed all your applications and of course they were in a much larger pile from which you guys were anointed as the selected attendees. So how many people have RNA-seq data already or they have some plan in their group to have it in the near future? Okay, so wow. This is common. It's often the case that at least half to two thirds of the class is sort of in that boat. How many of you are working with human? Wow, lots of humans. So who's non-human? Any plants? There's always one plant person or sometimes two. Which plant? Okay. All right. So that's not too bad. You've got some resources. Yes, you have a reference and it's one of the more well-annotated plants as far as they go. The plants are sort of famous for being really hard. Any prokaryotes? Okay, so everyone's at least eukaryotic. Fungi? No? Okay. So what are the other non-model organisms or what are the other non-human mice? Okay, that's a common one too. Anything else? Rats? Pigs. Okay. Dress off. Okay. So like some classic stuff. Pig is unusual. That's a little bit rare. I don't know if we've had an Arabidopsis person before either. Okay, that's interesting. So everyone has a reference genome. We're still going to talk a little bit about working without a reference. There's some value to it in certain circumstances and you never know when you might be working with something that doesn't have a well-sequenced and annotated reference genome in the future. What about background? So most people come from sort of, they started out computational and moved towards biology or they started out biology and they moved towards computational. Who started out computational and is moving towards biology? Okay, just a couple and everyone else would say they're biology and they're moving towards computational. Okay, so that's good. I would say this course is more aimed a little bit at that angle. So sort of assuming that you're already often an expert in some kind of molecular biology or biology of the species you're studying, a particular disease area and your lab has generated some data and now you're looking for some experience tips, guidance and sort of starting point to get going on the analysis. So we tend to try to start from the beginning from that perspective. So I think that was really helpful for me and it might influence the pacing of certain sections a little bit as we go through the next few days. So this is the first module, as I said, we're going to do a little intro. It's one of four. There's kind of five modules now, but one of them is quite brief. These are sort of the four main modules and they're sort of fit into this broad category. So just introducing to RNA-seq analysis and data, alignment and visualization, expression and differential expression, that's probably the longest section, which has a lot of content to it, a lot of exercises and tutorials. Isoform discovery and alternative expression is quite a brief introduction to that very complicated topic. And then we're going to also add on a sort of half module, which is reference-free alignment and expressionist estimation. Each of these modules has one or more tutorials provided along with them. The goal of the tutorials is really to provide a working example of an RNA-seq pipeline that you could go and set up on your own compute systems back at home or on the cloud or whatever form of compute that you use. We're going to do everything on the cloud. So if you're going to do that, then it'll be very similar. We wanted these things to be able to run in a reasonable amount of time with modest compute resources. I think you saw this yesterday with Jared and probably with the other instructors where sometimes you have to do certain things to make that work in a classroom setting, like using a smaller data set or the data set is a little bit contrived somehow. The main way we do it here is very simple. We just limit analysis to a particular part of the genome. So we say, let's pretend that we're analyzing the whole genome, but we're really analyzing just chromosome 22, which is a very small chromosome. And it allows things to move much more quickly. So we're not waiting for 15 minutes for the results of each command to complete. As we've done this, we've been doing this maybe five or six years. The tools have actually gotten faster and we continue to update to newer versions of the aligners and new expression estimation tools. And so it's generally we've been able to use a little bit more data and still things run actually even faster than we initially started this course. So that's been very helpful. We also want these tutorials to be really self-contained, ideally self-explanatory and portable. And by self-contained, I mean that all of the dependencies that you need to make the things run should not be hidden from you. So if a tool relies on another tool, we don't just set up this magic environment where it magically works. And then when you go home, you try to install that tool and it doesn't work because you don't have the magic environment. We try our best to explain everything about the way things are configured. So you really should have the instructions you need to set it up and make it work yourself. This is one of the common things in bioinformatics is dealing with many tools, tools of chains of tools being run sort of in sequence, having to sort of interoperate with each other. So everyone doing bioinformatics to some degree has to be a little bit of a system administrator. So we'll give you a little flavor of that. Self-explanatory, we try to document in the wiki what each command is doing and what the options mean. If there are places where that's really not clear, sometimes it's not obvious to us because we don't have that outside perspective. So please let us know and we'll make a note to try to improve that in the future. And then portable, again, kind of comes back to this idea of being self-contained and all of the dependencies being defined. You should be able to take everything you're seeing here and run it in a different environment. There may be a little bit of configuration required, but it should be possible. It shouldn't require some kind of platform that you don't have access to. Okay, so the first module, we're going to go through these six general areas. The rationale for RNA sequencing, probably preaching to the choir there. You guys are already doing it or interested in doing it. So you obviously feel that it has some value. Some of the challenges that are specific to RNAseq. So up till this point, you've been primarily focused on DNA data. There are some nuances or differences to RNAseq that are worth discussion and the influence analysis. We're going to talk about some of the general goals and themes of all RNAseq analysis workflows. So really, by going through the workflow that we're going to go through here, it should be really helpful when you go back to your lab and you want to do a different kind of analysis that wasn't covered here. They have a lot of sort of common themes. So you'll find that a lot of the same kinds of skills are needed and they sort of follow a certain pattern. We'll talk a little bit about how you get help outside this course. And then at the end of this lecture, we'll do a brief introduction to the hands-on tutorial and then we'll go through the hands-on tutorial, the first tutorial. So just to make sure we're all on the same page and to talk a little bit how this relates to RNAseq, I usually show this central dogma figure that I created for a book chapter some years ago, which is showing the sort of classic flow of information. This pointer is kind of useless. The classic flow of information from a gene and structure in the genome to a protein that's been sort of folded and has its post-translational modifications. So we're starting at the top with a double-stranded genomic DNA template and it has various features on here, which are sort of not exhaustive. List of them are being shown here, things like promoter regions, transcript initiation sites and so on. In this example, we have an exon with three genes and two introns. For human, this is probably not even close to scale, so the introns will be much, much, much larger than this relative to the size of the exons. This thing gets transcribed to a single-stranded pre-MRNA molecule, which again has a series of features that help the splicing machinery determine how to remove the introns and stitch the exons together. So these are sequence features that are recognized by various RNA and protein molecules that bind to them and instruct the splicing machinery on where the exons are, where the introns are, and how to stitch together the final mature mRNA product, which is then capped and polydentalated and exported to the cytoplasm. And then it gets translated into a protein, which gets folded and has often various post-translational modifications attached to it. And the reason we show all of this is just to remind ourselves what we care about and what we're analyzing here. So probably a lot of us, what we actually care about is the protein sequences, for many cases, unless you're specifically studying RNA genes that function at the RNA level that just sort of end at that point and don't become a protein. But of course, lots of genes and many, many important ones. If not most of them, do become proteins. So if we could sequence these things and profile their abundance in a massively high throughput manner, we would probably do that. That would be awesome if we could do basically the equivalent of RNA-seq, but at the protein level, a lot of people would probably just skip the RNA and go straight to that. But we can't do that. So we're using the RNA as a kind of proxy that's a way of sort of trying to get at information things that are happening at the protein level. But we'll sort of one-step upstream, one-step removed. And we're not really sequencing and RNA-seq these sequences directly. So this is depicting an RNA molecule. We don't sequence RNA, we sequence DNA. So we have to convert it from RNA to CDNA. And there's a lot of different sort of tweaks to the way that's done. And then we're also sequencing pieces of RNA because of the nature of the sequencing technology. Generally, we're sequencing fragments that are, say, in the range of two to four hundred base pairs long, which is generally shorter than your average RNA molecule, particularly in humans. So many of them will be 1500 to 2500 bases long. So to get at a representation like this, we would need to sequence many pieces of it and then try to stitch that information back together to get a sense of what the full length looked like. But we're not sequencing the full length. So there's inference that needs to happen there. There's some there's a step where we need to infer from the fragments that we did sequence, what the full length transcript and its structure might look like. But there's always going to be a certain amount of sort of educated guesswork in that procedure. So to get at those molecules, this is sort of what a typical library workup looks like. So we start with some samples of interest, say in this case, condition one and condition two. So we're going to use a few example data sets that are sort of like this where there's a pair, there's a tumor in the normal or there's a brain tissue and other tissue. We're going to do some comparisons of those. From these samples, you isolate RNAs. So at this step, they're essentially full length if they haven't been degraded. We'll talk a bit more about that. And then you generate CDNA fragments, size select and add linkers. Not necessarily always exactly in that order. Sometimes the fragmentation happens at the RNA level and then you make CDNA from the fragments. But they all end with adding linkers and size selection. There's also a little bit of differences in the way size selection is done with RNA libraries compared to DNA libraries potentially. And then we're going to sequence these fragments. But often we're not even sequencing the full fragment. We're just sequencing the ends of it. So we're going to flow these across a flow cell. This is sort of an old representation of what the flow cells look like. They're starting to look a little bit different now. Same idea. And you wind up with reads, paired reads often, where you've sequenced a little bit from the left side of a fragment and a little bit from the right side of the fragment, which is depicted in blue and red here. The dark blue part is the adapter sequence that specifies read one. And the dark red is the adapter sequence that specifies read two. So that's sort of where the sequencing is initiated off of. And then you wind up often with some unsequenced portion in the middle, but not always. Sometimes if the reads are long enough and the fragment is small enough, they'll meet in the middle and overlap a bit. If the fragment is really small, the overlap might be quite substantial or there might be quite a large insert. And so you'll have a distribution of fragment sizes that have a different amount of unsequenced basis in the middle, anywhere from zero to say a few hundred. So all of these reads, we're going to align them back against the reference genome and then we're going to feed the results into various downstream analyses. And a main difference here, which we'll talk about a bit more, is how we deal with the introns when we're doing the alignment. So why would we sequence RNA versus DNA in the first place? There's a functional component to it. So the genome is kind of a constant thing. The RNA gives you sort of a readout of what is happening to the environment of the cell, of the organism, et cetera. So it's sort of a way of examining what functions are happening. Projecting transcript sequences from the genome is really difficult. So this used to be sort of a whole sort of subfield of bound formatics would be to sequence genomes and then just kind of look at the sequence and say that thing looks like an exon and this looks like an intron. And it looks like maybe there's three exons here. So maybe those three exons get assembled into a transcript and maybe this region has these three transcripts and that's a gene locus. That has been largely made unnecessary by RNA-seq because the quality of the data that you get from RNA-seq by actually just interrogating the sequences that are transcribed is just so much easier than trying to infer the structure of genes by just looking at the genomic sequence. Of course, some molecular features can really only be observed at the RNA level. Things like alternative isoforms, fusion transcripts, RNA edits, of course, happen at the RNA level. In sort of the cancer-specific domain interpreting mutations, you may have somatic mutations in the genome and sequencing the RNA may help you to better interpret those mutations by either showing which mutations are expressed or not, which is the next point or identifying potentially regulatory mutations. And then sort of this idea of prioritizing protein coding mutations. So you often have a heterozygous mutation in a tumor. You can see whether there's any allelic bias between the expression of the wild type and the mutant. You can confirm that the gene is actually expressed and therefore there might actually be a mutant protein if you're trying to predict a neoepotope for cancer vaccine. This might be important because you want things that are actually expressed and so forth. Okay, so there's a number of challenges. These things are particular to RNA and not as common in DNA, some of them. So the sample purity, of course, matters for both DNA and RNA. Quantity, in some ways, it can be a little bit easier in some cases to get a decent quantity of RNA than it can be to get a decent quantity of DNA from a sample. But quality is often something that people working with genomic DNA don't have to worry nearly as much about. RNA is just quite a lot more fragile. So you're much more likely to encounter problems in your data and the construction of your library that relate to degradation of RNA than you are to find problems like that with DNA. As I mentioned, the RNAs consist of relatively small exons that might be separated by really large introns. So mapping reads the genome is challenging for a lot of reads, a lot more than for DNA. So yesterday or the day before, you talked about SV detections where you're trying to identify structural variants, say cases where there's a big deletion or a translocation in the genome where you'll have a read that spans across a breakpoint and you basically have one piece of a chromosome attached to another piece of a chromosome and that's kind of a challenging, mapping and interpretation problem or you might have large deletions or insertions where the alignment is discontinuous, where there's 50 bases aligned here and then the next 50 bases are aligned away 50 kb upstream because there's a big deletion there. This is happening all the time in RNA because every time a read spans across the edge of an exon, the next exon might not be for 1,000 bases, 10,000 bases, sometimes even 50 or 100,000 bases. So the alignment algorithm has to be able to look for sort of patches of alignment and then be able to align sort of a shorter piece of read that's left potentially quite far away and sometimes that can be ambiguous and it's just more computationally expensive as well. So you have different aligners. Generally you'll be using a different aligner to align your DNA data and your RNA data even if you're intending to kind of analyze them in an integrated fashion and sometimes that can create certain types of artifacts. The relative abundance of RNAs can vary widely so this is also something that's quite different from the DNA. So for most circumstances when you're sequencing the genomic DNA from some cells, there's a known ploidy for your sample so in human they'll be diploid, in some plants they may be hexaploid or something else but it's a sort of fixed copy number that you expect. So you're randomly sequencing say in whole genome, you're randomly shock on sequencing the whole genome and you expect say to sequence until you have about 30x coverage which means on average you have about 30 reads piling up everywhere and there'll be some sort of waviness to that pattern as you look across each chromosome but generally you'll see that it sort of sticks to that 30x plus or minus say five or 10 basis just due to random sampling. And that reflects the fact that there's two copies of each chromosome in each cell but for RNA we don't have this expectation and that's because from that genomic template there can be a lot of expression of RNA molecules or a very small amount and depending on the gene that might be what you expect. So one gene might be functional in terms of its biology in the cell with one or two copies per cell where so for example telomerase is a famous example of this where just a few copies of the RNA can make a few telomerase molecules and only a few of those protein molecules are needed to maintain the telomeres in the cell. Whereas another gene it might be something that's involved in assembling the structure of the cell itself so you might need tens of thousands or even more hundreds of thousands of copies of it at any one time. So you have this huge dynamic range between things that are lowly expressed and things that are highly expressed but we're still sequencing by random sampling so we're doing this shotgun sequencing of everything that we extracted from the cell. So we have this problem that naturally results in an overrepresentation of things that are highly expressed and things that are lowly expressed can be hard to sequence because we're always just sequencing the highly expressed stuff over and over again. So the amount of data you have to produce is maybe a little bit more than you would expect if you just sort of look at the amount of the genome that consists of exons and thought oh I'll just have to sequence 2% of what I would to do the whole genome sequencing. It doesn't turn out like that because of this problem. Similarly RNAs come in a wide range of sizes so again, the genomic DNA, you have chromosomes for all intents and purposes they are just absolutely massive compared to the reads. So you're not even really thinking about their sizes. They're just these huge things we're breaking them into little pieces and then we're sequencing all of those pieces and we're not really trying to infer what the structure of the entire chromosome looked like. That would be pretty awesome and you can kind of get at that by combining Illumina data with some kind of data that will allow you to phase SNPs or you're really not sequencing a whole chromosome but for RNAs we really are trying to get a sense of what does the full length structure look like and then we have this wide range of sizes so that sort of complicates that part of the problem. You have some RNAs that are very, very small as small as so micro RNAs are just a few tens of bases long. Other RNAs could be 100 kV or longer in their RNA form and this can produce a few problems. So if you do a polyase selection of your RNA then it can result in three prime end bias and the reason is that you're holding on to the polyA tail you're holding on to the three prime end and then any breaks that occur in that long sequence the five prime end just gets lost or washed away during one of the cleanup steps. So all of this really comes back to in that case the RNA being fragile compared to the DNA. Is anyone here doing a polyA selection? So are there people doing mRNA seek versus total RNA seek? Who's doing polyA? Both. Both, okay. Yeah, so some people are doing it. It's become a little bit less common. It used to be that almost everyone was doing mRNA seek and I think that was mostly just due to the problems with the library construction so we've gotten better at doing the ribodepletion and we've gotten better at basically creating a library from total RNA that's still sort of reasonable. And there's also generally a desire to produce this as much as possible sort of holistic sampling of the transcriptome so that you have coding and non-coding RNAs and things that are polyadenylated and things that aren't polyadenylated. But it's still a good way to really focus your data on protein coding regions if you do want to do a polyA selection. It just it does introduce this potential complication if your RNA is degraded and if there's a varying amount of degradation from one sample to the next it can complicate the comparisons between your samples. So anyone who's submitted an RNA library or RNA material to a core that's gonna produce their RNA seek library for them is probably familiar with these Agilent traces. So this is something that you'll commonly be asked to run this assay or they'll run the assay and then they'll tell you the RNA integrity score and if the score is too low it indicates that your RNA is degraded and if it's degraded below some level they may recommend that you not do RNA seek on that sample or they may want to kind of warn you that you may have some dead quality control issues when you get your data. What is this level? Because here people do like from slides and when they extract from slides like all summaries and it's not field press RNA like they have this issue. So what's, I mean, I guess there's no different answers to that but what could be a really bad RNA sample that you would not do RNA seek on? Yeah, that's a good question. So the number that a lot of the cores will tell you is that if the RIN score is below eight they generally don't like that which is quite stringent. So if it's above eight it's pretty good. If it's below eight there's probably many times where that will be just fine. It just may influence how you proceed a little bit. So for example for like you said in tumor we have this situation where we have formal and fixed samples that are often very heavily degraded they've been sitting on a shelf for five years. So the RNA that you get out can be already fragmented down to 100 bases to 200 to 300 bases at the most. And one of the things that we do to compensate for that is do an exome capture or CDNA capture as part of the library workup. So it used to be that we would just do it like a custom. I think we still do it custom but you can get a kit from Illumina now for this purpose because it's become quite popular for degraded samples to do that. You may also increase the amount of data you produce a little bit for those samples and you may reduce your expectations for the amount or the kinds of analysis that you can do with that sample. So it may work out okay for gene expression estimation or for differential gene expression estimation maybe. You might spend a little bit more time thinking about the data normalization step and batch effects and things like this just to sort of be a little bit more worried that there are some systematic differences between your samples that are gonna confound the analysis. But generally you can get away with quite a lot of degradation. You might also alter your fragmentation step. So sometimes we sort of consider those samples pre-fragmented and not even bother fragmenting them because they're already so broken into pieces. We almost don't want to introduce another step that's sort of like, okay, let's break these things out because we're already worried that they're too broken. So yeah, basically just things like that. Yeah. Maybe you're gonna ring 10 for that, that's close to that. Okay, so yeah, maybe I should explain how this works. So basically what's happening here is it's effectively like you're running your RNA on a gel and actually both of these are the total RNA without any fragmentation or anything happening yet. So these are two samples that I isolated. This was from a cell line that was sort of actively growing in culture. This was from a frozen, a fresh frozen tumor sample that was from a surgical resection and had probably been sitting out for a little bit before it was frozen, which is why there's so much degradation. So basically you're running your RNA on a gel but instead of sort of visualizing it like that, you're actually running it through a gel in a capillary. So you're basically feeding the RNA through this capillary and then there's a detector on the capillary and the smallest things come out first and the largest things come out last and then there's a dye that fluoresces when there's RNA there. So you get a peak when there's a lot of RNA at a particular time point. So on the x-axis here, we've basically got time and there's a ladder that has sort of RNA molecules of known size that's used to sort of establish the relationship between the amount of time it takes to come out and the size of the RNA molecule. So that allows you to convert this access to actual nucleotide size. And then this axis is sort of the amount of fluorescence units. So how much fluorescence you're seeing, how much material you're seeing at that particular time point. And this gives you peaks and these two peaks correspond to the human ribosomal RNA species which make up 95 to 98% of all RNA in a human sample. And so if your sample is very intact for human, you expect to see a pattern like this where you basically have two large bright peaks that correspond to the sizes of the two ribosomal RNAs. And then based on the area under those curves and the amount of other sort of noisy peaks around here, the RIN score is calculated. So this is basically what perfectly intact RNA looks like. And then as the RNA gets broken into pieces by being degraded, you start to get these sort of secondary peaks where the ribosomal RNA has been broken into fragments that are smaller than the expected size and it starts to smear. So when a gel it would look instead of like two distinct bands, you'd get this smearing and the smear would get kind of darker and darker until you couldn't really distinguish between where the two bands were and where the smear of all different sizes of RNA was. And as you go from this pattern to totally fragmented, the RIN score goes down. So here you can see that the area under the curves where we're seeing the 18S and 28S have gotten smaller relative to the area for all of these other peaks. And as degradation continues, these peaks start to move further and further to the left as the RNA gets degraded into smaller and smaller pieces until you basically have a hump over here that's sort of from the 25 to 200 base range. And this link I have here has a PDF of examples that go from sort of perfect down to completely degraded so you can kind of get a sense of what all those different patterns look like. And some of them are from cell lines, some are from fresh frozen material and some are from FFPE material. And so you can get kind of a reference point for what those things look like. Okay, so if you haven't already got your RNA-seq data and you're in the design sort of experimental design stage, here's a couple of resources and there's a, we also have a resources section in the wiki that has some documents with sort of discussion of how you should go about setting up an RNA-seq experiment. So they have recommendations about how much depth you should produce, what kind of metadata you should store associated with your RNA-seq samples, how many replicates you use and so forth. This one is quite out of date but the concepts don't really change. I mean, a good experiment still has replicates and these things don't actually update over time. Most people don't do all of these things but some of them are becoming more common. For example, they recommend including spike-in controls with your RNA that has gradually become more common practice. The example data we're gonna use was RNA-seq data that was generated at the Genome Institute and we included spike-ins in that so that we can look at what that sort of QC analysis looks like. Yeah, that was a question at the back. Oh, yeah, that's a good question. Not that I'm aware of but it's not really, yeah. So yeah, how many cells do you do and then from how many individuals or conditions or I would think that it's probably even more important in that context to have a good design like that and probably to have more replicates because the data is just fundamentally noisier from single-cell experiments just because you're really pushing the system as hard as possible in terms of low amounts of input and yeah, just everything is sort of barely working. But I'm not aware of any guidelines but yeah, I'll think about it and I'll let you know if anything comes up. That is sort of one of the just general recommendations we give is to look at what other people have done but sometimes in an area that's sort of quite new, there isn't necessarily an established best way or best practice that's really been proven. There's sort of often in those cases there's a variety of ways that are probably insufficient that have been done. Usually in the early days, everything is insufficient because people were just trying to do it at all and it's expensive when you first start doing things like this and so you're kind of just proving the principle but probably not doing it the way it should be done and like a production setting maybe. Okay, yeah, so this is just for reference materials for you guys. Library construction strategy, so we've talked a little bit about some of these things already. You'll see quite a few of these variables depending on which data set you're looking at. It might be total RNA versus polyRNA. We've talked about that already. If it's not polyadenylated selection then they probably will do a rival reduction. There's usually some kind of size selection or size exclusion. There used to be some kits that were commonly used that did a linear amplification. I haven't seen as much of that recently but there may be some applications where it's still used where the amount of material that you have is just vanishingly small. Stranded versus unstranded libraries. So it used to be a lot of RNA-seq data, the strand information. So what strand that RNA was expressed from that information was generally lost and you would infer it from the alignments. Now some of the most common RNA-seq library kits maintain the stranded information. We mentioned this idea of doing an exome capture of your CDNA prior to sequencing it. So that's another approach that you'll hear about. Library normalization. So there's a variety of approaches to try to even out the highly expressed genes from the lowly expression so that everything is sort of brought onto more of a level playing field. And so we'll refer you to a few papers where you can look into that if that's something that you think would make sense for your application. So there's a lot of details here and of course you wanna know as much as possible about how your data was generated. But sort of the general take home of all of this is just to really think about whether any of these things are varying between the samples that you intend to include in your analysis. So if any of these things are being done differently for half of your samples from the other half of the samples and then you're hoping to compare those things, that may cause problems. You may see differences. So for example, in the differential expression analysis, you don't want the differences to be about the different ways the libraries were made. You want the differences to be about the biology unless you're studying library construction methods, which I doubt any of you are or maybe one. Just a little bit more about fragmentation and size selection. So this sort of visualizes what I was talking about with the gel versus the electrophurogram. So here's a couple more examples of RNAs at varying levels of degradation. So you have this totally intact RNA, partially degraded total RNA, really heavily degraded. So we can see the peaks are all really shifting to the left here. And then when it's completely degraded, you just get this really like a big hump here. There seems to be a point at which the RNA doesn't, starts to really slow down in its degradation once it's down to the range of under 100 bases or so. And so you'll see that for samples that are really, really, really old. And then the sort of simulated gel version of this would look something like this, where you've got your two peaks and it gets gradually more smeared as the degradation continues. So these things can influence what the starting RNA looks like. So you're gonna isolate your total RNA from some cells. This is the armor. We assess the quality, but then at some point we're gonna continue on with making a library out of it. Generally, we'll do a DNA treatment and then fragment the RNA into smaller pieces if needed and then do cDNA synthesis. There's a couple of different ways that's done. And then there may also be a size selection or exclusion step here where you throw away really small RNAs. So there's usually at least one branch where people that are doing a small RNA or microRNA experiment will sort of produce one library specifically for that. And then they might produce a different library for just classic RNA seed. It's difficult to do all of it at once. So there's usually at least that division, really small stuff versus everything that's above a certain size. So with that in mind, a lot of places will do a size exclusion step where they basically use some kind of bead cleanup that throws away all of the RNAs that are below, say 100 bases or 150 bases. And you're left with sort of everything that's some amount larger than that. So that's what's being described here. You basically lose your small RNAs. And then from this step forward, you're gonna add your sequencing adapters onto your cDNA molecules. And then we're gonna move on to sequencing. I mentioned there were different sort of selection or depletion strategies, which is depicted here. This is kind of an elaborate figure, but I'll just go through it quickly. So sort of in cartoon form, total RNA is depicted in the top left here where we have basically the whole sample is dominated by ribosomal RNAs. If we do a ribosomal RNA reduction, we're basically grabbing the ribosomal RNAs and holding on to them and then alluding everything through that we wanna sequence. So we're sort of keeping everything but the ribosomal RNAs. PolyA selection is kind of the inverse of that where we're holding on to the things that we care about and we're washing away the stuff that we don't want. And those produce suddenly different outcomes. So the main one being the polyA selection of course selects for things that had a polyA tail and ribosomal RNA reduction sort of lets everything through except for the ribosomal RNAs. Neither of these things are perfect. So they're just sort of pushing things in a direction but you still lose some of your polyadenylated things and you still have things that are not polyadenylated when you do a polyA selection. And then similarly, the ribosomal RNA reduction is not perfect. It doesn't get rid of 100% of the ribosomal RNAs. It just sort of enriches for non-ribosomal RNAs. And then they capture sort of another more elaborate selection method where instead of selecting for RNAs that have a polyA tail, you're actually selecting for known RNAs by their exon sequences. So you basically hybridize against this library of probes that correspond to all of the known exons of the human genome or whatever genome and hold on to them and then wash everything else away so you enrich for data that corresponds to known transcripts in that case. And of course that strategy is limited by your knowledge of the known exon. So if your species is not very well annotated or there's a particularly new or rare human gene that you're interested in, it might not get captured because it won't be in the design to be captured. Stranded versus unstranded, so I mentioned this. A lot of the DNA from three or four years ago would look like this where you have reads that would generally pile up against your exons but you wouldn't be able to tell for sure what direction of transcription they came from. You could often infer by the transcript they seem to be aligning to that they came from that transcript and you didn't have that sort of extra piece of information that told you that, yes, it's also from the expected strand and it made it difficult to do things like look at sense, anti-sense expression where you have exons that are actually overlapping each other on opposite strands. You couldn't tell which of those transcripts the RNA reads came from often. The new library construction methods maintain this information so you now have this ability to see, okay, all of this stuff was expressed off of the positive strand and these other reads were from the negative strand. And we'll look at an example in IGV but this is basically, you can produce this view in IGV so we have a gene here that's being transcribed left to right and another one that's being transcribed right to left and you can see that the reads for the most part really correspond to the expected strand according to their coloring by strand there. Replicates, so we talked a little bit about this. There are many different kinds of replicates. There's generally, I like to think of it in three broad categories, technical, experimental and biological replicates. Technical replicates in the context of an aluminum experiment might be something like doing the same sample or the same library on multiple flow cells or multiple lanes of a flow cell just to make sure that the instrument itself isn't somehow producing some variability and result when it's run this week versus run next week or run on this part of the flow cell versus that part of the flow cell. For the most part, we don't bother, I don't know if anyone bothers with this anymore, the aluminum platform is pretty robust and reproducible at this point. One flow cell will generally be pretty darn consistent with the flow cell that you order the next week. Of course, sometimes things go wrong and then it becomes the exception to the rule but for the most part it's not cost effective to worry about those kind of technical replicates. Experimental replicates and biological replicates, of course, are as important as they are in any scientific experiment so you should do them as makes sense in your experiment. Common analysis goals of RNA-seq. So we're gonna cover a few of these in some detail but we don't have time to go through everything. We're gonna pretty much focus on gene expression and differential expression and a little bit on transcript discovery and annotation. We'll do some exercise that sort of hint at how you would study a little specific expression. We're gonna do some alternative expression analysis but there are other things like mutation discovery, fusion detection, RNA editing and much more that we won't get to but we do have some sort of supplementary materials throughout the wiki that sort of point you at starting points that we think are useful for some of these other applications. I mentioned general themes of RNA-seq workflows so each of those analysis goals that I showed on the previous slide, they generally have their own tools, their own algorithms, different labs specialized in these areas so there'll be a lot of different pipelines and workflows that for each of these different applications but they do have some general themes and they each follow this general format which is you start with obtaining your raw data, you do some kind of alignment or in some cases assembly, you then process that alignment with a specific tool that sort of aimed at that analysis goal so for example, fusion detection, you would use a fusion detection tool that starts with a BAM file as input for expression analysis, you would start with a different tool that starts with that same BAM file as its input and then there's almost always some kind of post-processing so the tool gives you some kind of crazy output file or series of complicated output files, it rarely writes your nature science paper for you so there's something that has to be done afterwards to summarize and visualize the outcome and to interpret what it means in terms of biology. Getting help after you leave this course, we just usually like to do a shout out for the BioStars website, if you haven't used this before to ask a bioinformatics or analysis question, there's a lot of common questions related to analysis of sequence data that have already been asked and answered there multiple times so it's often the case that if you go there and search for your question, you'll find a decent answer or if you don't find an answer, you can ask the question and sometimes someone will provide an answer and this website's been around for a while so often actually if you just Google where you're looking for, you'll often wind up at a BioStar post anyways so that's been really useful. Some common questions though just to sort of get them out of the way because these come up every time we give this course. So one is should I remove duplicates for RNA-seq? The short answer is no, even though you mark or remove duplicates for DNA and that's completely routine, a lot of pipelines just assume that that should be done. Generally it's not recommended in RNA-seq. The good news is you can probably get away with using the same or very similar alignment workflow even involving marking duplicates with for example Picard's mark duplicates tool because a lot of the downstream tools will just ignore that marking anyways but you can just skip it generally. The reason you don't wanna mark duplicates is because of the two aspects of RNA-seq that I mentioned were kind of particular to RNA that are not the same as DNA which is that RNAs come in different sizes and they come in different abundance levels. So it's entirely possible that the RNA that you care about is relatively small and very very highly expressed in one or more of your samples so if it's only say 250 bases long and it's very very highly expressed there just aren't that many unique fragments that you can make from that RNA molecule. So there will be naturally a lot of duplicates and that isn't true for chromosome one at the genomic DNA level. In that case you don't expect there to be duplicates other than PCR amplification artifacts because the chromosome is so massive there are a huge number of possible fragments that can come out of it at any one place. You have many many many. You can sequence it to five, 6,000 X depth and still not get that many duplicates but that's not the case for RNA. So we generally don't recommend that you remove duplicates for RNA. How much library depth is needed? It really depends on what you're doing. So if you just want gene expression estimates there's been a couple of papers that have tried to figure out what's the most cost effective way to profile the maximum number of samples. So if you really are just trying to get a readout that's sort of abundant to each gene across 100 or 500 samples then you really want to maximize the amount of samples that you can do because that's where your statistical power is going to come from. Not from having a lot of data for each sample but having the ability to profile all the samples that you care about. And so in those cases you can push the number of reads down to say I think the estimates at the low end are sort of 10 to 15 million reads is probably sufficient for each sample. Which nowadays means you can do quite a lot but you can refer to the papers that we link to in the Wiki to go and see if you agree with their reasoning or at least give it a quick glance before you make that decision. But the good news is you can fit a lot of RNA-seq samples on a single lane for example if you only care about gene expression analysis. So you really have to just remember that caveat that you may be giving up the ability to do mutation calling from your RNA or to really profile the structure of alternative transcripts or to get robust signals for very lowly expressed transcripts that are only present in a few copies per cell. You basically may just miss out on all of that. Again, a common strategy is just identify a publication with similar goals and sort of use that as a starting point when you're designing your experiment. Do a pilot experiment and maybe do some down sampling to where you actually see in your lab with your samples if I do an experiment with five or 10 samples where I sequence them quite deep and then I just down sample that and I look at what the results look like, how is it working basically and use that as sort of a starting point to decide on how you move forward with the rest of the 100 samples or however many they are. The throughput of the HiSeq platform continues to increase so you can get a flow cell will allow you to do a remarkable amount of RNA sequencing nowadays. Is there anyone doing sequencing on another platform? I'm basically assuming everyone is using Illuminus or any, I don't even know what the alternative would be but no one's using IonTorrent anymore, probably one, okay. It's becoming more rare. So it's generally a safe assumption in this course now that we're doing Illuminus sequencing but the same general principles apply to IonTorrent and there are some other platforms as well that are just a little bit less common. Mapping strategy, so we're gonna use this aligner called HiSat which is one of the more recent aligners that's come out. If you're still for some reason producing really short reads and this could be the case. So again, it comes back to cost effectiveness you might decide to produce deliberately produce really short reads or single end reads instead of paired reads so that you can produce just a lot more samples can be sequenced. If you decide to do that, I wouldn't recommend probably having reads shorter than 75. That's kind of a good minimum length where you can still do most things with that length but if you do go shorter than 75 then you may need to adjust your alignment strategy and there's a couple of notes there and how you might do that. Once they're greater than 50 to 75 bases then you can just proceed as you would for any RNAC library and there's quite a few aligner options. We started out in this course using Bowtie Top Hat which was for a while a very popular aligner. Then we switched to Star because it produced very similar alignments much more quickly and then more recently we switched to HiSat and then HiSat too and I'm sure in a year we'll be on something else. They're constantly trying to improve the performance of these alignment algorithms. I would say both. There's, the alignment problem is hard so I think there is rooms for sort of incremental improvement in how it handles the edges of exons and how it handles reads that map ambiguously to multiple places. But a lot of it is also performance in terms of run time. So producing an equally good alignment with less computational resources is something that people continue to work on and there's sort of different strategies that they've taken. A lot of it happens, has to do with the way the reference genome is indexed. So a lot of the performance of the aligner builds on that part of it. And there are, you know, the alignments that come out of Illumina data have problems. There are areas of the genome that are very difficult to align to and sometimes improving the aligner it's difficult to overcome those problems. There may be other things you can do. So you may benefit from longer reads or from larger fragments that are sequenced as pairs instead of as singletons or improvements in the reference genome sometimes can be the best way to improve your alignment. So also generally try to keep on top of new releases of your reference genome. Everyone here has a reference genome so I guess this question doesn't matter. What do I do if I don't have a reference genome? It's obviously challenging, more challenging when you don't have a reference genome. The reference is an incredibly powerful tool. If you don't have it, you're gonna have to work around that a bit. But generally there's a lot more species that at least have a draft reference genome now and sometimes a quite good one. But if you don't, there's things you can do and we cover some of those in the tutorial now. There's a large table with more of these kinds of common questions, some of the technical questions that come up during analysis that's available on the Wiki here and there's a published version but the Wiki version we try to periodically update with new papers that have come out so I would just go to the online version. And there's maybe 30 or 40 sort of common questions and attempts to answer them in there. And then the Biostars tutorial is another great place to look. So we're gonna go to, I believe we have a brief coffee break at 10, right? So we're very close to on time and then when we come back we're gonna start the tutorial and we're gonna keep showing you this sort of roadmap. Whoops, that gives you sort of just a basic description of what we're gonna work our way through. So we're gonna start with the actual data files that we're starting with. So we're gonna start look at some raw sequence data in fast queue format. We're gonna do alignment with high set two against a reference genome and we'll talk about the reference genome a bit more. Then we're gonna do transcript compilation with string tie and we'll talk about the genitation files that are used for that step. And then expression estimation is also gonna be done with string tie then differential expression with ball gown and at this point we'll diverge a bit. So there's sort of two camps of expression estimation RNA seek. There's the sort of cufflinks, string tie, ball gown crowd where you're trying to estimate what transcripts look like and then how many reads aligned to those transcripts and have a sort of probabilistic model that explains the transcriptome that you're seeing in the RNA seek data. And then there's another camp that's sort of like we don't need all that complication. We don't need to correct for GC bias. The length of transcript is all just nonsense. All we need is read counts and because we're comparing between sample A and sample B and they were both processed in the same way and they both have the same genes, all of these differences that you're trying to account for that will all just come out in the wash when we do our comparison. And let's focus on having the most robust statistics for dealing with those counts. And so we're gonna do some analysis from both of those camps, which is not depicted here. So we're gonna use HTC count and edge R for that, the count-based part. And then we're gonna do some visualization and produce some graphs and figures and stats in using various R packages.