 Basically, we'll do a brief introduction to RNA sequencing technology, and then a brief introduction to the first hands-on tutorial, and then we'll go back into your Amazon cloud instances that you've now all successfully connected to, and we'll do the first hands-on exercise. And you're free to ask questions on the little digital court board, but also feel free to just shout out your questions or put your hand up as well. Like I said, this is kind of a broad introduction lecture, and so it tends to sometimes raise sort of high-level questions about experimental design or sort of variations on how you generate your RNA-seq data in the first place. So it's one of four. There is kind of a fifth module as well now that's not listed here. That's sort of a bite-sized one at the end that we will try to get to if we have enough time. And the first one is really an introduction to RNA-seq. In the tutorials in this section, the idea here is to provide a working example of an RNA-seq analysis pipeline that's sort of the overall goal. We want this thing to be able to run in a reasonable amount of time, so you'll see as we go through it in the early ones we're going to introduce reference genomes and the file formats that are involved in this kind of analysis and a test dataset that we generated at WashU, and all of this has been crafted a little bit to allow it to work in a sort of educational setting like this where the file sizes are kind of carefully managed so that they're small so that when we do things like alignment or expression estimation, we're not waiting for hours for each command to complete. It all happens quite snappily and it's very representative of what it will be like to run the same commands on a full-size dataset, but that would take much longer. And I think OV mentioned this, our goal with this, the sort of hands-on component is that the resource be pretty self-contained and self-explanatory and portable, so hopefully you'll be able to rerun it again after this workshop ends, and the idea is that if you can do that, then you can easily swap in your own data files and get the commands to work without there being sort of very much of a sort of black box to it. So the first module we're just going to review really the theory in practice of RNA sequencing at a high level, including just sort of a general rationale for why we would do sequencing of RNA in the first place. Some of the challenges that are specific to RNA-seq, so often we have people who come to these workshops that maybe have experience with chip-seq or whole genome or exome sequencing, and there's some particulars to RNA-seq that influence the way we think about analysis and interpretation of the data, so we'll talk a little bit about that. We're going to go over some general goals and themes of RNA-seq analysis workflows. There are many, many, many, many tools and workflows out there, but they do kind of have a theme to them, so we're going to try to sort of illustrate that so that later when you go to try a different workflow, the experience that you had here will help you sort of pick that up more quickly. We'll go through a few common technical questions related to RNA-seq analysis and point you to some additional materials for sort of digging into the weeds of some of those technical questions on your own as well. Anne already talked about getting help outside the course. We'll just briefly review some options there, and then as I said, we'll do the introduction to the hands-on tutorial. Do I have a pointer here, by chance? Maybe not. A pointer? I guess using the desired option. I don't know if that actually works for me, but does that work? Okay, thank you. So just to kind of review a lot, so we did the little survey, who is sort of a wet lab person versus dry lab person, so the wet lab people are probably quite familiar with this, sort of a high level diagram of the central dogma, but just to kind of put this course in the context of the central dogma, that's sort of what we use this slide for. Starting at the top, we have a very cartoonish depiction of double-stranded genomic DNA template. We have an example gene here with three exons and two introns, an upstream region and a downstream region. Each of these exons is not to scale here for, well, depending on what species, but in human this is definitely not to scale. Most human genes have much larger introns than this relative to the size of the exons. A yeast gene might look like this, though. They have relatively much smaller introns. So we have a promoter region here, where transcription factors bind and transcription is initiated, and there's a polyadenylation site, and we're also shown this sort of translation initiation start point and translation termination codon here. This thing gets transcribed into a single-stranded pre-MRNA molecule, where the introns are still in place, and now we have a whole other set of regulatory features that control how the splicing is going to happen to remove the introns and stitch that exons together to give us a mature MRNA sequence. And then this MRNA sequence gets polyadenylated and capped and exported to the cytoplasm where from the nucleus to the cytoplasm where translation occurs, where we go from an MRNA molecule to a protein sequence, which is then folded, and there's various post-translational modifications that may be attached. For many biological questions, if we could do so, if we could sequence the protein sequences directly somehow in a high-throughput fashion interrogate their sequence and structure, we would probably just do that. That's generally not feasible to the extent that we can do it. It's very low-throughput and expensive by comparison to our abilities to propel DNA and RNA. So RNA-seq is really focused on this part, and for many people it's a proxy for the protein. So we're looking at RNA structure and abundance, and from that we're trying to infer things that are happening at the protein level often. Of course, there are many exceptions. Yes? Yeah, depending on what you're doing, especially if you're studying the process of transcription and splicing, that might be interesting. And there are a lot of variations on the sort of standard or typical run-of-the-mill RNA-seq experiments, and some of them involve manipulations like this, where we're trying to isolate nuclei, or we're trying to specifically enrich for RNA that are in the cytoplasm, or perhaps bound by ribosomes that indicates they're actively being translated. So depending on where your interest lies, if you're really focused on the sort of protein coding part of the problem, that might influence you to choose one of those techniques. If you're really interested in the actual splicing and transcriptional machinery, you might do something like looking at immature RNAs from the nucleus. But generally, the sort of standard RNA-seq protocol that a lot of the data is being generated from is making some attempt to enrich for mature mRNAs, or to deplete out the ribosomal RNAs in an attempt to kind of concentrate your data onto exonic regions so that you're not spending a lot of time sequencing introns or other things. Any other questions? So usually at this point, I point out that we have this depiction here of a sort of hypothetical RNA sequence where the exons have been stitched together. And just like it would be nice if we could just sequence protein sequences directly, it would actually be nice if we could just sequence these entire full-length RNA species. Unfortunately, RNA isn't really directly sequencing these things. So first of all, it's not sequencing RNA. The RNA is being converted to CDNA, and then we're sequencing CDNA. And generally, we're not sequencing very long CDNAs either. So there's almost always a fragmentation step involved in the creation of your RNA-seq data where this thing, which may be a Kb or 5Kb or 10Kb or even longer, is being fragmented into small pieces in the range of say 250 to 400 bases, and those things are being sequenced. So we're sequencing pieces of RNA that were converted to CDNA. And so there's several molecular biology steps there that we need to keep in mind when we're trying to interpret the data. In particular, this idea that we're trying to think about full-length transcripts and sort of map these things back to what we think the full-length RNA will look like and what we think the full-length protein will look like, but we're doing that from these sort of small pieces of the puzzle. So there's a lot of inference that's involved, and that creates uncertainty. So we could always keep that uncertainty in mind that we're kind of inferring what we think the full-length sequence looks like by what we see in terms of the fragments that we're sequencing. So this is kind of a typical RNA-seq workflow at a very high level. So imagine that we have some samples of interest, condition one and condition two. So it might be a tumor and a normal or an hour example data set where we've kind of just arbitrarily picked a couple of quite different samples that we expected will produce a lot of differences to allow us a lot of differences to look at when we do the analysis. From each of these samples of interest, we isolate RNA and then generate cDNA from that RNA, fragment it, size select it, and add sequencing linkers. So that's what's depicted there with the blue and the yellow. So these are small fragments made up from the larger RNAs. Then these fragments get flowed across a sequencing flow cell. By far Illumina is the most common platform for this. Is anyone here working with data that's generated on a different platform from Illumina? Say, yeah, which one? PacBio. Okay. So PacBio is capable of producing much longer reads and that's sort of the niche that it fills is the sort of long read technology. So there's a potential to, with a lot more accuracy, characterize the structure of longer RNAs without having to do as much of this inference and piecing the pieces together. But it sounds like everyone else is working with Illumina data for a while. There was quite a lot of iron torrent data that is sort of sort of fallen by the wayside and really Illumina is dominating the market right now. So for most of you, your fragments are being flowed across an Illumina sequencing flow cell, and then you're generating potentially hundreds of millions of paired end reads off of that flow cell where you have a fragment that's being sequenced from the left side and from the right side off of the sequencing adapters. And if the fragment is big enough, there will be some space in the middle that remains unsequenced. If your reads are long enough and your fragment is a little bit shorter than the two reads might meet in the middle where you effectively have sequenced the entire fragment. And you'll usually have a mixture of those two scenarios where you have a range of fragment sizes. And sometimes your reads meet in the middle and sometimes they don't. And then really the analysis starts from the point of having these raw sequence reads, usually in fast Q format where we have a read one file and a read two file and we start aligning these things against reference transcripts and reference genomes and doing the downstream analysis. And that's really everything that we're going to be talking about in this course is this downstream analysis part here. So why would you sequence RNA versus DNA? Probably don't have to convince anyone in this room since you're already interested. But generally a lot of people working functional studies, these are cases where the genome may be a sort of fixed factor, but the transcriptome is varying in response to say drug treatment. So we heard about some people working on experiments that involve drug resistance or mechanisms of drug response. You might be working in model organisms where you've genetically manipulated sake by creating a knockout mouse. I heard someone mentioned gene annotation. So RNA seek is really kind of revolutionized the field of genome genome annotation. It used to be a lot of predictive work where we would sequence the reference genome, the DNA, and then we would look at the sequence of the DNA, we would try to predict where does it look like the genes are, what looks like an exon, what looks like an intron, how might those exons gets get stitched together just by looking at the sequence and its conservation and other features of it. And there was sort of a whole field of bioinformatics focused on this problem of how do you look at the reference genome sequence and predict what a gene structure actually will be. RNA seek has really made that approach kind of irrelevant in a way because now we can just shotgun sequence vast amounts of transcriptome sequence data and align it against a reference genome and then let the data tell us what the exon intron structure is and where transcription is happening in what conditions and at what levels of abundance. There are some molecular features that simply can only be observed at the RNA level, so things like alternative isoforms, fusion transcripts were mentioned, RNA editing, and other features. For people in this sort of cancer space, sequence RNA seek is commonly done to help interpret mutations that don't have an obvious effect on protein sequence. So it's a way to try to interpret potentially regulatory mutations that don't affect a protein sequence but may have a regulatory consequence or to prioritize protein coding somatic mutations to figure out sort of which of them are actually being expressed. Is there evidence for haploid insufficiency? Is there evidence for allele specific expression of a mutation and so on. Some of the challenges that are particular to RNA seek, so these things that I'm sure have come up in each of your experiments to some degree, first starting with the sample. So sample purity is often a problem in disease studies where you have some tissue state that represents a disease like a tumor and it may not be purely tumor cells. There may be normal cells mixed in or if you're studying a particular cell lineage and you isolate it, you have to sort it or you have some other way of enriching and those things are never perfect. Of course, sample quantity is always a challenge and if you have very small samples, there are people studying things in mice where they're isolating very particular part of the mouse and you just don't have that many cells to deal with. RNA quality is often a big issue. So we'll talk a bit more about that RNA is much more fragile than DNA. So it tends to degrade and the degradation can lead to problems in both generating your data and analyzing and interpreting and comparing across samples. Another challenge is particular to RNA seek is that the that RNAs consist of small exons in many species that are separated by very large introns. So this creates an alignment a read alignment challenge relative to say whole genome sequencing where an whole genome sequencing for most reads there's an expectation that the read will align against the reference as one can see can take you as a block that it won't be spanning across an intron. So RNA seek aligners have this additional challenge of looking for reads where part of the read maps to the edge of an exon and then there could be 50 kb of intron and then the rest of the read aligns to the the next exon over. And that's quite of a challenging thing computationally to figure out where those pieces of reads align to and it increases the uncertainty in alignments for RNA seek relative to DNA sequencing technologies. The relative abundance of RNA is very wild very wide wildly. So again, comparing to DNA, if you sequence the genome of a critter, you have this sort of basic assumption going into it that you'll see approximately equal representation of each of the chromosomes. So if it's a diploid state, you'll see two copies of everything. And you can sort of target your sequencing with the size of the genome in mind and say, I want to get 30 x coverage of the whole genome. And when I look at the data, if I generate enough data, I can see approximately evil even coverage, you know, with waivers up and down a little bit because of GC content and random sampling. But for the most part, you get this nice uniform coverage across your genome. In the transcriptome, of course, we don't expect that because we have different RNA species being expressed at different levels that are functional. So you have some genes that are functional and expressed at just a few copies per cell. And you have other genes that are expressed at tens of thousands or hundreds of thousands of copies per cell. And that's just sort of the normal expected state. So we have this this wide range from estimates vary from sort of 10 to the five to 10 to the seven is sort of the orders of magnitude from the most lowly expressed things at sort of one up to maybe 10 to the seven. And since RNA sequencing works by random sampling, we have this problem that the most highly expressed things, we tend to be able to sequence them very readily, things that are lowly expressed, it's harder to sequence them because they just it takes a lot of you got to like reach your hand into the that bag a lot of times to randomly pull out reads that correspond to those rare transcripts. And then ribosomal and mitochondrial genes tend to be classes of genes that have this really, really high abundance. So we tend to see a lot of data corresponding to those kinds of genes in our RNA seek libraries. And relative to protein coding genes, sometimes they can drown out the stuff that we're more interested in potentially. Similar to the relative abundance challenge RNAs also come in a wide range of sizes. So again, comparing to genomic DNA, for all intents and purposes, all of the chromosomes are massive, compared to the size of the fragments that we're sequencing. So they're sort of arbitrarily large mega bases usually, possibly 10s or hundreds of mega bases. But RNAs are expressed and functional at this much smaller and wide range of sizes. You have some RNAs that are, you know, 20 bases long, and you have other RNAs that are 100 kb long. And they're all expressed together. And we're trying to characterize them all together. And this can introduce some bias. So we have potential to miss out on really small RNAs that we might care about because of the way we select our library fragments, we may be throwing away RNAs that are smaller than a certain size. And there may be a tendency to over represent because it's a little bit easier to sequence the large RNAs than the small RNAs in some sense. In other senses, the small RNAs can dominate cluster generation on a flow cell. So it kind of introduces bias, we would like to get this true representation of the relative abundance of transcripts in the transcript dome. But this relative size issue sort of complicates that a little bit. And I mentioned already that RNA is very fragile compared to DNA, so it gets easily degraded. Many of you have probably seen these Agilent bio analyzer traces where you run a small sample of your RNA, usually at the total RNA stage on this sort of lab on a chip. It's effectively you're running a gel, but you're running it through a capillary. So it's capillary electrophoresis and you get this readout of that's like a trace, where the peaks represent the abundance of RNA that's coming out over time as the RNA runs past a detector, the smallest RNAs come out first and the largest RNAs take longer to come through. And then over time, you get this profile that's sort of a series of spikes. And so this is a human sample. In human samples, you expect in total RNA something like 95 to 98% of the RNA to correspond to two ribosomal RNA species. So with that expectation in mind, we hope to see these two very big peaks that we're seeing here on the right of this slide. And based on the sort of how intact those two peaks are, we can estimate the quality and intactness of the RNA sample. So this is an example of some RNA that isolated a long time ago from a cell line that's very, you know, almost perfect quality. And it got a RNA integrity number of 10, which is sort of a perfect score. And you see these two really strong peaks corresponding to the 28s and 18s ribosomal peaks. And you don't see a lot of anything else. You see a sort of a marker down here, and not much else. As the RNA gets degraded, these RNAs at these sizes start to break down into smaller pieces. As the pieces get smaller, you start to see additional sort of bands on your gel or spikes on your electrophirogram here. And you sort of get this sort of choppy, additional spiky spikes to the left of your two expected peaks. And the more degraded the sample is, the harder it becomes to see the 18s and 28s peak. And based on this pattern of degradation, you can estimate sort of RNA quality score. So this is a RIN score of six. And a lot of sequencing cores will do this assay, and they'll have a cutoff that says if the RIN score is below some level, we don't want to sequence your sample, or we're going to have some kind of caveat that sort of, you know, if things go wrong, you still have to pay us even if the data sucks, something like that. So have any of you encountered that scenario with sort of RNA quality issues where? Okay. Yeah. So for people that aren't used to looking at these, these traces I provide, we provide a link to a PDF here with a whole bunch of examples of these traces from RNA that was isolated from different types of samples. So frozen tissue, FFP samples, cell lines, sort of showing you this broad range of scenarios, everything from just perfect intact RNA to completely degraded. There's like basically hardly anything left, and everything in between. So you can kind of put your, your traces in perspective if you haven't looked at a lot of these yet. And over time, you kind of get a sense of what they look like. And when things have potentially gone wrong, or sort of standing out from the norm. Some design considerations. So, or any of you in the kind of RNA seek design experiment stage, is everyone have their data already? How many people have data in hand kind of already? Okay, quite a few of you and others that are sort of thinking they have RNA seek experiment that may be planned in the next six months or something. Okay, great. So that's this is pretty typical for this course. The course comes around and there's some people that are in the design consideration stage. So there's a couple references here to useful guidelines. Even if you already have your data, it's still useful to think about these things. There's always the next experiment. And it may help you to think about how to analyze your current data to think about some of the sort of best practices and recommendations for how to design an RNA seek experiment. So we link to this standards and guidelines document that was generated by the encode consortium. It's a little while ago, but these are really fundamental things that don't change. So this talks about, you know, how many replicates should you use? What kind of sequencing depth should you target? What kind of control experiments and reporting standards? Should you think about including? And in our example data, we're going to talk a bit about this and analyze some. We're going to do some QC analysis. And we're also going to analyze some spike in data. So in our data generation, we included some QC spike ins to help assess the quality of the RNA seek library, construction and sequencing. And there's several of these large scale consortium sequencing projects that in their early stages sort of think hard about how they want to design their experiment and then they release guidelines. And these can be really useful for people that even down to a much more modest experiment that sort of I have five conditions or three conditions, and I'm, you know, doing a much more focused biology experiment. There's a lot of RNA seek library construction strategies. So we talked a little bit about the sort of sequencing nuclear RNA versus ribosomal RNA or cytoplasmic RNA. There are a whole bunch of factors that you see being varied in some of the RNA seek data generation steps. So there used to be a lot of poly A RNA being generated. And that's kind of shifted a little bit to being more focused on this total RNA approach. So how many people here is anyone doing an mRNA isolation or purification before generating their their RNA seek data? You know, so a couple. So this used to be much more common, because it was a really great way of enriching for the actual protein coding RNAs, the things that are already polyadenylated tend to be mature RNAs that have already been spliced, and they're likely to be protein correspond to protein coding genes. But in the last several years, some kits have really improved that allow you to do a little bit more holistic representation of the transcriptome where you just sequence total RNA and instead of enriching for the polyadenylated RNAs, you do arrival reduction, which is mentioned on the next line here. Size selection is something that's done sometimes it's done before cDNA synthesis. So you're actually fragmenting the RNA. Sometimes it's done after cDNA synthesis where you're fragmenting the cDNA. They're generally, even though I mentioned that there's a sort of holistic approach to sequencing the whole transcriptome, it's not really true. There's a bunch of caveats to the sort of most common true seek alumina RNA seed kit. And one of them is that it's still doing a pretty substantial size selection. So small RNAs are pretty much deliberately getting tossed. So you're not going to be getting good representation of micro RNAs of t RNAs of snow RNAs. A lot of that stuff is is gone. So if you're really interested in snow RNA biology, it's likely that the RNAs the sort of standard RNA seek approach that someone will have on a core service list is not suitable for that. There used to be some kits that were quite popular for that involved a linear amplification steps. This is something you might do if you have really, really limited, you're studying stem cells or really rare tumor cells or something. So some people are still doing that. There used to be a lot of a mix of stranded versus unstranded data. So this is where you've in the stranded data you've encoded the data in such a way that you can figure out for each read what strand it was likely transcribed from. It used to be that you didn't have that information. So you would just align the reads against the reference that you didn't know for actually which strand it was transcribed from. And you would infer that based on the way it aligned to a known gene or perhaps if it had, if it encompassed a splicing event, then you can get a pretty strong inference from what strand it came from. But now that's kind of built into library construction. And most people are working nowadays with stranded libraries. There are still quite a few scenarios where people are doing an exome capture of their RNA seek library. So this is where you basically take your RNA, you make an RNA seek library, essentially as you would normally, and then you hybridize it against a probe set that corresponds to all of the known exons of your genome. And this enriches four fragments that actually correspond to known genes. And this is commonly used as a way of sort of rescuing samples that are quite degraded, where some of the other enrichment steps don't work as well. Library normalization is an attempt to deal with the problem where you have really, really highly expressed species of RNA and really lowly expressed or sort of an attempt to sort of even things out a little bit. And so there's a lot of details here. And we'll in the wiki will point to a whole bunch of materials where you can really get into the weeds and find things that are relevant to your particular experimental setup. But the main point here is that these details can really affect the analysis strategy. So if you have a set of RNA seek libraries, and you have the data for those things, and within your data that you're going to try to compare to each other, if they vary on any of these factors, that could cause a problem. So you could be seeing batch effects instead of biological effect. If you're trying to compare two conditions where it's not just the condition that's varying, but the way the RNA seek library was constructed. This figure just kind of an overview of with a little bit more detail of how you actually make an RNA seek library. So you start with a tissue, you isolate total RNA, usually at the step we assess RNA quality by either running it on a gel, most people don't do this anymore. These are kind of synthetic gels showing a couple examples. You have totally intact RNA here. And this is what it would look like on the Agilent electrophurogram. So you've got two peaks, two bands. And then as the RNA gets more degraded, you go to a pattern that looks more like this where you've got partially degraded total RNA, you can still see the peak, the 18S and 28S peaks a little bit. And then as the RNA gets degraded more, it looks more and more like a smear. So we start getting smaller and smaller bands on our gel until eventually we just have sort of indistinguishable smear of small things. And this is what a Agilent electrophurogram looks like when the RNA has been pretty much degraded to completion. In general, you still do have small RNA fragments left. There seems to be a little bit of a floor where the RNA doesn't degrade, or it starts to degrade more slowly once it's all in fragments of say 100 to 150 basis in size, which is kind of convenient because it doesn't mean you can often still do some profiling of RNA that's pretty heavily degraded. Once you've assessed the quality, it's typical with your total RNA to do a DNA's treatment to try to remove genomic DNA that may still be in your sample, and then some kind of enrichment potentially. So we talked about some of those enrichment strategies on the previous slide. Then you're going to do cDNA synthesis, and you're going to take those cDNAs and often do a size selection and add sequencing adapters. At this point, your small RNAs are typically lost, so you're selecting four fragments that are above some size. The size selection is not perfect. So it'll be sort of incomplete. You wind up with your fragments with the sequencing adapters attached. And this library of fragments is what goes on to the machine for sequencing. So I mentioned enrichment strategies, there's several enrichment strategies that I've already touched upon, starting with doing no enrichment at all. No one does this. Basically, no one ever sequenced just straight total RNA because all you would do is sequence the same ribosome RNAs over and over and over again. But that's what's depicted here in A where we've got this pool of complex RNAs and it's totally represented in the total RNA pool. Probably the most common strategies to do a ribosomal RNA reduction step. So this is where you have a bunch of probes that correspond to ribosomal RNA sequences, take your total RNA, you hybridize them with those probes and you basically try to pull out the ribosomal RNA species and enrich for everything else that's left. The main alternative to that is polyase selection where instead of trying to remove the ribosomal RNAs, we're trying to grab onto and hold onto the things that are polyadenylated and then wash everything else away. And one of the main caveats of that approach is that if any of your RNAs are degraded a bit and you hold onto their polyate tail which is at the three prime end, you're losing the five prime ends potentially because anywhere where the RNA was broken, that piece is going to be lost when you grab onto the three prime end of the transcript. And then CDNA capture is kind of an orthogonal approach where you're not selecting for polyadenylated species and you're not trying to pull out the ribosomal RNAs, you're just directly trying to target known exon sequences and enrich for those. And I would say all three of these are still in relatively common use depending on what your experimental scenario is. Any questions on that? I briefly mentioned this stranded versus unstranded library so this is a sort of cartoon depiction of what we've taken some reads from a stranded library and an unstranded library, we aligned them against the reference genome and then we've colored them according to their sequencing strand and we can see in an unstranded library here on the top that you just get this sort of random mixture of reads that are from either strand and you don't know which strand was actually being transcribed. But with the stranded libraries now you have this information that basically tells you okay it looks like this read came from the positive strand or the negative strand in terms of where it was transcribed from. And on the right hand here we're showing an IGV screenshot of some actual data that was produced with one of these stranded libraries and then the coloring of the reads is based on the strand reported for each sequencing read. So you can see that it's pretty close to perfect in terms of identifying reads in this case that were from the positive strand for this gene that's going this way and then for the negative strand on this gene that happens to be going the other way. So we've got two genes that are sort of in a head to tail fashion and the stranded entities seem to be matching up pretty well. Replicates is a common question that comes up. How many replicates should we do? There's generally sort of different kinds of replicates that we can think about technical, experimental, and biological. I would typically think of technical replicates as being things really sort of instrument level so do we need to worry about different flow cells producing variability or different lanes on the same flow cell? Should we worry about that kind of variability? Is it run to run variability produced by the sequencing instrument itself? The answer is generally no. The Illumina platform has become pretty robust so you can take the same library and sequence it on one flow cell and the next we can sequence it on another flow cell. If you look at those two data sets they'll generally correlate extremely well. That's what's being depicted here on the right. Just an example where we've got two lanes generated from the same library and they're very very highly correlated. Of course experimental replicates and biological replicates are still important and it's very difficult to say how many replicates you have without really thinking about your biological condition and how much variability there is there but they are as important as they will always be in biological experiments. Some of the common analysis goals of RNA Seq, so things that we can ask of RNA Seq data that many of which we're going to talk about or cover in this course. So we're really going to focus on gene expression and differential expression, a little bit more an alternative expression analysis, but in the second and the third day transcript discovery and annotation. We're going to talk about that a fair bit. We're going to briefly touch on the concept of allele-specific expression relating to either common polymorphisms or in the case of cancer could be somatic mutations. People are doing mutation discovery in RNA Seq data for various reasons and scenarios. Fusion detection, similarly. RNA editing we don't touch on, but the sort of tools that you would need to start identifying RNA edits will be sort of made available to you during the next couple days. All of these questions that we ask of RNA Seq data generally have a particular set of tools that you chain together into a workflow to go from raw data to some kind of more interpretable output that you can do sort of more final and human readable analysis on. So there's many, many tools and many workflows, but they have sort of general themes and they all kind of follow this pattern of starting with raw data, perhaps converting it from one file format to another file format, aligning or assembling those reads and then processing the alignment or the assembly with a tool that's specific for a goal. So for example, cuff links or string tie for expression analysis or defuse or chimeric scan for fusion detection. And then usually after you run those sort of question specific bioinformatics tools, there'll be some kind of post-processing. So the tool will output some kind of crazy usually custom file format that they invented and only they fully understand. And there'll be some kind of cleanup or filtering or further munging to sort of pull out actual observations that you can follow up on further. And then there'll be a summarization and visualization step. So this is where you're actually creating your figures or viewing the data in some kind of interactive browser. We already talked about Biostar, so we'll think we'll just skip this exercise. If you haven't used Biostar, I definitely recommend it as a place to check out and sign up and ask a question. If you have one, some of the common questions that typically come up in this course, so things like should I remove duplicates for RNA-seq data? The answer is generally no. And the reason we specifically cover this is because so many workflows do involve duplicate marking. Basically any workflow involving DNA analysis, the any time you have a fragment worth a read one and read two, where you have two fragments that seem to start and end at exactly the same position, the default assumption in most DNA sequencing experiments is that those things are potentially amplification artifacts and that we should just collapse them down to one observation and that it wouldn't happen very often by chance that we'd have a fragment that starts and ends at exactly the same place. And if you do simulations for whole genome data, for example, if you sequence your whole genome data with fragments that range from size say 200 to 350 and you're targeting say 30 or 40x average coverage of the whole genome, there's an extremely low probability that two reads, two fragments start and end at exactly the same place. So it doesn't do you any harm to remove them and you get to remove all of these amplification artifacts. Unfortunately in RNA-seq the situation is quite different. So we're not sequencing chunks of huge chromosomes, we're sequencing RNA-seq or we're sequencing RNA fragments and as I mentioned some of those RNA fragments are expressed at really really high levels in each cell and some of them are quite small. So you can have a gene that's a relatively modest size say it's only 300 or 400 bases long and it's expressed at tens of thousands of copies per cell. Well now when we're representing that RNA-seq species there actually is a pretty high chance that we'll get two fragments that are the same just by chance because there's just not that many fragments that you can get out of that small thing that are say 300 bases long. So for that reason we generally don't mark duplicates in RNA-seq experiments and even if you do mark them most of the downstream tools will just ignore the marking. So in that sense it's kind of convenient that it generally won't do any harm if you if you run one of the typical duplicate marking steps. How much library depth? Of course this depends on many factors. How much data you generate will influence sort of how rich of a question you can ask of the data set. If you just want gene expression estimates so if you've been doing microarray experiments and you just want sort of to reproduce what you would get from a microarray experiment, abundances at the gene level for a set of known genes, you can get away with a relatively small amount of RNA-seq data so you can really multiplex and in that scenario you might really benefit from having more samples with a smaller amount of data to increase your statistical power. And so there's a number of papers that we reference on the wiki that kind of sort of formally address this question of what's the minimal amount of data that are the most efficient way to design an experiment where I have a finite amount of money which is always the case and I can either choose more samples or sequencing each sample more deeply whereas the right balance and I think estimates come out to something like 20 to 25 million reads should be plenty sufficient for each sample if you're just care about gene expression estimation. But of course for many RNA-seq experiments we care about a lot more than that so just telling the sort of approximate abundance of a gene is not nearly as hard as characterizing the exon intron structure of that gene or doing mutation calling so you really have to think about what you want to do with the data when you make this decision. Of course other things like the tissue type or RNA preparation or the quality of the input RNA or library construction method may also influence the amount of data you need to generate so if the RNA is degraded or if there's some impurity to it you may need to do to generate more data the read length whether your reads are paired or not paired may also influence. So one recommendation we usually make is to just try and find a publication that had sort of similar experiments designed to what you're thinking about and use that as a starting point even better than that do a pilot experiment where you kind of do a bit of an overkill on a small number of samples and then you can look at the data and you can do some down sampling experiments and sort of figure out what the sweet spot is in your conditions in your lab with whatever peculiar factors that you have in play. The good news is that the amount of data you can get from a high-seq instrument now is so spectacular that would say one lane or half a lane even or even less maybe a third of a lane is sufficient for most of these purposes. What mapping strategy this used to be a lot more of a common issue where we had quite a range of read lengths. Does anyone here have reads or working with reads that are shorter than 50 bases see this still sometimes? What kind of lengths is it how many people have something that's like two by one hundred reads? Okay so a few of you anything else? Seventy-five or is it? Seventy-five to three hundred. Seventy-five okay yeah so if your reads are seventy-five or larger you're probably going to want to do an alignment where you have enough information to try to do this alignment across the intron boundaries so you might as well use a splice aware liner. If you have really short reads and sometimes people are doing this to save money again if they want to profile a huge number of samples say you have a thousand samples and it's all about comparing a bunch of conditions within those samples and you just want to get gene expression abundance estimates out of them you might choose to sequence single and or maybe paired 50 mirrors or even shorter so you still see this because it's cheaper and you can process more samples that way and then but then in that scenario you might want to not use a splice aware liner so you might go back to something like bow tie and the liners are generally getting better at kind of handling both scenarios but it's just yeah something to keep in mind the last question here is sort of what if I don't have a reference genome so this also influences some of the previous questions so what what genome you're actually talking about influences things and whether you have a reference genome may influence things as well so some people have a great reference genome some people have a very draft reference genome some people don't have one at all some people have a decent set of reference transcripts but not really much in terms of the actual reference genome sequence so usually at this point I kind of do a bit of a survey how many people are working with data they came from say prokaryotes any prokaryotes a couple yeah okay so no splicing for you generally so the rest of you are eukaryotes I guess is that true okay how many people with human data okay so the human people are kind of lucky in a way because so much money and resources have been spent on producing the best reference genome considering the size of the human genome the quality of the reference genome is really good for human and millions of dollars have been spent annotating it and producing reference transcript data sets and high quality full-length cdna data sets and on and on and on all of this characterization so there's a almost you know too many resources you're overwhelmed by the number of databases and data sets and annotation tools and resources and sets of transcripts that are out there for human it's almost the challenges like figuring out which ones to use or what's best for your scenario but then other species not so much like that so what about any yeast anyone with yeast no fungi any any mushrooms or mushrooms plants any plant people one two three four well I think that's a record I don't think we've ever had four plant so what plants okay potatoes tomatoes and tobacco okay wheat okay wheat okay so how many of you have reference genomes sort of half of you is anyone out who doesn't have a reference genome for their species the back so what are those yeah okay no referencing at the back what are those things oh there's no salmon reference genome and the other one was sea lice okay and there's is there not a reference at all for the sea okay right so we have a really wide mix which is also typical for this course of people that have yeah species with very little sort of they're on the frontier of characterizing that species and then the other end of the spectrum as billions of dollars have been spent preparing resources that will be helpful for them and a bunch of stuff in between so there'll be sort of several points during the this course where some things will apply to some of you and not all of you or where there'll be a bit of nuance to how you would apply it to your species and I would encourage you to you know ask questions or talk to us any of us individually about your particular scenario and we'll see if we can learn more about what what in particular would be helpful to you some of the tools that we're going to talk about are quite reference free so they'll work just sort of on the raw data usually with usually it's good if you at least have a reference transcriptome if you have a transcriptome and a genome that's even better and sort of all options are open to you and sort of if you have don't have those things then you may need to do some sort of preliminary work to characterize your species before you go further so I think if you don't have a reference genome at all then it's tempting to to ask how can we get a reference genome generated because it's just such a useful tool for studying that species both at the gene when you're studying additional genomes and of course the transcriptome as well but we can talk more about that with with you individually on your particular critters there's this reference to the the wiki now where we have more common questions and their answers I'm going to go through the the RNA seek wiki and I'll review that that section specifically to sort of point out where you can access that so now I'm just going to jump straight into the the tutorial introduction and I'll probably blaze through this pretty quick so we're going to just in a minute here start the first hands-on tutorial so there's four five modules and each of them has a hands-on component this is going to be the first one so we're going to follow this pattern where we do a bit of an intro lecture this lecture was the longest one I think so the following lectures will get shorter and shorter and you'll spend more and more of your time at the command line running commands and thinking about how to do the analysis the first one starts out fairly basic so the the very first goal is to actually practice installing a bunch of commonly used RNA seek tools so we wanted to this to not be a black box where you felt like you needed this special environment to have been created for you in order to to run the analysis that you're going to do over the next two days so we're going to show basically how all of the tools were installed that we're going to use so that later if you want to run this on your own compute cluster you'll have a kind of reference point for how to install these kinds of tools and generally just doing for the people who are kind of just getting into bioinformatics command line analysis installing and updating and maintaining tools and all of their versions is one of the most tedious and painful and hair pulling aspects of bioinformatics but you kind of just have to pull the Band-Aid off and get used to it so this is sort of an attempt to show some examples of of what it looks like so that it becomes a little bit less foreign in the future and then we're going to move into really the RNA-seq specific stuff so we're going to obtain a reference genome we've kind of created a sort of bite-sized version of the human reference genome to work with here we're going to obtain gene and transcript annotations talk about where you get those kinds of annotations using human as an example and then we're going to dig into the GTF file format which is one of the most commonly used file formats for representing transcript annotations so it'll be used for many of your species we'll have a GTF file somewhere that someone has created we're going to index the the reference genome files for for use with the aligners we'll talk a bit about why we do indexing and then we're going to obtain our raw sequence data that we're going to use for the the downstream analysis so we have sort of three main components here the reference genome the reference transcriptome and then the actual raw RNA-seq data and then we're going to talk about the formats for each of those those files I think Obi briefly talked about this sort of some of the common gotchas or problems that come up while working the tutorials there is there are some of these commands are just really long so in the interests of everyone being able to get through the full workflow from raw data to sort of final interpretation some of the the longer commands are just provided to you and you're encouraged to kind of copy and paste those short commands you can type carefully but kind of you know figure out the logistics of how you copy and paste from the wiki into your terminal there were also sprinkled throughout the whole workshop there are practical exercises where you're not given the commands and though we'll kind of do those on the side so that you really have to type it all out yourself and think about how to construct these these complex commands so we're kind of trying to balance two things here one is wanting you to actually learn how to construct these commands on your own and the other is to be able to go through this relatively complicated workflow where each step depends on a previous step so things need to to all work for it all to come together at the end watch over copy and paste errors so sometimes you'll copy something from the wiki onto the command line and you'll be missing the end of it or another typical thing is that you'll copy a set of say two or three lines and you'll enter them and the first two lines execute and then the last one is just kind of doesn't execute because you didn't hit enter so generally when you paste probably a good idea to just hit enter a few times just to make sure that you're actually executing all of those commands otherwise you'll go and paste in and the next step and it'll be sort of appended onto the previous step and it will create kind of a garbled thing that will just result in an error being in the wrong directory at the wrong time so a lot of the everything kind of flows from one step to the next step so sometimes if you kind of navigate away and are sort of poking around and then you go to continue on at a step that you had hadn't gotten to yet sometimes you'll be in the wrong place at the wrong time so just keep an eye out for that and then there's some environment variables that need to be set but we pretty much have that sorted out so that you don't need to worry about it I'm just really going to briefly describe the tutorial steps here the wiki has much more complete instructions and commentary on what all of the commands are if you do see a command and there's something in there that seems like sort of jargon or it isn't explained please ask or let us know and we'll try to sort of make it more self-explanatory any lines that begin with the hash symbol are are common so you can go ahead and paste those into the command line but nothing will happen it's just sort of been commented out all the other lines that you see in the sort of command boxes on the wiki are meant to be executed as I said each command is annotated with basic commentary we've provided some reference materials for Linux actually the wiki has a whole bunch of learning Linux and command line resources these are some of the tools that we're going to be using this is just provided if you're a reference with sort of links to the the installation documents for each of them if you want to refer back to them later we're going to obtain a reference genome in our case we're getting our reference files from ensemble which is a a European organization that helps to organize reference genomes and annotations of those genomes this analysis is based on GRCH38 which is the latest build of the human reference genome that is in part actually created at Washington University where we work so we're one of the the main centers that's still maintaining the reference genome and trying to improve it and fill the remaining gaps and fix errors for this tutorial we're just using a single chromosome so we've picked chromosome 22 because it's one of the smallest chromosomes and that allows the analysis to happen more quickly and the data has been kind of paired with it as well but we provide instructions for downloading the full reference genome as well the reference annotations are also from ensemble so we're going to basically download a GTF file from ensemble and again we've kind of cut that down to just cover the genes on chromosome 22 so it matches we're going to create an indexed reference genome so this is something that is typically done for almost every aligner there's a step where you download the aligner you download the reference genome you use a tool that comes with the aligner to create an index of the the reference genome so this is basically like creating a kind of lookup table that allows the aligner to more quickly find places in this massive reference genome space and that's a big part of how the alignment is able to happen as quickly as it does and one needs to be pretty careful with these indexes so they tend to be particular to each alignment tool and sometimes even versions of that tool are many different RNA-seq aligners out there and you can't really mix and match their indexes if you're going to use a high-sat aligner then you need to index your genome with the high-sat indexing tool so it's something to to watch out for the RNA-seq data has been again we've kind of pre-filtered it so that all of the reads that are there we already know that they're going to map to chromosome 22 this is again just for kind of efficiency sake otherwise we would be even though we've cut our reference genome down to just chromosome 22 if we just aligned random reads against it most of them just wouldn't happen to hit the reference genome so we'd be spending a lot of time searching and not getting many alignments the test data comes from two RNA sources one is called the universal human reference and the other is the human brain reference the universal human reference is a collection of different cell lines and the human brain reference is brain tissue from I think 20 individuals that have kind of just been pooled together so both of these things are kind of arbitrary not very biologically meaningful RNA sets and the comparison isn't really that much more meaningful either but we expect them to be quite different and to represent a lot of transcription events we have multiple people covered by them and multiple tissues so we should see quite rich representation of the transcriptome and we expect to see a lot of differences between just a mixture of random cell lines and human brain samples each sample has also been spiked in with a control reagent so there are two versions of this control reagent it's called the ERCC RNA reagent and there's sort of two mixes and the idea of this is kind of like a ladder on a gel you have a bunch of RNA sequences that were sort of constructed artificially to be very unique and not match anything in the human genome and they've been spiked into this mix at different concentrations so we have somewhere we have put them in at really high concentrations medium low and very low and that the ratio of them is known so we have this prior expectation for these 90 or so transcripts that we expect to see this one being really rare and this one being a little bit higher and this one higher and higher and higher so we kind of know in advance what the distribution of them should look like and then similarly there's a mix one and mix two where the sort of the relative ratios of those things have been swapped around so if we mix one in one of our samples and mix two in the other sample we have again an expectation for certain fold changes to come out of it so we can compare both the differential expression analysis and the abundance expression analysis against these prior expectations for the spike in controls and this really allows us to do a pretty robust QC of the data generation and analysis so we'll know if anything has gone totally nuts in the way the data is generated or being analyzed the input data is in in FASTQ format I've got a reference there to how the FASTQ format works this is just a little bit more detail on each of those sources of our data with some links to even more details about each of these samples in this case we're going to have some replicates to allow us to do kind of a replicate experiment in the differential expression analysis so we have basically three replicates of the UHR libraries and three replicates of the HBR library so we have six samples in total and then for each of those samples we're going to depending on which step we might have two files so for read one and read two so our input raw input data is basically going to be 12 files read one and read two for all six of our samples we're going to play around with this pre-alignment QC tool it's called FASTQ a little bit