 So, this will be the most, I guess I would call it sort of biological of the lectures since it's sort of starting at the beginning, sort of an introduction to the rest of the lectures but also just general concepts of RNA-seq. And this one tends to sometimes have quite a bit of discussion so that everyone hears from sort of a different area of expertise. Their projects are all different. They have different perspectives on the different sort of flavors of RNA-seq. And so we're going to go through some of the nuances of what it means to do an RNA-seq experiment and some of the sort of the major flavors of it. And if you guys have comments or questions as we go, please speak up. And well, it's fine and good to have a bit of a discussion while we're going through these lectures and make some more interactive and interesting. So before I start with the first slide, I want to do something similar to what OBE did which is sort of get a bit of a sense of sort of more of the specifics of what you guys are doing. So how many people here are doing RNA-seq or interested in doing RNA-seq for human? So that's me. Whoa. Okay. So I guess about half of you are humans. The rest of you, well, we're going to find out, okay, what are the rest of you? How about just some other you carry out like a mouse or a yeast or whatever? Okay. So you guys are still interested in the splicing, intron aspect of RNA-seq. Any plants? It's you carry out, but sort of for other reasons in a special category. Okay. So one, two, three. Wow. Okay. So, yeah, that sucks for you guys depending on which plant it is. So the plants, plants are sort of infamously nasty genomes, very complicated. Some of them are just huge and polyploid and all kinds of craziness going on. So what are the plants quickly? Just those three people? I'm actually looking at polyploid. Yeah. So you're definitely screwed. In fireweed? Yeah. Oh, cool. That's really neat. Yeah. It's in the rest of the family. It says in triple. Every single triple of you. Okay. And the third one? You're getting out of your new cell. You're live. Green algae. Okay. Interesting. Okay. So quite a mix. So the human people definitely have a reference genome, but for the non-human people, do you all have reference genomes? Doesn't anyone not have a reference genome for their species? One? Two? Okay. Kind of. So generally... Or have it opposite of a little bit. Right. Okay. And now what would be the closest for fireweed? So not having a reference genome generally has quite an effect on the way you approach the analysis of RNA-seq data because there are quite a lot of tools and pipelines that sort of assume that you have a reference genome or at least one that's sort of of a closely related species. And so we're going to, at points, we're going to talk about some of the things you would do if you didn't have a reference genome, but really this course is quite focused on the scenario where you do have a reference genome, but there's a few pieces that we've added here and there to sort of do reference-free analysis. How many people have their data already? Or... Okay. And then there are some people that are sort of in the planning phases or they're going to be getting data. So possibly libraries are still being made or you're deciding how to make your libraries, this kind of thing? Okay. So this is going to be potentially a helpful exercise for you because we're going to talk about some of these things. So this is just a review of where we're at, so we just did the introduction to cloud computing. Now we're going to go through an introduction to RNA sequencing itself. This is the sort of second of five modules, which are numbered zero through four in a sort of very computer science kind of way. The tutorials, so each lecture will be accompanied by a tutorial. All of the tutorials are going to be done using the wiki at RNAseq.wiki and we've tried to create these tutorials with a few sort of basic principles in mind. One is to provide a sort of working example of a functional RNAseq analysis pipeline that's kind of from end to end. So the emphasis there is to try to not to make it sort of real world or not some sort of imaginary pipeline, but a pipeline that you might actually use. But one caveat to that is that the data that we run it on can't be real data because then it won't run in a reasonable amount of time for a sort of teaching setting. So we want to execute a command that say, hey, I'm going to align my reads to a reference genome and I don't want that to take eight hours because we would have a lot of time to stand here waiting eight hours and then the course wouldn't get done. So we've created this sort of idealized or sort of dumbed down data sets that allow us to get through a workflow using real data files, executing all the same commands that you would otherwise execute with the big files, but it's just everything's going to happen much faster. But you just have to keep that in mind when you're looking at the results and interpreting the results that it's sort of this idealized scenario where the data set is very, very small and we've still decided to select reads that only map to a small chromosome. So it's going to look a little bit different than your full data set will. It's going to look a little bit cleaner and smaller and sort of simpler potentially then sort of a real world large data set. And then as much as possible, we've also tried to make the tutorials and all the files associated with them like fairly self-contained, hopefully self-explanatory and documented in place so that when you go back, you can, of course, listen to these recordings again, but the hope is that everything that you need to understand what's going on is sort of in place. And then in cases where that's not clear, we'd really appreciate your feedback to make that more clear in the future. And then by self-contained, we mean that we're going to try to explain everything about the setup of the tools, the environment, etc., that you would need. So we've, in the past, we've had issues where workshops, a lot of things were kind of done for you. So tools were installed and configured and there was this elaborate machine that was sort of built for you and someone sat you in front of that machine and they said, okay, here's how you run RNA-seq, you press this button and all this amazing stuff happens. But you didn't know how to build that machine. So you go back to your lab and it's like, well, I don't even know what we were working on. So we've tried to provide the instructions to recreate the environment that is being used here and to actually show you what it's like to some degree to create that environment in the first place. So this module specifically is really going to talk about the theory and practice of RNA sequencing. We're going to do a little bit of just rationale for sequencing RNA. Most of you are already sort of on board with the concept of RNA sequencing. That's why you're here, but it's still kind of useful to think about the sort of the reasons we do RNA sequencing. We're going to really spend some time talking about the challenges that are specific to RNA-seq as opposed to perhaps other types of sequencing. So there's an issue sometimes that comes up where people have done some DNA sequencing and they've done some analysis and then they try to apply the same approaches to RNA-seq. And generally that is fine, but there are just a few sort of caveats or nuances to RNA-seq that you really want to keep in mind that are distinct from other types of sequence analysis. We're going to review the general goals and themes of RNA-seq analysis workflows using the sort of our example as a sort of proof of principle. We're going to talk about some of the common technical questions related to RNA-seq analysis and point you to some resources to help you get help outside of this course or for questions that we may not get a chance to cover. And then we'll do a brief introduction to the first tutorial to sort of get used to what these tutorials are going to be like before we jump into running commands at the command line with actual data tools, etc. So this is really the basic biological introduction to RNA-seq just to make sure we're all on the same page. This is of course a eukaryotic example where we have a hypothetical gene model. Sorry, I can't point out both of these screens at the same time. We're starting with a hypothetical gene model at the top which is a double-stranded genomic DNA template and a number of features are sort of enumerated in this pictograph where we have, in this case, a gene with three exons and two introns. These introns are not to scale for humans, so in humans the introns would be much, much larger relative to the exons. Something like yeast might have a gene that looks kind of like this and there's a promoter region where transcription factors will sit down and start to transcribe and then there are features that sort of signal when transcription should stop and where a poly A tail should be added. If it's a polyadenylated transcript. After you get, so a transcription complex will come along and make a single-stranded pre-mRNA molecule from this DNA template. So now we have this single-stranded RNA molecule where the introns are still in place and in the second panel we have some of the splicing regulatory features enumerated. So these are the sequence features that allow the splicing machinery to tell where the exons end and the introns start so that the introns can be spliced out and the exons can be assembled into a mature mRNA molecule which is capped and polyadenylated which is shown in the third panel. So now we have this mature mRNA. This is kind of getting closer to the subject of RNA sequencing or these RNA molecules and we'll often do something to enrich for the mature mRNA molecules. But it's important to remember that in RNA-Seq this is not strictly speaking the thing that is being sequenced directly. First of all, we're not actually sequencing RNA, we're sequencing CDNA so it's gonna get converted to CDNA, usually double-stranded CDNA. And second, the next generation sequencing platforms produce short reads that are much, much shorter than the average transcript in most species. So there's almost always a fragmentation step where full-length RNA molecules are broken into pieces and it's those pieces that we're sequencing and then we're generally not even sequencing those entire pieces. We're usually sequencing a little piece on the end of one of those fragments or we're sequencing part of both ends of the fragments but we're not really sequencing the entire fragment. But that's sort of what we're imagining we're doing or we're trying to pretend that we're gonna do and that we're gonna use the analysis to try to make it seem as if we just sequenced the RNA molecules that were there. And then of course many of these RNA molecules in code for a protein and often that's what we really care about and so just to complete the picture we show a protein sequence here which then gets folded and it gets modified in various ways with post-translational modifications. So of course that's a very important piece too that we're sort of using potentially in a lot of cases RNA is a proxy for studying actual protein function and we're sort of doing a lot of inference. So you can see there's various forms of inference that are going on here and it's useful to keep those in mind when you're analyzing interpreting the data. This is sort of a really high level view of what an RNA sequencing workflow looks like sort of from the lab to where analysis starts. So we have some samples of interest. So for example we might have a tumor and a normal so this is a lot of what Obi and I do. In our exercises we're actually going to be comparing two sort of arbitrarily different tissues. One is a universal human reference RNA sample which is sort of a mix of RNAs collected from various human tissues and then the other is a brain reference sample which is a mix of different brain tissues that were combined together. So this is a very kind of spurious comparison to do which we're just doing for the purposes of demonstrating the concepts of RNA seek and differential expression analysis and so on. But whatever your samples are you're going to isolate RNA from them and then you're going to generally do some kind of library construction procedure that involves not necessarily in this order but generating CDNA, fragmenting, selecting those fragments to be sort of within a certain size range, adding linkers onto the ends of those CDNA fragments and then sequencing those fragments on an alumina flow cell. So it used to be there was various platforms generally everyone now is doing alumina sequencing. So the people that have data now is there anyone who's using a platform other than alumina like IonTorrent or PacBio? Okay, so PacBio is quite different. The PacBio produces much, much longer reads and it has a fairly distinct way that it actually works compared to the alumina sequence by synthesis. Anyone else? Okay. And for the people that have alumina data or they generally paired end or single end reads does anyone have single or is it all paired? It's all paired and what kind of read length? I have both. Yeah, both, okay. And then read length, does anyone have reads that are say shorter than a hundred bases? Okay, two? 75. 75. Okay. Okay, so that gives you sort of, that's pretty representative of you ask sort of 10 people that have RNA-seq data. You'll see a bit of variation in how long the reads are. Sometimes there's paired reads. Sometimes there's a mix of paired and single end reads. Sometimes there are other minor variations but it's sort of, we're sort of converging on a fairly standardized approach that's sort of 75 to 125 base pair reads and the paired end reads are very popular. And by paired end I mean that you have some fragment where you're sequencing a little bit of both ends which is what's being depicted here at the bottom with the sort of blue and red sequences. So we're gonna produce this pile, often a very, very massive pile of these paired end sequences and then those things are going to get aligned back to some kind of reference sequence which can be a reference genome or reference transcriptome sequences or some combination of those things and there are other different games that people play and then the downstream analysis is gonna continue from that point. So why would you sequence RNAs? There's lots of sort of contexts in which people would be sequencing RNA. A lot of it is sort of functional work where you may have the sequence of a genome but the transcriptome gives you much more of a functional readout of the effective environmental changes so you could have something like drug treated versus untreated cell lines or a wild type versus a knockout mouse. Predicting transcript sequences from the genome is really difficult so this used to be kind of a whole field of bioinformatics was to sequence a reference genome and then to try to predict what gene structures looked like based on the genome sequence you try to predict, okay, what looks like an exon, how might those exons be assembled into an RNA transcript and then what would the protein look like but doing all of this just by looking at the reference genome sequence. So this is something that was fundamentally revolutionized by the advent of RNAseq that it's just actually much easier to just sequence the transcripts directly than to try to infer what they would look like from the genome that's a very challenging approach. Of course some molecular features can only be observed at the RNA level things like alternative isoforms, fusion transcripts, RNA editing, in the cancer realm interpreting mutations that don't have an obvious effect at the protein sequence level can be aided by RNAseq where you can try to look for regulatory mutations and then you can try to prioritize mutations that are in exons according to whether they're expressed or not and that's sort of a niche example that's quite specific to cancer genomics. There are a number of challenges that are quite particular to RNAseq compared to sort of other sequencing like chipseq, whole genome exome sequencing, et cetera. And then there are other things that are quite common. So things like relating to the sample, purity of the sample. So if it's sort of contaminated with something else so using the tumor example it may be contaminated with normal tissue and that of course influences interpretation of the result. RNA quantities can be limiting, RNA is sort of infamously fragile. So some amount of RNA degradation is sort of part of the life of anyone who works with RNA with the possible exception of people who work exclusively in cell lines. So how many people here would say that RNA quality is like a serious challenge in their work? Okay, so several. And there's a lot of reasons for that. It's often just simply because of the type of experiment that's being done it's simply not possible to get really beautiful intact RNA and so you're left with sort of dealing with the challenges that introduces. Another thing about RNA is that they often consist of small exons that are separated by large introns. So this is quite a difference in the analysis when you compare to say whole genome sequencing where when you're mapping your read which is generated from a CDNA fragment which came from an RNA molecule, you're mapping that thing back to a reference genome, the aligner needs to account for the possibility that that read may span across an intron. So instead of trying to find the place in the reference genome that these 100 bases match to more accurately than anywhere else in the reference genome it has to consider the possibility that 25 of those bases might match here and 75 of them might match 10 KB downstream where the next exon continues. And that's a big challenge for the alignment algorithms. It's much, much harder than just saying, okay, what's the one place that this thing goes contiguously when you have to say, well, I expect that in a significant proportion of my alignments the alignment will actually be discontiguous and there will be this potentially very large, introns can be huge in humans, they can be up to 100 KB. So that's, there's a big search space basically. And you can imagine that if that overlap across the intron into the next exon is really small it gets harder and harder to place that little bit that spilled over into the next exon. So that's one of the challenges that the RNA-seq specific aligners try to address. Something that is often sort of glossed over or forgotten about is that the relative abundance of RNA is very wildly within a sample. So estimates vary, but in human you'll often see estimates sort of in the 10 to five to 10 to the seven orders of magnitude difference in abundance between the most highly expressed RNA and the most lowly expressed RNA that is still biologically functional in that tissue. So this is very, very different from DNA sequencing where you have some complement of chromosomes and you expect approximately equal representation of each of those chromosomes. So in human you have two copies of each chromosome. And so when you sequence and you randomly shotgun sequence reads and you align them back to the reference genome you're expecting coverage to be kind of approximately even across the reference genome. In the transcriptome this is not the situation at all. You expect some things to be very, very heavily sequenced and other things to be very lowly sequenced. Hopefully very proportional to how abundant they are in the sample. And so this is a challenge because RNA sequencing works by random sampling. So you have the situation that a small fraction of really highly expressed genes consume a lot of the reads that you have. So you have this fixed pool of say 50 million reads because that's what you can afford to generate and 40 million of them are gonna get consumed by the top most highly expressed gene. So if your gene of interest happens to be a gene that's not so highly expressed you may not get good coverage. You may not sample it very well. So this really comes into estimating how much data you need to produce. And it's something that's quite different from DNA sequencing. And then you have things like ribosomal and mitochondrial genes where there's often in the lab some attempt to actually filter these out and enrich for other things to sort of reduce this problem that you don't just want to sequence ribosomal RNA to death and then not really learn anything about the rest of the transcriptome. Another issue that's again quite different from DNA sequencing is that RNAs come in a wide range of sizes and they function in those size ranges. So you can have really, really small RNAs that are important like micro RNAs that are just a few tens of bases long. And then you can have MRNAs that are very, very large. They can be 30, 40, 50 KB in size and some of those produce massive proteins. And again, this is quite different from DNA sequencing where generally the chromosomes are all quite large in terms of working with molecules and molecular biology. And then this introduces some complexities in the RNA-seq library construction procedure where you have to handle very small RNAs differently than the rest of the RNAs and then it gives you a sort of representation challenge that just like there are RNAs that are very highly abundant and other RNAs that are very rare in your sample and that gives them sort of a different probability of being hit by a random RNA-seq read. The same sort of principle applies to small versus large RNAs. It's easier to sort of capture and represent at least some amount of large RNA than it is a small RNA because there are more fragments that you can derive from that large RNA. So it's just sort of at an advantage fundamentally in the data generation. And then we already talked about the fragility of RNA that's easily degraded. Something you'll see a lot in RNA-seq experiments are these Agilent traces. So does everyone use Agilent? How many people are familiar with the Agilent assay for assessing RNA quality? Okay, so maybe like a third of you. So this is a really common tool that's used to assess the how intact and also abundant your RNA sample is. So you do RNA isolation and it's very common to run it on an Agilent which is kind of like running it on a gel except you're running it through a capillary, gel electrophoresis and then you're getting a readout over time as that sort of gel is run through this capillary. And the smallest RNAs come out first and the largest RNAs come out later and over time it gives you this trace that has the sort of spiky pattern to it. And when the RNA is total RNA, so this is a human example, you'll have peaks corresponding to the ribosomal RNAs which account for about 95% of all RNAs in your sample. And you can use the sort of the strength of those two peaks as an indication of how intact the RNA is. So since we know to expect about 95% of the RNAs correspond to molecules that are about these two sizes, if the RNA is intact, we should see two nice big peaks corresponding to those sizes. If the RNA is degraded, we'll start to see more and more peaks as we basically have evidence that those RNAs have been broken into smaller and smaller pieces. So the image you're seeing on the right is sort of a perfect RNA sample that I isolated from a cell line. The one on the left is a much more challenging RNA sample that I isolated from an actual human tumor sample. And you can see sort of the difference between completely intact and sort of partially degraded RNA. And this Agilent assay will give you a score based on the sort of area under these two peaks and a RIN score. So you'll commonly hear sequencing cores talk about these RIN scores and perhaps having some sort of minimum cutoff where they won't accept your sample if it's below a certain quality level. RIN of 10, it sort of indicates completely intact, perfect RNA. And as you get closer and closer to zero, that means the RNA is more and more degraded. And I've provided a link here of 50 or 100 RNA samples that I've worked with over the years that are covering sort of a wide gambit of what these traces look like from really, really intact to completely degraded from different types of preparations from FFPE material, from fresh frozen material, from cell lines, et cetera. And so it gives you sort of an idea of what the landscape looks like. A lot of cores have sort of arbitrarily chosen RIN score of eight to be sort of a cutoff that if the RIN score is below eight, they'll want you to kind of either give you a different, give them a different sample or at least sign off that the data quality may be affected in a negative way. And you're not gonna come back to them complaining that your data is inadequate. General design considerations. So we included this link to the on-code RNA-seq standards document that was developed a number of years ago. It has not been updated to my knowledge, but it's still a pretty good resource for just sort of generally introducing you to some best practices and things to think about with regard to what kind of metadata to store in your RNA-seq experiments, the importance of replicates, how much sequencing depth to target, control experiments to include standards for reporting your results and so forth. And there have been a number of additional initiatives that have got underway over the last couple of years to develop standards and best practices. And these include things like the sequencing quality, control consortium, the roadmap epigenetics map consortium also produced a number of guidelines, the beta cell biology consortium and there are others. And on the wiki we provide links to some of these resources where if you're still in the design phase where you're thinking about how you're going to do your RNA-seq, now is a good time to think about these things. Sort of like how many replicates do I want to use? Should I use spike in controls? And we're gonna talk about some of these things in reference to the tutorials where we have included a spike in control and we sort of walk you through doing a QC analysis based on that. I will say that these guidelines are very idealized. So there's sort of a list of everything that you should do but for practical reasons or cost reasons may not be able to do but there are definitely things to really think about. So I would encourage you to spend some time poring over those. As I've mentioned, there are a number of RNA-seq library construction strategies that so these are choices that people make and you should think seriously about what choice you wanna make because you're kind of going down a road and to some degree with regards to comparing data sets you can't really go back. So there are choices like should I just sequence total RNA that's maybe been ribo reduced where we've tried to remove some of the abundant ribosomal RNAs or should I actively select for RNAs that are polyadenylated? How aggressively should I size select and should I do that before or after CDNA synthesis? I'm not interested in small RNAs so if I do a size selection that basically throws away everything that's below a certain size you have to accept the sort of caveat that you're sort of missing out on that slice of the transcriptome but there are advantages to focusing on a narrow fragment size distribution. So there's sort of pros and cons there that need to be considered. If you have very limited materials so sometimes in some human experiments you maybe have like biopsy material where it's like very precious amounts of sample you may wanna do some kind of amplification. So there's some kits available for linear amplification of RNA-seq libraries. Stranded versus unstranded libraries used to be that most RNA-seq data was unstranded now we're shifting towards most experiments using stranded libraries. That's a really good idea if you can do it it's great to have that strand information. We sometimes capture RNA-seq libraries and enrich for RNA-seq fragments that actually correspond to known exons. This is a way to focus your data onto the things that you care about if you're really interested in protein coding space. It's also commonly used to rescue very problematic samples. So for example a lot of people working with FFPE tumor samples will do a capture sequencing like this where they produce a normal RNA-seq library and then they basically hybridize it to an exome reagent and then they sequence what comes off of that. There are some library normalization strategies that also try to deal with this problem of having a really big difference between the most abundant RNAs and the most rare RNAs. So if you're not as interested in the sort of relative abundance of RNAs you just wanna sort of sequence the transcriptome and see what structures are there. You might wanna consider one of these approaches. All of these details can affect the analysis and interpretation strategy. And if any of these things vary between the things that you intend to directly compare to each other that can introduce bias or sort of problems in your comparisons where you're seeing what you think or hope or biological differences but really they're down to molecular biology differences that happened upstream of the analysis. So the libraries were made in a different way and they're basically giving you sort of spurious differences that you kind of need to normalize for. Yes. You mentioned in the very different book. So there were a number of groups that did this in sort of a tech D phase including us at WashU. And it used to be done sort of in individual groups using an actual whatever exome reagent that they like to use. Now Illumina actually offers a kit specifically for this purpose. I think they call it sure select or something. I think it's on the Wiki which the name of the sort of the Illumina kit that people use for this purpose. But the idea is basically with the FFP material because your RNA is so heavily degraded it's often also limiting quantities. You tend to get an output where there's a lot of noise in the data. So there's a lot of reads that are piling up in introns and intergenic space. And you wind up having to sequence more deeply to get a good readout on the actual RNA sequences. So the reads that actually correspond to known transcripts. And by taking that CDNase RNA-seq library and hybridizing it to an exome, you're really concentrating all of your fragments towards things that actually correspond to known exons. And so there are of course caveats to this. One is that you're enriching for the things you already know about so that the exome reagent can only capture the things that it was designed to capture. It may be missing genes. It may be missing an important exon of the gene that you care about. And also to some degree you are interfering with the sort of natural abundance readout of the RNA-seq experiment. So normally you get a lot of reads for highly abundant RNAs because they're highly abundant and that's what you actually want. You want the readout to be a representation of the expression levels of all of the genes. When you hybridize your library to an exome reagent to some degree you're kind of pulling up the things that were really low and you're sort of pushing down the things that were really high. And that happens because the probes that you're hybridizing to in the exome reagent to some degree become saturated. So they can only capture so much stuff. And so that results in this sort of enrichment for rare things and sort of de-enrichment of very abundant things. We haven't actually found that effect to be very pronounced. So you can do, we've done experiments where we did standard RNA-seq and then we did a CDNA capture in RNA-seq and then we compared those samples with regards to how highly expressed the genes were and then for a pair of samples we did comparative differential expression analysis. And we actually found that, yes, you do compress the sort of dynamic range of readout from high to lowly expressed transcripts but you actually still get a pretty good correlation to the standard RNA-seq. The things that were the highest are still the highest. The things that were the lowest are still generally the lowest and everything in between. And the reason for that is that it's actually very difficult to saturate the probes that you're capturing with. They're designed to be sort of unimpossible to saturate by having just a huge, a very high molarity. This is sort of a graphic depiction of some of the concepts that we were just talking about including sort of example readouts of Agilent assays at various points. So showing at the top left a series of RNA samples that were run on a kind of a hypothetical gel showing everything from intact total RNA to partially and heavily degraded total RNA. And then if you have done an mRNA enrichment you now see this range of RNA sizes instead of being dominated by the ribosomal RNA peaks. And then just sort of walking through the steps of starting from RNA, converting it to CDNA, doing a CDNA synthesis and fragmentation and then selecting for fragments of a certain size and then adding your adapters to those sequences and then sending that library off to actually be sequenced on an instrument. This next figure depicts sort of some of the differences between the different styles of enrichment. So starting with total RNA you can see this depiction of ribosomal RNAs being very, very abundant. And then below that you have ribosomal RNA reduction where you basically have probes that sort of capture the ribosomal sequences. So you basically try to hold on to all the ribosomal RNAs and then wash through everything that isn't a ribosomal RNA effectively sort of reducing the ribosomal content of your RNA sample. And then the sort of alternate approach to that is to sort of actively select for polyadenylated RNAs and to wash everything else away which washes away the ribosomal RNAs but it also washes away other RNAs that you may care about. So there are interesting RNAs that are not polyadenylated that you may care about. And that's generally why the field has sort of moved towards the sequencing total RNA that's been ribo-reduced because it gives you this more complete or holistic representation of the transcriptome. And then the CDNA captures sort of a depiction of that as well where you're kind of sort of selecting for the exons of known genes. Okay, so we're just gonna continue on. A number of people sort of came up to either myself or one of the other instructors during the break to kind of talk about stuff that you're doing, details of your experiment. Just one of, in case it wasn't totally obvious that is obviously a good idea. Please do feel free to, if you wanna brainstorm about some peculiarity of what you guys are doing then you should absolutely avail yourself of that opportunity. No guarantees that we'll know anything about what you're doing but it's always interesting to hear about different kinds of RNA-seq analysis. So I think you'll find that we all benefit from that kind of thing. One thing that just, I was just talking to someone about a stranded versus unstranded libraries. So this is, if you're having your sequencing done sort of by a core at your center or you're sending it off somewhere else sort of a good thing to kind of figure out like the method that they're using whether it produces stranded or unstranded data. And what that means is sort of depicted here where you're gonna produce some RNA-seq reads and then you're gonna line them back to the reference genome. And if your library is unstranded, you won't know what strand each of those fragments came from. You don't know whether it was the strand that was actually being transcribed or the strand that was complementary to that. We're sequencing double-stranded CDNA. It can be either if the read aligns to a known exon or it aligns across an exon exon junction, you can start to get pretty confident that it probably came from the strand where that known transcript is. But you don't actually know that really. You're just kind of inferring that. The stranded libraries attempt to actually maintain the information from the original single-stranded RNA what strand it was from. And there's sort of, depending on how the library's done there's different molecular biology for how they make that work. But the sort of important piece of it is that when you get your data in the end, there's an expectation when it's aligned which strand it came from and then that information can be marked in the alignment and it can be visualized in your browser. And you'll see what an example like this when we look in IGV, you'll see that stranded libraries, the reads that go there tend to often really line up with the strand that you would expect for the gene that they're aligning over. And if you don't have a stranded library, you'll see this sort of equal mix of reads corresponding to the two strands. And if you're interested in sort of understanding, for example, sense-anti-sense transcription patterns, you will only be able to do that if you have a stranded library. And most of the people generating RNA-seq data today are selecting to use a stranded approach. And it just, it simply wasn't very possible before because the kits were very experimental and had problems, but now it's sort of been, wrinkles have been ironed out. Another sort of issue that comes up commonly is replicates. Should I use replicates? How many replicates? What kind of replicates and so on? There are a couple of different kind of replicates you might wanna think about. Technical replicates would be, I would say something that's sort of a replicate of the process, the sequencing process. Do I need to worry that I have some data from lane one and some data from lane two or I sequenced a sample and then later I decided I wanted a bit more data and so it's on, I get more data from another flow cell that came a week later or a month later. Generally the Illumina platform has become quite robust and reproducible and these kind of technical replicates are not needed to sort of control for lane to lane differences or flow cell to flow cell differences. It's pretty darn consistent. We periodically sort of circle back and do an experiment that sort of, you know, what if we just sequenced the same sample on two lanes of this flow cell or two consecutive flow cells and you get really, really stellar correlations. You can basically sort of treat that part of it as a commodity now but biological replicates of course are completely different story and it just really depends on what you're doing. You absolutely want them. How many you want will depend on how much variability there is in the system that you're studying and in some cases it may be difficult to achieve the number of replicates that you would want by statistical principles just because of the nature of the biology that you're studying. But you should try your best to have replicates and the analysis we're gonna do sort of as an example will assume that there is a possibility of having replicate data. Some common analysis goals of RNA-seq analysis. So this sort of like, what can you ask of the data? This also influences the answer to a lot of technical questions sort of depends on what your goals are. So when people ask how much data do I need or how many replicates do I need, to some degree depends what you're hoping to get out of the experiment. A lot of people are doing gene expression or differential expression analysis, alternative expression analysis. Maybe it's a transcript discover annotation exercise. You're looking for allele specific expression or you're trying to discover mutations in cancer. There's a lot of people doing fusion detection RNA editing and there's quite a long list of these applications and there tends to be tools that are sort of tailored to each of these areas, which gives you sort of a lot of tools. You're not generally gonna find a sort of one tool does it all situation in the RNA-seq field. There's a lot of tools that are geared towards sort of particular types of analysis. So that's kind of unfortunate in a way because it makes the sort of analysis landscape quite complex. So to do what you might consider a comprehensive analysis of your RNA-seq data where you ask sort of all of the obvious questions or what seem like obvious questions biologically, you may need to use quite a few tools and they're all developed by different labs and different programming languages and are documented to varying degrees of crappiness and so forth. So that makes it a lot of work. That's why you kind of need to become a bioinformatician to some degree. But the good news is that they generally have like this sort of theme to their workflow, which is that they all kind of follow this pattern of obtaining raw data, maybe doing some basic quality assessment of that raw data, either aligning or assembling the reads and those are different things but they sort of have a conceptual similarity and then you're gonna get an alignment or an assembly and then you're gonna process that alignment or assembly with a tool specific to your goal generally. So this is the point where things really start to diverge where you have a tool for expression analysis and a tool for splicing analysis and a tool for allele specific expression analysis and a tool for fusion detection and so on. And then there will be some kind of post processing so this tool isn't going to just write your paper for you. It will produce some crazy output files with their own peculiar format and then you'll be hunting through documentation trying to understand how to interpret those files and then you'll probably be feeding them into some kind of downstream statistical or visualization software like R or MATLAB or Site Escape or other things and so we're gonna show some sort of examples of that as well and then you'll be summarizing and visualizing what comes out of those sort of finishing analysis platforms, creating prioritized gene lists, designing validation experiments and so on. So Ann mentioned Bios stars. How many people here have used Bios star before? Okay, so about a quarter to a third of you maybe. So we just have this little exercise to just like quickly in five minutes check out the Bios star website just so you all have a familiarity with it. If you don't have an account, I would encourage you to actually create an account. You will need to do that in order to ask a question down the road. It's really easy. If you already have a Google or Yahoo or whatever account you just like click a button and authorize Bios stars to use that system for its authorization. So you don't really need to create a new account in that sense. And then just spend a few minutes so we'll literally do it for just like four or five minutes. Try to search for a question that seems useful to you and give it a vote or something just to kind of see how the interface works. So generally when these courses end we often get questions from students and we generally encourage the question to be asked in Bios stars and then we'll answer it there. It just as a way of sort of publicly answering that question so that other people can benefit from it and so that we don't keep repeating the same sort of question answer kind of sessions. And it's also a good way to get sort of feedback from the community to potentially get sort of updated answers to common questions over time. And it's by far the most popular sort of bioinformatics question answer forum I would say out there. Okay, so maybe we'll just go through the last few slides and start to get closer to where we can actually do the hands-on exercise, the first hands-on exercise. So you've just seen kind of a forum where you can ask and answer questions. We're just, the last few slides are just going through some of the most common questions just to sort of get them out there just because by experience we've been asked these questions over and over again. So it's good to maybe have just have brief discussion about each of them. And then if you have any sort of follow-up questions you can ask them and everyone can participate in that discussion. The first is should I remove duplicates for RNA-seq? A lot of people who've done whole genome or exome or other kinds of sequencing, chip-seq, almost all of the other types of seek it's very, very, very standard practice to mark duplicates after you align your reads. So what this means, you're gonna align your reads to the reference genome and then any read alignment that starts and ends at the same position or even just starts at the same position is gonna be marked as a duplicate under the assumption that it's possible that that alignment is actually a PCR amplification artifact and not a unique observation of a distinct fragment from your sample. And the reason you can do this in whole genome sequencing is that using whole genome sequencing as an example, you're sequencing the genome and you're getting say 30 to 40x coverage across the reference genome and each of those reads consists of a paired end fragment where you have say 100 bases that aligned here and 100 bases that aligned here and then maybe 100 bases in between so you have a 300 base fragment. The probability that two independent fragments would start and end at the same place have the exact same insert, everything identical is very low in terms of sort of the random sampling of molecules. So when you do see duplicates like that, it's more likely that they're actually just duplicates that were introduced during amplification during library construction. So you mark those, you choose one as a representative and then you ignore the rest of them. And so you'll see a lot of workflows out there sort of best practices based on say the Picard analysis toolkit or other toolkits where they just without really even thinking about it, just say, oh, remove your duplicates, mark your duplicates, it's something of course you would do, it's very important, everyone just do it. But the one place where that's not really a good idea is RNA-seq. So it's maybe a more complicated question for RNA-seq than it is for DNA and people who aren't doing RNA sequencing don't necessarily appreciate that. And the concern is that because of some of the sort of unique characteristics of RNA sequencing before, removing duplicates that you think are potentially PCR amplification, they could actually be real duplicates that just happened by chance. So think about a really highly expressed short gene. So say you have a gene that's only 400 bases long and it makes a very important but short protein and it's very, very highly expressed, it's almost ubiquitously expressed in the cells that you're studying. If you think about that 400 base RNA molecule, there's not actually that many ways to make a unique fragment out of it because it's short and because there's so, so many of them in the cell, you actually expect duplicates to happen just because of its abundance and its short length. So anything that's really abundant, you start to worry that you're seeing duplicates just because you're sampling very highly and you're sampling highly because it really is abundant. And so in that scenario, if you mark them and decide to ignore them, you're actually underestimating the abundance of that RNA. So you're basically putting a ceiling on the dynamic range of expression readout that you can get from your experiment that prevents you from accurately representing the difference between lowly expressed things and highly expressed things. So for that reason, people generally don't mark duplicates in RNA-seq data. So that comes with a caveat that you may have amplification artifacts making their way through but that's probably a lesser evil. Another common question is how much library depth do we need? Probably this is the most common question and just good reasons for asking this question, probably the most obvious being it influences cost. So Illumina data costs more for more data. And so generally you want to balance the need for deep sequencing against the need to have more replicates or to just be able to afford the experiment at all or be able to do other things with that money in your lab. But to answer the question how much you actually need, it really depends on a number of factors. So for example, it depends on what question you're asking of the data. So if you're doing gene expression analysis, this is something that probably places the one of the least demands on sequencing depth. And the reason is that you're just trying to get kind of a readout of the relative and absolute abundance of transcripts. A lot of these transcripts are fairly large and so there's a lot of fragments that can be derived from those transcripts. You can kind of get a readout of the relative abundance of transcripts without completely plastering every transcript from end to end with 50 X or greater coverage. But when you start to do things like alternative expression analysis or identifying individual point mutations, it starts to mean that you actually, you don't just want a bunch of reads to hit each transcript, you want every transcript to be covered nice and deeply from end to end, even lowly expressed transcripts. And that means the total amount of data you need to generate goes up in order to achieve that kind of coverage. Yeah. Well, you change it simply by sequencing more. So this in practice means you're either, so often you will pool three or four or 10 samples into a single lane of alumina. So the amount of pooling you do influences the total amount of data that you get out. And if you want even more data than you would get in one lane, then you might decide to do two lanes of sequencing or a whole flow cell for one sample. So that's sort of the way you adjust the total amount of data you get. And then in terms of assessing it, generally you can just assess the number of reads that were generated, of course. But a lot of people will also align those reads to their reference genome and then do some kind of assessment of how many mapped reads there were or how many spliced mapped reads there were, depending on the kind of question they're asking. And we're gonna sort of, we're gonna do that kind of basic QC in the hands-on exercises. Yeah. So you can definitely know how much data you're gonna get from a lane because that's like published knowledge that's based on the instrument. It depends a little bit on which platform of alumina you're using, whether it's a 4,000 or 2,000 or 2,500 or MySeq. But you can just, that information is readily available. There are papers, so in the, there's a resources section in the RNA-C Quickie that we're gonna go through and there are some papers that have specifically addressed this question that's sort of like, okay, I mostly care about gene expression, how many reads should I target and sort of what is the balance between more depth on one sample versus having more replicates of, to compare based on some basic assumptions. And kind of numbers that have come out of those experiments are things like, you generally want like 20 to 40 million reads per sample for just simple gene expression analysis, assuming you don't have any major like sample quality issues. That's, yeah, coming out, that's aligned to reads. So reads that actually successfully aligned to the genome. Which will often be about 95%. Yes, yeah. So that's assuming you've gone all the way through. But of course the caveat is that, that's for like a kind of a focused question. That's sort of like gene expression estimation. And the trade-off with replicates is also assuming that you have more. So if you have a hundred samples in the freezer, people will generally argue that you're better off to try to sequence some amount of all hundred of those samples rather than to have really great coverage of 20 of them. Unless you're going below that sort of 20 to 40 million read threshold and assuming that you are interested in just gene expression analysis. So where you're kind of get, you're hoping to get almost like a microwave expression kind of read out. But of course, as soon as you start asking other questions of the data, then you may find that that level is inadequate. I can tell you that what we do for our sort of precision medicine style experiments with patient tumors, we still generally sequence either a whole lane or maybe half a lane on today's instruments. So here it says one to two lanes. But now the capacity has increased to the point where it's probably two samples per lane that we're often doing one or two, depending on RNA quality. And that gives you, if everything works out well, that gives you the ability to do a lot with that data. So you can do gene expression analysis, but you can also assemble a transcriptome. You can do splicing analysis. You can have a pretty good confidence of detecting whether a particular SNP was expressed in a particular gene, unless that gene is very, very lowly expressed. You can do most of the kinds of things that we're talking about wanting to do with RAC with about a half a lane of data now on the latest instrument, which is still a non-trivial expense. So that'll cost you, I don't know, $6, $7, $800 maybe. Question is, can you detect small RNAs and RNA-seq data? The answer is yes, but it depends. Really small RNAs are usually processed completely differently. So people that are interested in micro RNA analysis will generally make a micro RNA library, which is made quite differently from a sort of standard RNA-seq library. Not crazy complicated or anything, but it's pretty typical in a normal RNA-seq library to do a size exclusion step after library construction, which will actually remove a lot of small RNAs. And this is done kind of on purpose to remove like a bunch of small things that are abundant and people that are interested in protein coding genes are generally willing to just toss all that stuff away because it improves the quality of data for the rest of the transcriptome. And it also makes it easier to load the instrument. So one of the things that the Illumina Instruments work best actually with a relatively tight size distribution for fragments. So if the fragments are all around say 250 to 300 bases, it's easier to balance the cluster density on the instrument so that you can put the optimal amount of fragments onto the flow cell while still getting good quality data. So if you overload the flow cell so that you have fragments that are just like everywhere, they start to overlap each other and this actually interferes with the running of the instrument. It introduces chimeric reads, sequencing errors, et cetera. So you want the flow cell to be to some degree sort of sparsely populated. So I showed a picture of a flow cell. Yeah, you can kind of see in the background here, each of these dots is a cluster of molecules on the physical flow cell that are being imaged from above while the sequencing by synthesis reaction is happening. And if those points get too close together, it starts to interfere with the sequencing quality. And if you have a mix of fragment sizes, it makes it harder to get that balance right. There's a sort of like bias towards smaller fragments forming clusters more easily. And so you actually wind up disfavoring fragments that are larger, which produce generally higher quality reads. So almost everyone sort of makes that division that if you're interested in small RNA, you're gonna do a small RNA library. If you're interested in the rest of RNA, you're gonna do a sort of standard RNA-seq library, which involves removing small things. If you're interested in both, you're going to do both of those things, not in one shot. You'll sort of split the experiment into two. Chimeric read is where you have a read pair and it's supposed to be that the first read corresponds to the beginning of your physical CDNA fragment, the second read corresponds to the end of the same fragment. Chimeric read is where you have the two reads are actually from different fragments, but they've been erroneously paired together during the sequencing. So you get like, it looks like a read, read one corresponds to one gene and read two corresponds to a different gene in a completely different area, maybe on a different chromosome. And that could really be from like a translocation event, but you get a bunch of artifactual chimeric reads like that when certain things happen during library construction or the running of the instrument. And you can also have partial chimeric reads where part of the read is corresponding to one gene and the other part is corresponding to another gene and it's not from real biology, it's from something that went wrong during library construction or sequencing. And that makes your data very, not good. Any other questions about that? Yes. Why if I want to pull the unisex data? If you want to pull them. Yeah, 30 medium would this and 30 medium would this. I see. I can be 30 or 40. Should the depth be equal? So say you have three samples for condition A and three samples for condition B, sort of the simplest experimental design is of course to kind of do everything the same. If it works out that you wind up with 30 million for one and 40 million and then 35 million, the analysis methods are gonna account for those differences in library size and generally it won't make a big difference. If you sort of systematically have less data for one of your conditions than you do for the other condition, then you will in a sense be, you have sort of differential sensitivity. So you may be able to discover things better in one condition than another condition and that's not optimal, but it won't really prevent you from doing the analysis. It's not the end of the world as long as it's not too extreme. You just, you wouldn't want it to be a scenario where one of your samples, you didn't have sufficient depth to really do the experiment properly in those samples and you ideally don't want a systematic difference between the things that you're comparing because that will potentially introduce sort of bias that needs to be normalized. But if it's a little bit kind of variable here and there, it's fine. You don't need to throw away reads, for example, to like make sure they're balanced or anything like that. Usually you can just deal with the minor differences like during the analysis and it'll happen automatically with a lot of these tools. Any other depth related questions? Yes. Well, unless there is some problem with the complexity of that sample, you can make a very, very strong statistical argument that were it there, it would have been seen. Unless something has really gone wrong. If you have 100 million reads for your normal and 100 million reads for your tumor and you see a lot of things going wrong, you don't see a lot of things going wrong. You see the tumor and you see sort of globally a sort of comparable readout that generally you see the same genes being expressed. But in a particular gene, you see none of it in the normal and lots of it in the tumor. Unless something has gone wrong, that should not happen unless it really was being suppressed in that sample. And then of course you can do a validation experiment if you're not convinced. But RNA seek has become so robust and so widely used that actually even a lot of manuscript reviewers are not really asking for like QPCR validation. And we actually have to use to have like two slides that were like basically should I believe any of this? And we actually took them out this year because everyone believes it now. But there is a reference to a paper where we did a bunch of these validation experiments to try to convince the skeptics and that's also referenced in the Wiki if you encounter that problem. But there are a few weird scenarios where you can get this happening because of something that went wrong in the analysis. So it's worth being, skepticism is good. I mean usually this... Yeah, so then the statistical argument gets weaker and weaker, the smaller the sampling is. But these methods that do these analyses are supposed to be taking that into account and they are generally quite conservative actually. And I guess there is a lot to say. That's an interesting question. So sort of like the housekeeping gene idea except things that are low to give you it, that's a good idea. I haven't seen someone who'd like assembled a set like that. That's like, hey, use these genes. We have used this concept in a kind of anecdotal way by looking actually at the telomerase gene. So telomerase is kind of famous. It's a little bit complicated in tumor analysis because you expect telomerase maybe to be upregulated. But even when upregulated it tends to be expressed at a very low level compared to other genes. And in a normal cell it's there at like a few copies per cell. So it's a good example of a gene like that where you have this like a priori expectation that it will be functional, important, but they're at a relatively low copy number. So if you see decent coverage of that it gives you confidence that you're actually detecting low copy number transcripts robustly. But it would be nice if it was a little bit less anecdotal. If it was like 50 or 100 of those you could get a better sense. Another thing that we're gonna walk through is the concept of a spike in. So you can spike in a series of sort of artificially constructed sequences including ones that are at a very, very low level to give you a sense like, okay, I know this thing was in my pool at a very low level relative to everything else and I'm still able to detect it. So it makes me more confident that I know kind of where the sensitivity drop off is for my experiment. And that's one of the things that's great about doing those experiments is that it kind of gives you that confidence. In terms of sort of analysis artifacts, if you don't see a gene or you see that very, very low copy number and you sort of want to make yourself a little bit more confident that it isn't some kind of just artifact of tool breaking or something. Something you can do is actually visually inspect the alignments and make sure that you see some alignments in that region and not literally zero. Because the thing about RNA-seq data is that it's sort of fundamentally noisy and actually the transcriptional machinery is itself noisy. So you generally expect a low level scattering of noise everywhere. So if you see a big desert, then you might worry like, maybe the alignments are not working here or there's something wrong with my reference genome or somehow was filtered out in some kind of automated step and then that's giving me a spurious sort of zero expression value that it's actually not real. And that can happen. And we'll talk about a specific example of that in the, for cufflinks, actually. I can say you can look into public databases to see whether the gene is expressed in normal tissues or not. So the best data which I figured out was GTX data, which has 17 different tissues from human samples. But data? GTX? GTX. So GTX is a consortium that's building a kind of normal tomb tissue atlas of RNA-seq expression. Sequencing is being done at the Broad and they are depositing in DBGAP and some of the information is just available directly on their project website. So you can download the sort of compendium of RNA-seq across a number of individuals from a number of tissues and they have corresponding genome sequence data for those same individuals so that if you wanna understand some relationship between sort of alleles and expression values, you can start to do that kind of analysis. And it's a large data set, but it's probably worth applying for access. And I think you can just get sort of FPKM values that were that simple analysis has sort of been pre-calculated where you don't actually have to run it all through like a pipeline. FQP and values are free, so you don't have to get access to it. Yeah. You could use that data file to select your set of 50 commonly but lowly expressed things or to play around with that idea. Yeah, that's a good point. Any other questions? Mapping strategy. So I asked earlier about the length of the reads. It used to be that we had kind of a mix of really short reads and longer reads and that would influence the alignment strategy would take because of this challenge that I mentioned about aligning reads across large introns. But this problem is kind of going away as everyone has sort of consolidated on reads that are at least say 75 to 100 basis long. That's enough to give you a pretty decent chance of aligning across large introns some reasonable amount of the time. So generally we're gonna be just focusing on spliced aligners. There are a number of spliced aligner options that have sort of different advantages or disadvantages but we're not gonna explore anymore this earlier concept of using a non-spliced aligner because most people just don't have that data. But if you are interested in that, we can talk about it as well. Yeah. Our reads would be a one-on-one basis that underpicts the data. Some of them, some of the reads might be shorter, like 30, if you are setting the read then whatever is removed, then this is how they do it. So read the range from three to one-on-one. So in the case like some of the, it's not equally distributed. Mm-hmm. I guess in that case it depends on what that distribution looks like. So if after trimming most of your reads look like, like they started all to be 100 and most of them are below 50 after trimming, I would wonder what happened there first of all. It's possibly the size of fragments that were fed into the experiment in the first place was problematic or maybe it was somehow maybe small things are being targeted on purpose. In which case, yeah, once you get into that range where say over half of your reads or 50 bases are lower, you're probably gonna wanna start thinking about a non-spliced alignment option. If say more than half of your reads are over, say 75 to 100 bases, then I would still stick with the spliced alignment option. And if you're kind of in a gray zone, you might try a pilot like analysis where you did both ways and you kind of compared the results. Yeah, that's a good point. Trimming is sort of potentially changes the landscape of things from your raw data to your trim data. And we're gonna talk a little bit about trimming and the hands-on exercises as well. Everyone, almost everyone had a reference genome. If you don't have a reference genome, it does introduce a challenge. You should encourage the powers that be to sequence the genome of your critter or plant or whatever or do it yourself. Sometimes that's just really, really hard or very expensive and you just have to get by without it. But there are a number of tools out there that are reference-free. And some of them were designed by people who actually didn't have a reference. In other cases, it's just sort of luck that someone designed an approach that doesn't use the reference for performance reasons. And we're gonna show an example of that tool called Callisto, which doesn't rely on the reference genome and is able to sort of simplify the problem by not aligning to a reference genome and produce an RNA-seq analysis output in a much, much shorter time than the workflow that we're gonna show you. It doesn't sort of cover all of the same analysis spaces that this workflow does, but if you just care about gene expression estimation, it's a pretty good option. And it may be your only option if you don't have a reference genome. This is on, so that the, I guess you guys have a printout, so you can't click on the link. But all of these slides are available on the Wiki and the Wiki also includes a list of the supplementary tables. The supplementary table refers to a manuscript that we wrote actually for sort of to describe this workshop last year. And that's what the supplementary tables are referring to. And on the RNA-seq Wiki, there's like a citation page that includes links to all the supplementary tables. Okay, so each of the hands-on work exercises or tutorials has a corresponding slide deck. And these are kind of like reference materials to sort of refer back to while you're doing the workshop. I'm just gonna really blast through it quickly because I feel like we've already had so many slides and it would be more fun to get into actually doing stuff at the command line. But the first one has a bit of sort of just generic information, sort of summary of the learning objectives here. So what we're gonna do is walk through the exercise of installing a bunch of bioinformatics tools, which is sort of an art to that. You'll see kind of how that goes. And then we're gonna start obtaining sort of the prerequisite input materials to an RNA-seq analysis. So we're gonna obtain a reference genome sequence. We're going to download gene transcript annotations and talk about the file formats that are involved here. We're gonna index the reference genome to be used within a liner. So this is a way of basically organizing the information that's in the reference genome in a way that allows alignment to happen very efficiently. The index is, it's very much like an index in a library. And then we're gonna download the sequence data that we're gonna use for the exercises. So we have some of this demonstration data that's been set up for you guys to play around with. Some common problems that encounter while working on these tutorials. So all this is gonna happen in the command line. We'd encourage you to type the short commands carefully if you like, but we've learned that sort of from the past in order to get through all the steps smoothly, you may wanna really copy and paste some of the long commands just because they're so long. And to help us get through, we're also at strategic points, we're gonna have sort of practical exercises where you're not given the commands at all and you're gonna be required to figure them out, type them out on your own. But to get through the sort of demonstration of this whole working pipeline, you really, every step relies on the previous steps. So for these big long, like multi-line commands, you're really probably just easiest to avoid headaches to just copy and paste them. A common, so there's errors related to the copy and pasting that sometimes happen where you don't quite select the whole thing when you copy it and then you paste part of it and generally that doesn't work out too well. So try to make sure that you're copying the entire command when you do that. Being in the wrong directory at the wrong time. So a lot of these commands are assuming you're gonna operate on some files in the current directory. You're gonna create an alignment command and then you're gonna summarize those files. If you move out of that directory between steps where when the tutorial didn't ask you to do that, the command might fail because it's sort of built with the expectation that you're in a particular location at a particular time. So just be cognizant of that if you're kind of navigating around and seeing, okay, I wanna look at the results that are in this directory. But if you weren't asked to do that, try to remember to go back to where you were before moving on to the next step. And you'll see in the tutorials, there's a lot of places where there's like a change directory command that's intended to kind of put you back where you need to be in case you have kind of wandered off. We're gonna set this RNA home variable. Sometimes that isn't set so that can cause problems. We'll show you how to do that. The presentation, this presentation provides a really brief description of some of the steps, but the wiki has more complete instructions. And then in the wiki when you see this hashtag or pound sign, those are comments. So they're interpreted at a command line just as a comment and nothing is executed. So you can safely paste those in or you can just read them and ignore them. But all the other lines that are in these sort of code blocks that you'll see in the wiki are things that need to be executed in order to work your way through the hands-on exercises. And then we've tried to annotate commands with sort of comments that explain what's going on, but a basic familiarity with Linux is assumed. However, if you see a command that has options or you don't understand what's going on, we're happy to sort of walk through what each command means and what each of the individual components of it are. These are some of the tools we're gonna install just for reference. There's some links there that in case you wanna read more of the documentation on each of them, we're gonna obtain a reference genome. And then for this exercise, we're just gonna use a single chromosome, but nothing about what we're doing would really change if we were using the whole reference genome. It would just be slower, but all the commands would basically stay the same. So this is sort of a reasonable demonstration. We're gonna obtain known transcript annotations. And again, these will be genes that are annotated on the same chromosome that we're focusing our analysis on. And we've chosen a small chromosome to also make the analysis go faster. So we're not waiting a long time for commands. We're gonna get annotations, gene annotations from a particular place, the Igenomes project, but there are lots of sources of these annotations and it to some degree will vary depending on what species you're studying. We're gonna talk about the file formats that are used to describe these gene annotations. We're gonna create an index reference genome. As I mentioned, the one sort of thing to remember about these indexes is that they're generally particular to the alignment algorithm. So each alignment algorithm builds one of these indexes and they're not interoperable generally. So if you use different aligners, you'll generally see there's this pattern of producing an index for that aligner. And sometimes even different versions of the same aligner, you may need to produce a new index of your reference genome to work with a new version of the aligner. The RNA seek data, so I mentioned this briefly before, we have sort of two RNA sources for the RNA seek data. So these are samples that we sequenced at WashU for the purposes of having data to play around with at this course. One is the universal human reference, which will be abbreviated UHR throughout and then the human brain reference, which will be abbreviated HBR. And then each of these samples has a spike in. So this is a series of control sequences that I mentioned. So there's about a hundred sequences that are, I believe they're bacterial and they're spiked in as a set, but they've been designed so that they sort of cover a range of abundances. So there's sequences that are in there at very high copy number. There are other sequences at very low copy number and a bunch of steps in between. And the ratio is known. So they're because it was created artificially, we know the ratio of all of these molecules to each other. And then there are two mixes. There's mix one and mix two. And in mix one and mix two, they've adjusted the sort of order. So something that's highly abundant in mix one. You have the same sequence in mix two, but they put it at a lower abundance and vice versa. So this allows you to do both a sort of assessment of the absolute expression readout across the series of sort of standards, as well as a differential comparison where in both cases you have a prior expectation about what sequences are the most abundant and the least abundant. And you also have an expectation when you do this comparison between the two mixes that there will be expected differential expression full change values between the two mixes. So there's just more background information on these sort of the biology and source of these samples that are used for the example data. And we're gonna do some pre alignment QC. So this is something we get a lot of questions about sort of how do I know if my data isn't good. So we're gonna try to sort of walk through some of those points as well in the first exercise. And that's it.