 Yeah, so I'm Malachi, Malachi, and this is Obi and Fuad, and Obi and I are from Washington University School of Medicine in St. Louis, but we used to be at the genome center in Vancouver, so we're at two of the other genome centers other than this one in North America that all, you know, have a lot of similarities. And Fuad is in Toronto at OICR. Is that good enough? Okay, so the usual preamble, I guess, Francise or Michelle probably went over the sort of creative commons philosophy of these courses. So our, this workshop that you're going to do over the next two days is actually part of quite a large set of materials that we're going to review that are all available online. So we have the course Wiki, which is providing you sort of basic information about, you know, this class and the schedule and things. This RNAC workshop has its own Wiki on GitHub, and everything that you'll see and much, much more is all available there that you can peruse later. It's all available under, also under a creative commons license, and we'll kind of go through that, that Wiki and give you an introduction, sort of orienting to all the materials that are available there. So we're going to skip this, what we're calling module zero, which is an introduction to cloud computing because you guys have already been introduced to the cloud and you've been on AWS, and we're just going to jump straight into module one, which is the introduction to RNA sequencing. The learning objectives of this module are basically to go over the theory and practices of RNA-seq analysis, and that includes some background sort of rationale for why you would do RNA sequencing in the first place. So who here, just sort of a rate show of hands, who's doing RNA-seq or planning RNA-seq in the near future? Wow, everyone? Okay. That's what we're doing. I see. You're also doing DNA-seq and things too, so the other modules were also, I assume, very relevant to some of you? Yep. Who is only interested in the RNA-seq? Kind of. I mean, you don't have to like, okay. So that's interesting. Okay. So everyone does RNA-seq, so the rationale will be rather pointless for this crowd. You guys obviously are all on board with it. But I guess you also have to face some of the challenges that are specific to RNA-seq, some of which I'm sure you're very familiar with. We're going to go over some of the general goals and themes of RNA-seq analysis workflows. You'll probably get a sense by the end of these two days that there's a ridiculous number of bioinformatics tools and sort of proposed workflows or pipelines for doing RNA-seq analysis. We are just going to show you one example, which is kind of a popular one, but it is sort of chosen from among many possibilities that each have their sort of merits and disadvantages. And the hope is that the skills that you learn will be sort of generally applicable to developing your own analysis strategy when you get back to your own labs. We're also going to talk a little bit about where to get help outside this course. So we'll use this time wisely and ask us all the questions that you've been dying to ask about RNA-seq or specific problems that you have with your type of data or whatever species you're working on, but you only get the two days. So we'll talk about some things that you can do after those two days. And at the end of this introduction, we'll talk about the hands-on tutorial a little bit, which we'll start sort of right after the intro lecture. So this is probably all really review. Sorry, that doesn't display that great. It's okay. But just to sort of orient ourselves by reviewing the central dogma here. So this is a depiction of a simple gene model with three axons and two introns in a promoter region and a poly-denylation site. So this is genomic DNA. This thing gets transcribed into single-stranded pre-RNA molecule that still has the introns intact, but now we have an RNA molecule. And there's various features on this molecule that govern the way the splicing machine removes the introns and stitches the axons together into a mature mRNA. This thing is really the subject of a lot of RNA-seq analysis. Either mature mRNAs or small RNAs that won't be poly-denylated or ribosomal RNAs, which may also not be poly-denylated. But a lot of RNA-seq is focused on these mRNAs. But it's important to remember that we're never really actually sequencing these things exactly in RNA-seq. So we're converting RNA to CDNA. And then for most species, the typical length of a transcript is much larger than what can actually be sequenced by RNA-seq. So we're usually fragmenting CDNAs and actually sequencing little pieces of these things, and then trying to assemble or align them in such a way that we can infer what the full-length structure of each RNA looked like. And that is a very difficult problem. And then, of course, a lot of this has to do with the actual protein, which if we could sequence it directly, we might do that if we could do it in a high throughput fashion, but there just simply isn't a way to do that high throughput. So RNA-seq is kind of a proxy for looking through the window at what is happening at the protein level in a lot of experiments. Some people are specifically studying RNA biology, and then they don't have that extra layer of sort of inference or projection. So this is just a simple depiction of an RNA-seq workflow at a really, really high level. So we're going to start with some samples of interest. So it could be a tumor normal pair, or it could be developmental tissue stages. It could be drug treated versus untreated, whatever it is that you're studying. And we're going to isolate RNAs from those. And then often we'll do a polyaselection, but not always. There are different enrichment strategies that try to enrich for RNAs that we'd like to be sequencing, depending on what our strategy is. And then we're going to generate cDNA from that RNA and fragment it. Sometimes the order of those things is reversed. There will be some size selection step to get fragments that are kind of in a range that makes sense for RNA sequencing, adds linkers or sequencing adapters. And then basically sequence the ends of these fragments from each end inwards. And that produces these RNA-seq reads that are shown here in a paired form. So there's sort of a blue read one and a red read two. And then there's often some little piece in the middle of the fragment that we didn't actually sequence because the two reads didn't meet in the middle. So there's some amount of unknown insert sequence. And the size of these fragments usually ranges a fair bit. So they might be in the range from say 250 to 500 bases. So sometimes the two reads will meet in the middle and sometimes they won't. And sometimes there'll be a different strategy where you only actually sequence from one end, the length of the sequence can vary depending on how you run the instrument and so forth. But at the end of the day you get basically a FASTQ file, which has your raw read data in it. And that, again, that gets aligned against a reference genome or a reference transcriptome or some combination of those things. And the alignments feed into a lot of downstream analysis. Any questions on that? Is it probably, okay. So that's a good question. So the question was whether random hexamer priming was kind of a better alternative to polyase selection if you're looking at small RNAs. I guess I would kind of separate those two things a little bit. When you generate these cDNAs, I would, generally, people use random hexamers to do the cDNA synthesis reaction. And that's sort of separate from whether or not they've done a polyase selection of the RNA upstream of that. But sort of the bigger Pandora's box that you opened with your question was relating to small RNAs in general. And there's quite a different strategy to creating a library for small RNAs compared to sort of mRNA-seq. And also there's a whole separate suite of analysis strategies and tools. And we're not gonna talk very specifically about small RNAs. Unfortunately, I thought we would love to add a component that really dug into that. But yeah, generally, if you're really interested in microRNAs and other small RNAs, you should really think right from the beginning about a customized strategy for creating the library and doing the analysis. And the sort of cut point that people tend to use in the range of 100 to 150 or smaller is sort of considered sort of special RNA-seq libraries. And everything that's over 150 to 200 basis is sort of regular RNA-seq. And those strategies do a really great job actually of capturing everything from a 150 or so up to as big as they get. And you will still get some signal from smaller RNAs. It's almost impossible to eliminate them completely. But there's usually this sort of split path there. And some people are actually taking it further and having sort of a really small RNA-seq like the microRNA libraries and then a sort of medium RNA-seq library strategy that's in the kind of 50 or 75 to 150 range sort of. So there's small, medium, and large strategy. And I think that's still kind of an area that's being experimented with whether you need to do that. Because there is a feeling that we've gone down this road of optimizing for microRNA sequencing and optimizing for mRNA sequencing. And there's sort of some orphaned sort of medium to small sized RNAs that are getting lost. But that area I think is still quite novel in terms of capturing those things effectively. There's not well established protocols yet. Does that answer your question? Okay, so why sequence RNA versus DNA? I originally created this slide because I spend a lot of time talking to DNA people at genome centers that really like to sequence the genome and learn everything. But there are some things that are a lot easier to learn by studying the transcriptome. And of course functional genomics would be right at the top of that category. So there's a lot of biology that's happening at the level of RNAs where the genome may be constant. But some experimental condition is resulting in a change in gene expression. So this could be things like a drug treated versus untreated cell line and so forth. Another thing that's really great about RNA-Seq is that predicting transcript sequences from the genome itself is very difficult. And this used to be sort of a whole field of bound formatics was trying to look at the genome and predict what the transcripts would look like. And now we can basically just sequence the transcriptome directly and align it back to the genome. And that is just way, way easier. We don't really understand it entirely or even very well at all. But we do know how transcription is regulated by the features of the genome. So it's a lot more effective to just sequence the transcriptome, align it back, and then think about how regulation is occurring rather than trying to anticipate or predict what is happening based on our current incomplete picture of how transcription regulation works. And then there's some molecular features that really can only be observed at the RNA level. Also things like alternative isoforms, fusion transcripts in tumors, for example, RNA editing, of course, can only be observed at the RNA level. Another application that has been used quite or starting to be more popular is interpreting mutations that don't have an effect on the protein sequence of looking at basically regulatory mutations. If you sequence a genome and you see that there are mutations there, you also sequence the transcriptome. It may actually help you interpret the functional relevance of mutations in the genome and then sort of related but simpler application is actually prioritizing the protein coding somatic mutations. This is really a cancer application where you find a bunch of mutations and they're inside exons of known genes. So you can do some kind of interpretation as to what their effect might be. And if you overlay transcriptome information on that, you can also tell whether the mutation is actually expressed in your tissue or not. And sometimes that can have important implications for the relevance of that mutation to perhaps a disease that you're studying. So there are a number of challenges to RNA-seq that generally are much less of a problem for people doing DNA sequencing. These include purity. So the sample purity, that issue could apply to DNA as well. Quantity, of course, is always a problem, but really it's quality that people seem to encounter problems with RNA compared to DNA just because RNA is so much more fragile than DNA. Another problem is that RNAs consist of small exons, at least in eukaryotes, that may be separated by large introns. And this creates an alignment challenge. So when you're sequencing reads that were derived from the genome, you have a little piece of the genome. And when you align it back to the genome, you expect it, for the most part, to align as a single contiguous sequence. And then we don't have this expectation for RNA. So we have an RNA sequence that many of the times a read will span across two exons. And then when we align it back to a reference genome, we have to resolve that exon-intron exon structure to figure out how it fits against the reference sequence. So I guess that leads me to another question. How many people here are working on, say, human as the species they're doing most of their research in? And for any eukaryote and not a eukaryote, like bacteria or yeast or, well, yeast wouldn't count, but bacteria or, OK. Any plant people? Sorry, too bad for you. Do you have genomes? The plant people have reference genomes? OK, well, that's good. Anyone not have a reference genome? One, OK. So I'm sure we'll be talking about that at some point. OK, so another problem that's somewhat particular to RNA sequence is that the relative RNA is very widely. So again, when you're sequencing the genome, you have some number of chromosomes in your species. And they're there in some expected copy number. So in human, you have chromosomes that are there diploid. So you expect two copies of each. So when you sort of shotgun sequence a human genome, you have this prior expectation that everything will be covered at a kind of approximately equal level because there's two copies of all of the things that you're sequencing. In the transcriptome, of course, this is not at all true. So there are some transcripts there that are present at many, many copies, thousands or tens of thousands of copies per cell. And there are other transcripts that are biologically functional, but they're at just a few copies per cell. And that creates this wide sort of dynamic range. And there's sort of different estimates about what that is, but it's sort of in the range of 10 to the 5 to 10 to the 7 orders of magnitude between the most lowly expressed gene and the most highly expressed gene. And that creates sort of a huge sampling problem for us. And that's a problem because RNA seek works by random sampling. So we're not directing the sequence in any real way. We're just taking RNA out of cells and we're just shotgun sequencing it. And we're basically pulling reeds randomly out of a hat. And the problem that this creates is that when we keep pulling these reeds randomly out, we tend to keep getting the most highly expressed things over and over and over again. So if our gene of interest is something that's really highly expressed, this is great. We don't need that much data to get good coverage of that transcript. But if we really want to characterize the whole transcriptome or we have particular interest in certain lowly expressed genes, then this is a big problem because it's going to take a lot of sequence depth to get good coverage on those transcripts. And this is something that people are usually maybe surprised at how bad this problem is at first. So the transcriptome on its face seems like kind of small space. So in human, for example, it only occupies one or 2% of the transcriptome. So you might think, oh, it won't take that much data to cover the transcriptome. But this basically kills that idea. It's almost like sequencing a genome because of this issue. Is that also your assumption that when you compare disease versus control on tumor versus normal tissue, you assume that the amount of expression is proportional to the amount of your detected distribution. Yeah, so that's why it's by random sampling. Yeah, you do assume that. And I think that you're kind of edging towards a sort of area of analysis that's called data normalization, which deals with potential biases between your two samples that aren't related to the biology that you're studying and that you want to correct so that when you do compare them, you can see differences that are related to biology. And we're going to talk a little bit about that, but it's also in itself kind of a big topic. Speaking, where do you place, like for instance, extended procedures? CDNA subtraction, for example. If you don't need quantification, you can go to rare constructs easily by subtracting out the overall amount. Yeah, so that's a really great question and I'll point the erase. So there are a number of wet lab strategies that attempt to address this problem by basically normalizing the varying levels to try to basically pull down the really highly expressed things and pull up the lowly expressed things and put them kind of more on an even playing field and sort of even out the number of copies for each. And there are quite a number of strategies that have been used over the years. This goes back to like EST sequencing, CDNA library construction, where different strategies involving different types of hybridization would be used. And there are a bunch of enrichment strategies that also kind of get at this. So the sort of the simple, the most prominent one is that ribosomal RNAs occupy something like 95% of all transcripts. So of course, almost every RNA-seq library construction strategy involves trying to remove the ribosomal RNAs. And then there are additional strategies that you're still left with really, really highly expressed genes and really lowly expressed genes. And there are other strategies that try to further even that out. And if your interest isn't just annotating the transcriptome or just detecting sort of presence or absence of certain transcripts, that might be a really good strategy because you can get sort of broader coverage across the transcriptome without as much sequencing. And the sort of, I guess the fear, the reason why people don't always want to do that is that you'll lose information about how highly expressed each transcript is relative to other transcripts or that you'll introduce bias. And that is a real concern. I think probably even more of a concern is, or it's less of a concern because it's actually really, really hard to do a good job of normalizing the library. It's very difficult to actually get to a point where you have even representation of all your transcripts. Usually you've just kind of like pushed it a little bit in that direction and you still have a quite, quite wide range. Another issue that's kind of particular to RNAs is that they come in a wide range of sizes. So we've already talked about small RNAs a little bit. But just generally the idea that you have, you're measuring these things all at once by the same technique, but they each have different sizes and because you're randomly sampling them, there's sort of a bias towards the things that are bigger because it's easier to derive a read from this big sequence than it is from this small sequence. So you sort of have to take that into account when you're thinking about the relative abundance of transcripts. And it also introduces some wet lab issues. So for example, if you're doing a polyase selection, so to get rid of ribosome RNAs, as mentioned here, you want to, a lot of people will select for polyadenylated species. So they've basically hybridized their RNA to oligo DT and beads or on a column. And then they wash away everything else, hopefully mostly ribosomal RNAs. And then you're left with these polyadenylated enriched for messenger RNAs on your column. And then you make your library from that. So one of the problems with that is that for long RNAs, there's a, the longer the RNA, the greater the chances that it will get broken at some point during the creation of your RNA or the handling of it. And then when you grab onto the three prime end of it, the polyA tail of it, you basically wash away the five prime ends of transcripts. And the bigger the transcript is, the more likely that is to occur. So it's really typical to see in mRNA seek libraries, this three prime end buys, where you tend to have better coverage at the three prime end of genes than you do at the five prime end. And there's just, it generally tails off as you get to the five prime end. And the bigger the transcript is, the more that problem is, is an issue. So speaking of quality, sort of an industry standard way of thinking about or interrogating the quality of RNA seek libraries is to use this Agilent Nano assay, which produces an RNA integrity number or RIN number, which is on a range from zero to 10. So who here is familiar with these traces or electrophierograms or what are they? Okay, so about half of you I guess. So we provided a large number of examples to give you kind of a reference point that you can download. But basically the idea is you're effectively running a gel but through a capillary. And then you're reading out this sort of fluorescence of nucleotides as they pass a detector over time. So you get this readout that is basically small things come out and then over time, larger and larger things, until you get up to, you know, very large and then you stop the assay and you look at this trace and use it to sort of estimate the amount and the quality of your RNA. So showing on the right here, this is some RNA that I isolated from cell line, which is basically perfect. And what you see when you run a total RNA that's a very, very high quality that's totally intact on this assay is two large peaks that correspond to the ribosomal RNA peaks for your species. And based on the height and relative size of these two peaks, you get a prediction of the quality of your RNA. So this is a perfect 10 out of 10. And then as your RNA starts to degrade, you start to see secondary peaks. So basically where the RNA has been broken into pieces, you start to get peaks on your gel, so to speak, that are at smaller sizes because the RNAs are being broken down into smaller and smaller pieces. So you start to get what would look like a smear on the gel, but here it looks like all of these extra peaks. And this gets basically translated into a score that tells you that there's some amount of degradation. And as the degradation continues, this trace keeps moving further and further until the left until it all kind of piles up sort of at the size where RNA degradation starts to not happen as much, which is sort of in the range of like 100 to 200 bases. The Y axis is basically just the intensity of fluorescence. So you basically have a dye that's fluorescing that's bound to the RNA molecules. So it tells you basically how much RNA there was at each size range or over time. Yeah, so that's a good point that I think we're gonna show a picture of it, but yeah, one of the following steps after you isolate RNA is often to fragment it, so maybe we don't care. And that, to some degree is true. There's sort of a few gotchas. One is the one that I just mentioned about doing a polyaselection. If you have degraded RNA, then you probably really don't want to do the polyaselection because you're gonna introduce this bias towards things that are close to the polyadenylated end of transcripts. The other is I guess a worry that the degradation that's happening in your sample is non-random and that will introduce some bias perhaps. But generally as long as it's not too degraded, sort of degraded below the size range that you would ideally wanna make your library, it can be not a big deal. But generally you will have some loss of smaller RNAs, getting broken into pieces that are then kind of below the size range that you're selecting and then when you make your library, you may have a size exclusion step that throws away small stuff and you may lose some signal from certain transcripts. But yeah, it's not the end of the world potentially. Would you be concerned sequencing something within the value of 6-1-6? Yeah, so the question was what should you or would you be concerned sequencing a sample that had a written value of 6, like in this example? People are definitely doing it. A lot of sort of core services will set a cutoff that they feel allows them to robustly make a library that ultimately sort of meets their quality metrics and that cutoff is typically 8, which is pretty conservative I would say. I think that this is probably fine as long as you're making the library the right way. So as long as you're doing random hexamer CDNA synthesis, you're not doing a polyad selection, you will probably be able to get pretty good data out of libraries that are degraded to this level. Another thing to really think about though is that yes, you can make a good library or one that's reasonable, but if you have sort of a project that has 10 or 50 or 100 samples and there's a lot of range, so some of them look like this and some of them have a rent of four and some of them have a rent of 10, that's probably a problem. But they're all six. But if they're all six, it's probably, yeah, you'll probably be fine. Wouldn't stop you from doing my experiment, that's for sure. Yeah, you first. Okay, concern if you are interested in doing this Will it be a concern to do polyad selection if you're looking for differentially expressed genes? I mean, most of the people that are doing MRNACC are often looking for differential expression. I think as long as you do it consistently across the samples that you wanna compare, then it should be fine. But I mean, it also sort of related to my last comment. You probably don't wanna do it on some of your samples and not others and then try to compare them. It needs to be consistent. But I think if you have good quality intact RNA, total RNA, and you're interested in messenger RNAs, things that are gonna be protein coding maybe, a lot of people will still do the MRNACC just because it reduces the amount of sequencing that you need to do. It's really the most effective way at removing the ribosomal RNAs. There are a lot of other strategies that work okay, you do wind up having to sequence a bit deeper. So it's sort of a cost consideration. Actually the question comes from, I went one additional where they do septic where they just sequenced the first few nucleotides starting from the 5.5.10 or the 3.5.10, that's it. Yeah, so this... Really intended for the sequencing of the sequence. Yeah, so you're referring, so there's a lot of different names for this. There's GIS, Sage Cage, various types of tag sequencing. So the idea is that if you're only interested in differential gene expression, you can make the cost of sequencing a lot lower by really just sequencing one tag or one index per transcript. And so there's different strategies to basically create a library that just has the five prime most 25 bases or whatever. And those get either you sequence them, you can sequence them on a lumen instrument with a very short read length. So you make a library then just sequence 35 or 36 or 42 or 40, 50 bases. And that allows you to save reagents and to get basically more sequence out of each kit that you buy from Illumina. And then you can multiplex a lot of samples. So if you have a lot of samples or a lot of biological replicates and you really just care about differential gene expression, then you could do one of those strategies. The one downside is that you lose the ability to do all of the other analysis applications beyond differential gene expression. And you might think that you don't care about those things now, but you might change your mind later. Definitely, like as someone who does analysis all the time, I usually advise against collaborators or other people doing those kind of really sort of end sequencing strategies because it's usually about five minutes after I get the data and start looking at the results that they immediately ask, oh, well how about alternative splicing or can you tell me what the expression level of this SNP is or this mutation or can you look for gene fusions? Basically they immediately start asking a lot of questions that can't be asked because of the way the data was generated. So one of the great things about RNA-Seq is that it's this really sort of hypothesis free, unbiased shotgun of the transcriptome and there's just this galaxy of questions that you can ask of the data after the fact just starting with the same data and you sort of, all of the information is there and it's just like a matter of mining it out, which is why it's so much better than SAGE and other techniques that have been around for much longer. You had a question? No, so there is no known bias to degradation like this. Like that's. Yeah, I haven't heard. I mean, I guess it would depend on why that why or how the degradation was occurring perhaps, but it's yeah, it's probably random and it's probably similar. I mean, it's often the, well not often, but some of the RNA fragmentation strategies that you had deliberately do to the RNA involved the use of RNA's, which is the thing that you would worry about to degrading your RNA naturally. So there's obviously people that are not worried about the unbiased nature of the way RNA's break things down. Okay, so we've talked to actually kind of talked around these issues quite a lot now, design considerations. So of course, when you're at the beginning of your project is when it's a good time to think about how you want to design your RNA seek experiment and there's many, many considerations. And we're gonna sort of talk about a lot of those over the next few days and we've talked about a few of them already. So there's things like how many replicates should I include? How much data should I generate? Control experiments, spike-ins, reporting standards, et cetera. And there's been a few organizations and there's a resources section on the Wiki that covers others of these. Probably the most comprehensive one that I've seen is this on-code RNA seek standards document that was produced. It's quite a while ago now, three or four years and is a very extensive list of the kinds of things you should think about when you're setting up your RNA seek experiment in the first place. And it's basically a list of everything you should do but that no one actually does. You should think about doing some of those things that are recommended because you may sort of thank yourself later for including spike-in standards or sequencing to a certain standard depth and so forth. Okay, so we've also talked quite a bit about this. So different RNA library construction strategies. So we've talked about starting with total RNA and either continuing on with some kind of total RNA seek strategy or doing a poly-selection and then sort of related to that is whether you do a ribo-reduction. So you usually do one or the other. If you're doing a poly-selection, you might not do that but if you're not doing a poly-selection you would definitely wanna do some kind of ribo-reduction strategy. Size selection is pretty common and this can be done to create a smaller RNA library or it can be just to constrain the fragment size of your cDNAs before making a library out of them. There's several reasons why you might wanna do that but one of the main reasons is that when you flow molecules across an aluminum flow cell it's easier to optimize the density of clusters on that flow cell if the fragment size is relatively consistent. So some people will select a rather tight size range for their RNA seek library so they'll say okay it has to be between 300 and 400 bases and I'm just gonna try to like cut everything else away that's either too big or too small. For people that are doing things like tumor sequencing there's sometimes you have so little material that you're doing some kind of amplification so one strategy is a linear amplification. Yeah, question? This is a quick question. What are some scenarios in which you would choose ribo-reduction over poly-selection? So the most obvious biological one is where whatever you're studying your biological question involves non-polyadenylated RNAs so there's a lot of RNAs that have biological significance that aren't polyadenylated. Bacterial genomes. Yeah, bacterial genomes. The degradation issue that we already talked about so if your RNA is degraded you don't wanna do poly-aselection because it'll introduce N-bias but generally the philosophy is just that it's more it's a more complete view of the transcriptome. So you're removing the ribosomal RNAs because they just take up too much of the sequence. They're 95 to 98% in human but otherwise you wanna have this idea of sequencing the whole transcriptome, whatever is there and analyzing it after the fact rather than imposing this upfront limitation that the RNAs have to be polyadenylated but there's a cost trade-off in there that you're basically taking a wider view of the transcriptome so it takes more data to cover that wide view well. Stranded, we're gonna talk a little bit about stranded versus unstranded libraries. It's becoming very common to use a stranded library and basically what this means is that you can tell what strand from the genome that transcript was transcribed from. For the last, the first three or four years of RNA-seq pretty much all of the libraries were double stranded where you didn't actually know the strand that each read came from. You could infer it often with high accuracy but there wasn't any molecular biology that told you which strand was being transcribed. You're basically sequencing both strands aligning back to the reference genome and then trying to infer what strand was that each read was likely being expressed from. But now this is sort of built into the library construction where you can, the reads are encoded with which strand the read came from. So this relates to the comment that was raised earlier about normalization. So there's a couple different normalization strategies. One that's becoming fairly common is to actually create your RNA-seq library and then hybridize it to an exome reagent and use that to basically enrich four fragments that correspond to the exomes as a way of also, so basically focuses your reads onto real genes that are actually being transcribed. It can be a way of cleaning up RNA-seq libraries that were made from a really poor quality sample, so this is a common strategy for FFPE material, for example. And then there are other library normalization strategies. And so there's a lot of detail in here, the sort of broad important thing to think about is that all of these things can affect your analysis strategy, especially if you're comparing between libraries. So you don't want to really vary any of these things between the libraries that you're ultimately gonna be comparing to each other as much as possible. And this next slide, which doesn't display very well, even though that's a PDF, shows kind of a depiction of some of these ideas. So what's shown here is a sort of gel electrophoresis of RNA showing sort of the gel version of that trace that we saw a few slides ago where you have your two ribosomal RNA bands and then showing sort of the idea of like as degradation occurs, you get these additional bands that are smaller and the worse the degradation gets, the smaller they get. And then when you select from the total RNA, you select polyadenylated RNA, say from perhaps intact RNA, you get sort of this mRNA smear that tends to look kind of like this. And there's sort of a lot of steps in making the library and it's sort of typical to keep evaluating by these edulant traces. So starting with the total RNA, you look at the quality of your total RNA and then you do a DNA treatment often and some kind of enrichment strategy where you try to enrich for mRNAs or you try to remove the ribosomal RNAs and then a cDNA synthesis step happens and you still have this mix of sort of different sized molecules and then a size selection or exclusion will often happen where you're basically picking sort of a tighter size range and there's kind of two strategies that are common. One is sort of just throw away all of the really small stuff and the other is to basically run your library on a gel or use an instrument that simulates that and cut basically a very tight size range and that's sort of what's depicted here. You have the size selection or just size exclusion and the main difference is that you get a bit more of a tale of large things when you do the excluding just the small stuff versus actually specifically picking a band of a certain size. And then you wind up with basically the subject of your RNA-seq library that's going to have linkers added to it and we'll get sequenced and at this step you're usually losing your small RNA. So if you want to do small RNA sequencing, you're going to have to basically capture these guys and produce separate libraries from the really small RNAs. And this is just a depiction of some of the different consequences of different enrichment or depletion strategies. So you start with your tissue and again isolate RNA and you have this mixture of RNA types and total RNA is sort of the we have everything but the problem with it is that we have a lot of total RNA. So we are sort of our libraries dominated by reads that stack up on this on this ribosomal RNA for example. And then you can do things like ribo reduction where you remove that over represented ribosomal RNA and it kind of allows you to focus in on the mRNA transcripts that you're interested in. A polyase selection sort of does that even more but it tends to give you less signal from it. You get like a cleaner result. More of your reads are focused on the exons. You have less intronic reads, less intergenic reads. And then CDNA capture kind of has a similar effect but allows you to sort of achieve that without actually doing the polyase selection. So you don't have to worry about this introducing this end bias. And then the sort of depiction of the stranded versus unstranded libraries. At the top here we're seeing an unstranded library where the strand information is encoded by the color. So red is positive and blue is negative for example. And this is sort of what RNA seed data has looked like typically for many years where you have this mixture of reads that are from either strand but because they're all aligning to known genes or transcripts in this case that have known transcription directions we can infer what direction these reads probably came from. But now with the strand specific libraries you actually get that information being encoded and we're gonna visualize some RNA seed library and if you sort of activate the right view in a genome viewer like IGV it'll basically tell you how each read was tagged. So basically what strand the read sort of believes that it came from and those will again be colored blue and red. And you can see that they work out pretty consistently that the reads that are colored blue tend to correspond to a transcript that was being transcribed in this direction and the reads that are colored red seem to be overlapping perfectly with a transcript that was transcribed in the other direction. Okay so now I'm just gonna go through a few common questions some of which we might have already covered and if so I'll just skip them. One is sort of the idea of replicates so how many replicates should I do and this doesn't really have anything to do with RNA seed per se. It's more of just a broad biology thing of course you should include biological replicates that is sort of there's no way around that RNA seek doesn't make biological variability go away but the idea of technical replicates sometimes comes up so should I trust if I have RNA seek data from one lane and I have RNA seek data from another lane for the same sample or those things equivalent do I need to worry that there's somehow some bias related to those two lanes or say I produce some data on one flow cell this week and then next week I realize that I want more data from the same library so I sequence another lane worth of data on different instruments or different flow cell should I trust that data that it can be pooled together safely and the short answer is yes that generally the Illumina platform in particular at this stage has sort of reached a stage of robustness and consistency that the technical variability from lane to lane or flow cell to flow cell as long as there's no serious problem with the run is really really high and this is just an example of two lanes being compared to each other with very very high correlation that's it's typical so people don't usually do those kind of technical replicates so in that encode document you refer to you wouldn't recommend that you need a spike ins anymore that they refer to the spike in I would recommend so but I would I guess that would be I would consider that kind of different from the kind of technical replicates I was just describing but I would definitely say if you include those spike ins in all of your libraries that is really can be a really nice thing to evaluate the quality of your data over time and you can use them for normalization yeah they're a really good idea they increase the cost a little bit of each sample that you run but they're highly recommend so you're wanting to spike since now you say that the work of this event I can go with it with the white it's like I don't need that yeah so the spike in yeah so I guess it comes back to what you mean by technical reproducibility or technical replicates so the technical replicates I'm talking about here are really like instrumentation related and so the spike ins would help you evaluate that but the spike ins go right into your RNA so they evaluate the reproducibility process so making the libraries doing polyaselection all of those things still I think benefit from an evaluation of reproducibility it's only really at the like instrument level like two lanes that I was referring to so that's a great point the spike ins they give you an idea of the robustness of your whole end to end process so we spike our spike ins so for example we could spike in increase the stored process and make sure we get that another spike in put in act of processing so that should help you detect things like you know different operator making a library a different way making a mistake or batches of reagent enzymes and so forth that vary over time these kinds of things but even though you're not concerned that there's a difference between flow cells and lanes etc. do you still control in your downstream analysis like in a routine kind of way or have you just decided that you're not concerned? yeah so the question is so the question is basically do you even though you're not even though I'm not concerned about these lane to lane technical replicates do you still build into your analysis procedure some concept of evaluating the quality anyways just in case I guess something went wrong and I think the answer to that is yes because you can automate those kinds of checks that why not so for example when we process RNA-seq data we process it lane by lane and then the results get merged and then you move to the next step so it's really easy to say at the end of each lane by lane alignment produce a quality report and summarize that at the lane by lane level or sometimes it's even more than that because you have multiple samples in the same lane with indexes so then it would be at the lane slash index level basically have a piece of instrument data that is at most one lane or a part of a lane if there was indexing and quality would get assessed on each of those pieces before they get merged into the final result and sometimes it's not an issue depending on your experimental design you may mostly just have one lane for each sample but you definitely want to think about QC of the flow cell the instrument in case you don't realize that you're basically introducing batch effects to your project overall so we're going to have a couple QC components in the hands-on tutorial and there is quite a wealth now of tools to help you evaluate QC and some of the metrics we'll talk about are quite particular to RNA-seq so we'll try to review those here's a quick question on replicas what happens when you have the undesirable situation of comparing replicas across multiple backgrounds but probably from the same samples and you may have spitements yeah it's interesting I don't think that I've done like Ion Torrent versus Illumina I don't think that we've done much of that there was a question at the back yeah so the question is if you have bad samples how do you decide to throw it out that is a really hard question there seems to be a very strong tendency to not want to throw out data no matter what and this happens at every step so we looked at the RNA integrity numbers and things like this and people try to say oh if it's written less than six or less than eight and we shouldn't do it and everyone agrees yeah that seems reasonable but then when it comes to it you've spent all of this time doing your biology or growing your cell lines or the mouse took nine months to like get that tumor or whatever and you know it's almost like a foregone conclusion that you're going to try to produce the data and you're going to try to get some results out of it I think that probably most people won't do that they won't throw away the data unless they really have an alternate or a backup which is often not the case so then I think that it's really just important to worry about and think about what the effect is and whether it's influencing your results and to try to account for it and you know there is quite a science to detecting bias detecting batch effects trying to correct for them and I think it's just you just have to be cognizant that there may be issues well you can you can do a basic normalization using your spike and that may help with some problems but I think if you're sample if you have one sample that's fundamentally degraded or something or it was really low input amount of material and you just didn't get representation of the transcriptome compared to another sample you're comparing to there may be no way to recover that especially depending on what question you're asking so if you have one of the applications that places high demands on good coverage and even representation across the transcriptome say alternative splicing where you want to understand the ratio between isoform A and isoform B that differ in the connections of exons if you're not getting good representation of the coverage of those exon-exon connections in one of your samples you're not going to be able to rescue that the information is simply may be missing or lost Michelle I'm fine to keep going but you know it's up to these I think it's up to these guys more than me I think we should let's finish this lecture which is try to get through as quickly as possible so I just mentioned this concept of analysis applications so there's a lot of applications that people use RNA-seq data for and to some degree it affects the amount of data you would want to produce and how good the quality of the data needs to be so some of these that we're going to talk about in some detail are gene expression and differential gene expression alternative expression or alternative splicing analysis transcript discovery and sort of genome annotation and then there are a lot of other applications that we would love to have time to go through but we would need like a five or seven day workshop to go through and these include but are not limited to allele specific expression, mutation discovery fusion detection RNA editing, viral detection and quite a slew of others as well but the good news is that all of these applications have some sort of broad themes for the analysis workflow and so when we go through the hands-on tutorial you're going to get sort of an example of a workflow pipeline analysis for a couple of these applications but a lot of the concepts will be cross applicable to the other types of questions that you would ask of the data and they all kind of follow this sort of theme of you start with some raw data and this could be in a fast Q format or it could be in BAM format already you're either going to align or assemble or some combination your reads then you're going to process alignment so that this stage you'll get a BAM file then you're going to process alignments with a specific tool so often the tools are specific to each of those applications so there's sort of a differential expression tool there's a fusion detection tool there's a virus discovery tool but you're going to have some tool that basically takes a BAM as input and it kind of asks one of these questions of the data and there'll be some post-processing so you might be importing it into downstream software for visualization or statistics or further summary and then there'll be sort of your final sort of summarization visualizations reviewing the results and trying to develop validation experiments creating lists, prioritizing candidates for further study and so forth and then here we have this bio-star exercise so this is kind of a long lecture so sometimes we break it up by doing a bio-star exercise but first I would like to ask who has used bio-star before okay so like maybe five or ten maybe ten of you so bio-stars for those of you who haven't used it is an online bioinformatics forum and it really has become the most popular place to ask questions about bioinformatics analysis and RNA-seq is no exception so there are a huge number of RNA-seq questions that have been asked and answered to varying degrees of quality at the on the bio-stars forum and it's sort of run by like a bioinformatics professor and there's sort of a community of bioinformatics people like us that answer questions there so when we when we leave the course and you have follow up questions if you're thinking about emailing a specific question we generally sort of ask that you first check to see if someone has already asked and answered that question on bio-stars of course and if it hasn't ask it there and you may feel free to send an email to give us a heads up but we'd rather like answer your questions publicly where we can avoid answering the same question over and over and then everyone can benefit and we can have a discussion and other people can chime in as well so maybe just spend like a few minutes going to bio-stars and logging in it's really easy to create an account if you already have a Google account or if there's a few other providers you don't need to create a new account you just log in with some existing credentials and just sort of poke around for a bit and see what kind of material is there related to your particular area of RNA sequencing okay I think we'll continue on it's been about five minutes but I definitely encourage you to sort of continue to check out bio-stars when you have questions and contribute if you like it okay so a few common questions that we'll just get out of the way just in case you were wondering I'm sure it wouldn't be long before you thought of some of these things so one of them is should I remove duplicates for RNA-seq the reason that we asked this question also relates to the DNA sequencing crowd it's become so routine to mark and or remove duplicates in DNA sequencing that some people when they start doing RNA-seq analysis just kind of they adopt their pipeline to some degree or they're just so used to that sort of best practice that it just seems of course I would mark duplicates just like or remove duplicates just like I would in DNA-seq but actually that's not really a good idea generally and it's generally not recommended to remove duplicates in RNA-seq and the reason for that is that you have sort of different expectations so in DNA sequencing you have this approximately equal representation of all of the regions of the genome so you sort of have this two copy state for everything and you assume that if you're going to sequence your genome to 30 or 50 or even 100x coverage with paired end fragments that if you have two fragments that are exactly the same they probably came from a PCR amplification artifact because it would be very unlikely for that to happen by chance for the a fragment that starts and ends at exactly the same place it's extremely unlikely that you would get that until you're sequencing your genome to 100,000x or even higher you wouldn't expect to see exact duplicates so you remove them the problem is in RNA-seq you don't have that same situation so you don't have approximately equal representation of these massive chromosomes instead you have much much smaller things some of which are there in many many copies you may have tens of thousands of copies of this RNA in every cell and the RNA may only be three or four hundred bases in length so it's actually quite easy then to get a duplicate just by chance where the same fragment is produced exactly the same way from two independent molecules so your duplicates there are probably more likely to come from true multiple representations of that transcript than they are from PCR amplification so generally we don't mark or remove duplicates in RNA-seq the good news is that if you do mark them a lot of the downstream tools will just ignore the marking information in your BAM file so there'll be a flag set in the BAM file so if you use the card mark duplicates it will mark each alignment and say oh this is a duplicate a lot of the downstream RNA-seq tools will just ignore that information anyways so you probably won't do any harm but it's generally something you don't do but it's still a good thing to think about one application where you may want to mark duplicate still is if you're going to be doing mutation discovery with RNA-seq data and then the sort of same reasoning applies there as a wood for DNA-seq how do you deal with PCR? yeah so of course that raises the question that I don't know yeah you don't you hope that it isn't happening that much we try to limit the amount of PCR amplification as much as possible yeah I think it might be possible yeah I don't know there's not a good answer because you wouldn't want to start removing duplicates for some libraries and not for other libraries so I think it's just something to be aware of that you may have issues in your libraries and that there's not really a good way to remove them so you kind of have to live with it to some degree another super common question is how much data do I, this is probably the most common question, how much data should I produce I'm sure you're all thinking that right now also can I ask one more question sure yeah so there are there are strategies to so the question is what do I mean by assess where is that comment actually oh you know assess duplicates so this is just a point to say that if you are going to remove or mark duplicates then make sure you do it like on the pairs not on individual okay so I think I was trying to shorten this which was obviously didn't work didn't work yeah so the comment is like how do you assess duplicates so I guess you would use one of the existing mark duplicate strategies and try to compare your libraries across time that were created in a consistent way sort of similar to the comment was just raised where and basically try to get a sense of what amount of duplication is typical and look for outliers and I think there are other strategies that will attempt to look for amplification bias in a much more localized way so when you have really bad amplification bias it does kind of have a signature so you don't just see like more data everywhere or kind of more duplicates everywhere you see certain spots where you just have like this psychotic pile up of alignments that are all the same fragment that's like piled to the ceiling and so you can actually see these in just looking in a genome browser at your data you'll see this sort of really blocky signature on your alignments where you just get in periodic spots you'll get just slammed with you know thousands of copies of exactly the same fragment that is disproportionate to the level of coverage otherwise in that area and those are areas that just for whatever reason like tend to over amplify so you can I don't know I can't think of the top of my head of a tool that specifically looks for that pattern there might be one that sort of looks for this like unusual spikiness in your data but you could definitely sort of develop an ad hoc procedure or metric for finding these things Sometimes you'll see this sort of signature as well and like the a lot of the QC tools will do like a camer analysis you'll have like certain camers that like so just sequence certain sequences that in particular libraries are very very very high levels for otherwise unexpected reason okay so how much depth so I guess just to disappoint you right away it's hard you can't really say because there's just way too many factors that come into how much depth you might need and those include sort of one of the big ones is what you're going to ask of the data so what kind of analysis do you intend to do some so we've already talked about this a bit how you're just doing all you care about is differential gene expression analysis you just want one number that gives you the relative abundance of each gene that places the least demand on an RAC library of any of the analysis applications so you'll be able to get away with a lot less data and in that scenario you might decide that having more replicates or larger cohort of samples gives you a lot more statistical benefit than having deeper individual libraries would and you have a limited budget so that's what you're going to do you're going to have a bunch of libraries that are only sequenced to say five million readstaffed each but if you want to do alternative expression analysis or you want to call mutations from your data or you want to look at RNA editing then that same calculation doesn't apply you need to adjust your expectations so I can tell you what we do which is basically one lane so all of our RAC libraries are one lane and we find that that is pretty much suitable for most analysis applications one sample per lane on a high seek 2,000 or 2,500 so it's like 250, 300, 350 millionth that kind of range that's for human so again it's sort of I can give the recommendation for what we do but we do almost exclusively human work and for yeast it would be that would probably way overkill so it depends on your genome size the transcriptome size the transcriptome complexity so yeast is an interesting example it's a U-carry out and it has introns that are spliced in comparison to something like human or mouse the splicing is way simpler so it wouldn't place as much demands on the amount of data that say doing splicing analysis in human would so all of these things need to be kind of considered together so it's very difficult to come up with an answer I can probably come up with like a guess or instinct value if you describe your particular situation some areas that I really know nothing about it would just be a complete guess but yeah so if you have like one sample somebody else ask if it's degraded you don't have enough data but you still have the sample is it a common practice you can add another layer and aggregate the data so you have enough people do tend to compensate for bad sample quality by producing more data so one of the things that you'll see with heavily degraded RNA libraries or libraries that come from FFP material is they tend to be much noisier so you have a lot more reads that are in the introns that are in intergenic space that don't seem to have anything to do with transcription but there are other you are still getting signal from the transcripts it just kind of buried in more noise and so you can compensate to some degree by just producing more data and a lot of people will do that and that was kind of also part of our sort of one lane for human standard came from this that you know a fair amount of the time we're actually making RNA libraries from FFP material and we found that one lane was still generally sufficient for that kind of input so it was sort of a robust like this will work 90% of the time kind of calculation we can say one lane for example so we do quite a few of these quite a few of these applications so it's yeah if we just want a gene differential gene expression probably would still be overkill for sure but we we always so a lot of our sequencing is tumor sequencing so we're often in the position where we've identified somatic mutations in the DNA and we want to be able to assess the expression status of every somatic mutation in the RNA so we want and some of those transcripts just aren't highly expressed genes but they could be really important genes so we want you know pretty good coverage across the transcript don't give us a chance of detecting every mutation in every gene and we don't want to miss that for lack of coverage so yeah so this 40 million figure so we reference this paper somewhere as well in the in the online wiki they again they evaluated that that analysis only with respect to one application gene expression so it's good for gene expression it's good for gene expression basically if you want to recreate what you used to do with arrays it's probably if you want like a microarray style output I think Michelle was telling us to to move along but maybe one last question yeah so that yeah no not for it's not for any prep it the prep may influence so you may if you have sufficient RNA quality to do the RNA or polyase selection that will allow you to use a little bit to generate less data and get a sufficient result but for total RNA I seek you might yeah I mean we found that one lane is fine for our total RNA seek libraries but yeah so these two I think these two recommendations are sort of the what you should really do which is to think about what your analysis goals are and to look at what other people have done and see how well it worked for them and even better than that do a pilot experiment so you know sequence maybe deeper than you would like to with a few samples and get a sense for what the result is like and then try to dial it back to where you feel like it makes sense mapping strategy we're going to talk about this more in the in the tutorial so I'll just go through this really quickly it used to depend a bit on read length because there used to be quite a lot of libraries that are sequencing that people are doing where it's smaller than 50 base pair reads so there were like 36 murder libraries and 42 murder libraries this is a lot less common now but if you do have short reads the kind of a liner you would choose would be different so we're going to be dealing with paired 100 base pair reads and we're going to use this bowtie top hat aligner which does a splice aware alignment so it will try to take each read and if it spans across an exon boundary it will try to resolve where the intron is and where the two parts align to the exon and doing that with really short reads is really really hard so it basically not recommended if your reads are smaller than say 50 to 75 bases then you would use a different liner but for the most part it's pretty standard now to generate 100 to or even longer base pair reads but if you have decided to make small or short read libraries then you would want to think about using a different liner potentially and we can talk about specifics if you want and then here's just a sort of a visual depiction of this showing an IGV screenshot in this case we've done some DNA sequencing of a normal sample and a tumor sample so this is blood and this is some tumor and we've identified a splice site mutation here and then we've aligned our we've also done RNA seek on the same tumor sample we've aligned those RNA seek reads to a reference genome and those reads span across the exon-intron boundaries so you have reads that are aligning so for example here's a read that aligns to this half of this exon and then the rest of the alignment continues across to the other exon so this is the kind of alignment that's difficult to do if you have really short reads and what I'm showing here is actually that you're basically seeing evidence in the RNA seek data of exon skipping that is caused by this somatic acceptor site mutation these guys pre-mRNA yeah so this really interesting question to think about what these things are and what they could be does anyone want to guess pre-mRNA so pre-mRNA molecules so these are where the intron hasn't actually been removed by the splicing machinery yet so when we isolate RNA from a pile we're sort of catching the transcriptome in the act everything is happening all at once so there will be some RNAs in there that were in the middle of being processed and they still have their introns in place that's one source any other ideas could actually be real transcription so basically like an alternative isoform that has a different exon-intron boundary is another possibility anything else long-coded could be another transcript that just happens to be transcribed there possibly on the other strand so this is unstranded data so we don't actually know what strand each of these reads correspond to so it could be that these reads here are aligning to this transcript going in that direction and there is an anti-sense RNA that's transcribed in the other direction we're seeing some signal from that perhaps another possibility is genomic DNA contamination that's made it all the way through that wasn't effectively removed by the various steps upstream that attempt to remove it just transcription noise there's sort of a lot of possible sources of this noise of course it could also be an alignment artifact so maybe those reads don't really go there maybe those reads belong somewhere else and they've been misaligned to this region I think this is the last common question so people are less worried about this than they used to be when I started doing RNA-seq a lot of people were very skeptical about it because it was a new technology and they wanted to have some sense of how accurate it was whether it was comparable to sort of more accepted gold standard things like RT-PCR or QPCR and the short answer is that it's very very accurate if you do the analysis right and the data is of good quality so we did this experiment where we took 400 candidate events from an RNA-seq analysis so this first example there was 200 cases where we'd observed an exon skipping event so we had reads that spanned across from exon 1 to exon 3 for example and then we had a mixture of those reads and then the alternate isoform that includes exon 2 so we designed sort of validation experiments to amplify those two isoforms that give you sort of a large band on a gel where the exon has been included and a small band where the exon has been skipped and you could cut these things out of a gel and you could sequence them and then you can compare the sequence that you get from say RT-PCR and Sanger sequencing back to what the RNA-seq predicted and the predictions are very good so basically like 95 to 98% of the time the RNA-seq says this exon is being skipped you can validate that by a more conventional RT-PCR Sanger sequencing strategy and the same thing for sort of the quantitative level of the expression where you have in this case differential expression of alternative exons or alternative parts of existing exons if we compare the readout from RNA-seq on the y-axis to Q-PCR on the x-axis you get a very very good correlation and if you apply a statistical test to sort of tell you whether both platforms said the exon was differentially expressed and this was based on what is now quite old RNA-seq analysis and RNA-seq library strategy and sequence length all of these things are better now so this is really a lower estimate you can get very very good predictions out of RNA-seq and then the last question that I have on here is what do I do if I don't have a reference genome and I think since we only have one person without a reference genome maybe we can just have a more detailed conversation with that person directly offline but the short answer is one, have you considered sequencing the genome of your species? 100 gigs 100 gigs, yeah usually the plant people have the best excuses for not sequencing their genome because the seed genomes are so big and so complicated and full of repeats and they have all kinds of problems so then they have a legitimate need to work around that and there's a lot of strategies so sometimes you may not have a reference genome but you may have some notion of a reference transcriptome perhaps from CDNA libraries and then you shift your analysis strategy to aligning to CDNAs and doing more directed comparison against the transcriptome instead of going to the reference genome and then back to a transcript interpretation of those alignments and then there's other strategies like DeNovo assembly of your RNA-seq reads instead of aligning against a reference genome and we provide we'll show you later some sort of resources that go through tool recommendations and strategies for some of these edge cases where you don't have a reference genome or you don't have a reference transcriptome and that's it so we're going to move on to hands-on tutorials so we're going to be logging into Amazon again and we're going to start going through some of the basics of the command line RNA-seq analysis we're going to install some RNA-seq tools so you get a sense of what it's like to create the analysis environment and then we're going to go through alignments and other things and there's sort of a brief lecture or a series of slides that sort of outline what will happen in the hands-on tutorial but maybe we can go through that when we come back from a bio break and we'll start with some analysis