 But before we dive into that, we're going to review sort of a little bit of the background behind the next phase of the workshop which is really about alignment and visualization. So we've kind of set ourselves up with the data and the reference genome and the annotations to do some alignments and that's what we're going to do in the second lab tutorial. So we're just going to talk a little bit about the background of that before diving into that. So we're on module two now, which as I said is RNA-seq alignment and visualization. The learning objectives of this module are to why is this so small? Perform alignment or learn about alignment challenges and common questions, discuss a couple different alignment strategies. We're going to review very briefly bowtie top hat, which is the aligner we're using in this course in these particular lectures. We're going to introduce the BAM embed formats and do some or talk about some basic manipulation of BAMs. We're going to visualize RNA-seq alignments in the integrative genomics viewer and then we're going to do some BAM read counting and look at variance allele expression status in some real data. So there are a number of challenges with RNA-seq alignments. One of them is computational cost. So we've been playing with a data set that has, I'm not sure, maybe tens of thousands of reads in it, but your real data is going to have hundreds of millions of reads. So when you do your alignments, you'll see that even with this tiny data set, it takes a few minutes. It's a complete alignment. With real data, you can expect it to take hours or days. So there's definitely a real computational cost. Unlike genome alignments, RNA can and often does have splicing. So the alignments are spliced. So we're going to talk about the distinction between spliced versus unspliced alignments. Unfortunately, you probably can't just align your data once and be done with it. So what you're going to find is that, depending on what downstream processes you're trying to do, so for example fusion detection might require a different alignment strategy than expression estimation. So not only is it computational and computationally intense, but you find yourself actually having to go through alignment again with maybe slightly different parameter setting or with a different algorithm. So is Toppat the only mapper to consider for RNA-seq data? Obviously, the question to that is no. We're going to play with two just in this lecture alone, but there are lots of others. I've included a link there to a BioStar forum post. I think we have these elsewhere throughout the lectures, or if you just go to BioStars like Malachi talked about and search through the search interface for whatever you're interested in. There's quite a lot of content there discussing the issues of RNA-seq analysis. So really in broad terms, there are three RNA-seq mapping strategies. In the top left, there's the idea of de novo assembly. So you might do this if, for example, you don't have a reference genome, or you have a reference genome, but you don't have good gene annotations, or maybe you have gene annotations, but you want to identify novel transcripts. So isoforms for your genes that have not been observed yet or annotated yet. Even in human, which is probably the most well annotated genome with the, you know, the highest quality reference and best annotations, there are still new transcripts being discovered all the time. And then you have things like fusion transcripts, where there's basically an infinite number of possible transcripts that can resolve from different genes being fused together. And a de novo assembly might be a good way to detect those kinds of events. So here you're not starting with any preconceived notion of the genome or transcriptome. You're using an assembly-based approach to construct what you believe all the transcripts are in that population of human transcripts. So that's one common approach. Probably the most common approach is what we're going to do or what we're going to focus the most on, which is aligning to a reference genome, but just being splice aware. So taking your short reads and acknowledging that they may not always align to one place in the genome, but there may be gaps and aligning like that to a reference genome. Another strategy is to align to a transcriptome. So where your reference isn't the genome, but it's a set of possible transcripts. So there you need to know what the transcripts are, and then you just align directly to them. This is probably the least common approach. So which strategy is best? Like I mentioned, a de novo assembly might be a good idea if you have a reference, don't have a reference genome for your species. If there are complex polymorphisms, mutations, haplotypes, etc. that might be missed by aligning to a reference genome. Aligning to the transcriptome, like I said, this isn't done as often anymore because we basically built splice aware aligners to avoid having to do this. We do sometimes recommend it if you have shorter reads, but most people don't have reads less than 50 base pairs these days. So most people are going to do this aligned to reference genome third strategy, which basically covers all other cases. Each of these involves different alignment or assembly tools. So this is a figure, we might have shown this already. It was in a review maybe a couple years ago basically showing when all of the different short read aligners came into use. And you can see there's a stack of, I don't know, maybe 40 or 50 different aligners broken down by application. So the ones in red are for RNA-seq specifically. You've got methylation, bisulfite sequencing, DNA, whole genome, exome type sequencing, and I can't actually read what's in green. MicroRNA, small RNA approaches. There are some algorithms that have been designed specifically for that. So a question is should I use a splice aware or unspliced mapper? As we talked about RNA-seq reads might span large introns. Or they will typically span introns. And the fragments being sequenced in RNA-seq represents messenger RNA, therefore the introns have been removed. So we're usually aligning these reads back to the reference genome, so you pretty much need to use a splice aware aligner. Again, if your reads are really short, you should use a splice aware aligner. And some examples are listed there, top hat, star, map splice. There's more options. So I think maybe in the early days this was a question, but for now I think it's pretty much resolved. The one that we're going to use is called bowtie top hat. I really wish I could see something on this screen. So bowtie top hat is one of these splice aware aligners. It requires a reference genome. If the way it works is it breaks the reads into little pieces, it uses the bowtie aligner to align the short pieces, and then it extends those alignments from these short alignments and dissolves the axon edges, which is why it's done. And that's kind of directly related. One of the questions we often hear is should I allow multi-mapped reads? So this is basically reads that align to more than one place with equal or near equal quality. So this could happen for a number of reasons. You could have let's say a gene family, repetitive regions, that sort of thing. In DNA analysis, it's fairly common to use a mapper that just randomly selects alignments from a series of equally good alignments. So if the alignment matches to multiple places equally good, it will actually just arbitrarily assign to one of them. That's less common in RNA analysis. But you might want to do that if you're doing variant calling. But you should probably allow multi-mapped reads for expression analysis with top hat-and-cuff links and certainly for fusion discovery, because those methods are starting with the assumption that some reads will map to multiple places and they've included in their models. They're taking a probabilistic approach and saying, okay, if this read maps to multiple places, I'm going to assign a probability that aligns to here versus here, and they basically take that into account. So you shouldn't exclude multi-mapped reads if you're using, for example, top hat-and-cuff links. So when you run bowtie top hat in a few minutes here, the output that you're going to get is a SAM file or a BAM file. SAM BAM files are basically the same thing. They stand for a sequence alignment map format file and BAM is just the binary or compressed version of the SAM file. And that's done for really space reasons. So these alignment files, as big as the raw sequence files are, the alignment files can get even bigger. And so to make it practical, to move these files around and store them cheaply, it's been necessary to compress them. So the BAM format was invented to basically make SAM more efficient. But unfortunately, that means that these compressed files require a little bit of extra special handling compared to plain text files. So you can't just open one and read it because it'll just be gibberish because it's been compressed. So that means you have to use separate or special tools to view the contents of a BAM file. Whereas a SAM file, you could technically just open it in a Word Editor. In many cases, the file would be so huge that it wouldn't be practical to open it in a Word Editor. But you could do things just straight manipulation of it, like add a command line without any sort of special tools. And there's a link to BIOS stars about how to convert BAM to SAM. This is a very common problem. You'll have a SAM file and you need it in BAM format or vice versa. And there are a lot of tools that help you go back and forth between those. Then I think we'll show a few examples during the exercises of how to do that. So this is an example of a BAM file. It's broken into two sections. So in the top section, you have a header. This you'll find at the top of every SAM file. It might be just a few lines or up to a few hundred lines. And it gives you information about the format or the version of the file, what kind of a liner was used, what regroups are contained in that file. All kinds of just general information about the data and the alignment that we've done. Nothing that's really specifically necessary. After that header, there's the name alignment section of the BAM file, where each line represents the alignment of one read from your data file that was aligned. So here I'm showing 10 alignments. Now we're going to zoom in on one of those and kind of break it down into its components. So it's basically, I think one line is sort of like here. A bunch of different fields of information and eventually the sequence evolves. Similar to a BAM file that we looked at before, but with more information because it's not just a sequence now, but it's how that sequence aligns to some part of the reference genome. You can kind of think of this like an alternative type of BLAST output. If you're familiar with BLAST, you've done, you know, taken a sequence, gone to NCBI BLAST and BLAST it and you get this alignment format. This is kind of like a more efficient representation of that kind of data. So you can follow the link to the SAM1 PDF to read about the specification. There's all kinds of details about how SAM BAM format works. Like I said, it's broken into two sections, the alignment and the header. It's compressed version of SAM. It's compressed using BGZF format. It's not really important. BAM files are often indexed. So just like we took our reference genome and we indexed it with bowtie, you often also index your alignment files to allow you to more quickly look up alignments in that BAM file. Because the BAM file is going to be huge. So you'll often basically sort that BAM file according to alignment position. So like chromosome 1, position 1 will come first and so on. And then you'll create this index file, which is a kind of a lookup file that allows other programs to very quickly go to the part of the BAM file where any specific alignment might be located. So it allows fast retrieval of alignments. So here's some example of the kinds of stuff you would find in that top header section. So this describes the source of the data, the reference sequence, the method of alignment, and so on. Each section begins with an apt symbol followed by a two-letter code, which tells you what kind of information it is. So for example, the SO code gives you the sorting order of alignments. The SQ refers to the reference sequence, RG to the root group, and so on. So you'll get information about the reference sequence name and length, so that could be something like chromosome 1 and it's out of the gate. The SQ sees that it's coming from and so on. And there's also information about the program. So this would be the program that was used to align your data. So it might say something along with it. And the version. Then in the main section, you had on one row 10 or 11 pieces of information separated by tabs. And the contents of that are listed here. So in the file, they're just listed one after another, but it's easier to read as a human has been able to warm up. So the first thing that's listed is the Q name, which is the query template name. So that's usually the read that was aligned. There's a flag. So a flag is a way of representing all kinds of information about that alignment. And we're going to go through what the flags are and how they work in a minute. The R name is the reference sequence name. So this would be, the Q name is the name of the read, the sequence that you're aligning, and the R name is the name of the thing that you're aligning to. So in this case, it says 1 for couple of more. Position is the position of the alignment. The last most one-day position alignment. So here's alignment starting at position 11, 16, 3 on the final one. You'll then have a map Q information, which is the mapping quality. So that's a kind of score for over the whole read. How good did it align to the genome? So here the mapping quality is 12, so it's three of this. There's then a cigar string that's just called cigar string. That's another way to represent the alignment. And we're going to go through a specific example of cigar also in a minute. You then have the R max and P max, which are the reference name and position of the main. So remember, a lot of times you've got paired data. So here it's saying P max equals, that's a symbol saying that the main pair of this read is actually aligned to the same chromosome. But it doesn't necessarily have to be. It could be that one read of a pair of max chromosomes on one, one the other read of that pair of max chromosomes on two, and then that would be indicated here two. And then P max, same thing, is telling you the position. So here the equals is indicating that they're both in front of the one, and you're seeing the two positions, 1163 and 11740, basically they're 100 bases between read one. T, length, is the observed template length. That's an estimate of the size of the fragment based on the distance between those positions for each pair. And then you have your sequence of quality spring. This is just like what you saw in the Fast and Furious. The sequence of your read and then the quality of each base call. So remember, the second thing is a flag. So this is a really kind of confusing concept. Oh, I have a pointer. How am I? You can read all about flags at this URL. There's an 11-bit wise flag describing the alignment. These are stored as a binary string of length 11 instead of a column of 11. It's basically a really efficient way to say something about a whole bunch of different properties all in one number. So, for example, the value of one indicates the flag is set to something besides zero. So here you've got 11 positions and each of these can be either zero or one. And all those combinations of zeros or ones add up to 2047 different possible codes. And each of those codes maps to a single number, which you can put in your file and it basically maps back to this complicated code, which tells you something about all these different properties. There's a website that helps you kind of understand these. Actually, this might be the website. So just to go through like, is that going to work? Yeah. So just like it says, it tries to explain SAM flags in plain English. So the number one, which corresponds probably to 1-0-0-0-0-0-0 for that string of 11, if we have it explained, it means that the read is paired. There's another string of zeros or ones, 11 long, that corresponds to the number two. That means the read is mapped in a proper pair and so on. So you can pick any number from one to 2137 or whatever it was, and it'll give you some combination of these properties. So you can give a lot of information. Basically, any of these binary states can be set and described by this one single number. So the BAM format is really all about efficiency, right? Like giving you all this information. Otherwise, you'd have to have like a different field in the BAM file for each of these. You'd have to have a mate is unmapped, yes, no, and read is a PCR or optical duplicate, yes, no, and so on. So if you see those flags and you want to know, after a while you sort of start to know a few of them, but it allows you to basically like filter your data. You can say, I want all of the reads that are not duplicates and have a read pair mapped. You can look up the code for that, and then you can quickly filter your file for everything that has that code. To make it even more confusing in the specification, they don't list the simple number, like one, or this binary string, but they list it in hexadecimal format, which is like a third way of representing this single concept. So cigar strings, that's the other thing that we kind of put aside for the moment in the BAM file. So number six is the cigar string, which tells you something about the alignment. Yeah, so when you do your primary alignments, those will all be set to not duplicate or unknown or something, and then when you do run a separate tool like Picard mark duplicates, it'll go back and it'll edit the BAM file and change those flags wherever appropriate. I don't know. I haven't noticed one, but I wouldn't be surprised if there was. That does mark duplicates basically during alignment. I mean, the problem is you need to do all the alignments and then go back through the file, and actually the marked duplicates, they even usually do a two-pass strategy, where they go through all of the alignments, look for duplicates, and then they go back through it again and choose the best read out of all the duplicates to remain and mark everything else as a duplicate. So incorporating that into your primary alignment would maybe make the alignment slower, and they do it on the fly, but they still do it by taking the alignment and piping it through another step, right? Like it's still a separate tool, but it's done. It's right. Yeah, exactly. And they don't do that two-pass strategy to make this possible. So the cigar string, there's a number of different symbols. Here's an example of a cigar string. It's basically just a string of numbers and letters that in a very compact way tells you a lot about the alignment. The numbers represent numbers of basis, and the letters represent alignment states or statuses. So it tells you basically which basis aligns the reference, either a match or a mismatch, which are deleted, inserted, represent introns, and so on. So in this example, you have 100 base pair read consisting of 81 bases that match to the reference genome, then 859 bases that are spliced or skipped, and then another 19 bases that match. So that's a relatively simple cigar string indicating basically a perfect match of 81 bases, an intron, or some gap, and then another 19 bases of perfect matching. Does that kind of make sense? So you can see by using like this combination of symbols, you can describe basically an arbitrarily complex alignment for your one read. And then other software can look at that and say, like, again, you can apply filters and say, give me only reads that have no mismatches, or only reads that have no gaps, or only reads that do have gaps, and maybe those are all junctions, things like that. So understanding these cigar strings and flags and the layout of a BAM format is really important, especially when you're troubleshooting things and you're trying to understand what's gone wrong with your alignment or your data quality or so on. Another really common format is the bed format. When you're working with BAM files, it's really common to want to focus on a subset of the reference genome, let's say all the exons of a particular gene. And the bed file is just a really simple way of specifying that. So a lot of BAM manipulation tools accept regions of interest in the bed format. So you have your BAM file, which has all your reads in it and all your alignments, and then you have this other bed file that's, say, in a really simple example, just one gene with the positions laid out for all the exons. And if you wanted to extract, let's say, all the alignments that aligned to exons of your gene, you could use a tool like bed tools, which we asked you to try to install as extra practice, to basically intersect that BAM file with that bed file to extract just the part of the BAM file that relates to that one gene or the exons of that gene. The bed format is not compressed, it's just a simple text, plain text, tab separated that has chromosome name, start position, and end position. One thing you have to watch out for is that coordinates in bed format are zero based, whereas coordinates in a lot of other applications are one based. Does anyone know what I mean by the zero base versus one base distinction? So the idea is like, just to give a really simple example, if you have a reference, let's say chromosome one, it's not real, but let's say it starts with sequence like this. In the zero based convention, you number between bases. So you don't number the bases directly, but you say this is position zero, this is position one, this is position two, this is position three, this is position four, and that's a zero based format. One base, you're numbering directly on the bases, one, two, three, four. So this matters because if I want to tell you to go and look up a base in zero base, I have to say to find this t, go to chromosome one, one dash two, for example, or you need to know a convention to always look at the position after the number that I say, something like that. Whereas with one base, you just go to two. It's kind of, this one is probably a little bit more human and intuitive. You just number things and this t corresponds to two. Zero base has certain advantages in other situations though. So they're both, I would say almost equally used and there's always a total mix of them no matter what you're doing. And so you're always having to think like, am I in zero base or am I in one base? Is this deletion or insertion represented in zero base or one base? If it is, how do I convert it from one to the other? If you search for this topic in bio stars, you'll see I have problems with this all the time. So I like made myself a tutorial and put it in bio stars to remind myself how to go between zero and one base. And you can find it there. But just remember that bed files are zero base, BAM files are one base. So manipulating salmon BAM files and bed files, there's lots of tools. We're using Sam tools a few places in this set of tutorials. We're also using Picard to mark duplicates. We don't use BAM tools, but it's a common popular alternative to Sam tools. There's a lot of the same things, but on BAM files instead of Sam files, although they both work on both salmon BAM. So bed files probably the most popular and common is bed tools, which was the thing we asked you to try and install. But there's also bed ops and others. So how should I sort my Sam or BAM file? So sorting is something you also come across all the time. As soon as you start to try to do something with your BAM file, whether it's align it or put it through a fusion detection algorithm or whatever, almost the first thing that will happen is you'll get an error saying it's not sorted properly because it expects it to be sorted one way versus another way. This is often done by position. So you put the alignments for chromosome one, position one first, and so on. That's usually for performance reasons. So when you sort it by position, you can index it, and then it's just easier to get to arbitrary positions in a huge BAM file. But there are other applications where you want things read sorted, and the usual reason for that is if you read sort, then the read one and read two of the pairs come one after the other because they have exactly the same name except for that one slash one or slash two. So it'll be like read A pair one, read A pair two, read B pair one, read B pair two, and so on. So those are the two most common kinds of sorting, and you'll find you have to sort one way or the other just depending on what you're trying to do. The common way to, as a human, visualize alignments is in one of these viewers. So the IGV browser is the one that we use most commonly. I think it's probably the most popular, but there are other alternatives. This is the equivalent of the UCSC genome browser for next-gen sequence data. So has everyone used the UCSC browser? Are most people familiar with that? Ensemble browser? The NCBI browser? These are all just different kinds of genome browsers. And in those, you can get a representation of your alignments and you could load them into, for example, the UCSC browser and see maybe you've done it before with like, I don't know, a set of primers that you designed. You want to make sure that they map to the, you know, part of the genome that you expected. So you're visualizing them in UCSC genome browser. Same idea, but it can handle these huge data files. So if you try to upload a BAM file with 100 million RNA secrets to UCSC, first of all, UCSC would probably shut off your connection. That's happened to me before because you're just using too much of their bandwidth. Second of all, it would take forever. It just wouldn't be practical. So the integrative genomics viewer is a good alternative. You can fire it up. We're going to do this later and load in your BAM file. It usually needs to be indexed and you need to have your reference genome. And here what we're showing is a single chromosome. So you've got an ideogram. You can select the region here. The viewer position. So the portion of the genome that we're at is marked here with this little red mark. So we're looking at just this little sliver of the genome. You have a track. This thin track along the top gives you an idea of coverage. So basically in your alignments, how many reads are stacking up in each of these positions. So you can see spikes of coverage, which are occurring over exons. And the reason for that is that this is RNA-seq alignment. And we're mostly just getting reads aligning to the transcribed portions of this gene. Each of these little marks here or bars indicates a single read and where it aligned. So you can see there are some reads that align just within an exon. And there are others that have this little thin line between that indicates they're mapping from one exon to another exon. So those are reads that are actually spanning an exon exon junction. The bottom track is a gene track. So you can compare your alignments to what's known about this locus in the genome that you're studying. Has anyone used the IGV before? Yeah. Sorry, which track? Splicing signs. Yeah. You can put almost anything into IGV just like you can with the UCC browser. And in a similar way, you would make like for example a bed file that represents known junctions or splice sites or genes that aren't annotated or any arbitrary definitions you want. And you can load them in and they will appear here usually at the bottom as additional tracks of information much like with the UCC browser. Any other questions? So we're going to check this out in the lab tutorial. There are some alternatives like I mentioned. There's a couple bios to our posts talking about the alternatives. These are ones that we know of, Artemis, BamView, Chipster, Gbrows2, GenoViewer, MagicViewer, Savants, Tablet, and TView. The only ones I've used other than IGV or Savant, and it's very similar. It's a very good product as well. Subtle differences really you could use either I think and be fairly happy. So here I'm just using a screenshot of a view in IGV to illustrate the concept of variant allele expression. So I don't know if you can make it out very easily, but in IGV the reads here are color blue and basically the way it works is this is a reference you know indicated at the bottom. You only see bases indicated at color when they're different from the reference. So pretty much everywhere here you can see the alignments correspond perfectly to the reference you know, but you've got this stack here of differences. Now those could be errors like these three guys here are probably just random errors, but the fact that you've got basically like 50 percent of the reads representing a T here instead of the reference C probably indicates this is a real variant. So a variant of C to T is observed in 12 out of 25 of the reads covering that position. And from that you can actually calculate a variant allele frequency. So 12 out of 25 is 48 percent. So about 50 percent of the reads are supporting one allele and the other 50 percent are supporting the other allele. This makes sense for a heterozygous variant with no allele-specific expression. So again I'm actually showing are any C alignments here. It's possible that for whatever reason one allele is expressed preferentially or maybe the other allele is deleted in which case you might see 100 percent of reads supporting the T. So you can learn information about the variant allele frequency and you can also just visualize variants in IDV. And we're going to do some examples of that as well. But a lot of times we don't want to have to look in IDV to calculate a variant allele frequency. But we do want to know whether there are variants, what the read counts supporting those variants are, what the variant allele frequency is. So we're going to introduce you to a simple tool called BAM recount which basically for any arbitrary position that you give it will go and just count. So if you say tell me what's happening at position whatever 1,000,223 or whatever the position is it'll go there and just count and say there's zero C's, well in this case 13 C's, 12 T's, zero G's and zero A's. So that's the last part of that tutorial. It's a BAM recounting. So just to kind of remind you where we're at in this whole workflow we're going to do module two which is really the start of it which is getting our RNA seek reads and performing the read alignment. And we're going to use bowtie and top hat to do that. So do we have a break schedule now Michelle? Okay. So we might as well get started.