 Okay, this is under the same Creative Commons share-share-like license. We're going to look at alignment and visualization. So the learning objectives of this module are to talk about alignment challenges and common questions, to go over some of the alignment strategies. We're going to review very briefly bow tie top hat. We're going to introduce you to the BAM and bed formats. We're going to do some basic manipulation of BAMs in the tutorials. We're actually going to visualize some RNA-seq alignments in IGV. We're going to try and count some reads in a BAM file and look at variant alleles to see if they're expressed or not in our RNA data. These are kind of common things you might be interested in doing with your RNA-seq data. So the challenges of RNA-seq alignment, a big one is computational cost. So we've already talked about this. We're kind of doing a lot of backflips to make something that actually runs in a practical amount of time for this tutorial because if we gave you regular size data sets with hundreds of millions of reads, we'd have to start the alignments and then come back the next day or something like that. Introns, so in comparison to DNA alignments, you have quite often spliced alignments, especially in eukaryotic organisms. So the aligner has to have some ability to handle the fact that the read might suddenly jump from one part of the genome to another part of the genome tens of thousands of bases away and know what to do with that. The question that you might ask is can I just align my data once using one approach and be done with it? Unfortunately the answer is probably not. So what you're going to find is depending on what you're trying to accomplish, like whether you're looking at small RNAs versus regular RNAs or looking for fusions versus just gene level expression or splicing, you may need to run alignments with different software or different parameters and there's lots of details about that that you can read in the literature and in the BioSTARs forum and so on. We're going to use Top Hat and also optionally Star as the mapper for RNA-seq data. This is not the only mapper to consider for RNA-seq data for sure and I've provided a BioSTAR link there so you can read a discussion about this topic there. We've actually... Yeah, I think that's pretty reasonable. Like what we're doing here would be reasonable to pull out a random down-sampled list of reads from your data and see if your pipeline is working as expected. That would be generally reasonable depending on... You might have some issues like, for example, if you're trying to predict fusions and you're just using a toy data site like this, you probably won't have any fusion-supporting reads in the 80,000 reads because fusions are relatively rare and then you won't actually be testing all of the capabilities of the fusion-detecting software. So it won't always work, but starting with a toy data site is not a bad idea just so that you're not waiting days to find out whether the pipeline is working. So basically you suggest to use it and see if it's for the platform launch? That's not a... I don't know if I'm suggesting that. It's not a terrible idea to try more than one approach. I would say more that you should do some research and talk to people and think about what you're trying to accomplish and decide what pipeline makes the most sense for you. We're going to run two different kinds of aligners. We're going to do two different kinds of expression estimation and we're going to show you two different methods of differential expression. And you're going to see that there's similarities between them, but there's also differences. And it's not clear which one is better. So maybe you want to do more than one approach. At this point there isn't really standard practices yet or there are more than one standard practice without either one necessarily being 100% correct. So I think you already talked a little bit about this. Different mapping strategies. DeNovo assembly is probably going to be most useful if you do not have a reference genome. Aligning to the transcriptome. We talked about this this morning in detail. This is not done very often, but it can make sense in some circumstances. And then the third approach is what we're going to be doing which is aligning to the reference genome. So which alignment strategy out of those three is best or which should I use? DeNovo assembly, you're going to have to use this if your reference genome doesn't exist like we talked about. So you don't have an option really. If there are complex polymorphisms, mutations, haplotypes that might be missed by comparing to the reference genome an assembly-based approach might be a good idea. Aligning to the transcriptome. You don't see this being done as much anymore. It was something we recommended for short reads like less than 50 base pairs which you don't usually see nowadays. Aligning to the reference genome is kind of the default position. If you have a reference genome, that's probably what you're going to do. Which read aligner should I use? So this was a figure from a paper that came out a couple years ago and actually each of these rows in this figure represents a different aligner. So you can see there's a huge number of different aligners that have been developed for different applications. The ones in red are the RNA targeted aligners. So there's star up there which is one of the more recent ones and top hat somewhere in here, there. And there's some others. And there are also aligners that are geared towards microRNAs or DNA-based aligners or bisulfite aligners. So you can go here, I think they actually try to keep this up to date and you can kind of see what all the different options are and read about each one. An obvious question is should I use a splice-aware or an unspliced mapper? So a non-splice-aware mapper or aligner would be something like just straight BWA. It's not going to attempt to align your reads across large gaps like tens of thousands of bases which you will have in spliced RNA data. That's just the reality of RNA-seq data that they might span and generally will span large introns because the fragments being sequenced in your RNA-seq library represent messenger RNA. Typically. And therefore the introns have been removed. So because we usually are aligning these reads back to the reference genome basically the answer to this question is yes, you should use a splice-aware aligner. We did have this caveat that if your reads are really short you might not want to use a splice-aware aligner or it might not work that well but hopefully this is not going to be a problem for you. So here's some examples of ones, top hat and star which we're going to try in this course. Map splice, splice map. There's a whole bunch of other ones on that last figure that you could consider using. Probably for most purposes top hat or star will work for you though. Bow tie and top hat. So top hat is actually kind of a wrapper for bow tie which is not a splice-aware aligner. So top hat kind of takes bow tie alignments and then makes them splice-aware. It requires a reference genome. It basically breaks the reads into little pieces. It uses bow tie to align those little pieces and then it extends those alignments from seeds and resolves exon edges or splice junctions. And you can read the Trapnel paper from 2009 which goes into great detail and you can follow the explanation of how the guts of bow tie work. Another common question is should I allow multi-mapped reads? So these are reads that basically map unambiguously or ambiguously to more than one location in the genome. In DNA analysis it's common for mappers to sort of randomly select alignments from a series of equally good alignments but this is not really a common or ideal practice in RNA. So if you're doing variant calling you might just allow multi-mapped reads where the same read maps equally well to multiple places in the genome. But if you're using the pipeline we're talking about today which is the tuxedo pipeline with top hat and cuff links you should definitely allow multi-mapped reads because it's basically assuming that that's the kind of alignments that you're going to have. And also for gene fusion discovery you want to allow multi-mapped reads. So when you run this aligner on your FASTQ data we've looked at just the raw sequence data right where you just have your sequence nucleotides at the top and then your quality string. We're going to run that through the aligner and we're going to get out a new format of file and it's quite a bit more complicated than the FASTQ file. And that is a SAM or BAM file. SAM stands for sequence alignment map formats and BAM is just the binary or compressed version of a SAM file. So when you uncompress and look at the contents of a BAM file it looks exactly like a SAM file because it is a SAM file it's just been compressed. And they compress it because of space. So the SAM files and even the BAM files are quite huge. You're taking all your sequence data and you're just adding even more information about how it aligns the genome, maybe it aligns multiple places, maybe the alignment is really complicated. So to represent all that information the BAM format can be quite complicated. And if you find yourself with a BAM file and you want a SAM file or vice versa there are simple tools that allow you to convert these back and forth and there's a BIOSAR question again that you can go and read about that. So this is just an example of what a SAM BAM file looks like. And I've broken it into two parts. You have the header, so every SAM BAM file will have some general information at the head or top of the file. And then it gets into the alignment part of the file where basically per row you have sequence alignments. And we're going to go through each of these sections in a little bit more detail. If you really want to understand the SAM format you should read the specification. So there's a detailed description of the SAM format, what every column stands for, how things are represented. These few slides that we're going over here is really just a high level introduction to the SAM format. There's a lot of details and complexity that can be encoded in a SAM file. So I encourage you to read the documentation if you want to really understand your aligned data. So like I said, there's two sections, the header section and the alignment section. The header section does things like describe the source of the data, what your reference sequence was, what your method of alignment was. Like it'll have maybe the name of the program you use to align with. Kind of keeps a record of how that alignment file as a whole came into existence. Then there's the alignment section which describes the reads, the quality of the reads and the nature of alignments of those reads to the particular region of the genome that they aligned to. BAM, like we talked about, is a compressed version of SAM, compressed using the lossless BGZF format. There are other BAM compression strategies that are being developed. So there's a new format coming that you might have to deal with in the next couple years called CRAM, which is just an even more compressed version of BAM files to, again, save more space. Because as sequencing gets cheaper, these sequence files just keep getting bigger and bigger. And like we talked about in the very first lecture, the disk storage space isn't coming down fast enough to accommodate it. So we're having to think of more clever ways of compressing the data to more efficiently store it. So that's the CRAM format that's under development. BAM files are often indexed. So just like you took your reference genome and we created an index for it, you can take your BAM file and create an index, which just gives you a faster way of looking up alignments in the file. So this BAM file is going to be huge, and if you want a particular alignment, say, from chromosome 10, rather than having to, like, search through the whole file for all of the places where the alignment is to chromosome 10, there's a kind of lookup file, which is the index that tells you how to get to that part of the file quickly. So you'll see sometimes like a .bai file next to the BAM file, and that's a BAM index file. And we're actually going to create some of those because we'll need them to look at the data in IGV. So going into more detail about the header section, remember this is the top part of the BAM file. As I said, this is used to describe the data, the reference sequence, the method of alignments. Each section begins with the app character, followed by a two-letter code, which are followed by two-letter tags and values. So you have, for example, the at-HD, which tells you the header line, and that will give you things like the format version. So what version of the BAM spec you're looking at. The sorting order of alignments. So there's different ways that you can sort a SAM or BAM file, and this will tell you what way this particular file has been ordered. The reference sequence dictionary. So this section tells you what the reference sequence names are. In most cases, these will be like the names of your chromosomes. Regroup. So there's lots of different information you can encapsulate in these regroups about, for example, what library the reads are coming from, what sequencing center, what sample name. All kinds of sample details can be included in a regroup. So that if you have a BAM file that includes, say, alignments from more than one different sample, you have ways of separating out the reads. Program name. This could be information, again, about the aligner. So this might say like top hat version 2.1.2 or whatever. So that's the header section. Then there's the alignment section. So remember the alignment section is kind of hard to make out, but it's basically one row in the file just tab separated a whole bunch of different fields. I've broken those down into more of a tabular format, so they're a little bit easier to read. So the first column is the Q name, which is the query template name. So this would be basically like the read. So your read name. You then have this flag, which is a bitwise flag. We're going to go through this in detail. It's basically a short number that encapsulates a whole bunch of information about the alignment in a really space-efficient way. You have the reference sequence name. So this is your reference sequence. It could be a chromosome name. You have basically the start coordinate, the one base leftmost mapping position, the mapping quality. So this is a string of mapping quality symbols. You have a cigar string. Again, we're going to go through the cigar string in detail, but it's kind of like this short series of what looks like kind of gibberish until you learn how to interpret it, that again in a very space-efficient way tells you details about how the alignment, what happened with the alignment. Is it a perfect alignment? Is it a spliced alignment? Are there mismatches? Things like that. Then you have r-next and p-next. So this is the name of the reference sequence and position of the mate. So if you have paired-end reads, you have the start position of read one, and then it will tell you where did read two go, what chromosome and what start position was that on. The template length, the sequence itself, so that's just reproducing the actual nucleotides, and then the quality string. Sorry, so the quality string is just reproducing again the per-base quality, and the mapping quality or map queue is giving a score for the overall alignment for that read. So remember there was this concept of a flag. These are really hard to understand until you've kind of played around with them. So there's this web resource that's a really nice tool for understanding them, and we'll click on that in a second and go and play around with some examples. But basically it's an 11-bit-wise flag describing the alignments. They're stored as a binary string of length 11. This is comparable to having like 11 columns of data, where for each column you had a one or a zero that indicates something about the sequence alignment. A value of one indicates the flag is set. So here's just an example, 001, 0000, is a particular bit-wise combination that corresponds to one flag ID, which tells you a specific thing about that read. Each of these numbers can have a 001, and if you consider all combinations, you basically have zero to 2047 possible combinations of zeros and ones here, and these zero to 2047 are the numbers that basically go in that bit-wise flag spot in the band file. And you can use that to specify that something's required or to filter flags using SAM tools. So for example, you have a BAM file and you want to extract everything from that BAM file that has certain attributes like, I don't know, the segment is mapped or unmapped. You can determine what the flag is for that and then filter on that flag. Say, give me only alignments that have this flag or don't have this flag. So if we go to this explain flags website. Okay. So if you go to that link explaining SAM flags in plain English, you can put in any number between one and 2037 and it will tell you what combination of features are true about that particular flag. So for example, just one means that the read is paired. And 200, the mate is unmapped. It's the first in pair and the second in pair, whatever that means. So you have to think about the logic of all these combinations. So if you see one of those numbers in your BAM file, you can always go to this website, put in the number and try to learn what it is about that alignment that that flag is supposed to be telling you. So we're going to do some exercises later where we filter some BAM files based on flags and hopefully that will make a little bit of sense that worked. Okay. So to make things extra complicated, not only can you represent these flags as the string of zeros and ones or as a number from zero to 2047, but there's also these like hexadecimal representations just to maximize confusion. The other thing, so I highlighted two of the columns here, flag, which we just talked about and cigar strings. So just to briefly go over cigar strings. Similar to the bitwise flag where we're just trying to represent information about the alignment in great detail with as little space consumed as possible in the file. So the cigar string is a sequence of base lengths and sort of operations that are used to indicate which bases align to the reference, either match or mismatch are deleted, inserted, represent introns and so on. So for example, a cigar string might look like this. It might be 81m, 85, 859, and 19m. And when you look at these operations, you can interpret that as 100 base pair read consisting of 81 bases that match of alignment to the reference, 859 bases that are skipped as an intron, and then 19 bases of further alignment. So you could visualize that alignment in something like BLAST and you'd probably understand what it means. This is just like a very efficient texturing way of representing that alignment. Does that make sense? So there's going to be one of these for every alignment in every row of your BAM file. A simpler and equally important format, a file to know about is the bed format. So we're going to use this very briefly as well. When you're working with BAM files, it's very common to want to extract or examine some subset of the reference genome, for example, the exons of a particular gene. And a common way to do that is to create a bed file that specifies the coordinates for those regions that you're interested in and put it in bed format. And then there are tools that allow you to manipulate BAM files using regions of interest in bed format. The basic bed format is, again, a tab separated file. It's very simple. It's just chromosome name, start and end. And then there's a few other files which are more or less optional. It's important to remember that coordinates in the bed format are zero based and not one based. Does everyone know what that means? Does anyone not know what I mean by that? So there's two common, I would say, almost equally used representations of coordinate systems for sequence. As soon as I draw it, you'll say, oh, that makes sense. But it causes you all kinds of problems in real life. And you have to be really aware of it. So you have a sequence of basis, right? So in zero based coordinate systems, the numbering starts at zero, and it doesn't line up with the basis, it lines up between the basis, right? The one based system, which I'm going to put on the top instead, lines up with the basis and starts at one. So instead of saying A is at position between zero and one, we just say A is at position one, T is at position two, so on. So you see how each of these are kind of equally valid ways of representing the coordinates of a base, but they're obviously different, and they have consequences. So for example, if I want to specify a deletion at this position, I would say something like the deletion is at A from two to three, and then indicates a zero or something to indicate it's been deleted. Whereas using a one based system, I would just say, maybe sometimes it's convention to put the start and end, but they'll be the same, position three, three is deleted. There's a BioStar post about this with diagrams that I created. So if you search like in BioStar for zero based versus one based, there's a whole post just diagramming this out and explaining the consequences of representing a single base, a range of bases, an insertion, deletion, and so on. Just trying to indicate the numbers between the bases. So for me, just as a human brain, I find this one based system easier, like okay, this is the first base, this is the second base. I have the number, I know which base it is. But for a lot of mathematical operations, it turns out that it's much easier in coding to have the zero based system. But they're about equally used. So for example, the UCSC genome files are usually zero based. I think the UCSC browser displays things in one based, or no, actually I think it may be a zero based. IGV is one based. Ensembl tends to be one based. It just depends. You have to be really aware, like if you've got some data, like a variant file, it could be one based or it could be zero based. And occasionally it'll be a mix because people have kind of bunged things up and then it's really kind of annoying. Bed files should by definition be zero based. That's part of the bed file spec. So usually if you're getting a bed file, if you download a bed file from UCSC, you can be confident that it's in a zero based coordinate system. But you may want to intersect it with another file that's not zero based, so you might have to do a conversion. And there are tools to help you do that. So manipulating SAM and BAM files and bed files, there are quite a few tools to help you do this. Probably the most popular SAM and BAM file manipulators are SAM tools, which we installed, BAM tools, and the card which we installed. And then for manipulating bed files, you have bed tools and bed ops are two popular ones. And bed tools was what we had as a practical exercise to install. And these things, these tools can just do all kinds of things for you. Like let's say you want to add some flank to some sequences. Maybe you have a bed file representing all of the exons in a gene or all the primers you designed for an experiment, and you want to add some flank to it, or you want to see if they overlap with some other coordinates. There will be all kinds of bed tools commands that let you do that sort of thing. So how should I sort my SAM or BAM file? Sorting is something that comes up all the time. Most of the tools that take a SAM or BAM file as input will reflect that your BAM file will be sorted in one way or another. The most common is probably sorted by position. So they'll be sorted by chromosome and then start and then end position. So you'll have all the chromosome one alignments first and then the early chromosome one alignments before the later chromosome one alignments. That is typically for performance reasons. So remember when I said we usually want a BAM index file. Before you can create an index file, you usually have to sort it. So you would sort an index. That's the order of operations. But some tools require the BAM to be sorted by read name. So basically it's sort of alphabetically by the read name and it's always read one then read two of the pairs of reads. This is usually done, for example, for things like SV or fusion detection where we're really interested in how the read pair alignments correspond to each other. Like are the two reads aligning sort of as expected within the distance of the fragment size you expect? Or is one read aligning to this chromosome and the other read aligning to another chromosome or somewhere really far down the chromosome much farther than you would expect that might possibly indicate some kind of like large scale deletion or something like that. So that's why you would usually use a read name sorted BAM file. Visualizing. So we're going to do this later. We're going to pull up IGV. So when you guys are trying out IGV, we're going to do our alignments and we're going to load our alignments into the browser. And this is what you're going to see if it works, which it probably will. So just to orient you to the way that IGV works. Has anyone used IGV before? One or two people? Not many. Okay. So IGV, it's similar to the UCSC browser or the ensemble browser if you've used a genome browser. I guess lots of you probably have used UCSC browser, right? Has anyone used that? Well, still some people. Okay. It will have a representation of the chromosome at the top. For human there's an ideogram. I don't know about all other species, but other species probably have ideograms as well. There's some controls along the top that let you do things like prevent automatic pop-ups or control how pop-ups appear. There will be a little red tick that shows you where in this chromosome your view is currently centered. And you'll have maybe a track that indicates coverage, which is an estimate of the depth of sequence coverage or the number of alignments you have at any given position. And then you have this read track. So this is showing your reads, basically all your short 100 base pair, two-by-100 base pair reads, and where they aligned and how they aligned in the genome. So IGV is really a browser designed for next-gen sequence data for short read alignments. At the bottom you have your gene track. So this is just your typical representation of a gene. You have your UTR and then your exons and introns interspersed. And you can see, because this is RNA-seq data, that for the most part you're getting alignments with the exons of this gene. There are the odd alignments that do appear in the introns. And there are also these thin lines separating alignments indicating that that's one alignment that's actually spanning from one exon to the other. So these are exon-exon junction spanning reads or spliced reads. The colors can indicate different things depending on the option so you can right-click on this track and say color by, and there's a whole bunch of different options. Right now they're colored by strand. So this is actually stranded data. So all of the reads for this transcript are on one strand and there are some other reads for some unannotated transcripts on the other strand. IGV is not the only viewer like this. There's actually quite a few competitors. Again, there are BioStar questions talking about them. In some of the CBW courses we've sometimes covered Savant, which is a very similar browser to IGV. It seems to perform about the same, but there are a bunch of others. We're just going to review IGV in this course. And the last thing we're going to try and do is actually count reads for a particular variant allele. So again, this is an IGV screenshot. We're zoomed in a bit more so you can see each read much more closely. You can't actually see the whole read for a lot of these. So this is part of a read and that's spilling off the screen. And the way IGV works is each bar represents a read and they'll be colored gray or a solid color like blue or red depending on how you have the coloring. And where there are individual bases that are contrasted is where you have non-reference bases. So here there's no A, Cs or Gs or Ts represented because these reads are perfectly matching the reference genome at all these positions except for this one position where about half the reads have a T instead of a C. So this is maybe a mutation if this is a tumor sample or maybe it's just a common polymorphism in the human genome and a common thing to want to know is basically well how many of the reads support the T and how many of them support the C. You can learn things from that like about the purity of your sample whether you think it's a heterozygous, polymorphism versus homozygous all kinds of things can be learned from knowing the frequency of the different alleles. So we're going to show you how you would go about counting reads in an automated way and that's basically just a software way of doing what you could do manually by eye which is say okay I have a variant here it's a C to T there are 25 reads covering this position 12 of them represent the C and 13 represent the T therefore I have a variant allele frequency of 12 over 25 which is 48%. And so maybe that's a heterozygous allele or variant. Obviously you can't do that manually so we helped you install BAM read count which is just one of the different software tools you could use to automatically count. So module 2 we are now starting this first phase of this pipeline which is now that we have our sequence data we need to do our alignments. So we're going to align with bow tie top hat to the reference genome and then optionally you can also run the star aligner.