 So, module two is all about alignment and visualization of alignments. And actually what's being shown here in this picture that's getting reused for each title slide is sort of a cartoon representation of reads, in this case they're single-end reads that are being aligned against an mRNA, where the exons have already been spliced together and the introns removed, but really what we're going to be doing is aligning against the genome and then inferring what that thing looks like. Just to review again our learning objectives for the course, we're now in module two, alignment and visualization, so we've already done our intro to the RNA sequencing and our input data, and now we're going to take our input data and talk about aligning it and visualizing the alignments themselves. For this objective, for this module in particular, the objectives are to discuss some of the alignment challenges and common questions again, we'll talk about some different alignment strategies, and we'll do a brief introduction to bowtie and top hat itself, which is the aligner that we're going to spend most of our time using. And then unfortunately, we're going to spend some time going over the BAM and bed formats, and I'll warn you in advance that the BAM format in particular is very esoteric, and I didn't invent it, so don't shoot the messenger. But you're really stuck with these files, they're very, very ubiquitous in next generation sequence analysis, so you might as well just get used to that and try to understand them the best as we can. We're going to visualize some of our alignments using a genome browser called the Integrated Genome Viewer, which is published by the Broad Institute, and it's by far the most popular genome viewer for next-gen sequence data, but there are multiple alternatives and we'll provide resources to help you find some of those alternatives. Sometimes there might be one that's better for a particular application or there might be something about it that you like. And then we're going to briefly introduce this concept of BAM read counting. In particular we're going to use it as a way to determine variant allele expression status. This is really when you want to dig into your alignments and find out at a particular position is there a mutation or a common polymorphism that's being expressed and what does that expression look like? So some of the challenges of RNA-seq alignment specifically, probably the easiest one to understand is computational cost, because we're starting with hundreds of millions of reads covering the transcriptome and all of them have to be placed onto the genome to try to understand what transcript they might have come from. There's a pretty high computational cost associated with that. It's common for anyone analyzing any serious amount of RNA-seq data to want to use a cluster or perhaps in our case we're going to use the cloud where we have some access to hefty computers and if we had lots of data we could easily fire up many instances on the cloud to churn through large amounts of data all at once as long as we're willing to pay for it. I've already alluded to this. Alignments in RNA in particular are complicated by the fact that the alignment can be spliced so a particular read can span across two exons with a big intervening intron sequence in between so the aligner has to often divide the read into pieces and figure out exactly where those boundaries are and sometimes there's a big search space to look in because introns particularly in species like human can be very large. This is a common question that gets asked about alignment with all of these different aligners out there. Can I just align my data once using one approach and be done with it once and for all? And unfortunately the answer is probably not. There are certain alignment strategies that are more suited to certain downstream applications. So for example when you're doing expression analysis you might want to optimally align a certain way. When you're aligning to find RNA fusions you may need to align in a different way and it's just a very difficult problem to create one alignment that is equally good for all purposes. So unfortunately we have to spend this computational cost sometimes multiple times if we're analyzing the data in many contexts. One sort of a follow on to that is top hat the only mapper to consider for RNA-seq data and the answer is of course no. There are many aligners and on your own time you can check out this bio-stars question where this question has been discussed in some detail and different people pipe in with what aligners they like for which application. So you can skip having to do those evaluations yourself and just learn from what the community has already learned about how different aligners behave in different contexts. So there's a variety of ways of thinking about the RNA-seq mapping strategies that people employ. I like this particular depiction because it's fairly straightforward. It's taken from a review article in Nature Methods a couple years ago by Sean Grimmond and Nicole Kuhnen. There's sort of three strategies that we'll sometimes talk about. The first is de novo assembly. So this isn't really alignment per se. This is where you're just taking your reads and you don't even think about a reference genome at first. All you're doing is you're looking at your reads and you're trying to identify reads that have similarity to each other and you're building a consensus out of them. So figuring out which reads look like they came from the same sequences that have overlapping homologous regions and building up basically a contic where you've assembled all of these reads into a larger piece. So you're starting with 100 mirrors and you might assemble a contic that's thousands of bases long and hopefully that corresponds to a transcript. And then once you've created that contic you can map that whole thing back to a reference genome if you have a reference genome. If you don't have a reference genome then you would just do some other downstream analysis so you might skip right to analyzing those sequences for evidence of open reading frames to try to understand what proteins might be translated from that RNA sequence. So this is a common strategy that people use if they don't have a reference genome or perhaps if they're looking for really novel structures where relying on a reference genome might bias what you're finding. Another strategy is to align to the transcriptome directly so where you have reads and instead of aligning to the genome and then figuring out what transcripts are annotated in that part of the genome you just align directly to transcript sequences. So this is a much smaller search space. So for example in human only about one to two percent of all bases correspond to the transcriptome so you can collapse that down to just the transcript sequences with the introns already removed it's just start aligning your reads directly to transcripts. The disadvantage of this approach is that it relies on what you already know about the transcriptome. So you're biasing your result to some degree towards the known transcripts and you may miss things, you may miss novel isoforms, you may miss novel fusions, you may not even notice that there's novel genes there and so forth. So the strategy that really the community has settled on is the most common is aligning to the reference genome and this can only be done if your reads are sufficiently long that they'll allow you to do these spliced alignments which is what's depicted in these three where you've got enough sequence on either end to anchor on one exon and then to figure out where the intron boundaries are and where the next exon continues on. So this is the strategy that we're going to use and we'll just talk about the other ones a little bit. So I think I basically covered all of this while describing the last slide but basically the sort of three use cases are you're going to do a de novo assembly if you don't have a reference genome if you're looking for complex variation that might be missed by relying on the structure of the reference genome. So for example in the human genome we know that there are regions that occur in some humans that are not currently represented in the human reference genome just because of the people that happened to have been selected for the human sequencing of the human reference genome project. So if you want a really unbiased view of a particular transcriptome you might be worried that the corresponding part of the genome where your transcript is coming from isn't even in the reference so you might want to start with a de novo assembly for that reason. You would align to the transcriptome mostly people do this if they have very short reads where they're not able to do spliced alignments and then you would align to the reference genome in basically all other scenarios. And each of these has sort of a different suite of tools that goes along with each strategy. If you're using RNA-seq for a very small use just between expression analysis could you then use just a transcriptome? You can and some people do that it's not as computationally expensive and if you're really if you're confident that your transcriptome is fairly complete comprehensive your view of so for human for example has a very well annotated transcriptome it's you know all of the major genes have been found by now after decades of working on it and it's a lot easier to align to than the genome but there are some analytical concerns so one of the challenges with that strategy is that how to handle alternative transcripts from the same gene. So a particular gene may have 10 transcripts that use slightly different combinations of exons or have different 3 prime start sites or polyadenylation sites and then when you align to all of those things together you have this issue that you have reads aligning equally well to multiple transcripts and there's sort of a counting problem that how do you decide which transcript each read came from. If you're doing just totally the gene level analysis you can just collapse everything down to the basically the exon content of each gene and you can align against that. You just have to think about basically how you create your database of transcripts before you do this strategy but it can be a quick and simple way to get gene expression estimates as long as you're willing to accept all of the caveats. So what do you do to get the database of the genes so which one is the best? 50 is right on the cusp you'll definitely be able to use the tutorial that we're doing will work just fine. You will probably not get as good results with respect to transcript discovery and alternative splicing just because 50 base reads are a little bit on the short side of things really they recommend more than 50 and really more than 75 before you start to really trust the junction mappings so reads that span across junctions but you can still get good gene expression estimates and you can run the tools exactly the same way it will automatically detect that your reads are shorter and it will try to do the best that it can. That's one of the nice things about this top hat suite that we're using is that a lot of the tweaks to the way things are done are actually built into the tool. It basically examines your data and tries to do what it thinks is best based on the data that you give it. Is it that? So what your question is should you do that? Is it as good? So there's a number of reasons people tend to recommend paired reasons. One is just practical because that's what most people are doing so a lot of people are designing tools with paired and reads in mind. Another is that having the pairing information allows you to do some very simple, take some very relatively simple approaches for finding gene fusions at least they're conceptually simple so if you want to find fusion genes and this is a big interest for cancer researchers you can look for cases where one read maps to one chromosome and the other read maps to the other chromosome. It's a sort of very simple concept and having the paired reads allows you to do that. You also improve mapping efficiency substantially by having paired reads so you have more than twice as much information because instead of having 100 mer you have 200 mer's and you also have a prior expectation as to how far apart they should be and all of that information allows you to increase the probability that you've placed it in the genome properly when you're mapping it. As for the difference between two paired 100 mer's and one single 200 mer depending on exactly what those lengths are like is it 250 versus 1100 or you'd have to really probably do some experiments to figure out like what the trade-offs are I think in certain circumstances for say splicing analysis if you can get a nice long read even if it's a single end that might be really advantageous it might be better than a paired end read that was shorter and it just depends. Thankfully this the tutorial we're going to go through requires fairly mild modifications to work on single end data versus paired end data and it will definitely work for a lot of it will work for either paired end or single end I think. The fusion detection will not work though. There is no reason why you can't detect fusions with single end reads if they're long enough it's just requires a different strategy and different tools yeah you might have to you might have to make your own strategy which will be potentially quite complicated. I mean and it's not that no one has done this before people used to do fusion detection with ESTs for example which were much longer and they were single end okay okay so I mentioned that there are lots of aligners and there really really are a lot so this is a summary created by someone at EBI that's placed it on a timeline from 2001 to 2013 and each of these is the name of a different aligner many of these developed by different groups all over the world and there's more coming out all the time this this hasn't stopped this just continues on. They're color coded according to the kind of data that they're expecting as input so the RNA is read by self-made aligners pink blue is DNA so you can see that there's a bias towards DNA there's been a lot of DNA aligners created and then a few for micro RNA the one that we are going to use so here's top hat here that they put top hat two separately looks like it doesn't have top hat two yet another one that we're going to compare to is star so that's one of the more recent ones you could spend a lot of time evaluating these aligners again I would urge you to not probably not spend your time evaluating aligners until you've searched around for other people that have done so so for example that there's a paper that corresponds to this figure you can go read all about aligners and there's a bunch of information on the wiki and bio stars about different aligners that you might want to try for different applications so I mentioned this briefly should I use a splice aware or unspliced mapper as I've been saying RNA seek reads spend large introns in some cases and the fragments being sequenced in RNA seek represent messenger RNAs and they're there therefore the introns are removed so when we're aligning these reads back to the genome we have to account for the the introns that are going to need to be basically spliced out of the alignment and generally unless your reads are short so I would say less than 50 definitely if they're like in the 30 to 40 range you're probably not going to want to use a splice aware liner it's just not going to work that well if they're greater than 50 then you can go to these aligners like we're going to use top hat star map splice and so on so this is just a high level description of how the bow tie top hat aligner works so really this aligner is kind of two blocks of code one is the original bow tie or bow tie bow tie one or bow tie two aligner which was originally designed for DNA and then top hat is kind of a layer on top of that that repurposes some of the functionality of bow tie for doing spliced alignments of RNA seek data to the genome and basically what it's doing is breaking reads into pieces and aligning those with bow tie against the genome and then it's interrogating all of those many alignments and trying to piece together what the spliced alignment actually looks like and that's what the top hat layer does it basically piles on top of bow tie so it's starting by mapping all these little pieces of reads against the genome it looks for islands tries to define the edges of exons as it gathers this information it tries to figure out what the splice sites that the reads are corresponding to might look like and it considers the information for your species so it tries to identify reads that span across exon exon boundaries that look like what exon exon boundaries should look like and it pieces all of this information together to resolve exon edges and you get your splice junctions and then cufflinks is going to come along and use that information to try to assemble transcripts out of the alignments so some of the common questions that come up when doing these alignments should I allow multi-mapped reads is a common question and basically a multi-mapped read is a read that can be placed in multiple places in the genome either equally well or almost equally well so there are many regions in the genome that are similar to each other so for example gene families you may have olfactory gene you may have ten olfactory genes that share a lot of sequence identity and there are going to be parts of those where a hundred mer is actually identical or nearly identical in one gene as it is in another gene so when you do your alignment of all your RNA secretes against the genome some subset of reads can't be assigned unambiguously to one single spot they could go equally well to two or three or ten or a hundred or a thousand spots in some cases in particular reads that correspond to repeat elements in the genome could map many many many places in DNA alignment probably the most common strategy is to randomly choose one of those sites so the most commonly used a liner for DNA reads is bw a and it will take all of those possible equally good alignments and it just picks one randomly and it assumes that if you sequence deeply enough and you have even coverage that you'll kind of just get an even representation and it all basically all come out in the wash for some other applications a particular in particularly certain RNA seek applications you don't want to do that you want to maintain some level of that multi-mapped information because some of the tools make take advantage of that so top hat is going to allow reads to map multiple places if they go multiple places equally well and then the downstream tools are going to use that information as a measure of uncertainty as to the placement of those reads and they're going and it's going to wait that information according to how much multi-mapping occurred and it basically going to do that by saying if this read could go one of ten places instead of giving each of those places account of one I'm going to give each of those places account of point one and that's going to affect your expression estimates and there's been some papers that show that with top hat and cuff links the the tool we're going to use to generate expression estimates you lose information if you don't allow multi-mapped reads and actually the more multi-mapped information you include the better the statistical model that cuff links uses the better it works so you you get more accurate expression estimates if you if you allow multiple multi-mapped alignments in your alignment file it doesn't actually so and then for multi-map reads in particular you actually have to set a threshold so a number an integer so basically by default it allows up to 20 alignments for a single read and it stores that information in the alignment file so that you have it downstream programs know whether each alignment is unique or whether there are many equally good alignments but it won't tell you that it's you don't know if any of them are correct there and top hat is not one of the weaknesses of top hat is that it doesn't really have a great score mapping quality score the information encoded in it there is a a mapping quality score field but it's not really known for being very useful not in the sense that BWA is I would say there is a way to access that information by pulling up the detailed read information for an alignment it should tell you whether the alignment is primary or secondary I don't know if it tells you if you can view in the viewer how many alignments there are something that we would have to check the the BAM specification document will explain how to encode multi-mapping and I think there's a tag that tells you how many multi-mappings were observed for a particular read but I'm not sure if we can actually get at that information in IGV it's something that we can look into though hmm good question it is and they do I mean they have the down the tools downstream of top hat in mind for star so they have they have recommended running guidelines to produce output that can be fed into cuff links so that you can use star alignments instead of top hat and it does it definitely does allow multi-mapped reads but I'm not sure if the default behavior is the same if they're represented the same way in the BAM file or not but we're going to generate we're going to generate star and top hat alignments on the same data so we can take a look at that it's difficult to come I haven't systematically star is quite recent so we have only started using it quite recently so that the stuff that you're doing in this tutorial is just fairly cutting-edge in that regard the only anecdote that I can really give other than the two papers claims of course they claim their own greatness I saw a poster at a GBT this spring where an author of another a liner I think Rome had done an evaluation his own independent evaluation of star top hat and like five or six other RNA-seq aligners and even though he was the author of a third a liner he said star was the best so that was fairly telling I thought but I think it will probably be context-dependent to some degree they doubt that there's going to be an easy answer what does seem clear is that you get very similar alignments in a much shorter amount of time than top hat so it's definitely faster at the expense of using much more memory so you consume 10 times as much RAM but you get your alignments up to 10 times faster and the alignments themselves appear very similar and we're going to do some comparisons of the output expression values and things from both and you'll see that they're quite comparable there's sort of different strategies it's fairly common when using cuff links to provide a list of genes in the form of a GTF file that you want to disregard when it comes to expression estimation and we we do that at WashU and that would include not just pseudo genes but ribosomal and mitochondrial genes as well but it all depends on how how much you want to limit your output to particular genes versus sort of taking a global picture okay so now we're going to get into the output sort of the nitty-gritty details of what the output of these liners looks like you're gonna hear these words a lot Sam and BAM so really what you get out of these aligners almost all aligners use this format now the Sam format which stands for sequence alignment map format BAM is this basically the same thing it's just a compressed version of the Sam file so keep in mind that when you have a BAM file you can't just view it in a text viewer I mean these files are gonna be massive don't ever try to just open them in WordPad or something you'll be disappointed with the outcome but there's lots of tools for handling the compressed versions of them it's common to need to convert between Sam and BAM because some tools support the compressed version some tools don't if you want to learn about that I've posted a useful bio star posting all about that topic but for now let's just get straight into looking at the file format themselves this is not to actually read this is just to give you a quick taste of what we're going to be getting into and it's basically a screenshot of a Sam file what I'm showing here is just the header of the file which has various entries describing the sequences the reads were aligned to and so forth and then the alignment section so each Sam file has a header section that explains details about the file and then the main body of the file is the alignment section that contains the actual alignments and they're stored in this very crazy looking format and you can't really read it here but we're gonna look at the pieces of those alignment sections in much more detail so if you want to read all about the specification every time you want to know some really nitty-gritty detail about what the abbreviations etc. and these files are you can look up the specifications so it's a document devoted entirely to explaining what these files are all about as I said it consists of two sections you've got a header section this describes the source of the data the reference sequence that you align to the method of alignment is stored there and so forth so you can if you get data from somewhere someone else that's already been aligned you can learn a lot about where that data came from what the data type is and how it was aligned just by examining the header and then there's the alignment section so this is used to describe the reads the quality of the read the way that a read aligns to a region of the genome and so forth if there's any variance in the read relative to the genome and so on as I said BAM is a compressed version of SAM the compressed compression is a lossless BGZF format that means when you compress it you save a lot of space when you decompress it you have all of the information that you had when you started with there is no information loss at all as there is with say an mp3 file that is compressed audio file but there are other BAM compression strategies that aren't necessarily lossless so for example this is a whole field of research trying to make these files more efficient by compressing them and sacrificing some information in order to reduce the file size a lot of things computational tasks dealing with these files are not limited by how fast your CPU are they're limited by how quickly you can read and write disk just because of the massive size of these files so there's a lot of interest in can we reduce the size of the file without sacrificing the information that's there very much just with our like with our reference genome file we usually index these files because they're so big we want to create an index that helps us find things in these massive files without having to scan through the file from beginning to end every time and this allows us to rapidly a tree retrieve alignments that overlap the specified region for example without going through the whole file in order to do this indexing you have to first sort the BAM and the sorting is done according to the position of the alignments I'm going to talk more about BAM sorting in a bit so the header section just to give a few more details as I said it's used to describe the source of the data the reference sequence in the method of linemen each section begins with a little at symbol there's a header line that tells you things about the sorting order of the alignments there's a reference sequence dictionary that describes all the sequences that you're aligning to so usually these are the chromosomes of your genome that you've aligned all of your reads to there's a read group information so you may have reads that belong to different samples that have been mixed together or different libraries from the same sample or that we're aligned with different parameters or so forth basically just a way of identifying subgroups of reads within the file and then there's a PG section that describes the program that was used to do the alignments the alignment section itself consists of 11 basic fields starting with the query template name there's flag reference name and so on we're going to look at some of these in detail so I won't spend a lot of time on this but the most important piece of information basically you have a read name you have some flags that describe the alignments we're going to talk about those in detail the reference sequence is like chromosome one chromosome to the position is the position on that chromosome so it aligned at position 1,520 of chromosome one and that will be of read one of a read pair there's a mapping quality value and how useful this is will depend on which aligner used a cigar string describes the nature of the alignment we're going to talk about those in detail on a separate slide and then there's information about the mate if you've got paired-end reads the length if you've got paired-end reads what the length of your fragment was likely to be and then you've got the actual sequence of the read and the quality of the read so basically all the information about the read itself is still maintained in this in this file so it's not just the alignments but actually contains the raw read sequence data and for that reason some people are actually stopping using fast queue files at all and they're just actually storing all of their read information in BAM format even if they're alignment even if their alignments are not there so you can have an alignment file a BAM file that contains reads that actually has no alignments in it but it can it follows the format of the BAM file specification and that allows it to be compressed and it just be this has just become a currency for moving and sharing read data as well as alignments so many aligners will actually take a BAM file as input and spit out a BAM file as output and then what I'm showing here is just example values for each of these fields so here's a read of that name the flags I'll have to explain that separately it's quite confusing I'd line to chromosome one at position eleven thousand six hundred and twenty three with a mapping quality of three hundred this is a gar string is a hundred and I'll explain what that means this equal sign means that both pairs mapped to the same chromosome and the second read as you can see map just a little bit further downstream this tells us that our fragment so the distance from the beginning and end of read one and read two is about 220 bases and then we've got the sequence and the qualities string okay so what are these flags all about the best way to get your wrap your head around understanding these flags actually to visit this website you can actually search for BAM flags explained or Sam flags explained and you'll get usually this will be the top hit if you Google for that basically there's eleven flags that describe the alignment various features of the alignment and at each of these positions a value of one indicates that the flag is set or that thing is true and a value of zero indicates that it's not true and then you can take all combinations of zeros and ones in these things and you can represent them as one number from zero two thousand and forty seven or two to eleven to the eleven minus one so it's a binary format for storing a number and these flags are used on every line in the BAM Sam file and then when you're viewing these files you can filter you can require that certain flags be set or you can filter out alignments that have certain flags are set and this can allow you to identify reads that meet certain criteria so for example you might want to grab all of the reads in your BAM file that are unmapped for some reason and you can do that by specifying the flag that would correspond to segment equals unmapped and we're going to do a demonstration of how to go through that and how to use this website as well in the tutorial okay so the other thing that requires extra explanation is these cigar strings I showed an example that was a hundred m so you have a hundred mer and the cigar string was a hundred m that basically means that there were a hundred bases that I that were either matches or mismatches but it's basically one contiguous alignment of a hundred bases cigar string is basically a way of explaining deviations from a simple alignment so if there are insertions or deletions or there's an intron it's sort of a compact format for allowing you to represent what that alignment looks like in the simple string so what this says is that there were 81 bases that are aligning as a match to the genome and then there's 859 with an N N means skipped there's a 859 skipped and then the alignment continues on for another 19 bases so really what this means is that there's 81 bases matching probably to the edge of an X on and then there's an intron of 859 bases and then the last 19 bases of the alignment continue on so this is one of those spliced alignments that we've seen in some of the cartoon depictions and these can get very hairy when there's insertions or deletions relative to the reference all of that can be described in a cigar string and they can get very long and there are many tools that parse these things and and use the information in them to to do various things it basically is trying to clarify that a sequence can align so say I have 50 bases and they align to a region in the genome the actual sequence might be slightly different so in the middle of that the reference might be a T but my read might have a G so there's actually a mismatch there so the M actually means match or mismatch but there's 50 bases and the alignment is 50 bases long there's no insertions or deletions it's just that one of those bases differs from what we expect and that is the cause of great confusion like confusion with this cigar strings particularly when people are looking for mutations or variants because you think a hundred M that must be match which means all the sequences much must match but it doesn't mean that they can be mismatches okay so one additional file format that we need to know something about and thankfully this is a much simpler file format is the bed format so the reason what we're introducing that is that when we're working with that ban files it's very common to want to examine a focus subset of the reference genome so for example we want to look at the exons of our favorite gene EGFR and we want to pull out information from the ban file for just those exons a good way to do that is to create a bed file that represents the coordinates of those exons the format is explained here and then there's all these ban manipulation tools that will accept a file regions of interest in bed format that will allow us to basically pull up information just for those regions the basic format of this is very simple so you just have chromosome name like chromosome one and then start and end position these are the coordinates of the start and end of the region you're interested in the one thing to remember about bed format is that they're zero based so the first base in the chromosome is zero not one otherwise it's fairly straightforward there are a number of tools that are commonly used to manipulate salmon ban bed files and these are some of the popular of these and we're gonna play around with a few of those so I won't go into them in more detail now sorting another common question how should I sort my Sam ban file so you have all of these alignments and they're generally not there in random order as you go through the file they're usually sorted according to the name of the read or the position of the alignment and you'll see sorting being done in those two ways depending on what you're doing so generally they're sorted by position for performance because this allows us to like to scan through a ban file quickly and find a location very rapidly it allows us to identify duplicate alignments very efficiently and so forth but certain tools require that the ban be sorted by read name and usually this is when we want to be able to easily identify both reads of a pair so we want them to be close to each other in the file so that we can look for reads that might be involved in fusions so for example if one read of a read pair maps chromosome one and one maps chromosome three if there's the file is sorted by position those things will be very very far apart if we sort of by name they come back to being right beside each other and that's just a sort of per computational performance one rose scanning through these big files here's a visualization in IGV of some RNA seek data so we're going to be looking we're going to be using this browser ourselves and playing around with it but this is just for reference and also just to give you a peak preview of what some of the features of this browser are so shown at the top here is an ideogram of a chromosome with the banding pattern as you mouse over various regions of the BAM file information will pop up sometimes this can be annoying this little button allows you to control that pop up to toggle it on or off this little red bar indicates where on the chromosome you're currently viewing the alignment so we're looking at a very small region just actually a single gene and that gene is located right there often that you have two sort of areas to each track you have a pile up area and then an individual reads area in the pile up area it's basically a histogram where at the higher the gray bars the more reads are piling up there and then in this panel here you're seeing individual reads being depicted one by one it's sometimes hard to see them because there's so many packed in together here's some example of a single reads that are not spliced so there's just basically a lining to a large exon and then these reads here with the little blue bars connecting them are examples of spliced reads it you can color the reads in different strata modes but in this mode reads that came from the positive strand or pink and those that came from the negative strand or blue and this would only be relevant information in a library where the strand information was maintained so remember we talked about library strategies where strand information is maintained and others where they're not in a library that's unstranded you will see just a mixture of pink and blue reads scattered all over the place they're approximately equal proportions if the library is stranded you'll see them all appearing to come from a single strand this may actually be captured data yeah yeah that's a good observation there may be the coloring can represent different things sometimes it represents low quality bases or misaligned mismatching bases can be colored and those will appear as like darker colors different colors at the bottom here we have a gene track this tiny little thing tells you the coverage scale so you want to pay attention to that scale sometimes to know how much depth is actually here and I think that's everything so I mentioned there were alternate viewers there's a whole bunch of them you can this is just provided as reference if you want to try some of the others out there's a couple of stars questions there that maintain fairly current lists of the alternatives and then the last thing that I want to talk about was this idea of BAM read counting again just as a primer to the tutorial what I'm showing here now is really zoomed in view in IGV of some reads from an RNA-seq library where each of these blue bars is an individual read and we're looking at say 50 base window of the genome the reference genome bases are shown along the bottom here so it's CCAAGCGG and so on and then within the field where the reads are shown by default the bases are not labeled unless they differ from the reference genome so what's being shown here is you've got several reads that have a T but the reference genome is a C at that position and some of them are labeled T the others are not labeled so those are the same as the reference genome so basically you have a mixture of C's and T's being reported that at this position and in this case what that is is actually a mutation this is a tumor sample this RNA came from a tumor you have a mutant position in the genome it's heterozygous and that's why we're seeing approximately 50% T bases and 50% reference bases and then at the top here you have this colored bar it's hard to see but it's sort of half blue half red and that indicates that it's about 50% count for one versus the other in this case you've got actually 12 reads versus out of 25 so you get really close to 50% of variant to low frequency so this is sort of an intuitive way of thinking about counting the allele observations for a particular position in a BAM file but of course this would not be practical to do at thousands of positions in the genome so we need a way to do this automatically and BAM BAM read count is a tool that we're going to use to do this kind of counting without having to look in IGB at the at the viewer we're just going to extract this information automatically okay so now we're going to move on to the tutorial so again I'm going to show this summary so we in the last section we covered our input formats now we're going to cover sequencing the sequence data and specifically we're really going to focus on this read alignment step for module two our codes are usually separated very early usually before you even have a fast cube file that this has already happened on the instruments even like the computer that's attached to the instrument will often do a demultiplex yeah no top hat will assume that your data has been demultiplexed already and that is really a fundamental next-gen sequencing type question that should be addressed by the software that gives you your data off the instrument you really shouldn't have to handle that problem yourself and if you're downloading data from the internet or from a public repository it's highly unlikely that it will still have indexes usually that information is removed upstream okay so again we're just going to briefly go over what we're going to do in this tutorial so really what we're going to do is run bowtie top hat to or star with parameters suitable for gene expression analysis we're going to use Sam tools to demonstrate the features of the Sam BAM format and basic manipulation of these alignment files so we're going to do things like view sort index and filter we're going to use IGV to look at some RNA-seq alignments and view a variant position just like that screenshot that I just showed you we're going to determine BAM read counts at a variant position without having to look at to manually look in IGV and we're going to try to understand these flags and the quality of BAM files by using these tools flag stat Sam stat we already did this fast you see so we're not going to maybe we did repeat it on alignments okay so basically what I just said these are the sections in this is the file that we're going to be working from we're going to start by aligning rates with top hat actually this should say four libraries not eight we're going to again we're going to use top hat for the alignment we're going to supply a GTF file that we obtained in our in the previous tutorial and then we're going to supply bowtie with the index genome that we also created in step four and we're going to use this dot dash G option to tell top hat to look for X on X on junctions of known transcripts and then on top of that it's still going to search for novel X on X on junctions as well so that we can potentially learn about alternative splicing since there are again they should say four since there are four libraries in the test data set for alignment commands are run we dropped this from eight down to four just to make it go faster and these alignments take a couple minutes each so we'll we'll be able to spend a few time watching them do their thing we already talked about the output files that we're going to get we're going to get some Sam and bam files and then we're going to repeat this process with star which is conceptually very similar but it's just an independent program and we're going to compare the try to compare the runtime to top hat and any additional steps that were needed to do the alignments once we have alignments we're going to visualize them so we're going to create index versions of our BAM files and because IGV needs these to to allow us to view them efficiently we might we're going to try and visualize some spliced alignments we'll look for some X on X on junction supporting reads we'll look at a couple examples of differentially expressed genes we'll try to compare the top hat and star alignments and then we're going to try to find a few variant positions in our NAC data and then we're going to test a few tools to create a pile up from BAM the BAM file and we'll explain what a pile up looks like but it's basically a representation of what bases were observed at a particular position. Yep. No they shouldn't be I mean if your library if you created a library that allowed non-coding RNAs. Yeah so that's a good question the the feature file will just be used as a guide so it's going to basically what top hat is going to do is going to first try to align reads to known transcripts and then it's going to align to the junctions of exons from those transcripts but reads that don't align to those things will still be aligned to the whole genome and they could correspond to a completely novel genes genes that weren't accurately represented your GTF it's just a little helping hand to make it basically to make the alignment more efficient. So another screenshot of the post alignment visualization in IGV in this case we're showing two different samples so we've got a tumor and a normal I believe and again it's sort of the same features as what I showed before. We're going to use a variety of things to do a post alignment QC so we're going to use SAM tools view to see the format of the SAM BAM file we're going to look at some flags filter out some types of alignments we're going to use SAM tools flag stat to get sort of an overall summary of the alignments and then we're going to run SAM stat on our tumor normal BAMs and review the resulting report in our browser and then we're going to use fast QC again this time to perform a QC of the actual alignment files instead of the fast Q files. Here's an example of the post alignment QC output from SAM stat I won't go into that because we're going to look at it first hand and that's it I think we're ready to go well with the tutorial.